--- title: "A basic workflow for sitePath" author: "Chengyang Ji" package: sitePath output: BiocStyle::html_document: toc_float: true abstract: > Detection of sites with fixation of amino acid substitutions in protein evolution. vignette: > %\VignetteIndexEntry{A basic workflow for sitePath} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction The `sitePath` package does hierarchical search for fixation events given multiple sequence alignment and phylogenetic tree. These fixation events can be specific to a phylogenetic lineages or shared by multiple lineages. This is achieved by three major steps: 1. Import tree and sequence alignment 2. Resolve phylogenetic lineages 3. Hierarchical search for lineage-dependent fixation events # Import tree and sequence alignment There're various R packages for parsing phylogenetic tree and multiple sequence alignment files. For now, `sitepath` accepts `phylo` object and `alignment` object. Functions from `ggtree` and `seqinr` are able to handle most file formats. ## Parse phylogenetic tree The S3 phylo class is a common data structure for phylogenetic analysis in R. The CRAN package [ape](https://cran.r-project.org/web/packages/ape/index.html) provides basic parsing function for reading tree files. The Bioconductor package [ggtree](https://bioconductor.org/packages/release/bioc/html/ggtree.html) provides more comprehensive parsing utilities. ```{r import_tree, message=FALSE} library(ape) tree <- read.tree(system.file("extdata", "ZIKV.newick", package = "sitePath")) ``` It is highly recommended that the file stores a rooted tree as R would consider the tree is rooted by default and re-rooting the tree in R is difficult. ## Parse and add sequence alignment Most multiple sequence alignment format can be parsed by [seqinr](https://cran.r-project.org/web/packages/seqinr/index.html). sitePath has a wrapper function for parsing and adding the sequence alignment. ```{r add_alignment, message=FALSE} library(sitePath) alignment_file <- system.file("extdata", "ZIKV.fasta", package = "sitePath") tree <- addMSA(tree, alignment_file, "fasta") ``` # Resolve phylogenetic lineages The names in tree and aligment must be matched. We use a **tip-to-root** algorithm to trim tree leaves/tips to expose the major branches. Before finding putative phylogenetic lineages, there involves a few more steps to evalute the impact of threshold on result. ## The impact of threshold on resolving lineages In the current version, the resolving function only takes sequence similarity as one single threshold. The impact of threshold depends on the tree topology hence there is no universal choice. The function `sneakPeak` samples thresholds and calculates the resulting number of paths. *The use of this function can be of great help in choosing the threshold.* ```{r sneakPeek_plot} preassessment <- sneakPeek(tree) plot(preassessment$similarity, preassessment$pathNum) ``` ## Choose a threshold by yourself Use the function `lineagePath` to get a S3 sitePath class object^[The S3 sitePath object contains the nodes to represent phylogenetic lineages] for downstream analysis. The choice of the threshold really depends. You can use the result from `sneakPeak` as a reference for threhold choosing. Here the default value is used. ```{r get_lineagePath} paths <- lineagePath(tree) paths ``` You can visualize the result. ```{r plot_paths} plot(paths, no.margin = TRUE) ``` # Hierarchical search for fixation events Now you're ready to find fixation events. ## Fixation mutations The hierarchical search is done by `fixationSites` function. The function outputs a list of mutations with the sequence names involved before and after the fixation. ```{r find_fixations} fixations <- fixationSites(paths) fixations ``` If you want to retrieve the tip names involved in the fixation of a site, you can pass the result of `fixationSites` and the site index to `extractSite` function. The output is a `sitePath` object which stores the tip names. ```{r get_sitePath} extractSite(fixations, 139) ``` It is also possible to retrieve the tips involved in the fixation of the site. ```{r get_tipNames} extractTips(fixations, 139) ``` ## Visualize the result Use `plot` to visualize each fixation mutation. You'll have to specify which mutation to be visualized. Because the `names(fixationSites)` are the mutation names, you can also use one of them as function input ```{r plot_fixations} plotSingleSite(fixations, 139) ``` # Additional functions This part is extra and experimental but might be useful when pre-assessing your data. We'll use an example to demonstrate. ## Inspect one site The `plotSingleSite` function will color the tree according to amino acids if you use the output of `lineagePath` function. ```{r plot_sites} plotSingleSite(paths, 139) plotSingleSite(paths, 763) ``` ## SNP sites An SNP site could potentially undergo fixation event. The `SNPsites` function predicts possible SNP sites and the result could be what you'll expect to be fixation mutation. ```{r find_SNP} snps <- SNPsites(tree) plotSingleSite(paths, snps[4]) plotSingleSite(paths, snps[5]) ``` ## Use just the trimming function The function `groupTips` does the same on the tree as the function `lineagePath` does but it outputs clusters of tree tips instead of paths. This will give you a feel of the behavior of **tip-to-root** trimming algorithm. The `names` of the output is the internal node number in `phylo` object. ```{r group_tips} grouping <- groupTips(tree) grouping ``` # Session info {.unnumbered} ```{r session_info} sessionInfo() ```