--- title: "dagLogo Vignette" author: "Jianhong Ou, Haibo Liu, Lihua Julie Zhu" date: "`r doc_date()`" package: "`r pkg_ver('dagLogo')`" bibliography: bibliography.bib abstract: > Sequence logo has been widely used as a graphical representation of nucleic acid motifs and conserved amino acid (AA) usage. We have developed a R/Bioconductor package _dagLogo_ to facilitate the identification and visualization of significant differential usage of AAs between experimental sets and various background sets, with or without grouping based on AA properties. vignette: > %\VignetteIndexEntry{dagLogo Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document --- ```{r echo=FALSE, results='hide', warning=FALSE, message=FALSE} suppressPackageStartupMessages({ library(dagLogo) library(biomaRt) library(UniProt.ws) library(motifStack) library(Biostrings) library(grDevices) }) ``` # Introduction A sequence logo has been widely used as a graphical representation of nucleic acid sequence motifs. There is a R package _seqlogo_[@Oliver2006] for drawing sequence logos for a single DNA motif. There is another R package _motifStack_[@ou2018motifstack] for depicting individual sequence motif as well as multiple motifs for amino acid (AA), DNA and RNA sequences. _IceLogo_[@Colaert2009] is a tool developed in Java for visualizing significantly conserved sequence patterns in a list of aligned AA sequences against a set of background sequences. Compared to _webLogo_[@Crooks2004], which relies on information theory, _IceLogo_ builds on probability theory. It is reported that _IceLogo_ has a more dynamic nature and is more appropriate for analysis of conserved sequence patterns. However, _IceLogo_ can only compare conserved sequences to reference sequences at the individual amino acid level. As we know, some conserved sequence patterns are not conserved at the individual amino acid level, but conserved at the level of amino acid group characteristic of their physical and chemical properties, such as charge and hydrophobicity. Here we developed a R/Bioconductor package _dagLogo_, for visualizing significantly conserved sequence patterns relative to a proper background set of sequences, with or without grouping amino acid residuals based on their physical and chemical properties. Figure 1 shows the flowchart of performing analysis using dagLogo. Comparing to existing tools, _dagLogo_ allows aligned or not aligned subsequences of different length as input; Provides more options and functions to generate various background sets that can be tailored to fit the experimental design; Both significantly over- and under-represented amino acid residues can be plotted; AA residues can be grouped and statistical significance test can be performed at the group level. ![Figure 1. Flowchart of performing analysis using _dagLogo_. Two ways to prepare an object of `Proteome` are colored in greenish and yellowish, while two alternative ways to build an object of `dagPeptides` are colored in blue and red.](dagLogo_flowchart.png). Figure 1. Flowchart of performing analysis using _dagLogo_. Two ways to prepare an object of `Proteome` are colored in greenish and yellowish, while two alternative ways to build an object of `dagPeptides` are colored in blue and red. # Step-by-step guide on using _dagLogo_ ## First load the library _dagLogo_ ```{r} library(dagLogo) ``` ## Step 1: Fetching peptide sequences from BioMart ### Case 1: Fetch sequences using the `fetchSequence` function in _biomaRt_ package given a list of gene identifiers and the corresponding positions of the anchoring AA. ```{r fetchSequences1, results='hide'} ##just in case biomaRt server does not response if (interactive()) { try({ mart <- useMart("ensembl") fly_mart <- useDataset(mart = mart, dataset = "dmelanogaster_gene_ensembl") dat <- read.csv(system.file("extdata", "dagLogoTestData.csv", package = "dagLogo")) seq <- fetchSequence(IDs = as.character(dat$entrez_geneid), anchorPos = as.character(dat$NCBI_site), mart = fly_mart, upstreamOffset = 7, downstreamOffset = 7) head(seq@peptides) }) } ``` ### Case 2: Fetch sequences using the `fetchSequence` function in _biomaRt_ package given a list of gene identifiers and the corresponding peptide subsequences of interest with the anchoring AA marked such as asterisks or lower case of one or more AA letters. ```{r fetchSequences2, results='hide'} if (interactive()) { try({ mart <- useMart("ensembl") fly_mart <- useDataset(mart = mart, dataset = "dmelanogaster_gene_ensembl") dat <- read.csv(system.file("extdata", "dagLogoTestData.csv", package = "dagLogo")) seq <- fetchSequence(IDs = as.character(dat$entrez_geneid), anchorAA = "*", anchorPos = as.character(dat$peptide), mart = fly_mart, upstreamOffset = 7, downstreamOffset = 7) head(seq@peptides) }) } ``` In the following example, the anchoring AA is marked as lower case "s" for amino acid serine. ```{r fetchSequences3, results='hide'} if(interactive()){ try({ dat <- read.csv(system.file("extdata", "peptides4dagLogo.csv", package = "dagLogo")) ## cleanup the data dat <- unique(cleanPeptides(dat, anchors = "s")) mart <- useMart("ensembl") human_mart <- useDataset(mart = mart, dataset = "hsapiens_gene_ensembl") seq <- fetchSequence(IDs = toupper(as.character(dat$symbol)), type = "hgnc_symbol", anchorAA = "s", anchorPos = as.character(dat$peptides), mart = human_mart, upstreamOffset = 7, downstreamOffset = 7) head(seq@peptides) }) } ``` The function `cleanPeptides` can be used to select a subset of data to analyze when input data contains multiple anchoring AAs, represented as lower case of AAs. ```{r fetchSequences4} if(interactive()){ dat <- read.csv(system.file("extdata", "peptides4dagLogo.csv", package="dagLogo")) dat <- unique(cleanPeptides(dat, anchors = c("s", "t"))) mart <- useMart("ensembl", "hsapiens_gene_ensembl") seq <- fetchSequence(toupper(as.character(dat$symbol)), type="hgnc_symbol", anchorAA=as.character(dat$anchor), anchorPos=as.character(dat$peptides), mart=mart, upstreamOffset=7, downstreamOffset=7) head(seq@peptides) } ``` Similarly, peptide sequences can be fetched from an object of Proteome. ### Case 3: Prepare an object of `dagPeptides` using `prepareProteome` and `formatSequence` functions sequentially given a list of unaligned/aligned peptide sequences. ```{r formatSequence, results='hide'} dat <- unlist(read.delim(system.file("extdata", "grB.txt", package = "dagLogo"), header = F, as.is = TRUE)) ##prepare proteome from a fasta file proteome <- prepareProteome(fasta = system.file("extdata", "HUMAN.fasta", package = "dagLogo"), species = "Homo sapiens") ##prepare an object of dagPeptides seq <- formatSequence(seq = dat, proteome = proteome, upstreamOffset = 14, downstreamOffset = 15) ``` ## Step 2: Building background models Once you have an object of `dagPeptides` in hand, you can start to build a background model for statistical significance test. The background could be a set of random subsequences of a whole proteome or your inputs. To build a background model from a whole proteome, an object of Proteome is required. Sequences provided by a fasta file or downloaded from the UniProt database can be used to prepare a `Proteome` object. Case 3 in step 1 shows how to prepare a `Proteome` object from a fasta file. The following code snippet shows how to prepare an object of `Proteome` using the _UniProt.ws_ package. ```{r prepareProteome0} if(interactive()){ proteome <- prepareProteome("UniProt", species = "Homo sapiens") } ``` The prepared `Proteome` object can be used as a background model for the following statistical significance test using Fisher’s exact test or *Z*-test. ```{r buildBackgroundModel} bg_fisher <- buildBackgroundModel(seq, background = "wholeProteome", proteome = proteome, testType = "fisher") bg_ztest <- buildBackgroundModel(seq, background = "wholeProteome", proteome = proteome, testType = "ztest") ``` ## Step 3: Statistical significance test for differential usage of amino acids with or without grouping Statistical significance test can be performed at the AA level without making any change to the formatted and aligned amino acids. Alternatively, statistical significance test can be performed at the AA group level, where amino acids are grouped based on their physical or chemical properties. To group the AAs, the formatted and aligned AA symbols are replaced by a new set of symbols representing their corresponding groups. For example, if AA charge is of your interest, then group symbols "P", "N" and "U" are used to replace amino acids with positive charge, negative charge and no charge respectively. A few pre-built grouping schemes have been made available for you to specify as follows. ```{r testDAU} ## no grouping t0 <- testDAU(seq, dagBackground = bg_ztest) ## grouping based on chemical properties of AAs. t1 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest, groupingScheme = "chemistry_property_Mahler_group") ## grouping based on the charge of AAs. t2 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest, groupingScheme = "charge_group") ## grouping based on the consensus similarity. t3 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest, groupingScheme = "consensus_similarity_SF_group") ## grouping based on the hydrophobicity. t4 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest, groupingScheme = "hydrophobicity_KD") ## In addition, dagLogo allows users to use their own grouping ## Scheme. The following example shows how to supply a customized ## scheme. Please note that the user-supplied grouping is named ## as "custom_group" internally. ## Add a grouping scheme based on the level 3 of BLOSUM50 color = c(LVIMC = "#33FF00", AGSTP = "#CCFF00", FYW = '#00FF66', EDNQKRH = "#FF0066") symbol = c(LVIMC = "L", AGSTP = "A", FYW = "F", EDNQKRH = "E") group = list( LVIMC = c("L", "V", "I", "M", "C"), AGSTP = c("A", "G", "S", "T", "P"), FYW = c("F", "Y", "W"), EDNQKRH = c("E", "D", "N", "Q", "K", "R", "H")) addScheme(color = color, symbol = symbol, group = group) t5 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest, groupingScheme = "custom_group") ``` ## Step 4: Visualize significance test results We can use a heatmap or logo to display the statistical significance test results. ```{r dagHeatmap,fig.cap="DAG heatmap",fig.width=8,fig.height=6} ##Plot a heatmap to show the results dagHeatmap(t0) ``` ```{r dagLogo0,fig.cap="ungrouped results",fig.width=8,fig.height=6} ## dagLogo showing differentially used AAs without grouping dagLogo(t0) ``` ```{r dagLogo1,fig.cap="classic grouping",fig.width=8,fig.height=6} ## dagLogo showing AA grouped based on properties of individual a amino acid. dagLogo(t1, groupingSymbol = getGroupingSymbol(t1@group), legend = TRUE) ``` ```{r dagLogo2,fig.cap="grouped on charge",fig.width=8,fig.height=6} ## grouped on the basis of charge. dagLogo(t2, groupingSymbol = getGroupingSymbol(t2@group), legend = TRUE) ``` ```{r dagLogo3,fig.cap="grouped on chemical property",fig.width=8,fig.height=6} ## grouped on the basis of consensus similarity. dagLogo(t3, groupingSymbol = getGroupingSymbol(t3@group), legend = TRUE) ``` ```{r dagLogo4,fig.cap="grouped on hydrophobicity",fig.width=8,fig.height=6} ## grouped on the basis of hydrophobicity. dagLogo(t4, groupingSymbol = getGroupingSymbol(t4@group), legend = TRUE) ``` # Session Info ```{r sessionInfo, echo=TRUE} sessionInfo() ```