--- title: "1. Usage of YAPSA" author: "Daniel Huebschmann" date: "26/08/2015" vignette: > %\VignetteIndexEntry{1. Usage of YAPSA} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: toc: yes references: - author: - family: Alexandrov given: LB - family: Nik-Zainal given: S - family: Wedge given: DC - family: Aparicio given: SA - family: Behjati given: S - family: Biankin given: AV - family: Bignell given: GR - family: Bolli given: N - family: Borg given: A - family: Borresen-Dale given: AL - family: Boyault given: S - family: Burkhardt given: B - family: Butler given: AP - family: Caldas given: C - family: Davies given: HR - family: Desmedt given: C - family: Eils given: R - family: Eyfjörd given: JE - family: Greaves given: M - family: Hosoda given: F - family: Hutter given: B - family: Ilicic given: T - family: Imbeaud given: S - family: Imielinski given: M - family: Jäger given: N - family: Jones given: DT - family: Jones given: D - family: Knappskog given: S - family: Kool given: M - family: Lakhani given: SR - family: Lopez-Otin given: C - family: Martin given: S - family: Munshi given: NC - family: Nakamura given: H - family: Northcott given: PA - family: Pajic given: M - family: Papaemmanuil given: E - family: Paradiso given: A - family: Pearson given: JV - family: Puente given: XS - family: Raine given: K - family: Ramakrishna given: M - family: Richardson given: AL - family: Richter given: J - family: Rosenstiel given: P - family: Schlesner given: M - family: Schumacher given: TN - family: Span given: PN - family: Teague given: JW - family: Tokoti given: Y - family: Tutt given: AN - family: Valdes-Mas given: R - family: van Buuren given: MM - family: van't Veer given: L - family: Vincent-Salomon given: A - family: Waddell given: N - family: Yates given: LR - family: Australian Pancreatic Cancer Initiative given: - family: ICGC Breast Cancer Consortium given: - family: ICGC MMML-Seq Consortium given: - family: ICGC PedBrain given: - family: Zucman-Rossi given: J - family: Futreal given: PA - family: McDermott given: U - family: Lichter given: P - family: Meyerson given: M - family: Grimmond given: SM - family: Siebert given: R - family: Campo given: E - family: Shibata given: T - family: Pfister given: SM - family: Campbell given: PJ - family: Stratton given: MR container-title: Nature id: Alex2013 issued: month: 8 volume: 500 year: 2013 publisher: Nature Publishing Group title: 'Signatures of Mutational Processes in Cancer' - author: - family: Alexandrov given: LB id: Alex_package2012 issued: year: 2012 title: 'WTSI Mutational Signature Framework' - author: - family: Alexandrov given: LB - family: Nik-Zainal given: S - family: Wedge given: DC - family: Campbell given: PJ - family: Stratton given: MR container-title: Cell Reports id: Alex_CellRep2013 issued: year: 2013 title: > Deciphering signatures of mutational processes operative in human cancer - author: - family: Gehring given: Julian - family: Fischer given: Bernd - family: Lawrence given: Michael - family: Huber given: Wolfgang container-title: Bioinformatics id: Gehring_article2015 issued: year: 2015 publisher: Oxford Journals title: > SomaticSignatures: inferring mutational signatures from single-nucleotide variants - author: - family: Rosenthal given: Rachel - family: McGranahan given: Nicholas - family: Herrero given: Javier - family: Taylor given: Barry S. - family: Swanton given: Charles container-title: Genome Biology id: Rosenthal_2016 issued: year: 2016 publisher: BioMed Central title: > DeconstructSigs: delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. - author: - family: Wang given: Yong - family: Lawson given: Charles L. - family: Hanson given: Richard J container-title: R-package version 1.1-1 id: lsei_package2015 issued: year: 2015 title: > lsei: Solving Least Squares Problems under Equality/Inequality Constraints - title: 'Genome Level Trellis Layout' author: - family: Gu given: Zuguang container-title: R package version 1.0.0 id: gtrellis2015 issued: year: 2015 - author: - family: Gu given: Z container-title: R package version 1.4.4 id: ComplexHeatmap2015 issued: year: 2015 title: 'ComplexHeatmap: Making Complex Heatmaps' - author: - family: Raue given: Andreas - family: Kreutz given: C. - family: Maiwald given: T. - family: Bachmann given: J. - family: Schilling given: M. - family: Klingmüller given: U. - family: Timmer given: J. container-title: Bioinformatics id: Raue_Bioinformatics2009 issued: year: 2009 title: > 'Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood.' - author: - family: Alexandrov given: LB - family: Kim given: J - family: Haradhvala given: NJ - family: Huang given: MN - family: Ng given: AW - family: Boot given: A - family: Covington given: KR - family: Gordenin given: DA - family: Bergstrom given: E - family: Lopez-Bigas given: N - family: Klimczak given: LJ - family: McPherson given: JR - family: Morganella given: S - family: Sabarinathan given: R - family: Wheeler given: DA - family: Mustonen given: V - family: Getz given: G - family: Rozen given: SG - family: Stratton given: MR container-title: bioarxiv id: Alex2018 issued: month: 05 year: 2018 publisher: bioarxiv title: 'The Repertoire of Mutational Signatures in Human Cancer' --- ```{r load_style, warning=FALSE, message=FALSE, results="hide"} library(BiocStyle) ``` # Introduction {#introduction} The concept of mutational signatures was introduced in a series of papers by Ludmil Alexandrov et al. [@Alex2013] and [@Alex_CellRep2013]. A computational framework was published [@Alex_package2012] with the purpose to detect a limited number of mutational processes which then describe the whole set of SNVs (single nucleotide variants) in a cohort of cancer samples. The general approach [@Alex2013] is as follows: 1. The SNVs are categorized by their nucleotide exchange. In total there are $4 \times 3 = 12$ different nucleotide exchanges, but if summing over reverse complements only $12 / 2 = 6$ different categories are left. For every SNV detected, the motif context around the position of the SNV is extracted. This may be a trinucleotide context if taking one base upstream and one base downstream of the position of the SNV, but larger motifs may be taken as well (e.g. pentamers). Taking into account the motif context increases combinatorial complexity: in the case of the trinucleotide context, there are $4 \times 6 \times 4 = 96$ different variant categories. These categories are called **features** in the following text. The number of features will be called $n$. 2. A cohort consists of different samples with the number of samples denoted by $m$. For each sample we can count the occurences of each feature, yielding an $n$-dimensional vector ($n$ being the number of features) per sample. For a cohort, we thus get an $n \times m$ -dimensional matrix, called the **mutational catalogue** $V$. It can be understood as a summary indicating which sample has how many variants of which category, but omitting the information of the genomic coordinates of the variants. 3. The mutational catalogue $V$ is quite big and still carries a lot of complexity. For many analyses a reduction of complexity is desirable. One way to achieve such a complexity reduction is a matrix decomposition: we would like to find two smaller matrices $W$ and $H$ which, if multiplied, would span a high fraction of the complexity of the big matrix $V$ (the mutational catalogue). Remember that $V$ is an $n \times m$ -dimensional matrix, $n$ being the number of features and $m$ being the number of samples. $W$ in this setting is an $n \times l$ -dimensional matrix and $H$ is an $l \times m$ -dimensional matrix. According to the nomeclature defined in [@Alex2013], the columns of $W$ are called the **mutational signatures** and the columns of $H$ are called **exposures**. $l$ denotes the number of mutational signatures. Hence the signatures are $n$-dimensional vectors (with $n$ being the number of features), while the exposures are $l$-dimensional vectors ($l$ being the number of signatures). Note that as we are dealing with count data, we would like to have only positive entries in $W$ and $H$. A mathematical method which is able to do such a decomposition is the **NMF** (**non-negative matrix factorization**). It basically solves the problem as illustrated in the following figure (image taken from ): ![NMF](NMF.png) Note that the NMF itself solves the above problem for a given number of signatures $l$. In order to achieve a reduction in complexity, the number of signatures has to be smaller than the number of features ($l < $n), as indicated in the above figure. The framework of Ludmil Alexandrov et al. [@Alex2013] performs not only the NMF decomposition itself, but also identifies the appropriate number of signatures by an iterative procedure. Another software, the Bioconductor package `r Biocpkg("SomaticSignatures")` to perform analyses of mutational signatures, is available [@Gehring_article2015]. It allows the matrix decomposition to be performed by NMF and alternatively by PCA (principal component analysis). Both methods have in common that they can be used for **discovery**, i.e. for the **extraction of new signatures**. However, they only work well if the analyzed data set has sufficient statistical power, i.e. a sufficient number of samples and sufficient numbers of counts per feature per sample. The package YAPSA introduced here is complementary to these existing software packages. It is designed for a supervised analysis of mutational signatures, i.e. an analysis with already known signatures $W$, and with much lower requirements on statistical power of the input data. # The YAPSA package {#YAPSA_package} In a context where mutational signatures $W$ are already known (because they were decribed and published as in [@Alex2013] or they are available in a database as under ), we might want to just find the exposures $H$ for these known signatures in the mutational catalogue $V$ of a given cohort. Mathematically, this is a different and potentially simpler task. The **YAPSA**-package (Yet Another Package for Signature Analysis) presented here provides the function `LCD` (**l**inear **c**ombination **d**ecomposition) to perform this task. The advantage of this method is that there are **no constraints on the cohort size**, so `LCD` can be run for as little as one sample and thus be used e.g. for signature analysis in personalized oncology. In contrast to NMF, `LCD` is very fast and requires very little computational resources. YAPSA has some other unique functionalities, which are briefly mentioned below and described in detail in separate vignettes. ## Linear Combination Decomposition (LCD) {#LCD} In the following, we will denote the columns of $V$ by $V_{(\cdot j)}$, which corresponds to the mutational catalogue of sample $j$. Analogously we denote the columns of $H$ by $H_{(\cdot j)}$, which is the exposure vector of sample $j$. Then `LCD` is designed to solve the optimization problem: (@LCD_formula) $$ \begin{aligned} \min_{H_{(\cdot j)} \in \mathbb{R}^l}||W \cdot H_{(\cdot j)} - V_{(\cdot j)}|| \quad \forall j \in \{1...m\} \\ \textrm{under the constraint of non-negativity:} \quad H_{(ij)} >= 0 \quad \forall i \in \{1...l\} \quad \forall j \in \{1...m\} \end{aligned} $$ Remember that $j$ is the index over samples, $m$ is the number of samples, $i$ is the index over signatures and $l$ is the number of signatures. `LCD` uses a non-negative least squares (NNLS) algorithm (from the R package `r CRANpkg("nnls")` ) to solve this optimization problem. Note that the optimization procedure is carried out for every $V_{(\cdot j)}$, i.e. for every column of $V$ separately. Of course $W$ is constant, i.e. the same for every $V_{(\cdot j)}$. This procedure is highly sensitive: as soon as a signature has a contribution or an exposure in at least one sample of a cohort, it will be reported (within the floating point precision of the operating system). This might blur the picture and counteracts the initial purpose of complexity reduction. Therefore there is a function `LCD_complex_cutoff`. This function takes as a second argument a cutoff (a value between zero and one). In the analysis, it will keep only those signatures which have a cumulative (over the cohort) normalized exposure greater than this cutoff. In fact it runs the LCD-procedure twice: once to find initial exposures, summing over the cohort and excluding the ones with too low a contribution as described just above, and a second time doing the analysis only with the signatures left over. Beside the exposures $H$ corresponding to this reduced set of signatures, the function `LCD_complex_cutoff` also returns the reduced set of signatures itself. Another R package for the supervised analysis of mutational signatures is available: `r CRANpkg("deconstructSigs")` [@Rosenthal_2016]. One difference between `LCD_complex_cutoff` as described here in `YAPSA` and the corresponding function `whichSignatures` in `r CRANpkg("deconstructSigs")` is that `LCD_complex_cutoff` accepts different cutoffs and signature-specific cutoffs (accounting for potentially different detectability of different signatures), whereas in `whichSignatures` in `r CRANpkg("deconstructSigs")` a general fixed cutoff is set to be 0.06. In the following, we briefly mention other features of the software package YAPSA and refer to the corresponding vignettes for detailed descriptions. ## Signature-specific cutoffs One special characteristic of YAPSA is that it provides the opportunity to perform analyses of mutational signtures with signature-specific cutoffs. Different signatures have different detectability. Those with high detectability will occur as false positive calls more often. In order to account for the different detectability, we introduced the concept of signature-specific cutoffs: a signature which leads to many false positive calls has to cross a higher threshold than a signature which rarely leads to false positive calls. While this vignette introduces [how to work with signature-specific cutoffs in general](#sigSpecCutoffs), optimal signature-specific cutoffs are presented in [2. Signature-specific cutoffs](http://bioconductor.org/packages/3.11/bioc/vignettes/YAPSA/inst/doc/vignette_signature_specific_cutoffs.html). ## Confidence intervals for mutational signature exposures In order to evaluate the confidence of computed exposures to mutational signatures, YAPSA provides 95% confidence intervals (CIs). The computation relies on the concept of profile likelihood [@Raue_Bioinformatics2009]. Details can be found in [3. Confidence Intervals](http://bioconductor.org/packages/3.11/bioc/vignettes/YAPSA/inst/doc/vignette_confidenceIntervals.html). ## Stratification of the Mutational Catalogue (SMC) For some questions it is useful to assign the SNVs detected in the samples of a cohort to categories. We call an analysis of mutational signatures which takes into account these strata a *stratified* analysis, which has the potential to reveal enrichment and depletion patterns. Of note, this is different from performing completely separate and independent NNLS analyses of mutational signatures on the different strata. Instead, the results of the unstratified analysis are used as input for a constrained analysis for the strata. Details can be found in [4. Stratified Analysis of Mutational Signatures](http://bioconductor.org/packages/3.11/bioc/vignettes/YAPSA/inst/doc/vignette_stratifiedAnalysis.html) ## Indel signatures Recently a new and extended set of mutational signatures was published by the Pan Cancer Analysis of Whole Genomes (PCAWG) consortium [@Alex2018]. In addition to an extended set of SNV mutational signatures, that analysis for the first time had sufficient statistical power to also extract 17 Indel signatures, based on a classification of Indels into 83 categories or features. YAPSA also offers functionality to perform supervised analyses of mutational signatures on these Indel signatures, details can be found in [5. Indel signature analysis](http://bioconductor.org/packages/3.11/bioc/vignettes/YAPSA/inst/doc/vignettes_Indel.html) # Example: a cohort of B-cell lymphomas We will now apply some functions of the YAPSA package to Whole Genome Sequencing datasets published in Alexandrov et al. [-@Alex2013]. First we have to load this data and get an overview ([first subsection](#example-data)). Then we will load data on published signatures ([second subsection](#loading-the-signature-information)). Only in the [third subsection](#performing-an-lcd-analysis) we will actually start using the YAPSA functions. ```{r, warning=FALSE, message=FALSE} library(YAPSA) library(knitr) opts_chunk$set(echo=TRUE) opts_chunk$set(fig.show='asis') ``` ## Example data In the following, we will load and get an overview of the data used in the analysis by Alexandrov et al. [@Alex2013] ### Loading example data ```{r load_lymphoma} data("lymphoma_Nature2013_raw") ``` This creates a dataframe with `r dim(lymphoma_Nature2013_raw_df)[1]` rows. It is equivalent to executing the R code ```{r load_lymphoma_ftp, eval=FALSE} lymphoma_Nature2013_ftp_path <- paste0( "ftp://ftp.sanger.ac.uk/pub/cancer/AlexandrovEtAl/", "somatic_mutation_data/Lymphoma B-cell/", "Lymphoma B-cell_clean_somatic_mutations_", "for_signature_analysis.txt") lymphoma_Nature2013_raw_df <- read.csv(file=lymphoma_Nature2013_ftp_path, header=FALSE,sep="\t") ``` The format is inspired by the vcf format with one line per called variant. Note that the files provided at that URL have no header information, therefore we have to add some. We will also slightly adapt the data structure: ```{r adapt_data} names(lymphoma_Nature2013_raw_df) <- c("PID","TYPE","CHROM","START", "STOP","REF","ALT","FLAG") lymphoma_Nature2013_df <- subset(lymphoma_Nature2013_raw_df,TYPE=="subs", select=c(CHROM,START,REF,ALT,PID)) names(lymphoma_Nature2013_df)[2] <- "POS" kable(head(lymphoma_Nature2013_df)) ``` Here, we have selected only the variants characterized as `subs` (those are the SNVs we are interested in for the mutational signatures analysis, small indels are filtered out by this step), so we are left with `r dim(lymphoma_Nature2013_df)[1]` variants or rows. Note that there are `r length(unique(lymphoma_Nature2013_df$PID))` different samples: ```{r PID_info} unique(lymphoma_Nature2013_df$PID) ``` For convenience later on, we annotate subgroup information to every variant (indirectly through the sample it occurs in). For reasons of simplicity, we also restrict the analysis to the Whole Genome Sequencing (WGS) datasets: ```{r annotate_subgroup, warning=FALSE, message=FALSE} lymphoma_Nature2013_df$SUBGROUP <- "unknown" DLBCL_ind <- grep("^DLBCL.*",lymphoma_Nature2013_df$PID) lymphoma_Nature2013_df$SUBGROUP[DLBCL_ind] <- "DLBCL_other" MMML_ind <- grep("^41[0-9]+$",lymphoma_Nature2013_df$PID) lymphoma_Nature2013_df <- lymphoma_Nature2013_df[MMML_ind,] data(lymphoma_PID) for(my_PID in rownames(lymphoma_PID_df)) { PID_ind <- which(as.character(lymphoma_Nature2013_df$PID)==my_PID) lymphoma_Nature2013_df$SUBGROUP[PID_ind] <- lymphoma_PID_df$subgroup[which(rownames(lymphoma_PID_df)==my_PID)] } lymphoma_Nature2013_df$SUBGROUP <- factor(lymphoma_Nature2013_df$SUBGROUP) unique(lymphoma_Nature2013_df$SUBGROUP) ``` ### Displaying example data Rainfall plots provide a quick overview of the mutational load of a sample. To this end we have to compute the intermutational distances. But first we still do some reformatting... ```{r intermut_distances, warning=FALSE, message=FALSE, results="hide"} lymphoma_Nature2013_df <- translate_to_hg19(lymphoma_Nature2013_df,"CHROM") lymphoma_Nature2013_df$change <- attribute_nucleotide_exchanges(lymphoma_Nature2013_df) lymphoma_Nature2013_df <- lymphoma_Nature2013_df[order(lymphoma_Nature2013_df$PID, lymphoma_Nature2013_df$CHROM, lymphoma_Nature2013_df$POS),] lymphoma_Nature2013_df <- annotate_intermut_dist_cohort(lymphoma_Nature2013_df, in_PID.field="PID") data("exchange_colour_vector") lymphoma_Nature2013_df$col <- exchange_colour_vector[lymphoma_Nature2013_df$change] ``` Now we can select one sample and make the rainfall plot. The plot function used here relies on the package `r Biocpkg("gtrellis")` by Zuguang Gu [@gtrellis2015]. ```{r make_rainfall_plot, fig.cap="Rainfall plot in a trellis structure."} choice_PID <- "4121361" PID_df <- subset(lymphoma_Nature2013_df,PID==choice_PID) #trellis_rainfall_plot(PID_df,in_point_size=unit(0.5,"mm")) ``` This shows a rainfall plot typical for a lymphoma sample with clusters of increased mutation density e.g. at the immunoglobulin loci. \newpage ## Loading the signature information As stated [above](#LCD), one of the functions in the YAPSA package (`LCD`) is designed to do mutational signatures analysis with known signatures. There are (at least) two possible sources for signature data: i) the ones published initially by Alexandrov et al. [@Alex2013], and ii) an updated and curated current set of mutational signatures is maintained by Ludmil Alexandrov at . The following three subsections describe how you can load the data from these resources. Alternatively, you can bypass the three following subsections because the signature datasets are also included in this package: ```{r, load_stored_sig_data} data(sigs) ``` ### Loading the initial set of signatures We first load the (older) set of signatures as published in Alexandrov et al. [@Alex2013]: ```{r process_old_sigs} Alex_signatures_path <- paste0("ftp://ftp.sanger.ac.uk/pub/cancer/", "AlexandrovEtAl/signatures.txt") AlexInitialArtif_sig_df <- read.csv(Alex_signatures_path,header=TRUE,sep="\t") kable(AlexInitialArtif_sig_df[c(1:9),c(1:4)]) ``` We will now reformat the dataframe: ```{r reformat_old_sigs} Alex_rownames <- paste(AlexInitialArtif_sig_df[,1], AlexInitialArtif_sig_df[,2],sep=" ") select_ind <- grep("Signature",names(AlexInitialArtif_sig_df)) AlexInitialArtif_sig_df <- AlexInitialArtif_sig_df[,select_ind] number_of_Alex_sigs <- dim(AlexInitialArtif_sig_df)[2] names(AlexInitialArtif_sig_df) <- gsub("Signature\\.","A", names(AlexInitialArtif_sig_df)) rownames(AlexInitialArtif_sig_df) <- Alex_rownames kable(AlexInitialArtif_sig_df[c(1:9),c(1:6)], caption="Exemplary data from the initial Alexandrov signatures.") ``` This results in a dataframe for signatures, containing `r number_of_Alex_sigs` signatures as column vectors. It is worth noting that in the initial publication, only a subset of these `r number_of_Alex_sigs` signatures were validated by an orthogonal sequencing technology. So we can filter down: ```{r filter_validated_signatures} AlexInitialValid_sig_df <- AlexInitialArtif_sig_df[,grep("^A[0-9]+", names(AlexInitialArtif_sig_df))] number_of_Alex_validated_sigs <- dim(AlexInitialValid_sig_df)[2] ``` We are left with `r number_of_Alex_validated_sigs` signatures. ### Loading the updated set of mutational signatures An updated and curated set of mutational signatures is maintained by Ludmil Alexandrov at . We will use this set for the following analysis: ```{r load_sigs} Alex_COSMIC_signatures_path <- paste0("http://cancer.sanger.ac.uk/cancergenome/", "assets/signatures_probabilities.txt") AlexCosmicValid_sig_df <- read.csv(Alex_COSMIC_signatures_path, header=TRUE,sep="\t") Alex_COSMIC_rownames <- paste(AlexCosmicValid_sig_df[,1], AlexCosmicValid_sig_df[,2],sep=" ") COSMIC_select_ind <- grep("Signature",names(AlexCosmicValid_sig_df)) AlexCosmicValid_sig_df <- AlexCosmicValid_sig_df[,COSMIC_select_ind] number_of_Alex_COSMIC_sigs <- dim(AlexCosmicValid_sig_df)[2] names(AlexCosmicValid_sig_df) <- gsub("Signature\\.","AC", names(AlexCosmicValid_sig_df)) rownames(AlexCosmicValid_sig_df) <- Alex_COSMIC_rownames kable(AlexCosmicValid_sig_df[c(1:9),c(1:6)], caption="Exemplary data from the updated Alexandrov signatures.") ``` This results in a dataframe containing `r number_of_Alex_COSMIC_sigs` signatures as column vectors. For reasons of convenience and comparability with the initial signatures, we reorder the features. To this end, we adhere to the convention chosen in the initial publication by Alexandrov et al. [@Alex2013] for the initial signatures. ```{r reorder_features} COSMIC_order_ind <- match(Alex_rownames,Alex_COSMIC_rownames) AlexCosmicValid_sig_df <- AlexCosmicValid_sig_df[COSMIC_order_ind,] kable(AlexCosmicValid_sig_df[c(1:9),c(1:6)], caption=paste0("Exemplary data from the updated Alexandrov ", "signatures, rows reordered.")) ``` Note that the order of the features, i.e. nucleotide exchanges in their trinucleotide content, is changed from the fifth line on as indicated by the row names. ### Preparation for later analysis For every set of signatures, the functions in the YAPSA package require an additional dataframe containing meta information about the signatures. In that dataframe you can specify the order in which the signatures are going to be plotted and the colours asserted to the different signatures. In the following subsection we will set up such a dataframe. However, the respective dataframes are also stored in the package. If loaded by `data(sigs)` the following code block can be bypassed. ```{r, build_sig_ind_df} signature_colour_vector <- c("darkgreen","green","pink","goldenrod", "lightblue","blue","orangered","yellow", "orange","brown","purple","red", "darkblue","magenta","maroon","yellowgreen", "violet","lightgreen","sienna4","deeppink", "darkorchid","seagreen","grey10","grey30", "grey50","grey70","grey90") bio_process_vector <- c("spontaneous deamination","spontaneous deamination", "APOBEC","BRCA1_2","Smoking","unknown", "defect DNA MMR","UV light exposure","unknown", "IG hypermutation","POL E mutations","temozolomide", "unknown","APOBEC","unknown","unknown","unknown", "unknown","unknown","unknown","unknown","unknown", "nonvalidated","nonvalidated","nonvalidated", "nonvalidated","nonvalidated") AlexInitialArtif_sigInd_df <- data.frame(sig=colnames(AlexInitialArtif_sig_df)) AlexInitialArtif_sigInd_df$index <- seq_len(dim(AlexInitialArtif_sigInd_df)[1]) AlexInitialArtif_sigInd_df$colour <- signature_colour_vector AlexInitialArtif_sigInd_df$process <- bio_process_vector COSMIC_signature_colour_vector <- c("green","pink","goldenrod", "lightblue","blue","orangered","yellow", "orange","brown","purple","red", "darkblue","magenta","maroon", "yellowgreen","violet","lightgreen", "sienna4","deeppink","darkorchid", "seagreen","grey","darkgrey", "black","yellow4","coral2","chocolate2", "navyblue","plum","springgreen") COSMIC_bio_process_vector <- c("spontaneous deamination","APOBEC", "defect DNA DSB repair hom. recomb.", "tobacco mutatgens, benzo(a)pyrene", "unknown", "defect DNA MMR, found in MSI tumors", "UV light exposure","unknown","POL eta and SHM", "altered POL E", "alkylating agents, temozolomide", "unknown","APOBEC","unknown", "defect DNA MMR","unknown","unknown", "unknown","unknown", "associated w. small indels at repeats", "unknown","aristocholic acid","unknown", "aflatoxin","unknown","defect DNA MMR", "unknown","unknown","tobacco chewing","unknown") AlexCosmicValid_sigInd_df <- data.frame(sig=colnames(AlexCosmicValid_sig_df)) AlexCosmicValid_sigInd_df$index <- seq_len(dim(AlexCosmicValid_sigInd_df)[1]) AlexCosmicValid_sigInd_df$colour <- COSMIC_signature_colour_vector AlexCosmicValid_sigInd_df$process <- COSMIC_bio_process_vector ``` YAPSA can also perform analyses based on other sets of mutational signatures. Details can be found in additional vignettes on [signature-specific cutoffs](http://bioconductor.org/packages/3.11/bioc/vignettes/YAPSA/inst/doc/vignette_signature_specific_cutoffs.html) and [Indel signatures](http://bioconductor.org/packages/3.11/bioc/vignettes/YAPSA/inst/doc/vignettes_Indel.html). ## Performing an LCD analysis Now we can start using the functions from the YAPSA package. We will start with a mutational signatures analysis using known signatures (the ones we loaded in the above paragraph). For this, we will use the functions `LCD` and `LCD_complex_cutoff`. ### Building a mutational catalogue This section uses functions which are to a large extent wrappers for functions in the package SomaticSignatures by Julian Gehring [@Gehring_article2015]. ```{r load_libraries, warning=FALSE, message=FALSE} library(BSgenome.Hsapiens.UCSC.hg19) ``` ```{r build_mutational_catalogue, results="hide"} word_length <- 3 lymphomaNature2013_mutCat_list <- create_mutation_catalogue_from_df( lymphoma_Nature2013_df, this_seqnames.field = "CHROM", this_start.field = "POS", this_end.field = "POS", this_PID.field = "PID", this_subgroup.field = "SUBGROUP", this_refGenome = BSgenome.Hsapiens.UCSC.hg19, this_wordLength = word_length) ``` The function `create_mutation_catalogue_from_df` returns a list object with several entries. We will use the one called `matrix`. ```{r display_structure_mutation_catalogue_list} names(lymphomaNature2013_mutCat_list) lymphomaNature2013_mutCat_df <- as.data.frame( lymphomaNature2013_mutCat_list$matrix) kable(lymphomaNature2013_mutCat_df[c(1:9),c(5:10)]) ``` \newpage ### LCD analysis without any cutoff The `LCD` function performs the decomposition of a mutational catalogue into a priori known signatures and the respective exposures to these signatures as described in the second section of this vignette. We use the signatures from [@Alex2013] from the COSMIC website (). ```{r LCD, warning=FALSE} current_sig_df <- AlexCosmicValid_sig_df current_sigInd_df <- AlexCosmicValid_sigInd_df lymphomaNature2013_COSMICExposures_df <- LCD(lymphomaNature2013_mutCat_df,current_sig_df) ``` Some adaptation (extracting and reformatting the information which sample belongs to which subgroup): ```{r make_subgroups_df_LCD} COSMIC_subgroups_df <- make_subgroups_df(lymphoma_Nature2013_df, lymphomaNature2013_COSMICExposures_df) ``` The resulting signature exposures can be plotted using custom plotting functions. First as absolute exposures: ```{r caption_barplot, echo=FALSE} cap <- "Absoute exposures of the COSMIC signatures in the lymphoma mutational catalogues, no cutoff for the LCD (Linear Combination Decomposition)" ``` ```{r exposures_barplot_LCD, fig.width=6, fig.height=5, fig.cap=cap} exposures_barplot( in_exposures_df = lymphomaNature2013_COSMICExposures_df, in_subgroups_df = COSMIC_subgroups_df) ``` Here, as no colour information was given to the plotting function `exposures_barplot`, the identified signatures are coloured in a rainbow palette. If you want to assign colours to the signatures, this is possible via a data structure of type `sigInd_df`. ```{r caption_barplot_2, echo=FALSE} cap <- "Absoute exposures of the COSMIC signatures in the lymphoma mutational catalogues, no cutoff for the LCD (Linear Combination Decomposition)" ``` ```{r exposures_barplot_LCD_sig_ind, fig.width=6, fig.height=5, fig.cap=cap} exposures_barplot( in_exposures_df = lymphomaNature2013_COSMICExposures_df, in_signatures_ind_df = current_sigInd_df, in_subgroups_df = COSMIC_subgroups_df) ``` This figure has a colour coding which suits our needs, but there is one slight inconsistency: colour codes are assigned to all `r dim(current_sig_df)[2]` provided signatures, even though some of them might not have any contributions in this cohort: ```{r rowSums_in_cohort} rowSums(lymphomaNature2013_COSMICExposures_df) ``` This can be overcome by using `LCD_complex_cutoff`. It requires an additional parameter: `in_cutoff_vector`; this is already the more general framework which will be explained in more detail in the following section. ```{r LCD_complex_cutoff_cutoffZero} zero_cutoff_vector <- rep(0,dim(current_sig_df)[2]) CosmicValid_cutoffZero_LCDlist <- LCD_complex_cutoff( in_mutation_catalogue_df = lymphomaNature2013_mutCat_df, in_signatures_df = current_sig_df, in_cutoff_vector = zero_cutoff_vector, in_sig_ind_df = current_sigInd_df) ``` We can re-create the subgroup information (even though this is identical to the already determined one): ```{r apply_make_subgroups_df_cutoffZero} COSMIC_subgroups_df <- make_subgroups_df(lymphoma_Nature2013_df, CosmicValid_cutoffZero_LCDlist$exposures) ``` And if we plot this, we obtain: ```{r caption_barplot_3, echo=FALSE} cap="Absoute exposures of the COSMIC signatures in the lymphoma mutational catalogues, no cutoff for the LCD (Linear Combination Decomposition)." ``` ```{r exposures_barplot_abs_cutoffZero, fig.width=6, fig.height=5, fig.cap=cap} exposures_barplot( in_exposures_df = CosmicValid_cutoffZero_LCDlist$exposures, in_signatures_ind_df = CosmicValid_cutoffZero_LCDlist$out_sig_ind_df, in_subgroups_df = COSMIC_subgroups_df) ``` Note that this time, only the `r dim(CosmicValid_cutoffZero_LCDlist$exposures)[1]` signatures which actually have a contribution to this cohort are displayed in the legend. Of course, also relative exposures may be plotted: ```{r caption_barplot_4, echo=FALSE} cap="Relative exposures of the COSMIC signatures in the lymphoma mutational catalogues, no cutoff for the LCD (Linear Combination Decomposition)." ``` ```{r exposures_barplot_rel_cutoffZero, fig.width=6, fig.height=5, fig.cap=cap} exposures_barplot( in_exposures_df = CosmicValid_cutoffZero_LCDlist$norm_exposures, in_signatures_ind_df = CosmicValid_cutoffZero_LCDlist$out_sig_ind_df, in_subgroups_df = COSMIC_subgroups_df) ``` ### LCD analysis with a generalized cutoff Now let's rerun the analysis with a cutoff to discard signatures with insufficient cohort-wide contribution. ```{r definition_of_cutoff} my_cutoff <- 0.06 ``` The cutoff of `r my_cutoff` means that a signature is kept if it's exposure represents at least `r my_cutoff*100`% of all SNVs in the cohort. We will use the function `LCD_complex_cutoff` instead of `LCD`. ```{r LCD_complex_cutoff_cutoffGen, warning=FALSE} general_cutoff_vector <- rep(my_cutoff,dim(current_sig_df)[2]) CosmicValid_cutoffGen_LCDlist <- LCD_complex_cutoff( in_mutation_catalogue_df = lymphomaNature2013_mutCat_df, in_signatures_df = current_sig_df, in_cutoff_vector = general_cutoff_vector, in_sig_ind_df = current_sigInd_df) ``` At the chosen cutoff of `r my_cutoff`, we are left with `r dim(CosmicValid_cutoffGen_LCDlist$out_sig_ind_df)[1]` signatures. We can look at these signatures in detail and their attributed biological processes: ```{r} kable(CosmicValid_cutoffGen_LCDlist$out_sig_ind_df, row.names=FALSE, caption=paste0("Signatures with cohort-wide exposures > ",my_cutoff)) ``` Again we can plot absolute exposures: ```{r caption_barplot_5, echo=FALSE} cap="Absoute exposures of the COSMIC signatures in the lymphoma mutational catalogues, cutoff of 6% for the LCD (Linear Combination Decomposition)." ``` ```{r exposures_barplot_abs_cutoffGen, fig.width=6, fig.height=4, fig.cap=cap} exposures_barplot( in_exposures_df = CosmicValid_cutoffGen_LCDlist$exposures, in_signatures_ind_df = CosmicValid_cutoffGen_LCDlist$out_sig_ind_df, in_subgroups_df = COSMIC_subgroups_df) ``` And relative exposures: ```{r caption_barplot_6, echo=FALSE} cap="Relative exposures of the COSMIC signatures in the lymphoma mutational catalogues, cutoff of 6% for the LCD (Linear Combination Decomposition)." ``` ```{r exposures_barplot_rel_cutoffGen, fig.width=6, fig.height=4, fig.cap=cap} exposures_barplot( in_exposures_df = CosmicValid_cutoffGen_LCDlist$norm_exposures, in_signatures_ind_df = CosmicValid_cutoffGen_LCDlist$out_sig_ind_df, in_subgroups_df = COSMIC_subgroups_df) ``` ### LCD analysis with signature-specific cutoffs {#sigSpecCutoffs} When using `LCD_complex_cutoff`, we have to supply a vector of cutoffs with as many entries as there are signatures. In the analysis carried out above, these were all equal, but this is not a necessary condition. Indeed it may make sense to provide different cutoffs for different signatures. ```{r make_specific_cutoff_vector} specific_cutoff_vector <- general_cutoff_vector specific_cutoff_vector[c(1,5)] <- 0 specific_cutoff_vector ``` In this example, the cutoff for signatures AC1 and AC5 is thus set to 0, whereas the cutoffs for all other signatures remains at 0.06. Running the function `LCD_complex_cutoff` is completely analogous: ```{r LCD_complex_cutoff_cutoffSpec, warning=FALSE} CosmicValid_cutoffSpec_LCDlist <- LCD_complex_cutoff( in_mutation_catalogue_df = lymphomaNature2013_mutCat_df, in_signatures_df = current_sig_df, in_cutoff_vector = specific_cutoff_vector, in_sig_ind_df = current_sigInd_df) ``` Plotting absolute exposures for visualization: ```{r caption_barplot_7, echo=FALSE} cap="Absoute exposures of the COSMIC signatures in the lymphoma mutational catalogues, cutoff of 6% for the LCD (Linear Combination Decomposition)." ``` ```{r exposures_barplot_abs_cutoffSpec, fig.width=6, fig.height=4, fig.cap=cap} exposures_barplot( in_exposures_df = CosmicValid_cutoffSpec_LCDlist$exposures, in_signatures_ind_df = CosmicValid_cutoffSpec_LCDlist$out_sig_ind_df, in_subgroups_df = COSMIC_subgroups_df) ``` And relative exposures: ```{r caption_barplot_8, echo=FALSE} cap="Relative exposures of the COSMIC signatures in the lymphoma mutational catalogues, cutoff of 6% for the LCD (Linear Combination Decomposition)." ``` ```{r exposures_barplot_rel_cutoffSpec, fig.width=6, fig.height=4, fig.cap=cap} exposures_barplot( in_exposures_df = CosmicValid_cutoffSpec_LCDlist$norm_exposures, in_signatures_ind_df = CosmicValid_cutoffSpec_LCDlist$out_sig_ind_df, in_subgroups_df = COSMIC_subgroups_df) ``` Note that the signatures extracted with the signature-specific cutoffs are the same in the example displayed here. Depending on the analyzed cohort and the choice of cutoffs, the extracted signatures may vary considerably. A unique feature of YAPSA is that it also provides optimal signature-specific cutoffs, a topic explained in a separate [vignette](http://bioconductor.org/packages/3.11/bioc/vignettes/YAPSA/inst/doc/vignette_signature_specific_cutoffs.html). ## Cluster samples based on their signature exposures To identify groups of samples which were exposed to similar mutational processes, the exposure vectors of the samples can be compared. The YAPSA package provides a custom function for this task: `complex_heatmap_exposures`, which uses the package `r Biocpkg("ComplexHeatmap")` by Zuguang Gu [@ComplexHeatmap2015]. It produces output as follows: ```{r caption_heatmap, echo=FALSE} cap="Clustering of Samples and Signatures based on the relative exposures of the COSMIC signatures in the lymphoma mutational catalogues." ``` ```{r apply_heatmap_exposures, fig.width=6, fig.height=4, fig.cap=cap} complex_heatmap_exposures(CosmicValid_cutoffGen_LCDlist$norm_exposures, COSMIC_subgroups_df, CosmicValid_cutoffGen_LCDlist$out_sig_ind_df, in_data_type="norm exposures", in_subgroup_colour_column="col", in_method="manhattan", in_subgroup_column="subgroup") ``` If you are interested only in the clustering and not in the heatmap information, you could also use `hclust_exposures`: ```{r caption_hclust, echo=FALSE} cap="Clustering of the Samples based on the relative exposures of the COSMIC signatures in the lymphoma mutational catalogues." ``` ```{r apply_hclust_exposures, fig.width=6, fig.height=3, fig.cap=cap} hclust_list <- hclust_exposures(CosmicValid_cutoffGen_LCDlist$norm_exposures, COSMIC_subgroups_df, in_method="manhattan", in_subgroup_column="subgroup") ``` The dendrogram produced by either the function `complex_heatmap_exposures` or the function `hclust_exposures` can be cut to yield signature exposures specific to subgroups of PIDs. ```{r caption_heatmap_2, echo=FALSE} cap=paste0("PIDs labelled by the clusters extracted from the signature analysis.") ``` ```{r cluster_PIDs_sig_info, fig.width=6, fig.height=4, fig.cap=cap} cluster_vector <- cutree(hclust_list$hclust,k=4) COSMIC_subgroups_df$cluster <- cluster_vector subgroup_colour_vector <- rainbow(length(unique(COSMIC_subgroups_df$cluster))) COSMIC_subgroups_df$cluster_col <- subgroup_colour_vector[factor(COSMIC_subgroups_df$cluster)] complex_heatmap_exposures(CosmicValid_cutoffGen_LCDlist$norm_exposures, COSMIC_subgroups_df, CosmicValid_cutoffGen_LCDlist$out_sig_ind_df, in_data_type="norm exposures", in_subgroup_colour_column="cluster_col", in_method="manhattan", in_subgroup_column="cluster") ``` # References