--- title: "2. Signature-specific cutoffs" author: "Lea Jopp-Saile and Daniel Hübschmann" date: "27/11/2019" vignette: > %\VignetteIndexEntry{2. Signature-specific cutoffs} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: toc: yes references: - author: - family: Alexandrov given: LB - family: Nik-Zainal given: S - family: Wedge given: DC - family: Aparicio given: SA - family: Behjati given: S - family: Biankin given: AV - family: Bignell given: GR - family: Bolli given: N - family: Borg given: A - family: Borresen-Dale given: AL - family: Boyault given: S - family: Burkhardt given: B - family: Butler given: AP - family: Caldas given: C - family: Davies given: HR - family: Desmedt given: C - family: Eils given: R - family: Eyfjörd given: JE - family: Greaves given: M - family: Hosoda given: F - family: Hutter given: B - family: Ilicic given: T - family: Imbeaud given: S - family: Imielinski given: M - family: Jäger given: N - family: Jones given: DT - family: Jones given: D - family: Knappskog given: S - family: Kool given: M - family: Lakhani given: SR - family: Lopez-Otin given: C - family: Martin given: S - family: Munshi given: NC - family: Nakamura given: H - family: Northcott given: PA - family: Pajic given: M - family: Papaemmanuil given: E - family: Paradiso given: A - family: Pearson given: JV - family: Puente given: XS - family: Raine given: K - family: Ramakrishna given: M - family: Richardson given: AL - family: Richter given: J - family: Rosenstiel given: P - family: Schlesner given: M - family: Schumacher given: TN - family: Span given: PN - family: Teague given: JW - family: Tokoti given: Y - family: Tutt given: AN - family: Valdes-Mas given: R - family: van Buuren given: MM - family: van't Veer given: L - family: Vincent-Salomon given: A - family: Waddell given: N - family: Yates given: LR - family: Australian Pancreatic Cancer Initiative given: - family: ICGC Breast Cancer Consortium given: - family: ICGC MMML-Seq Consortium given: - family: ICGC PedBrain given: - family: Zucman-Rossi given: J - family: Futreal given: PA - family: McDermott given: U - family: Lichter given: P - family: Meyerson given: M - family: Grimmond given: SM - family: Siebert given: R - family: Campo given: E - family: Shibata given: T - family: Pfister given: SM - family: Campbell given: PJ - family: Stratton given: MR container-title: Nature id: Alex2013 issued: month: 08 volume: 500 year: 2013 publisher: Nature Publishing Group title: 'Signatures of Mutational Processes in Cancer' - author: - family: Alexandrov given: LB - family: Kim given: J - family: Haradhvala given: NJ - family: Huang given: MN - family: Ng given: AW - family: Boot given: A - family: Covington given: KR - family: Gordenin given: DA - family: Bergstrom given: E - family: Lopez-Bigas given: N - family: Klimczak given: LJ - family: McPherson given: JR - family: Morganella given: S - family: Sabarinathan given: R - family: Wheeler given: DA - family: Mustonen given: V - family: Getz given: G - family: Rozen given: SG - family: Stratton given: MR container-title: bioarxiv id: Alex2018 issued: month: 05 year: 2018 publisher: bioarxiv title: 'The Repertoire of Mutational Signatures in Human Cancer' - author: - family: Alexandrov given: LB - family: Nik-Zainal given: S - family: Wedge given: DC - family: Campbell given: PJ - family: Stratton given: MR container-title: Cell Reports id: Alex_CellRep2013 issued: year: 2013 title: > Deciphering signatures of mutational processes operative in human cancer - author: - family: Wang given: Yong - family: Lawson given: Charles L. - family: Hanson given: Richard J container-title: R-package version 1.1-1 id: lsei_package2015 issued: year: 2015 title: > lsei: Solving Least Squares Problems under Equality/Inequality Constraints - author: - family: Sing given: Tobias - family: Sander given: Oliver - family: Beerenwinkel given: Niko - family: Lengauer given: Thomas container-title: Bioinfomatics id: ROCR_package2005 issued: year: 2005 title: > ROCR: visualizing classifier performance in R --- ```{r load_style, warning=FALSE, echo=FALSE, message=FALSE, results="hide"} library(BiocStyle) ``` ```{r packages, include=FALSE} library(YAPSA) library(knitr) library(gridExtra) library(dplyr) opts_chunk$set(echo=TRUE) opts_chunk$set(fig.show='asis') ``` # Introduction {#introduction} The vignette [1. Usage of YAPSA](http://bioconductor.org/packages/3.11/bioc/vignettes/YAPSA/inst/doc/YAPSA.html) describes the general usage of YAPSA as a tool to perform supervised analyses of mutational signatures. Calling of mutational signature in YAPSA is performed with a Non-Negative Least Square (NNLS) algorithm implemented in the function **l**inear **c**ombination **d**ecomposition (`LCD`) function based on the function `lsei` from the package `r CRANpkg("lsei")` [@lsei_package2015]. The function `LCD` can take signature-specific cutoffs as optional arguments. When choosing optimal cutoffs, this can lead to an increase in sensitivity and specificity as compared to NNLS alone. In such a context, `LCD` then determines the exposure, i.e., the contribution of signature to the mutational load of a samples, in a three-step computational process: 1. Find an NNLS solution with all supplied signatures using LCD 1. Identify those signatures which have a contribution higher than a signature-specific cutoff 1. Rerun NNLS based on the selected subset of signatures Mutational signatures as provided by http://cancer.sanger.ac.uk/cosmic/signatures have been extracted by Non-negative Matrix Factorization (NMF). Due to the constraint of non-negativity, these signatures cannot be orthogonal. However, orthogonality is a prerequisite or at least beneficial for least squares algorithms. This lack of orthogonality leads to potential overlapping of the signature in a deconvolution. Different signatures have different detectability. Those with high detectability will occur as false positive calls more often. In order to account for the different detectability, we introduced the concept of signature-specific cutoffs: a signature which leads to many false positive calls has to cross a higher threshold than a signature which rarely leads to false positive calls. For YAPSA, we have computed optimal signature cutoffs by a modfied Reciever Operating Characteristic (ROC) analysis on the same data as it had been used for the initial NMF analyses with the extraction of the published mutational signatures ([@Alex2013] and [@Alex2018]). Numeric values for these cutoffs are stored in dataframes accessible after loading the package YAPSA. In the implementation of the package `r CRANpkg("ROCR")` [@ROCR_package2005], a modified ROC analysis is parametrized by defining cost terms for punishing false-negative ($cost_{FN}$) and false-positive ($cost_{FP}$) findings separately. In the following, we will call the ratio between the cost for a false-negative finding divided by the cost for a false positive finding the $cost_{factor}$ ($cost_{factor} = cost_{FN}/cost_{FP}$). Separate ROC analyses were performed for every signature. For every signature, the global minimum of the cost function defines the optimal value for the signature-specific cutoff. However, the shape of the cost function and hence the position of its minimum depend on $cost_{factor}$. Therefore there is one set of optimal signature-specific cutoffs per chosen value of $cost_{factor}$ ([cf. below](#OptimalCutoffs)). After having computed the optimal signature-specific cutoffs for a range of values for $cost_{factor}$, we applied an additional criterion in order to get an optimal value for $cost_{factor}$: minimization of the overall number of false attributions. Using this criterion, for the Alexandrov COSMIC signatures (`AC1 - AC30`), the optimal value for $cost_{factor}$ was determined to be $6$. # Overview of the data {#DataOverview} ## Signatures In YAPSA, the patterns of nucleotide exchanges in their triplet context constituting the mutational signatures are stored in data frames which can be loaded as follows (more information provided by the help function, cf. ([1. Usage of YAPSA](http://bioconductor.org/packages/3.11/bioc/vignettes/YAPSA/inst/doc/YAPSA.html)) for an overview of how these data frames can be re-created from data available online): ```{r load signatures} data(sigs) data(sigs_pcawg) ``` The first command, `data(sigs)`, loads eight dataframes into the workspace. Among these, four contain the patterns of nucleotide exchanges in their triplet context, i.e., the mutational signatures themselves, and the remaining four dataframes contain additional information on the signatures, including a naming and numbering convention, a description of the asserted underlying mutational process and a choice of generic colour coding. In particular, these eight signature dataframes are: * Two signature and two additonal information dataframes containing SNV Alexandrov COSMIC signatures (available at , downloaded on January 15th, 2016): * `AlexCosmicArtif_sig_df` and `AlexCosmicArtif_sigInd_df`, * `AlexCosmicValid_sig_df` and `AlexCosmicValid_sigInd_df` * Two signature and two additonal information dataframes containing SNV Alexadrov Initial signatures (the signatures presented in the initial publication [@Alex2013]): * `AlexInitialArtif_sig_df` and `AlexInitialArtif_sigInd_df`, * `AlexInitialValid_sig_df` and `AlexInitialValid_sigInd_df` Of note, all data provided in YAPSA follows a consistent naming convention: * Names containing `Valid` or `Real` refer to those subsets of signatures which have been identified by NMF in the respective discovery analyses and validated using orthogonal sequencing technologies. * Names containing `Artif` refer to extended sets of signatures including also those signatures which have later been ascribed to be artifact signatures (reflecting, among others, sequencing errors). The second command in the above code block, `data(sigs_pcawg)`, loads six additional dataframes into the workspace, three dataframes containing mutational signatures themselves and three dataframes containing additional information on the signatures. In particular, these six dataframes are: * Two signature and two additional information dataframes containing SNV Alexandrov PCAWG signatures (available at , downloaded on August 16th, 2018): * `PCAWG_SP_SBS_sigs_Artif_df` and `PCAWG_SP_SBS_sigInd_Artif_df`, * `PCAWG_SP_SBS_sigs_Real_df` and `PCAWG_SP_SBS_sigInd_Real_df` * One signature and one additional information dataframe containing Indel Alexandrov PCAWG signatures (also available at , downloaded on August 16th, 2018): * `PCAWG_SP_ID_sigs_df` and `PCAWG_SP_ID_sigInd_df` As of december 2019 and to our knowledge, it is a unique feature of YAPSA to include PCAWG mutational signatures. Of note, a [separate vignette](http://bioconductor.org/packages/3.11/bioc/vignettes/YAPSA/inst/doc/vignettes_Indel.html) describes the usage of and analysis with Indel signatures. ## Optimal cutoffs {#OptimalCutoffs} For each of the signature dataframes, corresponding dataframes storing numerical values of the signature-specific cutoffs are available. The cutoff dataframes can be loaded to the workspace analogously to the signatures themselves: ```{r load cutoffs} data(cutoffs) data(cutoffs_pcawg) ``` The command `data(cutoffs)` loads cutoff dataframes corresponding to the four Alexandrov COSMIC and Alexandrov initial sets of signatures. In the following, an additional naming convention is introduced: * Absolute (abs) signature-specific cutoffs were computed using an ROC analysis performed on absolute exposures as provided by the initial NMF analysis * Relative (rel) signature-specific cutoffs were computed using on ROC analysis performed on relative, i.e., normalized exposures as provided by the initial NMF analysis. This double training has proven to be both useful and necessary in order to account for the heterogeneity of the underlying training data: on one hand, a training with absolute exposures weighs all mutations equally, however a whole genome sequenced (WGS) sample gets a much higher weight due to the high number of mutations detected in it than a whole exome sequenced (WES sample). On the other hand, a training with relative or normalized signature exposures weighs all samples equally, but at the cost of having different weights for mutations originating from samples of different mutation count. Based on these considerations, cutoffs derived from the training with absolute exposures should be used for the supervised analysis of WGS data, whereas cutoffs trained on relative exposures are better suited for the analysis of WES data. The two different sets of cutoffs thus account for the different maximal feature occurrence between WGS and WES data. The correspondence between signatures and cutoffs is as follows: * For `AlexCosmicValid_sig_df`, the corresponding cutoff dataframes are `cutoffCOSMICValid_abs_df` and `cutoffCOSMICValid_rel_df`. * For `AlexCosmicArtif_sig_df`, the corresponding cutoff dataframes are `cutoffCOSMICArtif_abs_df` and `cutoffCOSMICArtif_rel_df`. With `data(cutoffs_pcawg)` the signature-specific cutoff dataframes for the PCAWG signatures are loaded. In this case, only one dataframe with cutoffs per signature dataframe is available. Signature-specific cutoffs are valid for WGS and WES data analysis as differences in the feature occurrence is taken into account during the training procedure. In the provided dataframes with numerical values for optimal signature-specific cutoffs, the columns correspond to the different signatures of the chosen signature set, whereas the rows encode different values of $cost_{factor}$ (cf. [introduction](#introduction)). For an actual analysis of mutational signatures, only one optimal cutoff value per signature is required, i.e., the user has to choose one row from the chosen cutoff dataframe. As already explained above, optimal values for $cost_{factor}$ have also been determined. These are listed in the table below. Unless a user has a specific wish to use a different $cost_{factor}$, it is generally recommended to use these default parameter choices. cutoff dataframe | optimal cost factor ------- | ----- cutoffInitialValid_abs_df | 6 cutoffInitialValid_rel_df | 6 cutoffInitialArtif_abs_df | 6 cutoffInitialArtif_rel_df | 6 cutoffCosmicValid_abs_df | 6 cutoffCosmicValid_rel_df | 6 cutoffCosmicArtif_abs_df | 6 cutoffCosmicArtif_rel_df | 6 cutoffPCAWG_SBS_WGSWES_realPid_df | 10 cutoffPCAWG_SBS_WGSWES_artiflPid_df | 16 cutoffPCAWG_ID_WGS_Pid_df | 3 : (\#tab:Table:) Overview of signature specific cutoffs and the cost factor to obtain the optimal cutoffs. Thus, if a user wants to do an analysis with the Alexandrov COSMIC signatures, discarding the artifact signatures, on WGS data, he/she would use the following code snippet: ```{r cutoff example} data(cutoffs) current_cutoff_vector <- cutoffCosmicValid_abs_df[6, ] current_cutoff_vector ``` # Application example The vignette [1. Usage of YAPSA](http://bioconductor.org/packages/3.11/bioc/vignettes/YAPSA/inst/doc/YAPSA.html) shows how to obtain a mutational catalog ($V$) from a vcf file. Here, for the sake of simplicits, we just use an example data set stored in the software package. ```{r load data} data(lymphomaNature2013_mutCat_df) ``` The mutational catalog $V$ together with the signatures $V$ and the corresponding signature-specific cutoffs can be used to determine the signature exposures (corresponding to the matrix $H$ in the naming convention influenced by NMF) per sample or cohort. This is done by the function `LCD_complex_cutoff_combined`. In the examples provided in [1. Usage of YAPSA](http://bioconductor.org/packages/3.11/bioc/vignettes/YAPSA/inst/doc/YAPSA.html), signature exposures were determined (i) with a zero cutoff vector, i.e. without signature-specific cutoffs or (ii) with a manually chosen but not necessarily optimal cutoff. In contrast, in the following, we will perform an analysis of mutational signatures using optimal cutoffs and compare the results to an analysis without any cutoffs. ```{r set variables } current_sig_df <- AlexCosmicValid_sig_df current_sigInd_df <- AlexCosmicValid_sigInd_df ``` ## Comparison: analysis without any cutoffs {#comparisonZeroCutoffs} For the purpose of comparison, we first create a zero cutoff vector: ```{r cutoff vector} current_cutoff_vector <- rep(0, dim(AlexCosmicValid_sig_df)[2]) ``` Next, we compute the exposures to the mutational signatures using LCD analysis and this zero cutoff vector: ```{r lymphoma_cohort_LCD_results} lymphoma_COSMIC_zero_listsList <- LCD_complex_cutoff_combined( in_mutation_catalogue_df = lymphomaNature2013_mutCat_df, in_cutoff_vector = current_cutoff_vector, in_signatures_df = current_sig_df, in_sig_ind_df = current_sigInd_df) ``` The function returns an object with the results per cohort, per PID and the consensus between both analyses. Annotation of a subgroup to each PID allows to group the samples per subgroup upon visualization: ```{r subrgroup annotation} data(lymphoma_PID) colnames(lymphoma_PID_df) <- "SUBGROUP" lymphoma_PID_df$PID <- rownames(lymphoma_PID_df) COSMIC_subgroups_df <- make_subgroups_df(lymphoma_PID_df, lymphoma_COSMIC_zero_listsList$cohort$exposures) ``` ```{r caption_barplot_2, echo=FALSE} cap <- "Absoute exposures of the COSMIC signatures in the lymphoma mutational catalogs, signature-specific cutoffs with a cost factor of 6 used for the LCD" ``` We select the cohort-wide analysis for visualization and make use of the custom plotting function `exposures_barplot()` in order to display signature exposures as stacked barplots. Note the occurrence of multiple signatures with partially very low exposures. ```{r exposures_zero, warning=FALSE, fig.width=8, fig.height=6, fig.cap= cap} result_cohort <- lymphoma_COSMIC_zero_listsList$cohort exposures_barplot( in_exposures_df = result_cohort$exposures, in_signatures_ind_df = result_cohort$out_sig_ind_df, in_subgroups_df = COSMIC_subgroups_df) ``` ## Making use of optimal signature-specific cutoffs {#optimalSpecificCutoffs} The zero-cutoff vector will now be replaced by the signature-specific cutoff vector. After the initial LCD analysis only signatures that contribute higher then their signature specific cutoff value will be considered, leading to a reduction of the overall number of detected signatures. Based on \@ref(tab:Table1) we know that for the validated COSMIC Signatures the optimal cutoffs are the once computed with a cost factor of six. ```{r set signature-specific cutoff} current_cutoff_df <- cutoffCosmicValid_abs_df current_cost_factor <- 6 current_cutoff_vector <- current_cutoff_df[current_cost_factor,] ``` We again compute signature exposures, but this time using the vector of optimal signature-specific cutoffs: ```{r LCD with cutoffs} lymphoma_COSMIC_listsList <- LCD_complex_cutoff_combined( in_mutation_catalogue_df = lymphomaNature2013_mutCat_df, in_cutoff_vector = current_cutoff_vector, in_signatures_df = current_sig_df, in_sig_ind_df = current_sigInd_df) ``` And again proceeding to a visualization (of note, the sample subgrouping information can be used as in the previous subsection): ```{r caption_barplot, echo=FALSE} cap <- "Absolute exposures of the COSMIC signatures in the lymphoma mutational catalogs, signature-specific cutoffs with a cost factor of 6 used for the LCD" ``` ```{r exposures_cutoffs, warning=FALSE, fig.width=8, fig.height=6, fig.cap= cap} result_cohort <- lymphoma_COSMIC_listsList$cohort exposures_barplot( in_exposures_df = result_cohort$exposures, in_signatures_ind_df = result_cohort$out_sig_ind_df, in_subgroups_df = COSMIC_subgroups_df) ``` We note that the number of identified signatures is smaller. Those signatures which had been identified in [the analysis without any cutoffs](#comparisonZeroCutoffs) but not in [the analysis with optimal cutoffs](#optimalSpecificCutoffs) can be considered false positive calls. # Reference