%\VignetteIndexEntry{TNBC.CMS: prediction of TNBC consensus molecular subtype} %\VignetteDepends{e1071, quadprog, SummarizedExperiment} %\VignetteImports{GSVA, pheatmap, grDevices, RColorBrewer, pracma} %\VignettePackage{TNBC.CMS} %\VignetteEngine{knitr::knitr} \documentclass{article} \usepackage{graphicx} \usepackage{microtype} \usepackage[T1]{fontenc} \usepackage[latin1]{inputenc} \usepackage{lmodern} \usepackage{geometry} \usepackage{authblk} \usepackage{float} \geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm} \usepackage[table]{xcolor} \bibliographystyle{unsrt} \begin{document} <>= library(knitr) opts_chunk$set(concordance=TRUE) options(scipen = 1, digits = 2, warn = -1) @ \title{\texttt{TNBC.CMS}: prediction of TNBC consensus molecular subtype} \author[1]{Doyeong Yu} \author[1]{Jihyun Kim} \author[2]{In Hae Park} \author[1]{Charny Park} \affil[1]{Clinical Genomics Analysis Branch, Research Institute, National Cancer Center, Gyeonggi-do, Republic of Korea} \affil[2]{Center for Breast Cancer, Hospital, National Cancer Center, Gyeonggi-do, Republic of Korea} \date{March 10, 2019} \maketitle \tableofcontents \pagebreak %------------------------------------------------------------ \section{Introduction} %------------------------------------------------------------ TNBC.CMS is a package for molecular subtype classification of triple-negative breast cancer (TNBC). While various classification strategies have been proposed, absence of precise subtype classifier was a limitation of patient diagnosis and TNBC studies. Our machine learning-based classifier model was derived from gene expression profiles of 957 TNBC patients. The TNBC.CMS package classifies patients into four consensus molecular subtypes (CMS): mesenchymal-like (MSL), immunomodulatory (IM), luminal AR (LAR) and stem-like (SL). It also provides a summary of genomic and clinical characteristics including survival, hazard ratio, pathway activities and drug responses. %------------------------------------------------------------ \section{Loading package and dataset for case studies} %------------------------------------------------------------ In this vignette, we walk through a case study of the breast cancer microarray dataset GSE25055 [1] to demonstrate the practical use of our package. The GSE25055 dataset was obtained from the \texttt{curatedBreastData} package. We filtered out samples which seemed to be positive for ER, PR, and HER2 based on immunohistochemistry results and the distribution of gene expression.\paragraph{} First, we load the package and the processed expression data. The dataset is contained in a \texttt{SummarizedExperiment} object, which includes expression profiles of 4,746 genes and 73 samples. Note that rows and columns correspond to genes and samples, and row names must be gene symbols. <>= library(TNBC.CMS) data("GSE25055") dim(assays(GSE25055)[[1]]) assays(GSE25055)[[1]][1:5, 1:5] @ \pagebreak %------------------------------------------------------------ \section{Case study: CMS classification} %------------------------------------------------------------ The \texttt{predictCMS} function assigns consensus molecular subtypes to TNBC samples based on input matrix or \texttt{SummarizedExperiment} object. If input is a \texttt{SummarizedExperiment} object, the first element in the assays list should be a matrix of gene expression. In any case, gene expression profiles should neither be scaled nor log-transformed. Class probabilities can be retrieved by accessing the \texttt{probabilities} attribute. <>= predictions <- predictCMS(GSE25055) table(predictions) head(attr(predictions, "probabilities")) @ %------------------------------------------------------------ \section{Case study: summary of genomic and clinical characteristics} %------------------------------------------------------------ The \texttt{TNBC.CMS} package includes several functions for studying genomic and clinical characteristics of the consensus molecular subtypes. In this section, we apply these functions to the GSE25055 datasets of gene expression and clinical features.\paragraph{} The \texttt{computeGES} function calculates signature scores for the following 7 gene expression signatures: EMT (epithelial-mesencymal transition), stromal, immune, microenvironment, stemness, hormone, and CIN (chromosomal instability) [2-6]. For more details about the gene expression signatures, please see the manual page for \texttt{computeGES} function.\paragraph{} As shown in Figure 1, this function also draws boxplots of signature scores with p-values of comparison among the subtypes. Stromal, immune, hormone, and stemness scores are significantly higher in MSL, IM, LAR, and SL subgroups than in other subgroups, respectively. <>= resultGES <- t(as.matrix(colData(GSE25055)[,3:9])) @ <>= resultGES <- computeGES(expr = GSE25055, pred = predictions, rnaseq = FALSE) <>= resultGES[,1:4] @ <>= library(grid) library(ggpubr) CMS.palette <- c("MSL" = "brown2", "IM" = "gold2", "LAR" = "yellowgreen", "SL" = "midnightblue") pval1 <- 8.1e-7 pval2 <- 3.6e-6 pval3 <- 0.016 pval4 <- 0.00034 TITLE_SIZE <- 12 SUBTITLE_SIZE <- 10 sigdat <- data.frame(t(resultGES)) sigdat$CMS <- predictions sub1 <- bquote(paste("Wilcoxon (MSL vs. others) ", italic(p), " = ", .(format(pval1, digits = 2)))) sub2 <- bquote(paste("Wilcoxon (MSL vs. others) ", italic(p), " = ", .(format(pval2, digits = 2)))) sub3 <- bquote(paste("Wilcoxon (MSL vs. others) ", italic(p), " = ", .(format(pval3, digits = 2)))) sub4 <- bquote(paste("Wilcoxon (MSL vs. others) ", italic(p), " = ", .(format(pval4, digits = 2)))) p1 <- ggboxplot(sigdat, x = "CMS", y = "Stromal", fill = "CMS", palette = CMS.palette) + theme_bw() + labs(title = "Stromal", subtitle = sub1) + theme(legend.position = "none", axis.title = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), panel.grid = element_blank(), plot.title = element_text(size = TITLE_SIZE, hjust = 0.5, face = "bold"), plot.subtitle = element_text(size= SUBTITLE_SIZE, hjust = 0.5)) p2 <- ggboxplot(sigdat, x = "CMS", y = "Immune", fill = "CMS", palette = CMS.palette) + theme_bw() + labs(title = "Immune", subtitle = sub2) + theme(legend.position = "none", axis.title = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), panel.grid = element_blank(), plot.title = element_text(size = TITLE_SIZE, hjust = 0.5, face = "bold"), plot.subtitle = element_text(size= SUBTITLE_SIZE, hjust = 0.5)) p3 <- ggboxplot(sigdat, x = "CMS", y = "Hormone", fill = "CMS", palette = CMS.palette) + theme_bw() + labs(title = "Hormone", subtitle = sub3) + theme(legend.position = "none", axis.title = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), panel.grid = element_blank(), plot.title = element_text(size = TITLE_SIZE, hjust = 0.5, face = "bold"), plot.subtitle = element_text(size= SUBTITLE_SIZE, hjust = 0.5)) p4 <- ggboxplot(sigdat, x = "CMS", y = "Stemness", fill = "CMS", palette = CMS.palette) + theme_bw() + labs(title = "Stem", subtitle = sub4) + theme(legend.position = "none", axis.title = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), panel.grid = element_blank(), plot.title = element_text(size = TITLE_SIZE, hjust = 0.5, face = "bold"), plot.subtitle = element_text(size= SUBTITLE_SIZE, hjust = 0.5)) g1 <- ggplotGrob(p1) g2 <- ggplotGrob(p2) g3 <- ggplotGrob(p3) g4 <- ggplotGrob(p4) g <- cbind(rbind(g1, g3, size = "first"), rbind(g2, g4, size = "first"), size = "first") grid.newpage() grid.draw(g) @ The \texttt{performGSVA} function performs gene set variation analysis on gene sets and produces a heatmap representing GSVA enrichment scores [7]. If gene sets are not given, the hallmark pathway gene sets are used [8]. The user can also choose a kernel for estimating the cumulative distribution function of expression values by setting the \texttt{gsva.kcdf} argument, which is set to \texttt{"Gaussian"} by default. If expression levels are integer counts, the \texttt{"Poisson"} is recommended.\paragraph{} Figure 2 shows differential activation of the hallmark pathways across the subtypes. The MSL subtype has high levels of EMT and P53 pathway activation, and the IM subtype shows the high interferon gamma response. AR and ER response pathways are highly activated in the LAR subtpe, and the expression of cell cycle associated pathway genes is up-regulated in the SL subtype. <>= resultGSVA <- performGSVA(expr = GSE25055, pred = predictions, gene.set = NULL) head(resultGSVA[,1:4]) @ The \texttt{TNBC.CMS} package provides two functions for survival analysis: \texttt{plotKM} and \texttt{plotHR}. Here, we use the survival data from the GSE25055 dataset to study the association between overall survival and the consensus molecular subtypes. The survival data is also included in the \texttt{SummarizedExperiment} object and can be accessed using the \texttt{colData} function. \paragraph{} The \texttt{plotKM} function produces a Kaplan-Meier curve for each consensus molecular subtype like Figure 3. The SL group showed the worst prognosis, which is consistent with our previous study. <>= time <- colData(GSE25055)$DFS.month event <- colData(GSE25055)$DFS.status plotKM(pred = predictions, time = time, event = event) @ The \texttt{plotHR} produces a forest plot of hazard ratios for genes that the user provides. For each input gene, samples are divided into high and low groups based on its expression level and the 95\% confidence interval for the hazard ratio is calculated. We selected 10 genes significantly associated with overall survival and generated a forest plot (Figure 4). <>= library(survival) #Test for difference of survival between low and high expression groups surv <- Surv(time, event) GSE25055.exprs <- assays(GSE25055)[[1]] chisq <- apply(GSE25055.exprs, 1, function(x) survdiff(surv ~ (x > median(x)))$chisq) pval <- 1 - pchisq(chisq, 1) #Select 10 genes with lowest p-values for the log-rank test gs <- names(sort(pval)[1:10]) gs plotHR(expr = GSE25055, gene.symbol = gs, pred = predictions, time = time, event = event, by.subtype = FALSE) @ Also, as shown in Figure 5, subtype-specific hazard ratios for genes of interest can be computed by setting the \texttt{by.subtype} argument. <>= plotHR(expr = GSE25055, gene.symbol = gs[1:4], pred = predictions, time = time, event = event, by.subtype = TRUE) @ %------------------------------------------------------------ \section{Case study: drug response investigation} %------------------------------------------------------------ The \texttt{TNBC.CMS} package also provides a function for predicting drug responses. The \texttt{computeDS} function computes drug signature scores for the corresponding gene sets in the MSigDB CGP (chemical and genetic perturbations) collection [9] and draws a heatmap of the signature scores. Drug signature scores are calculated as the difference between the average expression values of gene sets associated with drug response and resistance. The higher a signature score is, the more likely a patient is to be responsive. The user can provide their own gene sets via the \texttt{gene.set} argument. Note that names of gene sets must follow the format of [DRUG NAME]{\_}[RESPONSE/RESISTANCE]{\_}[UP/DN] (e.g. CISPLATIN{\_}RESISTANCE{\_}UP).\paragraph{} Figure 6 shows a heatmap of drug signature scores for each sample. The MSL and SL subtypes appear to be resistant to dasatinib and doxorubicin, respectively. Also, the IM and LAR subtypes show higher levels of signature scores for androgen agonist and SB216763 (an inhibitor of GSK3B) than other subtypes, respectively. <>= resultDS <- computeDS(expr = GSE25055, pred = predictions) head(resultDS[,1:4]) @ \pagebreak %------------------------------------------------------------ \section{Saving results} %------------------------------------------------------------ For future analysis, it is useful to save the results of subtype assignment and characterization into a data frame and save it into a text file. <>= dfCMS <- data.frame(row.names = colnames(GSE25055.exprs), CMS = predictions, t(resultGES), t(resultDS), stringsAsFactors = FALSE) head(dfCMS) write.table(dfCMS, file = "GSE25055_CMS.txt") @ %------------------------------------------------------------ \section{Session Info} %------------------------------------------------------------ <>= sessionInfo() @ \renewcommand{\refname}{\section{References}} \begin{thebibliography}{1} \bibitem{gse25055} Hatzis, C. et al. (2011). \newblock A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer \newblock {\em JAMA\/}, {\bf 305}, 1873--81. \bibitem{emt} Tan, T.Z. et al. (2014). \newblock Epithelial-mesenchymal transition spectrum quantification and its efficacy in deciphering survival and drug responses of cancer patients. \newblock {\em EMBO molecular medicine\/}, {\bf 6}, 1279--93. \bibitem{estimate} Yoshihara, K. et al. (2013). \newblock Inferring tumour purity and stromal and immune cell admixture from expression data. \newblock {\em Nature communications\/}, {\bf 4}, 2612. \bibitem{xcell} Aran, D. et al. (2017). \newblock xCell: digitally portraying the tissue cellular heterogeneity landscape. \newblock {\em Genome biology\/}, {\bf 18}, 220. \bibitem{stem} Malta, T.M. et al. (2018). \newblock Machine learning identifies stemness features associated with oncogenic dedifferentiation. \newblock {\em Cell\/}, {\bf 173}, 338--354. \bibitem{cin} Carter, S.L. et al. (2006). \newblock A signature of chromosomal instability inferred from gene expression profiles predicts clinical outcome in multiple human cancers. \newblock {\em Nature genetics\/}, {\bf 38}, 1043. \bibitem{gsva} Hanzelmann, S. et al. (2013). \newblock GSVA: gene set variation analysis for microarray and RNA-Seq data. \newblock {\em BMC Bioinformatics\/}, {\bf 14}, 7. \bibitem{hallmark} Liberzon, A. et al. (2015). \newblock The molecular signatures database hallmark gene set collection. \newblock {\em Cell systems\/}, {\bf 1}, 417--425. \bibitem{msigdb} Liberzon, A. et al. (2011). \newblock Molecular signatures database (MSigDB) 3.0. \newblock {\em Bioinformatics\/}, {\bf 27}, 1739--40. \end{thebibliography} \end{document}