%\VignetteIndexEntry{TNBC.CMS: prediction of TNBC consensus molecular subtype}
%\VignetteDepends{e1071, quadprog, SummarizedExperiment}
%\VignetteImports{GSVA, pheatmap, grDevices, RColorBrewer, pracma}
%\VignettePackage{TNBC.CMS}
%\VignetteEngine{knitr::knitr}

\documentclass{article}
\usepackage{graphicx}
\usepackage{microtype}
\usepackage[T1]{fontenc}
\usepackage[latin1]{inputenc}
\usepackage{lmodern}
\usepackage{geometry}
\usepackage{authblk}
\usepackage{float}
\geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm}
\usepackage[table]{xcolor}
\bibliographystyle{unsrt}

\begin{document}

<<include=FALSE>>=
library(knitr)
opts_chunk$set(concordance=TRUE)
options(scipen = 1, digits = 2, warn = -1)
@


\title{\texttt{TNBC.CMS}: prediction of TNBC consensus molecular subtype}

\author[1]{Doyeong Yu}
\author[1]{Jihyun Kim}
\author[2]{In Hae Park}
\author[1]{Charny Park}

\affil[1]{Clinical Genomics Analysis Branch, Research Institute,
National Cancer Center, Gyeonggi-do, Republic of Korea}
\affil[2]{Center for Breast Cancer, Hospital, National Cancer Center,
Gyeonggi-do, Republic of Korea}

\date{March 10, 2019}
\maketitle
\tableofcontents
\pagebreak
%------------------------------------------------------------
\section{Introduction}
%------------------------------------------------------------
TNBC.CMS is a package for molecular subtype classification of
triple-negative breast cancer (TNBC). While various classification strategies
have been proposed, absence of precise subtype classifier was a limitation of
patient diagnosis and TNBC studies. Our machine learning-based classifier model
was derived from gene expression profiles of 957 TNBC patients.
The TNBC.CMS package classifies patients into four consensus molecular subtypes
(CMS): mesenchymal-like (MSL), immunomodulatory (IM), luminal AR (LAR) and
stem-like (SL). It also provides a summary of genomic and clinical
characteristics including survival, hazard ratio, pathway activities and
drug responses.
%------------------------------------------------------------
\section{Loading package and dataset for case studies}
%------------------------------------------------------------
In this vignette, we walk through a case study of the breast cancer microarray
dataset GSE25055 [1] to demonstrate the practical use of our package.
The GSE25055 dataset was obtained from the \texttt{curatedBreastData} package.
We filtered out samples which seemed to be positive for ER, PR,
and HER2 based on immunohistochemistry results and the distribution of
gene expression.\paragraph{}
First, we load the package and the processed expression data. The dataset is
contained in a \texttt{SummarizedExperiment} object, which includes expression
profiles of 4,746 genes and 73 samples. Note that rows and columns
correspond to genes and samples, and row names must be gene symbols.
<<loadPackagesAndData,message=FALSE>>=
library(TNBC.CMS)
data("GSE25055")
dim(assays(GSE25055)[[1]])
assays(GSE25055)[[1]][1:5, 1:5]
@

\pagebreak

%------------------------------------------------------------
\section{Case study: CMS classification}
%------------------------------------------------------------
The \texttt{predictCMS} function assigns consensus molecular subtypes
to TNBC samples based on input matrix or \texttt{SummarizedExperiment} object.
If input is a \texttt{SummarizedExperiment} object, the first element in the
assays list should be a matrix of gene expression. In any case, gene expression
profiles should neither be scaled nor log-transformed. Class probabilities
can be retrieved by accessing the \texttt{probabilities} attribute.
<<prediction>>=
predictions <- predictCMS(GSE25055)
table(predictions)
head(attr(predictions, "probabilities"))
@

%------------------------------------------------------------
\section{Case study: summary of genomic and clinical characteristics}
%------------------------------------------------------------
The \texttt{TNBC.CMS} package includes several functions for studying
genomic and clinical characteristics of the consensus molecular subtypes.
In this section, we apply these functions to the GSE25055 datasets of
gene expression and clinical features.\paragraph{}
The \texttt{computeGES} function calculates signature scores for the
following 7 gene expression signatures: EMT (epithelial-mesencymal
transition), stromal, immune, microenvironment, stemness, hormone, and
CIN (chromosomal instability) [2-6]. For more details about the gene expression
signatures, please see the manual page for \texttt{computeGES}
function.\paragraph{}
As shown in Figure 1, this function also draws boxplots of signature scores
with p-values of comparison among the subtypes. Stromal, immune, hormone,
and stemness scores are significantly higher in MSL, IM, LAR, and SL subgroups
than in other subgroups, respectively.

<<include=FALSE>>=
resultGES <- t(as.matrix(colData(GSE25055)[,3:9]))
@

<<eval=FALSE>>=
resultGES <- computeGES(expr = GSE25055, pred = predictions,
                        rnaseq = FALSE)
<<eval=TRUE>>=
resultGES[,1:4]
@

<<label=gesplot,echo=FALSE,fig.height=4,fig.cap="Gene expression signature scores",fig.pos="H",message=FALSE>>=
library(grid)
library(ggpubr)
CMS.palette <- c("MSL" = "brown2", "IM" = "gold2",
                   "LAR" = "yellowgreen", "SL" = "midnightblue")
pval1 <- 8.1e-7
pval2 <- 3.6e-6
pval3 <- 0.016
pval4 <- 0.00034
TITLE_SIZE <- 12
SUBTITLE_SIZE <- 10

sigdat <- data.frame(t(resultGES))
sigdat$CMS <- predictions

sub1 <- bquote(paste("Wilcoxon (MSL vs. others) ",
                         italic(p), " = ", .(format(pval1, digits = 2))))
sub2 <- bquote(paste("Wilcoxon (MSL vs. others) ",
                         italic(p), " = ", .(format(pval2, digits = 2))))
sub3 <- bquote(paste("Wilcoxon (MSL vs. others) ",
                         italic(p), " = ", .(format(pval3, digits = 2))))
sub4 <- bquote(paste("Wilcoxon (MSL vs. others) ",
                         italic(p), " = ", .(format(pval4, digits = 2))))

p1 <- ggboxplot(sigdat, x = "CMS", y = "Stromal",
                fill = "CMS", palette = CMS.palette) +
  theme_bw() + labs(title = "Stromal", subtitle = sub1) +
  theme(legend.position = "none", axis.title = element_blank(),
        axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        panel.grid = element_blank(),
        plot.title = element_text(size = TITLE_SIZE,
                                  hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(size=  SUBTITLE_SIZE, hjust = 0.5))

p2 <- ggboxplot(sigdat, x = "CMS", y = "Immune",
                fill = "CMS", palette = CMS.palette) +
  theme_bw() + labs(title = "Immune", subtitle = sub2) +
  theme(legend.position = "none", axis.title = element_blank(),
        axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        panel.grid = element_blank(),
        plot.title = element_text(size = TITLE_SIZE,
                                  hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(size=  SUBTITLE_SIZE, hjust = 0.5))

p3 <- ggboxplot(sigdat, x = "CMS", y = "Hormone",
                fill = "CMS", palette = CMS.palette) +
  theme_bw() + labs(title = "Hormone", subtitle = sub3) +
  theme(legend.position = "none", axis.title = element_blank(),
        axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        panel.grid = element_blank(),
        plot.title = element_text(size = TITLE_SIZE,
                                  hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(size=  SUBTITLE_SIZE, hjust = 0.5))

p4 <- ggboxplot(sigdat, x = "CMS", y = "Stemness",
                fill = "CMS", palette = CMS.palette) +
  theme_bw() + labs(title = "Stem", subtitle = sub4) +
  theme(legend.position = "none", axis.title = element_blank(),
        axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        panel.grid = element_blank(),
        plot.title = element_text(size = TITLE_SIZE,
                                  hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(size=  SUBTITLE_SIZE, hjust = 0.5))

g1 <- ggplotGrob(p1)
g2 <- ggplotGrob(p2)
g3 <- ggplotGrob(p3)
g4 <- ggplotGrob(p4)
g <- cbind(rbind(g1, g3, size = "first"),
           rbind(g2,  g4, size = "first"), size = "first")
grid.newpage()
grid.draw(g)
@

The \texttt{performGSVA} function performs gene set variation analysis
on gene sets and produces a heatmap representing GSVA enrichment scores [7].
If gene sets are not given, the hallmark pathway gene sets are used [8].
The user can also choose a kernel for estimating the cumulative distribution
function of expression values by setting the \texttt{gsva.kcdf} argument,
which is set to \texttt{"Gaussian"} by default. If expression levels are
integer counts, the \texttt{"Poisson"} is recommended.\paragraph{}
Figure 2 shows differential activation of the hallmark pathways across
the subtypes. The MSL subtype has high levels of EMT and P53 pathway
activation, and the IM subtype shows the high interferon gamma response.
AR and ER response pathways are highly activated in the LAR subtpe, and
the expression of cell cycle associated pathway genes is up-regulated
in the SL subtype.
<<label=gsa,fig.height=6,fig.cap="GSVA enrichment scores",fig.pos="H",fig.show="hold">>=
resultGSVA <- performGSVA(expr = GSE25055, pred = predictions,
                          gene.set = NULL)
head(resultGSVA[,1:4])
@

The \texttt{TNBC.CMS} package provides two functions for survival
analysis: \texttt{plotKM} and \texttt{plotHR}. Here, we use the
survival data from the GSE25055 dataset to study the association between
overall survival and the consensus molecular subtypes. The survival data is also
included in the \texttt{SummarizedExperiment} object and can be accessed using
the \texttt{colData} function. \paragraph{}
The \texttt{plotKM} function produces a Kaplan-Meier curve for each
consensus molecular subtype like Figure 3. The SL group showed the worst
prognosis, which is consistent with our previous study.
<<label=os,fig.height=3,fig.width=5,fig.cap="Overall survival",fig.pos="H",fig.show="hold",fig.align="center">>=
time <- colData(GSE25055)$DFS.month
event <- colData(GSE25055)$DFS.status
plotKM(pred = predictions, time = time, event = event)
@

The \texttt{plotHR} produces a forest plot of hazard ratios for genes
that the user provides. For each input gene, samples are divided into
high and low groups based on its expression level and the 95\% confidence
interval for the hazard ratio is calculated. We selected 10 genes significantly
associated with overall survival and generated a forest plot (Figure 4).
<<label=hr,fig.height=3,fig.cap="Forest plot of hazard ratios",fig.pos="H",fig.show="hold">>=
library(survival)

#Test for difference of survival between low and high expression groups
surv <- Surv(time, event)
GSE25055.exprs <- assays(GSE25055)[[1]]
chisq <- apply(GSE25055.exprs, 1, function(x) survdiff(surv ~ (x > median(x)))$chisq)
pval <- 1 - pchisq(chisq, 1)

#Select 10 genes with lowest p-values for the log-rank test
gs <- names(sort(pval)[1:10])
gs
plotHR(expr = GSE25055, gene.symbol = gs, pred = predictions, time = time,
       event = event, by.subtype = FALSE)
@

Also, as shown in Figure 5, subtype-specific hazard ratios for genes
of interest can be computed by setting the \texttt{by.subtype} argument.
<<label=hrsub,fig.height=4,fig.cap="Forest plot of subtype-specific hazard ratios",fig.pos="H",fig.show="hold">>=
plotHR(expr = GSE25055, gene.symbol = gs[1:4], pred = predictions,
       time = time, event = event, by.subtype = TRUE)
@

%------------------------------------------------------------
\section{Case study: drug response investigation}
%------------------------------------------------------------
The \texttt{TNBC.CMS} package also provides a function for predicting
drug responses. The \texttt{computeDS} function computes drug signature
scores for the corresponding gene sets in the MSigDB CGP (chemical and
genetic perturbations) collection [9] and draws a heatmap of the signature
scores. Drug signature scores are calculated as the difference between
the average expression values of gene sets associated with drug response
and resistance. The higher a signature score is, the more likely a patient
is to be responsive. The user can provide their own gene sets via
the \texttt{gene.set} argument. Note that names of gene sets must follow
the format of [DRUG NAME]{\_}[RESPONSE/RESISTANCE]{\_}[UP/DN]
(e.g. CISPLATIN{\_}RESISTANCE{\_}UP).\paragraph{}
Figure 6 shows a heatmap of drug signature scores for each sample.
The MSL and SL subtypes appear to be resistant to dasatinib and doxorubicin,
respectively. Also, the IM and LAR subtypes show higher levels of signature
scores for androgen agonist and SB216763 (an inhibitor of GSK3B)
than other subtypes, respectively.
<<label=ds,fig.height=3,fig.cap="Drug signature scores",fig.pos="H",fig.show="hold">>=
resultDS <- computeDS(expr = GSE25055, pred = predictions)
head(resultDS[,1:4])
@

\pagebreak

%------------------------------------------------------------
\section{Saving results}
%------------------------------------------------------------
For future analysis, it is useful to save the results of subtype assignment
and characterization into a data frame and save it into a text file.
<<saveResults>>=
dfCMS <- data.frame(row.names = colnames(GSE25055.exprs), CMS = predictions, t(resultGES),
                    t(resultDS), stringsAsFactors = FALSE)
head(dfCMS)
write.table(dfCMS, file = "GSE25055_CMS.txt")
@

%------------------------------------------------------------
\section{Session Info}
%------------------------------------------------------------
<<sessionInfo>>=
sessionInfo()
@

\renewcommand{\refname}{\section{References}}
\begin{thebibliography}{1}

\bibitem{gse25055}
Hatzis, C. et al. (2011).
\newblock A genomic predictor of response and survival following
taxane-anthracycline chemotherapy for invasive breast cancer
\newblock {\em JAMA\/}, {\bf 305}, 1873--81.

\bibitem{emt}
Tan, T.Z. et al. (2014).
\newblock Epithelial-mesenchymal transition spectrum
quantification and its efficacy in deciphering survival and drug responses
of cancer patients.
\newblock {\em EMBO molecular medicine\/}, {\bf 6}, 1279--93.

\bibitem{estimate}
Yoshihara, K. et al. (2013).
\newblock Inferring tumour purity and stromal and
immune cell admixture from
expression data.
\newblock {\em Nature communications\/}, {\bf 4}, 2612.

\bibitem{xcell}
Aran, D. et al. (2017).
\newblock xCell: digitally portraying the tissue cellular
heterogeneity landscape.
\newblock {\em Genome biology\/}, {\bf 18}, 220.

\bibitem{stem}
Malta, T.M. et al. (2018).
\newblock Machine learning identifies stemness features
associated with oncogenic dedifferentiation.
\newblock {\em Cell\/}, {\bf 173}, 338--354.

\bibitem{cin}
Carter, S.L. et al. (2006).
\newblock A signature of chromosomal instability
inferred from gene expression profiles predicts clinical outcome in multiple
human cancers.
\newblock {\em Nature genetics\/}, {\bf 38}, 1043.

\bibitem{gsva}
Hanzelmann, S. et al. (2013).
\newblock GSVA: gene set variation analysis for microarray and RNA-Seq data.
\newblock {\em BMC Bioinformatics\/}, {\bf 14}, 7.

\bibitem{hallmark}
Liberzon, A. et al. (2015).
\newblock The molecular signatures database hallmark gene set collection.
\newblock {\em Cell systems\/}, {\bf 1}, 417--425.

\bibitem{msigdb}
Liberzon, A. et al. (2011).
\newblock Molecular signatures database (MSigDB) 3.0.
\newblock {\em Bioinformatics\/}, {\bf 27}, 1739--40.

\end{thebibliography}

\end{document}