---
title: "PCOSP: Pancreatic Cancer Overall Survival Predictor"
author:
- name: Vandana Sandhu
- name: Heewon Seo
- name: Christopher Eeles
  affiliation:
  - &pm Bioinformatics and Computational Genomics Laboratory, Princess Margaret Cancer Center, University Health Network, Toronto, Ontario, Canada
  email: christopher.eeles@uhnresearch.ca
- name: Benjamin Haibe-Kains
  affiliation:
  - *pm
  - &mbp Department of Medical Biophysics, University of Toronto, Toronto, Canada
  email: benjamin.haibe.kains@utoronto.ca
date: 2021-02-01
output:
    BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{PCOSP: Pancreatic Cancer Overall Survival Predictor}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

# Pancreatic Cancer Overall Survival Predictor

As an example of the utility of the PDATK package, we provide code
replicating the analysis published in Sandhu *et al.* (2019). While the
code presented here is run on a subset of the original data to ensure that 
the PDATK installation size is not too large, the full data from that study can 
is available via the `MetaGxPancreas` Bioconductor data package.

```{r include=FALSE, echo=FALSE, eval=TRUE}
library(BiocParallel)

if (Sys.info()['sysname'] == 'windows') {
    BiocParallel::register(BiocParallel::SerialParam())
}
```

```{r load_dependencies, message=FALSE, warning=FALSE, include=FALSE}
library(PDATK)
library(msigdbr)
library(data.table)
```

```{r load_sample_data}
data(sampleCohortList)
sampleCohortList
```

## Split Training and Validation Data

To get started using PDATK, place each of your patient cohorts into 
`SurvivalExperiment` objects and then assemble those into a master `CohortList`,
which holds the training and validation data for use with the various
`SurvivalModel`s in this package.


```{r subset_and_split_data}
commonGenes <- findCommonGenes(sampleCohortList)
# Subsets all list items, with subset for specifying rows and select for
# specifying columns
cohortList <- subset(sampleCohortList, subset=commonGenes)

ICGCcohortList <- cohortList[grepl('ICGC', names(cohortList), ignore.case=TRUE)]
validationCohortList <- cohortList[!grepl('icgc', names(cohortList),
    ignore.case=TRUE)]
```

Since we are interested in predicting survival, it is necessary to remove
patients with insufficient data to be useful. In general, we want to remove
patients who did not have an event in our period of interest. As such we remove
samples who dropped out of a study, but did not pass away before the first year.


```{r drop_not_censored_patients}
validationCohortList <- dropNotCensored(validationCohortList)
ICGCcohortList <- dropNotCensored(ICGCcohortList)
```

We have now split our patient cohorts into training and validation data. For 
this analysis we will be training using the ICGC cohorts, which includes one 
cohort with RNA micro-array data and another with RNA sequencing data. When 
using multiple cohorts to train a model, it is required that those cohorts share
samples. As a result we will take as training data all patients shared between 
the two cohorts and leave the remainder of patients as part of our 
validationData.


```{r split_train_test}
# find common samples between our training cohorts in a cohort list
commonSamples <- findCommonSamples(ICGCcohortList)

# split into shared samples for training, the rest for testing
ICGCtrainCohorts <- subset(ICGCcohortList, select=commonSamples)
ICGCtestCohorts <- subset(ICGCcohortList, select=commonSamples, invert=TRUE)

# merge our training cohort test data into the rest of the validation data
validationCohortList <- c(ICGCtestCohorts, validationCohortList)

# drop ICGCSEQ from the validation data, because it only has 7 patients
validationCohortList <- 
    validationCohortList[names(validationCohortList) != 'ICGCSEQ']
```

## Setup A `PCOSP` Model Object

We now have patient molecular data, annotated with the number of days survived
since treatment and the survival status and are ready to apply a `SurvivalModel`
to this data. In this example, we are applying a Pancreatic Cancer Overall
Survival Model, as described in <PCOSP_paper reference>. This class uses the
`switchBox` package to create an ensemble of binary classifiers, whos votes are
then tallied into a PCOSP score. A PCOSP score is simply the proportion of
models predicting good survival out of the total number of models in the
ensemble.

```{r build_PCOSP_model, message=FALSE}
set.seed(1987)
PCOSPmodel <- PCOSP(ICGCtrainCohorts, minDaysSurvived=365, randomSeed=1987)

# view the model parameters; these make your model run reproducible
metadata(PCOSPmodel)$modelParams
```
## Training a PCOSP Model

To simplify working with different `SurvivalModel` sub-classes, we have 
implemented a standard work-flow that is valid for all `SurvivalModel`s. 
This involves first training the model, then using it to predict 
risk/risk-classes for a set of validation cohorts and finally assessing 
performance on the validation data.

To train a model, the `trainModel` method is used. This function abstracts away
the implementation of model training, allowing end-users to focus on applying
the `SurvivalModel` to make predictions without a need to understand the model
internals. We hope this will make the package useful for those unfamiliar or
uninterested in the details of survival prediction methods.

For training a PCOSP model there are two parameters. First, `numModels` is the
number of models to train for use in the ensemble classifier to predict PCOSP
scores. To keep computation brief, we are only training 25 models. However, it
is recommended to use a minimum of 1000 for real world applications. The second
parameter is `minAccuracy`, which is the minimum model accuracy for a trained
model to included in the final model ensemble. Paradoxically, increasing this
too high can actually decrease the overall performance of the PCOSP model. We
recommend 0.6 as a sweet spot between random chance and over-fitting but 
encourage experimentation to see what works best with your data.


```{r messages=FALSE, warning=FALSE}
trainedPCOSPmodel <- trainModel(PCOSPmodel, numModels=15, minAccuracy=0.6)

metadata(trainedPCOSPmodel)$modelParams
```

We can see that after training, the additional model parameters are added to the
`modelParams` item in the model `metadata`. The goal is to ensure that your
model training, prediction and validation are fully reproducible by capturing
the parameters relevant to a specific model.


## Risk Prediction with a PCOSP Model

After training, a model can now be used with new data to make risk predictions
and classify samples into 'good' or 'bad' survival groups. To do this, the
standard `predictClasses` method is used. Similar to `trainData`, we have
abstracted away the implementation details to provide users with a simple,
consistent interface for using `SurvivalModel` sub-classes to make patient risk
predictions.


```{r PCOSP_predictions, message=FALSE, warning=FALSE}
PCOSPpredValCohorts <- predictClasses(validationCohortList,
    model=trainedPCOSPmodel)
```

The returned `CohortList` object now indicates that each of the cohorts have
predictions. This information is available in the `elementMetadata` slot of
the cohort list and can be accessed with the `mcols` function from `S4Vectors`.


```{r predicted_elementMetadata}
mcols(PCOSPpredValCohorts)
```

Predicting risk with a specific model adds a corresponding metadata
column to the object `colData`. In the case of a `PCOSP` model, the new column
is called `PCOSP_prob_good` and represents the proportion of models in the
ensemble which predicted good survival for a given sample.

```{r risk_column}
knitr::kable(head(colData(PCOSPpredValCohorts[[1]])))
```

Additionally, binary predictions of good or bad survival can be found in the
`PCOSPpredictions` item of each `SurvivalExperiment`s `metadata`. This contains
the raw predictions from the model for each classifier in the ensemble, ordered
by classifier accuracy. This data is not important for end users, but is used
internally when calculating validation statistics for the model. For users
wishing to classify samples rather than estimate risks, we recommend a
PCOSP cut-off of >0.5 for good survival prognosis.

```{r raw_predictions}
knitr::kable(metadata(PCOSPpredValCohorts[[1]])$PCOSPpredictions[1:5, 1:5])
```

## Validating A PCOSP Model

The final step in the standard `SurvivalModel` work-flow is to compute 
model performance statistics for the model on the validation data. This can be 
accomplished using the `validateModel` method, which will add statistics to the 
`validationStats` slot of a `SurvivalModel` object and the data to the 
`validationData` slot.

```{r validate_PCOSP_model, message=FALSE, warning=FALSE}
validatedPCOSPmodel <- validateModel(trainedPCOSPmodel,
    valData=PCOSPpredValCohorts)
```

```{r validationStats}
knitr::kable(head(validationStats(validatedPCOSPmodel)))
```

Examining the `data.table` from the `validationStats` slot we can see that three
model performance statistics have been calculated for all of the validation
cohorts. Additionally, aggregate statistics have been calculated by molecular
data type and for all cohorts combined. This table can be used to generate
model performance plots. We have included several functions for examining
model performance.

## Plotting Model Performance

```{r PCOSP_D_index_forestplot, fig.width=8, fig.height=8, fig.wide=TRUE}
PCOSPdIndexForestPlot <- forestPlot(validatedPCOSPmodel, stat='log_D_index')
PCOSPdIndexForestPlot
```

```{r PCOSP_concordance_index_forestplot, fig.width=8, fig.height=8, fig.wide=TRUE}
PCOSPconcIndexForestPlot <- forestPlot(validatedPCOSPmodel, stat='concordance_index')
PCOSPconcIndexForestPlot
```


```{r PCOSP_ROC_curve, fig.height=8, fig.width=8, fig.wide=TRUE, message=FALSE}
cohortROCplots <- plotROC(validatedPCOSPmodel, alpha=0.05)
cohortROCplots
```

# Permutations Testing for a PCOSP Model

To compare the performance of a PCOSP model to random chance, we have
included two model classes which permute either patient prognosis labels or 
the feature names. These models can be used to evaluate if a PCOSP model 
performs better than random chance.

## Random Label Shuffling Model

The `RLSModel` class is a `SurvivalModel` using the same risk prediction
algorithm as a `PCOSP` model, but randomizing the patient prognosis labels
in each of the individual KTSP classifiers used in the classification ensemble.
Given this random prognosis label shuffling, we expect the classification 
results for this model to be no better than random chance.

The work-flow for this model follows the standard `SurvivalModel` sub-class 
work-flow.

### Construct the Model Object

```{r RLSModeL_constructor}
# Merge to a single SurvivalExperiment
ICGCtrainCohorts <- merge(ICGCtrainCohorts[[1]], ICGCtrainCohorts[[2]], 
    cohortNames=names(ICGCtrainCohorts))
RLSmodel <- RLSModel(ICGCtrainCohorts, minDaysSurvived=365, randomSeed=1987)
```

### Train the Model

```{r RLSModel_training}
trainedRLSmodel <- trainModel(RLSmodel, numModels=15)
```

### Predict the Classes

```{r RLSModel_predictions}
RLSpredCohortList <- predictClasses(validationCohortList, model=trainedRLSmodel)
```

### Validate Model Performance

```{r RLSModel_validation}
validatedRLSmodel <- validateModel(trainedRLSmodel, RLSpredCohortList)
```

### Compare with a PCOSP Model

To compare the performance of a permuted model to that of a PCOSP model, we
have included the `densityPlotModelComparion` method, which plots a
density plot of model AUCs per molecular data type and overall. The dotted
line represents the mean AUC of the PCOSP model. From this plot we can 
see that the PCOSP model performs significantly better than the random label
shuffling model.

```{r RLSModel_PCOSP_comparison_plot, fig.width=8, fig.height=8, fig.wide=TRUE}
RLSmodelComparisonPlot <- densityPlotModelComparison(validatedRLSmodel,
    validatedPCOSPmodel, title='Random Label Shuffling vs PCOSP',
    mDataTypeLabels=c(rna_seq='Sequencing-based', rna_micro='Array-based',
        combined='Overall'))
RLSmodelComparisonPlot
```

## Random Gene Assignment Model

A `RandomGeneAssignmentModel` (aliased as `RGAModel` for user convenience)
is similar to a `RLSModel` except that the gene labels are randomized for
each KTSP classifier in the ensemble. In this case, we also expect this
model to perform no better than random chance.

### Construct the Model Object

```{r RGAModeL_constructor}
RGAmodel <- RGAModel(ICGCtrainCohorts, randomSeed=1987)
```

### Train the Model

```{r RGAModel_training}
trainedRGAmodel <- trainModel(RGAmodel, numModels=15)
```

### Predict the Classes

```{r RGAModel_predictions}
RGApredCohortList <- predictClasses(validationCohortList,
    model=trainedRGAmodel)
```

### Validate Model Performance

```{r RGAModel_validation}
validatedRGAmodel <- validateModel(trainedRGAmodel, RGApredCohortList)
```

### Compare RGAModel to PCOSP

The density plot for the `RGAModel` also shows that the `PCOSP` model does
significantly better than random chance.

```{r RGAModel_PCOSP_comparison_plot, fig.width=8, fig.height=8, fig.wide=TRUE}
RGAmodelComparisonPlot <-  densityPlotModelComparison(validatedRGAmodel,
    validatedPCOSPmodel, title='Random Gene Assignment vs PCOSP',
    mDataTypeLabels=c(rna_seq='Sequencing-based', rna_micro='Array-based',
        combined='Overall'))
RGAmodelComparisonPlot
```

# Pathway Analysis of a PCOSP Model

Confident that the PCOSP model is performing better than random chance at
prediction patient survival, we can now begin to look into the features that
are most relevant to these predictions. Once we have extracted the features,
we will use the `runGSEA` method to evaluate which pathways these genes are
representative of.

## Get the Top Predictive Features

We can get the features which are most predictive from a `SurvivalModel`
object using the `getTopFeatures` method. By default this retrieves the
gene names from the top 10% of models in the `SurvivalModel`. Users can 
customize the top N models to extract genes from using the `numModels` 
parameter.

```{r PCOSP_get_top_features}
topFeatures <- getTopFeatures(validatedPCOSPmodel)
topFeatures
```

## Querying Genesets for Enriched Pathways

To perform pathway analysis, which is done internally using the 
`piano::runGSAhyper` function we first to pathway data to query against.
We recommend using the `msigdbr` R package to fetch pathways of interest.

Because of the small number of genes included in the sample patient cohorts,
this section will not find any enriched pathways and therefore this code is
not run.

```{r msigdbr_get_pathways, eval=FALSE}
allHumanGeneSets <- msigdbr()
allGeneSets <- as.data.table(allHumanGeneSets)
geneSets <- allGeneSets[grepl('^GO.*|.*CANONICAL.*|^HALLMARK.*', gs_name),
    .(gene_symbol, gs_name)]
```

```{r PCOSP_runGSEA, eval=FALSE}
GSEAresultDT <- runGSEA(validatedPCOSPmodel, geneSets)
```

# Clinical Model Comparison

Equipped with a risk prediction model performing better than random chance,
the next pertinent question is: does it out perform simpler models? To answer
this question we have included the `ClinicalModel` class, which leverages the
`glm` function from the `stats` package to fit a generalized linear model based 
on a set of clinical variables in the patient metadata. To do this, we need to 
specify a model formula based on the column names available in the `colData` 
slot of our training data `SurvivalExpeiment` objects. All formula parameters 
must be valid column names in the `colData` of the training cohorts. The same
patient metadata must be present in any validation cohorts used to make risk 
predictions.


## Build the Model

We can see there are a number of clinical variables related to patient
survived in the patient metadata. For our model, we will be using patient sex,
age and tumour grade along with the TNM staging of the tumour.

```{r training_data_patient_metadata}
knitr::kable(head(colData(ICGCtrainCohorts)))
```


```{r ClinicalModel_constructor}
clinicalModel <- ClinicalModel(ICGCtrainCohorts,
    formula='prognosis ~ sex + age + T + N + M + grade',
    randomSeed=1987)
```


## Train the Model


```{r ClinicalModel_training}
trainedClinicalModel <- trainModel(clinicalModel)
```

## Predict the Classes

Clinical annotations are only available for a subset of our patient cohorts,
thus we subset our cohort list to ensure the model works as expected. Variable
columns with levels of a model parameter which are not present in the original
model will be dropped automatically. To prevent this behavior, it is necessary
to add any missing levels to the `colData` of the training data before training
the model. The easiest way to achieve this is by converting the columns in
question into factors and ensuring all levels in the training and validation
data are set for those columns.


```{r ClinicalModel_prediction}
hasModelParamsCohortList <-
    PCOSPpredValCohorts[c('ICGCMICRO', 'TCGA', 'PCSI', 'OUH')]

clinicalPredCohortList <- predictClasses(hasModelParamsCohortList,
    model=trainedClinicalModel)
```

## Validate the Model

```{r ClinicalModel_predictions}
validatedClinicalModel <- validateModel(trainedClinicalModel,
    clinicalPredCohortList)
```

## Visualize Comparison with PCOSP Model

To get a general idea of how the `ClinicalModel` performed relative to the
`PCOSP` model, the `barPlotModelComaprison` can be used to show the . For 
this example, we will compare the AUC of the models in each of the validation
cohorts. Other statistics in the `validationStats` slot can be used by changing
the `stat` argument to the plotting function.

```{r ClinicalModel_vs_PCOSP_AUC_barplot, fig.wide=TRUE, fig.width=8, fig.height=8}
clinicalVsPCOSPbarPlot <- barPlotModelComparison(validatedClinicalModel,
    validatedPCOSPmodel, stat='AUC')
clinicalVsPCOSPbarPlot
```

To reduce the number of plotting functions and simplify meta-analysis of many 
models of different types, we have included the `ModelComparison` object. This 
object aggregates the validation statistics from two models. Plotting methods 
can then be dispatched on this class, greatly simplifying the process of 
comparing several `SurivalModel` sub-classes. You can also compare a model to 
an existing model comparison, and in this way built up a collection of model 
performance statistics which can be visualized in a forest plot to enable 
complex comparisons between many models.

Below, we use this feature to compare the PCOSP and `ClinicalModels` based on
D index and concordance index, as calculated using the `survcomp` R package.

```{r ModelComparison_object}
clinicalVsPCOSP <- compareModels(validatedClinicalModel, validatedPCOSPmodel)
```

```{r ClinicalModel_vs_PCOSP_dIndex, fig.wide=TRUE, fig.height=8, fig.width=8}
clinVsPCOSPdIndexForestPlot <- forestPlot(clinicalVsPCOSP, stat='log_D_index')
clinVsPCOSPdIndexForestPlot
```

```{r  ClinicalModel_vs_PCOSP_concIndex, fig.wide=TRUE, fig.height=8, fig.width=8}
clinVsPCOSPconcIndexForestPlot <- forestPlot(clinicalVsPCOSP,
    stat='concordance_index')
clinVsPCOSPconcIndexForestPlot
```

# Comparing PCOSP Models to Existing Published Classifiers

To get and idea of how our PCOSP model stacks up against other classifiers
from the literature, we have included the `GeneFuModel` class. This 
`SurvivalModel` performs risk prediction using a set of genes and corresponding
coefficients using the `genefu::sig.score` function. We have included data
for three published classifiers from Chen *et al.* (2015), Birnbaum *et al.* 
(2017) and Haider *et al.* (2014).

## Make the Models

Since there is no training data for the published classifiers, we create
empty `GeneFuModel` objects, then assign the model predictors to the `models`
slot of each respective object.

```{r GeneFuModel_constructor}
chenGeneFuModel <- GeneFuModel(randomSeed=1987)
birnbaumGeneFuModel <- GeneFuModel(randomSeed=1987)
haiderGeneFuModel <- GeneFuModel(randomSeed=1987)
```

For the Haider model, we were unable to get the genes and coefficients. However,
the author provided the risk scores. As a result we need to do a bit of work
to get the Haider GeneFuModel to work with the package function.

```{r GeneFuModel_assign_models}
data(existingClassifierData)

models(chenGeneFuModel) <- SimpleList(list(chen=chen))
models(birnbaumGeneFuModel) <- SimpleList(list(birnbuam=birnbaum))
models(haiderGeneFuModel) <- SimpleList(list(haider=NA)) 
```

## Predict the Classes


```{r GeneFuModel_predictions}
chenClassPredictions <- predictClasses(PCOSPpredValCohorts[names(haiderSigScores)],
    model=chenGeneFuModel)
birnClassPredictions <- predictClasses(PCOSPpredValCohorts[names(haiderSigScores)],
    model=birnbaumGeneFuModel)
```


```{r GeneFuModel_haider_fix}
haiderClassPredictions <- PCOSPpredValCohorts[names(haiderSigScores)]
# Manually assign the scores to the prediction cohorts
for (i in seq_along(haiderClassPredictions)) {
  colMData <- colData(haiderClassPredictions[[i]])
  colMData$genefu_score <- NA_real_
  colMData[rownames(colMData) %in% names(haiderSigScores[[i]]), ]$genefu_score <- 
      haiderSigScores[[i]][names(haiderSigScores[[i]]) %in% rownames(colMData)]
  colData(haiderClassPredictions[[i]]) <- colMData
}
# Setup the correct model metadata
mcols(haiderClassPredictions)$hasPredictions <- TRUE
metadata(haiderClassPredictions)$predictionModel <- haiderGeneFuModel
```


## Validate the Models

```{r}
validatedChenModel <- validateModel(chenGeneFuModel, valData=chenClassPredictions)
validatedBirnbaumModel <- validateModel(birnbaumGeneFuModel, 
    valData=birnClassPredictions)
validatedHaiderModel <- validateModel(haiderGeneFuModel, valData=haiderClassPredictions)
```


## Model Performance Meta-Analysis

```{r comparing_GeneFuModels}
genefuModelComparisons <- compareModels(validatedChenModel,
    validatedBirnbaumModel, modelNames=c('Chen', 'Birnbaum'))
genefuModelComparisons <- compareModels(genefuModelComparisons,
    validatedHaiderModel, model2Name='Haider')
```


```{r comparing_comparisons}
allModelComparisons <- compareModels(genefuModelComparisons, validatedPCOSPmodel, 
  model2Name='PCOSP')
# We are only interested in comparing the summaries, so we subset our model comparison
allModelComparisons <- subset(allModelComparisons, isSummary == TRUE)
```

```{r plotting Model_Comparisons_dindex, fig.width=8, fig.height=8, fig.wide=TRUE}
allDindexComparisonForestPlot <- forestPlot(allModelComparisons,
    stat='log_D_index', colourBy='model', groupBy='mDataType')
allDindexComparisonForestPlot
```


```{r plotting Model Comparisons_concindex, fig.width=8, fig.height=8, fig.wide=TRUE}
allConcIndexComparisonForestPlot <- forestPlot(allModelComparisons,
    stat='concordance_index', colourBy='model', groupBy='mDataType')
allConcIndexComparisonForestPlot
```

From the two forest plots, we can see that the `PCOSP` model matched our 
outperformed all of the public classifiers, even when only trained with 100
models. It is likely that using 1000 models, we would see even better separation
of the `PCOSP` model. Indeed, this is was the result in Sandhu *et al.* (2019).

# Comparing PCOSP Models By Patient Subtype

```{r adding_subtypes_to_CohortList}
data(cohortSubtypeDFs)

# Add the subtypes to the prediction cohort
subtypedPCOSPValCohorts <- assignSubtypes(PCOSPpredValCohorts, cohortSubtypeDFs)
```

```{r validateModel_with_subtypes}
subtypeValidatedPCOSPmodel <- validateModel(trainedPCOSPmodel, valData=subtypedPCOSPValCohorts)
```
```{r forestPlot_Dindex_subtyped_PCOSP_model, fig.width=8, fig.height=8, fig.wide=TRUE}
forestPlot(subtypeValidatedPCOSPmodel, stat='log_D_index', groupBy='cohort',
    colourBy='subtype')
```
```{r forestPlot_Cindex_subtyped_PCOSP_model, fig.width=8, fig.height=8, fig.wide=TRUE}
forestPlot(subtypeValidatedPCOSPmodel, stat='concordance_index', groupBy='cohort',
    colourBy='subtype')
```

# References

1. Sandhu V, Labori KJ, Borgida A, et al. Meta-Analysis of 1,200 Transcriptomic 
Profiles Identifies a Prognostic Model for Pancreatic Ductal Adenocarcinoma. 
JCO Clin Cancer Inform. 2019;3:1-16. doi:10.1200/CCI.18.00102

2. Chen DT, Davis-Yadley AH, Huang PY, et al. Prognostic Fifteen-Gene Signature 
for Early Stage Pancreatic Ductal Adenocarcinoma. PLoS One. 2015;10(8):e0133562. 
Published 2015 Aug 6. doi:10.1371/journal.pone.0133562 

3. Birnbaum DJ, Finetti P, Lopresti A, et al. A 25-gene classifier predicts 
overall survival in resectable pancreatic cancer. BMC Med. 2017;15(1):170. 
Published 2017 Sep 20. doi:10.1186/s12916-017-0936-z

4. Haider S, Wang J, Nagano A, et al. A multi-gene signature predicts outcome 
in patients with pancreatic ductal adenocarcinoma. Genome Med. 2014;6(12):105. 
Published 2014 Dec 3. doi:10.1186/s13073-014-0105-3