%\VignetteIndexEntry{Using ssPATHS} \documentclass{article} \title{ssPATHS: Single Sample PATHway Score (version 0.1.0)} \date{2019\\ July} \author{Natalie R. Davidson, Philipp Markolin, Gunnar R{\"a}tsch \\ Biomedical Informatics, ETH Z{\"u}rich} \begin{document} \maketitle \SweaveOpts{concordance=TRUE} \SweaveOpts{keep.source=TRUE} \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} <>= options(width=60) @ <>= options(continue=" ") @ \section{Introduction} Precision oncology requires that a single patient can be accurately and meaningfully characterized in order to tailor treatments. Using our method, ssPATHS, we are able to estimate pathway deviations for a single patient, by first learning a discriminative gene weighting from a reference cohort. In this vignette, we use TCGA gene expression data to learn a weighting on hypoxia-related genes then apply it to PDAC cancer cell lines to estimate the level of hypoxia within each sample. \section{Formatting Data} First, we load the appropriate packages we will need. <<>>= library("ROCR") library("ggplot2") library("ssPATHS") @ Next, we will read in the TCGA data and format it appropriately. <<>>= data(tcga_expr_df) @ Let's look at the format of our data. To learn our gene weights we will need a \emph{Y} column and a \emph{sample\_id} column. <<>>= tcga_expr_df[1:6,1:5] @ Now we will need to transform the data.frame into a SummarizedExperiment object. <<>>= tcga_se <- SummarizedExperiment(t(tcga_expr_df[ , -(1:4)]), colData=tcga_expr_df[ , 2:4]) colnames(tcga_se) <- tcga_expr_df$tcga_id colData(tcga_se)$sample_id <- tcga_expr_df$tcga_id @ Since we are interested in hypoxia, we want to learn a weighting only on genes associated with hypoxia. In this package we have a helper function to retreive these genes, but other gene sets can be used for different pathways of interest. Gene sets can be easily fetched using the package \emph{msigdbr}. <<>>= hypoxia_gene_ids <- get_hypoxia_genes() hypoxia_gene_ids <- intersect(hypoxia_gene_ids, rownames(tcga_se)) @ Now we will need to identify how we want to discriminate our samples. Here, we use the assumption that normal samples are less hypoxic than tumor samples. Therefore, we will use the \emph{is\_normal} column as our \emph{Y} column. We set normal samples to 0 and tumor samples to 1. This implies that a higher score indicates a more hypoxic sample. <<>>= colData(tcga_se)$Y <- ifelse(colData(tcga_se)$is_normal, 0, 1) @ \section{Get Reference Gene Weightings } Now that our data is in the appropriate format, we can learn the weightings. Within the method \emph{get\_gene\_weights}, there is a normalization step where we scale across all the genes available in the SummarizedExperiment assay. For this to be stable and consistent, we recommend that the assay contain at least 500 genes that are consistently expressed across all samples in addition to the genes in the pathway of interest. We also assume the most of the genes in our hypoxia geneset are increasing, meaning that our genes move unidirectionally. <<>>= res <- get_gene_weights(tcga_se, hypoxia_gene_ids, unidirectional=TRUE) gene_weights <- res[[1]] sample_scores <- res[[2]] @ Now let's see how well we did in seperating the two classes defined by \emph{Y}. <>= training_res <- get_classification_accuracy(sample_scores, positive_val=1) # plot the ROC curve plot(training_res[[4]], col="blue", ylim=c(0, 1)) roc_text <- paste("AUC:", round(training_res$auc_roc,3)) legend(0.1,0.8, roc_text, border="white",cex=1,box.col = "white") @ <>= # plot the PR curve plot(training_res[[3]], col="orange", ylim=c(0, 1)) pr_text <- paste("AUC:", round(training_res$auc_pr,3)) legend(0.1,0.8, pr_text, border="white",cex=1,box.col = "white") @ \section{Apply Gene Weightings to New Samples} Now, using the gene weightings learned from the reference set, we can apply it to a new sample. <<>>= data(new_samp_df) new_samp_se <- SummarizedExperiment(t(new_samp_df[ , -(1)]), colData=new_samp_df[ , 1, drop=FALSE]) colnames(colData(new_samp_se)) <- "sample_id" new_score_df <- get_new_samp_score(gene_weights, new_samp_se) new_score_df @ Now lets see if the derived score match our experimental expectations. Samples with \textbf{hyp} or \textbf{norm} in the sample id are cell lines that were exposed to hypoxic or normoxic conditions respectively. Samples with \textbf{ctrl} or \textbf{noHIF} were samples that were able to produce a HIF-mediated hypoxic response or not, respectively. <>= plot_scores <- function(hif_scores){ # format the sample IDS hif_scores$sample_type <- substr(hif_scores$sample_id, 1, nchar((hif_scores$sample_id))-2) colnames(hif_scores)[2] <- "pathway_score" gg <- ggplot(hif_scores, aes(x=sample_type, y=pathway_score, fill=sample_type)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + theme_bw() return(gg) } gg <- plot_scores(as.data.frame(new_score_df)) print(gg) @ Accordingly, we find the \textbf{hyp\_ctrl} samples have the highest pathway score. According to our labeling (tumor/hypoxic is 1 and normal/normoxic is 0), this implies that these samples are the most hypoxic. Furthermore, we see that the samples that were not able to produce a hypoxic response, even in the absence of oxygen (\textbf{hyp\_noHIF}) are found to have similar score to the normoxic samples. \end{document}