% Created 2013-10-20 Sun 18:22 \documentclass{article} <>= BiocStyle::latex() @ \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{fixltx2e} \usepackage{graphicx} \usepackage{longtable} \usepackage{float} \usepackage{wrapfig} \usepackage{rotating} \usepackage[normalem]{ulem} \usepackage{amsmath} \usepackage{textcomp} \usepackage{marvosym} \usepackage{wasysym} \usepackage{amssymb} \usepackage{hyperref} \tolerance=1000 \author{Julian Gehring, EMBL Heidelberg} \date{\today} \title{SomaticCancerAlterations} \hypersetup{ pdfkeywords={}, pdfsubject={}, pdfcreator={Emacs 24.1.1 (Org mode 8.2.1)}} \begin{document} \maketitle \tableofcontents %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{SomaticCancerAlterations - PDF} %\VignettePackage{SomaticCancerAlterations} \section{Motivation} \label{sec-1} Over the last years, large efforts have been taken to characterize the somatic landscape of cancers. Many of the conducted studies make their results publicly available, providing a valuable resource for investigating beyond the level of individual cohorts. The \Biocpkg{SomaticCancerAlterations} package collects mutational data of several tumor types, currently focusing on the TCGA calls sets, and aims for a tight integration with \R{} and \Bioconductor{} workflows. In the following, we will illustrate how to access this data and give examples for use cases. <>= library(SomaticCancerAlterations) library(GenomicRanges) library(ggbio) @ %def \section{Data Sets} \label{sec-2} The Cancer Genome Atlas (TCGA)\footnote{\url{http://cancergenome.nih.gov}} is a consortium effort to analyze a variety of tumor types, including gene expression, methylation, copy number changes, and somatic mutations\footnote{\url{https://wiki.nci.nih.gov/display/TCGA/TCGA+Home}}. With the \Biocpkg{SomaticCancerAlterations} package, we provide the callsets of somatic mutations for all publically available TCGA studies. Over time, more studies will be added, as they become available and unrestriced in their usage. To get started, we get a list of all available data sets and access the metadata associated with each study. <>= all_datasets = scaListDatasets() print(all_datasets) meta_data = scaMetadata() print(meta_data) @ %def Next, we load a single dataset with the \Rfunction{scaLoadDataset} function. <>= ov = scaLoadDatasets("ov_tcga", merge = TRUE) @ %def \section{Exploring Mutational Data} \label{sec-3} The somatic variants of each study are represented as a object, ordered by genomic positions. Additional columns describe properties of the variant and relate it the the affected gene, sample, and patient. <>= head(ov, 3) @ %def <>= with(mcols(ov), table(Variant_Classification, Variant_Type)) @ %def With such data at hand, we can identify the samples and genes haboring the most mutations. <<>>= head(sort(table(ov$Sample_ID), decreasing = TRUE)) head(sort(table(ov$Hugo_Symbol), decreasing = TRUE), 10) @ %def \section{Exploring Multiple Studies} \label{sec-4} Instead of focusing on an individual study, we can also import several at once. The results are stored as a \Rclass{GRangesList} in which each element corresponds to a single study. This can be merged into a single \Rclass{GRanges} object with \texttt{merge = TRUE}. <>= three_studies = scaLoadDatasets(all_datasets[1:3]) print(elementLengths(three_studies)) class(three_studies) @ %def <>= merged_studies = scaLoadDatasets(all_datasets[1:3], merge = TRUE) class(merged_studies) @ %def We then compute the number of mutations per gene and study: <>= gene_study_count = with(mcols(merged_studies), table(Hugo_Symbol, Dataset)) gene_study_count = gene_study_count[order(apply(gene_study_count, 1, sum), decreasing = TRUE), ] gene_study_count = addmargins(gene_study_count) head(gene_study_count) @ %def Further, we can subset the data by regions of interests, and compute descriptive statistics only on the subset. <>= tp53_region = GRanges("17", IRanges(7571720, 7590863)) tp53_studies = subsetByOverlaps(merged_studies, tp53_region) @ %def For example, we can investigate which type of somatic variants can be found in TP53 throughout the studies. <>= addmargins(table(tp53_studies$Variant_Classification, tp53_studies$Dataset)) @ %def To go further, how many patients have mutations in TP53 for each cancer type? <>= fraction_mutated_region = function(y, region) { s = subsetByOverlaps(y, region) m = length(unique(s$Patient_ID)) / metadata(s)$Number_Patients return(m) } mutated_fraction = sapply(three_studies, fraction_mutated_region, tp53_region) mutated_fraction = data.frame(name = names(three_studies), fraction = mutated_fraction) @ %def <>= library(ggplot2) p = ggplot(mutated_fraction) + ggplot2::geom_bar(aes(x = name, y = fraction, fill = name), stat = "identity") + ylim(0, 1) + xlab("Study") + ylab("Ratio") + theme_bw() print(p) @ %def \section{Data Provenance} \label{sec-5} \subsection{TCGA Data} \label{sec-5-1} When importing the mutation data from the TCGA servers, we checked the data for consistency and fix common ambiguities in the annotation. \subsubsection{Processing} \label{sec-5-1-1} \begin{enumerate} \item Selection of the most recent somatic variant calls for each study. These were stored as \texttt{*.maf} files in the TCGA data directory\footnote{\url{https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/}}. If both manually curated and automatically generated variant calls were available, the curated version was chosen. \item Importing of the \texttt{*.maf} files into \R{} and checking for consistency with the TCGA MAF specifications\footnote{\url{https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+(MAF)}+Specification}. Please note that these guidelines are currently only suggestions and most TCGA files violate some of these. \item Transformation of the imported variants into a \Robject{GRanges} object, with one row for each reported variant. Only columns related to the genomic origin of the somatic variant were stored, additional columns describing higher-level effects, such as mutational consequences and alterations at the protein level, were dropped. The \Robject{seqlevels} information defining the chromosomal ranges were taken from the 1000genomes phase 2 reference assembly\footnote{\url{ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/}}. \item The patient barcode was extracted from the sample barcode. \item Metadata describing the design and analysis of the study was extracted. \item The processed variants were written to disk, with one file for each study. The metadata for all studies were stored as a single, separate object. \end{enumerate} \subsubsection{Selection Criteria of Data Sets} \label{sec-5-1-2} We included data sets in the package that were \begin{itemize} \item conducted by the Broad Institute. \item cleared for unrestricted access and usage\footnote{\url{http://cancergenome.nih.gov/abouttcga/policies/publicationguidelines}}. \item sequenced with Illumina platforms. \end{itemize} \subsubsection{Consistency Check} \label{sec-5-1-3} According to the TCGA specifications for the \texttt{MAF} files, we screened and corrected for common artifacts in the data regarding annotation. This included: \begin{itemize} \item Transfering of all genomic coordinates to the NCBI 37 reference notation (with the chromosome always depicted as 'MT') \item Checking of the entries against all allowed values for this field (currently for the columns \texttt{Hugo\_Symbol}, \texttt{Chromosome}, \texttt{Strand}, \texttt{Variant\_Classification}, \texttt{Variant\_Type}, \texttt{Reference\_Allele}, \texttt{Tumor\_Seq\_Allele1}, \texttt{Tumor\_Seq\_Allele2}, \texttt{Verification\_Status}, \texttt{Validation\_Status}, \texttt{Sequencer}). \end{itemize} \section{Alternatives} \label{sec-6} The TCGA data sets can be accessed in different ways. First, the TCGA itself offers access to certain types of its collected data\footnote{\url{https://tcga-data.nci.nih.gov/tcga/tcgaDownload.jsp}}. Another approach has been taken by the cBioPortal for Cancer Genomics\footnote{\url{http://www.cbioportal.org/public-portal}} which has performed high-level analyses of several TCGA data sources, such as gene expression and copy number changes. This summarized data can be queried through an \R{} interface\footnote{\url{http://www.cbioportal.org/public-portal/cgds_r.jsp}}. \section{Session Info} \label{sec-7} <>= sessionInfo() @ %def % Emacs 24.1.1 (Org mode 8.2.1) \end{document}