MultiAssayExperiment
curatedTCGAData
cBioPortalData
docker run -e PASSWORD=bioc -p 8787:8787 mr148/multiassayworkshop:latest
The workshop uses a Docker container with Bioconductor devel version
3.14
. If you would like to install Bioconductor on your
computer at a later date, see the Bioconductor
installation instructions.
Here is a list of packages that we will be using:
library(MultiAssayExperiment)
library(curatedTCGAData)
library(cBioPortalData)
library(TCGAutils)
library(SingleCellMultiModal)
library(UpSetR)
library(GenomicDataCommons)
Please use these citations (Ramos et al.
2017) and (Ramos et al. 2020) when
using MultiAssayExperiment
, curatedTCGAData
,
or cBioPortalData
. Your citations are very much
appreciated!
Package | Description |
---|---|
MultiAssayExperiment | Infrastructure package to represent multi-omics data |
curatedTCGAData | Downloads TCGA data from ExperimentHub in MultiAssayExperiment form |
cBioPortalData | Access over 300 study datasets from the cBio Genomics Portal |
TCGAutils | Make use of utility functions for working with TCGA data |
SingleCellMultiModal | Obtain single cell data from various multi-modality studies |
MultiAssayExperiment
SummarizedExperiment
representation
for expression datacBioPortalData
getStudies()
for a list of available studiescuratedTCGAData
RTCGAToolbox
,
GenomicDataCommons
, TCGAbiolinks
,
cBioPortal
website, Broad GDAC Firehose, and morehg19
dataMultiAssayExperiment
representationsAvailable Studies
– (curatedTCGAData section) A list
of available cancer studies from
TCGAutils::diseaseCodes
.
OmicsTypes
– A descriptive table of ’omics types in
curatedTCGAData
(thanks to Ludwig G.
@lgeistlinger
)
SingleCellMultiModal
MultiAssayExperiment
data representations. Some
representations are out of memory using the HDF5
format as
well as the MTX
format.TCGAutils
curatedTCGAData
outputsIt provides convenience / helper functions in three major areas:
For the cheatsheet reference table, see the TCGAutils Cheatsheet.
To better understand how it all fits together, this schematic shows the relationship among all as part of the curatedTCGAData pipeline.
This section summarizes three fundamental data classes for the representation of multi-omics experiments.
(Ranged)SummarizedExperiment
rowData
)colData
)metadata
).RangedSummarizedExperiment
associates a
GRanges
or GRangesList
vector with the
rowsNote. Many other classes for experimental data are actually
derived from SummarizedExperiment
(e.g.,
SingleCellExperiment
for single-cell RNA sequencing
experiments)
library(SingleCellExperiment)
extends("SingleCellExperiment")
## [1] "SingleCellExperiment" "RangedSummarizedExperiment"
## [3] "SummarizedExperiment" "RectangularData"
## [5] "Vector" "Annotated"
## [7] "vector_OR_Vector"
RaggedExperiment
.vcf
files, and other ragged array
schema for genomic location data.GRangesList
class in
GenomicRanges
showClass("RaggedExperiment")
## Class "RaggedExperiment" [package "RaggedExperiment"]
##
## Slots:
##
## Name: assays rowidx colidx metadata
## Class: GRangesList integer integer list
##
## Extends: "Annotated"
RaggedExperiment
provides a flexible set of _*Assay_
methods to support transformation of ranged list data to matrix
format.
MultiAssayExperiment
GRanges
objects,
such as gene expression or copy number.matrix
: the most basic class for ID-based datasets,
could be used for example for gene expression summarized per-gene,
microRNA, metabolomics, or microbiome data.SummarizedExperiment
and derived methods: described
above, could be used for miRNA, gene expression, proteomics, or any
matrix-like data where measurements are represented by IDs.RangedSummarizedExperiment
: described above, could be
used for gene expression, methylation, or other data types referring to
genomic positions.ExpressionSet
: Another rich representation for ID-based
datasets, supported only for legacy reasonsRaggedExperiment
: described above, for non-rectangular
(ragged) ranged-based datasets such as segmented copy number, where
segmentation of copy number alterations occurs and different genomic
locations in each sample.RangedVcfStack
: For VCF archives broken up by
chromosome (see VcfStack
class defined in the
GenomicFiles
package)DelayedMatrix
: An on-disk representation of matrix-like
objects for large datasets. It reduces memory usage and optimizes
performance with delayed operations. This class is part of the
DelayedArray
package.Note. Data classes that support row and column naming and
subsetting may be used in a MultiAssayExperiment
.
MatchedAssayExperiment
MultiAssayExperiment
# coercion
as(x, "MatchedAssayExperiment")
# construction from MAE
MatchedAssayExperiment(mae)
Note. The MultiAssayExperiment
package provides
functionality to merge replicate profiles for a single patient
(mergeReplicates()
).
Key points
MultiAssayExperiment
coordinates different Bioconductor
classes into one unified objectMultiAssayExperiment
is an infrastructure package while
curatedTCGAData
and cBioPortalData
provide
data on cancer studies including TCGAMultiAssayExperiment
miniACC
DemoGet started by trying out MultiAssayExperiment
using a
subset of the TCGA adrenocortical carcinoma (ACC) dataset provided with
the package. This dataset provides five assays on 92 patients, although
all five assays were not performed for every patient:
data("miniACC")
miniACC
## A MultiAssayExperiment object of 5 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 5:
## [1] RNASeq2GeneNorm: SummarizedExperiment with 198 rows and 79 columns
## [2] gistict: SummarizedExperiment with 198 rows and 90 columns
## [3] RPPAArray: SummarizedExperiment with 33 rows and 46 columns
## [4] Mutations: matrix with 97 rows and 90 columns
## [5] miRNASeqGene: SummarizedExperiment with 471 rows and 80 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
shiny
DemoClick Here to
open the shiny
tutorial.
Key points
MultiAssayExperiment
objectexperiments
which extracts the
ExperimentList
The MultiAssayExperiment
constructor function accepts
three arguments:
experiments
- An ExperimentList
or
list
of rectangular datacolData
- A DataFrame
describing the
patients (or cell lines, or other biological units)sampleMap
- A DataFrame
of
assay
, primary
, and colname
identifiersThe miniACC object can be reconstructed as follows:
MultiAssayExperiment(
experiments = experiments(miniACC),
colData = colData(miniACC),
sampleMap = sampleMap(miniACC),
metadata = metadata(miniACC)
)
## A MultiAssayExperiment object of 5 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 5:
## [1] RNASeq2GeneNorm: SummarizedExperiment with 198 rows and 79 columns
## [2] gistict: SummarizedExperiment with 198 rows and 90 columns
## [3] RPPAArray: SummarizedExperiment with 33 rows and 46 columns
## [4] Mutations: matrix with 97 rows and 90 columns
## [5] miRNASeqGene: SummarizedExperiment with 471 rows and 80 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
prepMultiAssay
The prepMultiAssay
function allows the user to diagnose
typical problems when creating a MultiAssayExperiment
object. See ?prepMultiAssay
for more details.
c
functionThe c
function allows the user to concatenate an
additional experiment to an existing MultiAssayExperiment
.
The optional sampleMap
argument allows concatenating an
assay whose column names do not match the row names of
colData
. For convenience, the mapFrom argument
allows the user to map from a particular experiment
provided that the order of the
colnames is in the same. A warning
will be
issued to make the user aware of this assumption. For example, to
concatenate a matrix of log2-transformed RNA-seq results:
miniACC2 <- c(
miniACC,
log2rnaseq = log2(assays(miniACC)$RNASeq2GeneNorm),
mapFrom=1L
)
## Warning: Assuming column order in the data provided
## matches the order in 'mapFrom' experiment(s) colnames
experiments(miniACC2)
## ExperimentList class object of length 6:
## [1] RNASeq2GeneNorm: SummarizedExperiment with 198 rows and 79 columns
## [2] gistict: SummarizedExperiment with 198 rows and 90 columns
## [3] RPPAArray: SummarizedExperiment with 33 rows and 46 columns
## [4] Mutations: matrix with 97 rows and 90 columns
## [5] miRNASeqGene: SummarizedExperiment with 471 rows and 80 columns
## [6] log2rnaseq: matrix with 198 rows and 79 columns
This slot is a DataFrame
describing the characteristics
of biological units, for example clinical data for patients. In the
prepared datasets from The
Cancer Genome Atlas, each row is one patient and each column is a
clinical, pathological, subtype, or other variable. The $
function provides a shortcut for accessing or setting
colData
columns.
colData(miniACC)[1:4, 1:4]
## DataFrame with 4 rows and 4 columns
## patientID years_to_birth vital_status days_to_death
## <character> <integer> <integer> <integer>
## TCGA-OR-A5J1 TCGA-OR-A5J1 58 1 1355
## TCGA-OR-A5J2 TCGA-OR-A5J2 44 1 1677
## TCGA-OR-A5J3 TCGA-OR-A5J3 23 0 NA
## TCGA-OR-A5J4 TCGA-OR-A5J4 23 1 423
table(miniACC$race)
##
## asian black or african american white
## 2 1 78
Note. MultiAssayExperiment
supports both missing
observations and replicate observations, i.e., one row of
colData
can map to 0, 1, or more columns of any of the
experimental data matrices. One could therefore treat replicate
observations as one or multiple rows of colData
. This can
result in different subsetting, duplicated()
, and
wideFormat()
behaviors.
Note. Multiple time points, or distinct biological replicates, can be
separate rows of the colData
.
Key points
Experimental datasets can be input as either a base list
or ExperimentList
object for the set of samples collected.
To see the experiments use the experiments
getter
function.
experiments(miniACC)
## ExperimentList class object of length 5:
## [1] RNASeq2GeneNorm: SummarizedExperiment with 198 rows and 79 columns
## [2] gistict: SummarizedExperiment with 198 rows and 90 columns
## [3] RPPAArray: SummarizedExperiment with 33 rows and 46 columns
## [4] Mutations: matrix with 97 rows and 90 columns
## [5] miRNASeqGene: SummarizedExperiment with 471 rows and 80 columns
Note. Each matrix column must correspond to exactly one row in
colData
. In other words, each patient or cell line must be
traceable. However, multiple columns can come from the same patient, or
there can be no data for that patient.
Key points:
ExperimentList
elements can be genomic range-based
(e.g. SummarizedExperiment
or
RaggedExperiment
) or ID-based dataNote. Any data class can be included in the
ExperimentList
, as long as it supports: single-bracket
subsetting ([
), dimnames
, and
dim
. Most data classes defined in Bioconductor meet these
requirements.
sampleMap
is a graph representation of the relationship
between biological units and experimental results. In simple cases where
the column names of ExperimentList
data matrices match the
row names of colData
, the user won’t need to specify or
think about a sample map, it can be created automatically by the
MultiAssayExperiment
constructor. sampleMap
is
a simple three-column DataFrame
:
assay
column: the name of the assay, and found in the
names of ExperimentList
list namesprimary
column: identifiers of patients or biological
units, and found in the row names of colData
colname
column: identifiers of assay results, and found
in the column names of ExperimentList
elements Helper
functions are available for creating a map from a list. See
?listToMap
sampleMap(miniACC)
## DataFrame with 385 rows and 3 columns
## assay primary colname
## <factor> <character> <character>
## 1 RNASeq2GeneNorm TCGA-OR-A5J1 TCGA-OR-A5J1-01A-11R..
## 2 RNASeq2GeneNorm TCGA-OR-A5J2 TCGA-OR-A5J2-01A-11R..
## 3 RNASeq2GeneNorm TCGA-OR-A5J3 TCGA-OR-A5J3-01A-11R..
## 4 RNASeq2GeneNorm TCGA-OR-A5J5 TCGA-OR-A5J5-01A-11R..
## 5 RNASeq2GeneNorm TCGA-OR-A5J6 TCGA-OR-A5J6-01A-31R..
## ... ... ... ...
## 381 miRNASeqGene TCGA-PA-A5YG TCGA-PA-A5YG-01A-11R..
## 382 miRNASeqGene TCGA-PK-A5H8 TCGA-PK-A5H8-01A-11R..
## 383 miRNASeqGene TCGA-PK-A5H9 TCGA-PK-A5H9-01A-11R..
## 384 miRNASeqGene TCGA-PK-A5HA TCGA-PK-A5HA-01A-11R..
## 385 miRNASeqGene TCGA-PK-A5HB TCGA-PK-A5HB-01A-11R..
Key points:
colnames
) to
colData
Metadata can be used to keep additional information about patients,
assays performed on individuals or on the entire cohort, or features
such as genes, proteins, and genomic ranges. There are many options
available for storing metadata. First, MultiAssayExperiment
has its own metadata for describing the entire experiment:
metadata(miniACC)
## $title
## [1] "Comprehensive Pan-Genomic Characterization of Adrenocortical Carcinoma"
##
## $PMID
## [1] "27165744"
##
## $sourceURL
## [1] "http://s3.amazonaws.com/multiassayexperiments/accMAEO.rds"
##
## $RPPAfeatureDataURL
## [1] "http://genomeportal.stanford.edu/pan-tcga/show_target_selection_file?filename=Allprotein.txt"
##
## $colDataExtrasURL
## [1] "http://www.cell.com/cms/attachment/2062093088/2063584534/mmc3.xlsx"
Additionally, the DataFrame
class used by
sampleMap
and colData
, as well as the
ExperimentList
class, similarly support metadata. Finally,
many experimental data objects that can be used in the
ExperimentList
support metadata. These provide flexible
options to users and to developers of derived classes.
curatedTCGAData
Most unrestricted TCGA data (from 33 cancer types) are available as
MultiAssayExperiment
objects from the
curatedTCGAData
package. This represents a lot of
harmonization!
Here we list the available data for the Adrenocortical Carcinoma
("ACC"
) cancer type:
library(curatedTCGAData)
curatedTCGAData("ACC", version = "2.0.1", dry.run = TRUE)
## ah_id title file_size
## 1 EH4737 ACC_CNASNP-20160128 0.8 Mb
## 2 EH4738 ACC_CNVSNP-20160128 0.2 Mb
## 3 EH4740 ACC_GISTIC_AllByGene-20160128 0.2 Mb
## 4 EH4741 ACC_GISTIC_Peaks-20160128 0 Mb
## 5 EH4742 ACC_GISTIC_ThresholdedByGene-20160128 0.2 Mb
## 6 EH4744 ACC_Methylation-20160128_assays 239.2 Mb
## 7 EH4745 ACC_Methylation-20160128_se 6 Mb
## 8 EH4746 ACC_miRNASeqGene-20160128 0.1 Mb
## 9 EH4747 ACC_Mutation-20160128 0.7 Mb
## 10 EH4748 ACC_RNASeq2Gene-20160128 2.7 Mb
## 11 EH4749 ACC_RNASeq2GeneNorm-20160128 4 Mb
## 12 EH4750 ACC_RPPAArray-20160128 0.1 Mb
## rdataclass rdatadateadded rdatadateremoved
## 1 RaggedExperiment 2021-01-27 <NA>
## 2 RaggedExperiment 2021-01-27 <NA>
## 3 SummarizedExperiment 2021-01-27 <NA>
## 4 RangedSummarizedExperiment 2021-01-27 <NA>
## 5 SummarizedExperiment 2021-01-27 <NA>
## 6 SummarizedExperiment 2021-01-27 <NA>
## 7 RaggedExperiment 2021-01-27 <NA>
## 8 SummarizedExperiment 2021-01-27 <NA>
## 9 SummarizedExperiment 2021-01-27 <NA>
## 10 SummarizedExperiment 2021-01-27 <NA>
## 11 DFrame 2021-01-27 <NA>
## 12 SummarizedExperiment 2021-01-27 <NA>
We then download the data with dry.run
set to
FALSE
.
acc <- curatedTCGAData(
diseaseCode = "ACC",
assays = c(
"miRNASeqGene", "RPPAArray", "Mutation", "RNASeq2GeneNorm", "CNVSNP"
),
version = "2.0.1",
dry.run = FALSE
)
acc
## A MultiAssayExperiment object of 5 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 5:
## [1] ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 180 columns
## [2] ACC_miRNASeqGene-20160128: SummarizedExperiment with 1046 rows and 80 columns
## [3] ACC_Mutation-20160128: RaggedExperiment with 20166 rows and 90 columns
## [4] ACC_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 79 columns
## [5] ACC_RPPAArray-20160128: SummarizedExperiment with 192 rows and 46 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
These objects contain most unrestricted TCGA assay and clinical / pathological data, as well as some curated data from the supplements of published TCGA primary papers at the end of the colData columns:
dim(colData(acc))
## [1] 92 822
tail(colnames(colData(acc)), 10)
## [1] "MethyLevel" "miRNA.cluster" "SCNA.cluster" "protein.cluster"
## [5] "COC" "OncoSign" "purity" "ploidy"
## [9] "genome_doublings" "ADS"
cBioPortalData
To date, the cBio Genomics Portal provides access to more than 300 datasets collected and curated from different instutions.
There are two main ways of accessing this data:
cBioDataPack
- tarball (.tar.gz
) data
filescBioPortalData
- data from the APINote. pkgdown
reference website here: https://waldronlab.io/cBioPortalData/
First, we create an API object using the cBioPortal
function. This will allow us to subsequently generate queries for the
service.
cbio <- cBioPortal()
## Warning in .service_validate_md5sum(api_reference_url, api_reference_md5sum, : service version differs from validated version
## service url: https://www.cbioportal.org/api/api-docs
## observed md5sum: 456be3b7a6b5871a1b41f1233ecbae85
## expected md5sum: 1615443badbeaada68463859c34f15f8
getStudies(cbio)
## # A tibble: 347 × 13
## name description publicStudy groups status importDate allSampleCount
## <chr> <chr> <lgl> <chr> <int> <chr> <int>
## 1 Adrenocortic… "TCGA Adre… TRUE "PUBL… 0 2022-03-0… 92
## 2 Bladder Canc… "Whole exo… TRUE "" 0 2022-03-0… 34
## 3 Basal Cell C… "Whole-exo… TRUE "PUBL… 0 2022-03-0… 293
## 4 Acute Lympho… "Comprehen… TRUE "PUBL… 0 2022-03-0… 93
## 5 Ampullary Ca… "Exome seq… TRUE "PUBL… 0 2022-03-0… 160
## 6 Bladder Urot… "Whole exo… TRUE "PUBL… 0 2022-03-0… 50
## 7 Bladder Canc… "Comprehen… TRUE "PUBL… 0 2022-03-0… 97
## 8 Bladder Urot… "Whole-exo… TRUE "PUBL… 0 2022-03-0… 99
## 9 Bladder Canc… "Genomic P… TRUE "PUBL… 0 2022-03-0… 109
## 10 Hypodiploid … "Whole gen… TRUE "" 0 2022-03-0… 44
## # … with 337 more rows, and 6 more variables: readPermission <lgl>,
## # studyId <chr>, cancerTypeId <chr>, referenceGenome <chr>, pmid <chr>,
## # citation <chr>
We can also see the build status of the datasets using the
buildReport
argument.
getStudies(cbio, buildReport = TRUE)
## # A tibble: 347 × 15
## name description publicStudy groups status importDate allSampleCount
## <chr> <chr> <lgl> <chr> <int> <chr> <int>
## 1 Adrenocortic… "TCGA Adre… TRUE "PUBL… 0 2022-03-0… 92
## 2 Bladder Canc… "Whole exo… TRUE "" 0 2022-03-0… 34
## 3 Basal Cell C… "Whole-exo… TRUE "PUBL… 0 2022-03-0… 293
## 4 Acute Lympho… "Comprehen… TRUE "PUBL… 0 2022-03-0… 93
## 5 Ampullary Ca… "Exome seq… TRUE "PUBL… 0 2022-03-0… 160
## 6 Bladder Urot… "Whole exo… TRUE "PUBL… 0 2022-03-0… 50
## 7 Bladder Canc… "Comprehen… TRUE "PUBL… 0 2022-03-0… 97
## 8 Bladder Urot… "Whole-exo… TRUE "PUBL… 0 2022-03-0… 99
## 9 Bladder Canc… "Genomic P… TRUE "PUBL… 0 2022-03-0… 109
## 10 Hypodiploid … "Whole gen… TRUE "" 0 2022-03-0… 44
## # … with 337 more rows, and 8 more variables: readPermission <lgl>,
## # studyId <chr>, cancerTypeId <chr>, referenceGenome <chr>, pmid <chr>,
## # citation <chr>, api_build <lgl>, pack_build <lgl>
This adds two additional columns to the end of the dataset reporting
the status of the builds. Not all studies can be converted to
MultiAssayExperiment
. Some studies require additonal
cleaning to be represented with MultiAssayExperiment
.
library(cBioPortalData)
(uvm <- cBioDataPack("uvm_tcga"))
## A MultiAssayExperiment object of 9 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 9:
## [1] cna_hg19.seg: RaggedExperiment with 7618 rows and 80 columns
## [2] CNA: SummarizedExperiment with 24776 rows and 80 columns
## [3] linear_CNA: SummarizedExperiment with 24776 rows and 80 columns
## [4] methylation_hm450: SummarizedExperiment with 15469 rows and 80 columns
## [5] mutations_extended: RaggedExperiment with 2174 rows and 80 columns
## [6] mutations_mskcc: RaggedExperiment with 2174 rows and 80 columns
## [7] RNA_Seq_v2_expression_median: SummarizedExperiment with 20531 rows and 80 columns
## [8] RNA_Seq_v2_mRNA_median_all_sample_Zscores: SummarizedExperiment with 20531 rows and 80 columns
## [9] RNA_Seq_v2_mRNA_median_Zscores: SummarizedExperiment with 20440 rows and 80 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
(
urcc <- cBioPortalData(
cbio, studyId = "urcc_mskcc_2016", genePanelId = "IMPACT341"
)
)
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] urcc_mskcc_2016_cna: SummarizedExperiment with 214 rows and 62 columns
## [2] urcc_mskcc_2016_mutations: RangedSummarizedExperiment with 147 rows and 53 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
Key points
curatedTCGAData
provides TCGA data with some curation
including tumor subtype informationcBioPortalData
has two main functions, one for
downloading pre-packaged data and another for sending queries through
the cBioPortal APIAside from the available reshaping functions already included in the
MultiAssayExperiment
package, the TCGAutils
package provides additional helper functions for working with TCGA
data.
A number of helper functions are available for managing datasets from
curatedTCGAData
. These include:
SummarizedExperiment
to
RangedSummarizedExperiment
based on
TxDb.Hsapiens.UCSC.hg19.knownGene
for:
mirToRanges()
: microRNAsymbolsToRanges()
: gene symbolsqreduceTCGA()
: convert RaggedExperiment
objects to RangedSummarizedExperiment
with one row per gene
symbol, for:
curatedTCGAData
objectsmirToRanges
microRNA assays obtained from curatedTCGAData
have
annotated sequences that can be converted to genomic ranges using the
mirbase.db
package. The function looks up all sequences and
converts them to (‘hg19’) ranges. For those rows that cannot be found,
an ‘unranged’ assay is introduced in the resulting MultiAssayExperiment
object.
mirToRanges(acc)
## Warning in (function (seqlevels, genome, new_style) : cannot switch some of
## hg19's seqlevels from UCSC to NCBI style
## Warning: 'experiments' dropped; see 'metadata'
## harmonizing input:
## removing 80 sampleMap rows not in names(experiments)
## A MultiAssayExperiment object of 6 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 6:
## [1] ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 180 columns
## [2] ACC_Mutation-20160128: RaggedExperiment with 20166 rows and 90 columns
## [3] ACC_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 79 columns
## [4] ACC_RPPAArray-20160128: SummarizedExperiment with 192 rows and 46 columns
## [5] ACC_miRNASeqGene-20160128_ranged: RangedSummarizedExperiment with 1002 rows and 80 columns
## [6] ACC_miRNASeqGene-20160128_unranged: SummarizedExperiment with 44 rows and 80 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
qreduceTCGA
The qreduceTCGA
function converts
RaggedExperiment
mutation data objects to
RangedSummarizedExperiment
using org.Hs.eg.db
and the qreduceTCGA
utility function from
RaggedExperiment
to summarize ‘silent’ and ‘non-silent’
mutations based on a ‘Variant_Classification’ metadata column in the
original object.
## Update build metadata to "hg19"
genome(acc[["ACC_Mutation-20160128"]]) <- "NCBI37"
seqlevelsStyle(acc[["ACC_Mutation-20160128"]]) <- "UCSC"
gnome <- genome(acc[["ACC_Mutation-20160128"]])
gnome <- translateBuild(gnome)
genome(acc[["ACC_Mutation-20160128"]]) <- gnome
qreduceTCGA(acc)
##
## 403 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some of
## hg19's seqlevels from UCSC to NCBI style
## 'select()' returned 1:1 mapping between keys and columns
## Warning in .normarg_seqlevelsStyle(value): more than one seqlevels style
## supplied, using the 1st one only
## Warning in .normarg_seqlevelsStyle(value): cannot switch some of hg19's
## seqlevels from UCSC to NCBI style
## Warning: 'experiments' dropped; see 'metadata'
## harmonizing input:
## removing 270 sampleMap rows not in names(experiments)
## A MultiAssayExperiment object of 5 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 5:
## [1] ACC_miRNASeqGene-20160128: SummarizedExperiment with 1046 rows and 80 columns
## [2] ACC_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 79 columns
## [3] ACC_RPPAArray-20160128: SummarizedExperiment with 192 rows and 46 columns
## [4] ACC_Mutation-20160128_simplified: RangedSummarizedExperiment with 22911 rows and 90 columns
## [5] ACC_CNVSNP-20160128_simplified: RangedSummarizedExperiment with 22911 rows and 180 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
symbolsToRanges
In the cases where row annotations indicate gene symbols, the
symbolsToRanges
utility function converts genes to genomic
ranges and replaces existing assays with
RangedSummarizedExperiment
objects. Gene annotations are
given as ‘hg19’ genomic regions.
symbolsToRanges(acc)
## 403 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some of
## hg19's seqlevels from UCSC to NCBI style
## 'select()' returned 1:1 mapping between keys and columns
## Warning: 'experiments' dropped; see 'metadata'
## harmonizing input:
## removing 79 sampleMap rows not in names(experiments)
## A MultiAssayExperiment object of 6 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 6:
## [1] ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 180 columns
## [2] ACC_miRNASeqGene-20160128: SummarizedExperiment with 1046 rows and 80 columns
## [3] ACC_Mutation-20160128: RaggedExperiment with 20166 rows and 90 columns
## [4] ACC_RPPAArray-20160128: SummarizedExperiment with 192 rows and 46 columns
## [5] ACC_RNASeq2GeneNorm-20160128_ranged: RangedSummarizedExperiment with 17103 rows and 79 columns
## [6] ACC_RNASeq2GeneNorm-20160128_unranged: SummarizedExperiment with 3398 rows and 79 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
simplifyTCGA
The simplifyTCGA
function combines all of the above
operations to create a more managable MultiAssayExperiment
object and using RangedSummarizedExperiment
assays where
possible.
TCGAutils::simplifyTCGA(acc)
## 403 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some of
## hg19's seqlevels from UCSC to NCBI style
## 'select()' returned 1:1 mapping between keys and columns
## Warning in .normarg_seqlevelsStyle(value): more than one seqlevels style
## supplied, using the 1st one only
## Warning in .normarg_seqlevelsStyle(value): cannot switch some of hg19's
## seqlevels from UCSC to NCBI style
## Warning: 'experiments' dropped; see 'metadata'
## harmonizing input:
## removing 270 sampleMap rows not in names(experiments)
## Warning in (function (seqlevels, genome, new_style) : cannot switch some of
## hg19's seqlevels from UCSC to NCBI style
## Warning in (function (seqlevels, genome, new_style) : 'experiments' dropped; see
## 'metadata'
## harmonizing input:
## removing 80 sampleMap rows not in names(experiments)
## 403 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some of
## hg19's seqlevels from UCSC to NCBI style
## 'select()' returned 1:1 mapping between keys and columns
## Warning: 'experiments' dropped; see 'metadata'
## harmonizing input:
## removing 79 sampleMap rows not in names(experiments)
## A MultiAssayExperiment object of 7 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 7:
## [1] ACC_RPPAArray-20160128: SummarizedExperiment with 192 rows and 46 columns
## [2] ACC_Mutation-20160128_simplified: RangedSummarizedExperiment with 22911 rows and 90 columns
## [3] ACC_CNVSNP-20160128_simplified: RangedSummarizedExperiment with 22911 rows and 180 columns
## [4] ACC_miRNASeqGene-20160128_ranged: RangedSummarizedExperiment with 1002 rows and 80 columns
## [5] ACC_miRNASeqGene-20160128_unranged: SummarizedExperiment with 44 rows and 80 columns
## [6] ACC_RNASeq2GeneNorm-20160128_ranged: RangedSummarizedExperiment with 17103 rows and 79 columns
## [7] ACC_RNASeq2GeneNorm-20160128_unranged: SummarizedExperiment with 3398 rows and 79 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
Solution
The sampleTables
function gives you an overview of
samples in each assay:
sampleTables(acc)
## $`ACC_CNVSNP-20160128`
##
## 01 10 11
## 90 85 5
##
## $`ACC_miRNASeqGene-20160128`
##
## 01
## 80
##
## $`ACC_Mutation-20160128`
##
## 01
## 90
##
## $`ACC_RNASeq2GeneNorm-20160128`
##
## 01
## 79
##
## $`ACC_RPPAArray-20160128`
##
## 01
## 46
Interpretation of sample codes:
data("sampleTypes")
head(sampleTypes)
## Code Definition Short.Letter.Code
## 1 01 Primary Solid Tumor TP
## 2 02 Recurrent Solid Tumor TR
## 3 03 Primary Blood Derived Cancer - Peripheral Blood TB
## 4 04 Recurrent Blood Derived Cancer - Bone Marrow TRBM
## 5 05 Additional - New Primary TAP
## 6 06 Metastatic TM
splitAssays
: separate the data from different tissue
typesTCGA datasets include multiple -omics for solid tumors, adjacent
normal tissues, blood-derived cancers and normals, and other tissue
types, which may be mixed together in a single dataset. The
MultiAssayExperiment
object generated here has one patient
per row of its colData
, but each patient may have two or
more -omics profiles by any assay, whether due to assaying of different
types of tissues or to technical replication. splitAssays
separates profiles from different tissue types (such as tumor and
adjacent normal) into different assays of the
MultiAssayExperiment
by taking a vector of sample codes,
and partitioning the current assays into assays with an appended sample
code:
split_acc <- splitAssays(acc, c("01", "11"))
## Warning: 'splitAssays' is deprecated.
## Use 'TCGAsplitAssays' instead.
## See help("Deprecated")
## Warning: Some 'sampleCodes' not found in assays
## Warning in .checkBarcodes(barcodes): Inconsistent barcode lengths: 28, 27
Only about 43 participants have data across all experiments.
Is there subtype data available in the
MultiAssayExperiment
obtained from
curatedTCGAData
?
Solution
The getSubtypeMap
function will show actual variable
names found in colData
that contain subtype information.
This can only be obtained from MultiAssayExperiment
objects
provided by curatedTCGAData
.
getSubtypeMap(acc)
## ACC_annotations ACC_subtype
## 1 Patient_ID patientID
## 2 histological_subtypes Histology
## 3 mrna_subtypes C1A/C1B
## 4 mrna_subtypes mRNA_K4
## 5 cimp MethyLevel
## 6 microrna_subtypes miRNA cluster
## 7 scna_subtypes SCNA cluster
## 8 protein_subtypes protein cluster
## 9 integrative_subtypes COC
## 10 mutation_subtypes OncoSign
head(colData(acc)$Histology)
## [1] "Usual Type" "Usual Type" "Usual Type" "Usual Type" "Usual Type"
## [6] "Usual Type"
TCGAutils
provides a number of ID translation functions.
These allow the user to translate from either file or case UUIDs to TCGA
barcodes and back. These functions work by querying the Genomic Data
Commons API via the GenomicDataCommons
package (thanks to
Sean Davis). These include:
UUIDtoBarcode()
- UUID to TCGA barcodeHere we have a known case UUID that we want to translate into a TCGA barcode.
UUIDtoBarcode("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", from_type = "case_id")
## case_id submitter_id
## 1 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 TCGA-B0-5117
In cases where we want to translate a known file UUID to the
associated TCGA patient barcode, we can use
UUIDtoBarcode
.
UUIDtoBarcode("b4bce3ff-7fdc-4849-880b-56f2b348ceac", from_type = "file_id")
## file_id associated_entities.entity_submitter_id
## 1 b4bce3ff-7fdc-4849-880b-56f2b348ceac TCGA-B0-5094-11A-01D-1421-08
barcodeToUUID()
- TCGA barcode to UUIDHere we translate the first two TCGA barcodes of the previous copy-number alterations dataset to UUID:
(xbarcode <- head(colnames(acc)[["ACC_CNVSNP-20160128"]], 4L))
## [1] "TCGA-OR-A5J1-01A-11D-A29H-01" "TCGA-OR-A5J1-10A-01D-A29K-01"
## [3] "TCGA-OR-A5J2-01A-11D-A29H-01" "TCGA-OR-A5J2-10A-01D-A29K-01"
barcodeToUUID(xbarcode)
## submitter_aliquot_ids aliquot_ids
## 18 TCGA-OR-A5J1-01A-11D-A29H-01 1387b6c7-48fe-4961-86a7-0bdcbd3fef92
## 16 TCGA-OR-A5J1-10A-01D-A29K-01 cb537629-6a01-4d67-84ea-dbf130bd59c7
## 8 TCGA-OR-A5J2-01A-11D-A29H-01 6f0290b0-4cb4-4f72-853e-9ac363bd2c3b
## 11 TCGA-OR-A5J2-10A-01D-A29K-01 4bf2e4ac-399f-4a00-854b-8e23b561bb4d
UUIDtoUUID()
- file and case IDsWe can also translate from file UUIDs to case UUIDs and vice versa as
long as we know the input type. We can use the case UUID from the
previous example to get the associated file UUIDs using
UUIDtoUUID
. Note that this translation is a one to many
relationship, thus yielding a data.frame
of file UUIDs for
a single case UUID.
head(UUIDtoUUID("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", to_type = "file_id"))
## case_id files.file_id
## 1 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 0285b91e-591f-466d-a762-88db3585e302
## 2 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 ff27ebf9-598b-477e-98db-d16946b0af77
## 3 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 e4d818ee-c9fc-44c8-914c-ab32f387d42e
## 4 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 e8f4e519-4d84-401c-a624-1fa477196f84
## 5 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 e5fc0ead-8403-4494-be9c-43336e5fa7c6
## 6 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 784fb164-43e6-4239-b3e2-27b0b806016b
One possible way to verify that file IDs are matching case UUIDS is
to browse to the Genomic Data Commons webpage with the specific file
UUID. Here we look at the first file UUID entry in the output
data.frame
:
https://portal.gdc.cancer.gov/files/0b4acc9e-3933-4d74-916a-a53c4a0665e6
In the page we check that the case UUID matches the input.
filenameToBarcode()
- Using file names as inputfquery <- files() |>
GenomicDataCommons::filter(~ cases.project.project_id == "TCGA-ACC" &
data_category == "Copy Number Variation" &
data_type == "Copy Number Segment")
fnames <- head(results(fquery)$file_name)
filenameToBarcode(fnames)
## file_name
## 1 BLAIN_p_TCGA_282_304_b2_N_GenomeWideSNP_6_D07_1348598.grch38.seg.v2.txt
## 2 BLAIN_p_TCGA_282_304_b2_N_GenomeWideSNP_6_D01_1348590.grch38.seg.v2.txt
## 3 BLAIN_p_TCGA_282_304_b2_N_GenomeWideSNP_6_F11_1348558.grch38.seg.v2.txt
## 4 GAMED_p_TCGA_B_312_313_314_NSP_GenomeWideSNP_6_D11_1361642.grch38.seg.v2.txt
## 5 AQUAE_p_TCGA_112_304_b2_N_GenomeWideSNP_6_F05_1348420.grch38.seg.v2.txt
## 6 BLAIN_p_TCGA_282_304_b2_N_GenomeWideSNP_6_A11_1348562.grch38.seg.v2.txt
## file_id
## 1 c3c194cf-9f00-4b46-8e17-1b5671e9fa5e
## 2 45aabf4e-8601-41f1-9ea0-bd5b34e582ef
## 3 8d8a7df6-7dfa-441b-b298-e038b6919fd5
## 4 766dcd23-c2ad-495c-bc0e-60f9c2a1f11b
## 5 4418dca4-9d93-41db-9d7e-26019ceb9911
## 6 e80d0cc7-535b-44a4-958f-bacf35fe88a7
## cases.samples.portions.analytes.aliquots.submitter_id
## 1 TCGA-OR-A5J7-01A-11D-A29H-01
## 2 TCGA-OR-A5J7-10A-01D-A29K-01
## 3 TCGA-OR-A5K6-01A-11D-A29H-01
## 4 TCGA-OR-A5KB-01A-11D-A309-01
## 5 TCGA-OR-A5L5-01A-11D-A29H-01
## 6 TCGA-OR-A5JP-10A-01D-A29K-01
See the TCGAutils
vignette page for more details.
Key points
TCGAutils
provides users additional tools for modifying
row and column metadata[
subsettingIn the pseudo code below, the subsetting operations work on the rows of the following indices:
list
or List
inputs)multiassayexperiment[i = rownames, j = primary or colnames, k = assay]
Subsetting operations always return another
MultiAssayExperiment
. For example, the following will
return any rows named “MAPK14” or “IGFBP2”, and remove any assays where
no rows match:
miniACC[c("MAPK14", "IGFBP2"), , ]
The following will keep only patients of pathological stage IV, and all their associated assays:
stg4 <- miniACC$pathologic_stage == "stage iv"
# remove NA values from vector
miniACC[, stg4 & !is.na(stg4), ]
And the following will keep only the RNA-seq dataset, and only patients for which this assay is available:
miniACC[, , "RNASeq2GeneNorm"]
## Warning: 'experiments' dropped; see 'metadata'
## harmonizing input:
## removing 306 sampleMap rows not in names(experiments)
## removing 13 colData rownames not in sampleMap 'primary'
If any ExperimentList objects have features represented by genomic
ranges (e.g. RangedSummarizedExperiment
,
RaggedExperiment
), then a GRanges
object in
the first subsetting position will subset these objects as in
GenomicRanges::findOverlaps()
. Any non-ranged
ExperimentList
element will be subset to zero rows.
[[
subsettingThe “double bracket” method ([[
) is a convenience
function for extracting a single element of the
MultiAssayExperiment
ExperimentList
. It avoids
the use of experiments(mae)[[1L]]
. For example, both of the
following extract the ExpressionSet
object containing
RNA-seq data:
miniACC[[1L]]
## class: SummarizedExperiment
## dim: 198 79
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(198): DIRAS3 MAPK14 ... SQSTM1 KCNJ13
## rowData names(0):
## colnames(79): TCGA-OR-A5J1-01A-11R-A29S-07 TCGA-OR-A5J2-01A-11R-A29S-07
## ... TCGA-PK-A5HA-01A-11R-A29S-07 TCGA-PK-A5HB-01A-11R-A29S-07
## colData names(0):
## equivalently
miniACC[["RNASeq2GeneNorm"]]
## class: SummarizedExperiment
## dim: 198 79
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(198): DIRAS3 MAPK14 ... SQSTM1 KCNJ13
## rowData names(0):
## colnames(79): TCGA-OR-A5J1-01A-11R-A29S-07 TCGA-OR-A5J2-01A-11R-A29S-07
## ... TCGA-PK-A5HA-01A-11R-A29S-07 TCGA-PK-A5HB-01A-11R-A29S-07
## colData names(0):
To preserve the colData
during the extraction, see
?getWithColData
.
complete.cases()
shows which patients have complete data
for all assays:
summary(complete.cases(miniACC))
## Mode FALSE TRUE
## logical 49 43
The above logical vector could be used for patient subsetting. More
simply, intersectColumns()
will select complete cases and
rearrange each ExperimentList
element so its columns
correspond exactly to rows of colData
in the same
order:
accmatched <- intersectColumns(miniACC)
Note, the column names of the assays in accmatched
are
not the same because of assay-specific identifiers, but they have been
automatically re-arranged to correspond to the same patients. In these
TCGA assays, the first three -
delimited positions
correspond to patient, ie the first patient is
TCGA-OR-A5J2:
colnames(accmatched)
## CharacterList of length 5
## [["RNASeq2GeneNorm"]] TCGA-OR-A5J2-01A-11R-A29S-07 ...
## [["gistict"]] TCGA-OR-A5J2-01A-11D-A29H-01 ... TCGA-PK-A5HA-01A-11D-A29H-01
## [["RPPAArray"]] TCGA-OR-A5J2-01A-21-A39K-20 ... TCGA-PK-A5HA-01A-21-A39K-20
## [["Mutations"]] TCGA-OR-A5J2-01A-11D-A29I-10 ... TCGA-PK-A5HA-01A-11D-A29I-10
## [["miRNASeqGene"]] TCGA-OR-A5J2-01A-11R-A29W-13 ...
intersectRows()
keeps only rows that are common to each
assay, and aligns them in identical order. For example, to keep only
genes where data are available for RNA-seq, GISTIC copy number, and
somatic mutations:
accmatched2 <- intersectRows(miniACC[, ,
c("RNASeq2GeneNorm", "gistict", "Mutations")])
## Warning: 'experiments' dropped; see 'metadata'
## harmonizing input:
## removing 126 sampleMap rows not in names(experiments)
rownames(accmatched2)
## CharacterList of length 3
## [["RNASeq2GeneNorm"]] DIRAS3 G6PD KDR ERBB3 AKT1S1 ... RET CDKN2A MACC1 CHGA
## [["gistict"]] DIRAS3 G6PD KDR ERBB3 AKT1S1 ... PREX1 RET CDKN2A MACC1 CHGA
## [["Mutations"]] DIRAS3 G6PD KDR ERBB3 AKT1S1 ... PREX1 RET CDKN2A MACC1 CHGA
The assay
and assays
methods follow
SummarizedExperiment
convention. The assay
(singular) method will extract the first element of the
ExperimentList
and will return a matrix
.
class(assay(miniACC))
## [1] "matrix" "array"
The assays
(plural) method will return a
SimpleList
of the data with each element being a
matrix
.
assays(miniACC)
## List of length 5
## names(5): RNASeq2GeneNorm gistict RPPAArray Mutations miRNASeqGene
Key point:
[[
returned an experiment in its original
class, assay()
and assays()
convert the assay
data to matrix format.Slot in the MultiAssayExperiment
can be accessed or set
using the respective accessor functions:
Slot | Accessor |
---|---|
ExperimentList |
experiments() |
colData |
colData() and $ * |
sampleMap |
sampleMap() |
metadata |
metadata() |
__*__ The $
operator on a
MultiAssayExperiment
returns a single column of the
colData
.
The longFormat
or wideFormat
functions will
“reshape” and combine experiments with each other and with
colData
into one DataFrame
. These functions
provide compatibility with most of the common R/Bioconductor functions
for regression, machine learning, and visualization.
longFormat
In long format a single column provides all assay results,
with additional optional colData
columns whose values are
repeated as necessary. Here assay is the name of the
ExperimentList element, primary is the patient identifier
(rowname of colData), rowname is the assay rowname (in this
case genes), colname is the assay-specific identifier (column
name), value is the numeric measurement (gene expression, copy
number, presence of a non-silent mutation, etc), and following these are
the vital_status and days_to_death colData columns
that have been added:
longFormat(miniACC[c("TP53", "CTNNB1"), , ],
colDataCols = c("vital_status", "days_to_death"))
## harmonizing input:
## removing 126 sampleMap rows not in names(experiments)
## DataFrame with 518 rows and 7 columns
## assay primary rowname colname value
## <character> <character> <character> <character> <numeric>
## 1 RNASeq2GeneNorm TCGA-OR-A5J1 TP53 TCGA-OR-A5J1-01A-11R.. 563.401
## 2 RNASeq2GeneNorm TCGA-OR-A5J1 CTNNB1 TCGA-OR-A5J1-01A-11R.. 5634.467
## 3 RNASeq2GeneNorm TCGA-OR-A5J2 TP53 TCGA-OR-A5J2-01A-11R.. 165.481
## 4 RNASeq2GeneNorm TCGA-OR-A5J2 CTNNB1 TCGA-OR-A5J2-01A-11R.. 62658.391
## 5 RNASeq2GeneNorm TCGA-OR-A5J3 TP53 TCGA-OR-A5J3-01A-11R.. 956.303
## ... ... ... ... ... ...
## 514 Mutations TCGA-PK-A5HA CTNNB1 TCGA-PK-A5HA-01A-11D.. 0
## 515 Mutations TCGA-PK-A5HB TP53 TCGA-PK-A5HB-01A-11D.. 0
## 516 Mutations TCGA-PK-A5HB CTNNB1 TCGA-PK-A5HB-01A-11D.. 0
## 517 Mutations TCGA-PK-A5HC TP53 TCGA-PK-A5HC-01A-11D.. 0
## 518 Mutations TCGA-PK-A5HC CTNNB1 TCGA-PK-A5HC-01A-11D.. 0
## vital_status days_to_death
## <integer> <integer>
## 1 1 1355
## 2 1 1355
## 3 1 1677
## 4 1 1677
## 5 0 NA
## ... ... ...
## 514 0 NA
## 515 0 NA
## 516 0 NA
## 517 0 NA
## 518 0 NA
wideFormat
In wide format, each feature from each assay goes in a separate column, with one row per primary identifier (patient). Here, each variable becomes a new column:
wideFormat(miniACC[c("TP53", "CTNNB1"), , ],
colDataCols = c("vital_status", "days_to_death"))
## harmonizing input:
## removing 126 sampleMap rows not in names(experiments)
## DataFrame with 92 rows and 9 columns
## primary vital_status days_to_death RNASeq2GeneNorm_TP53
## <character> <integer> <integer> <numeric>
## 1 TCGA-OR-A5J1 1 1355 563.401
## 2 TCGA-OR-A5J2 1 1677 165.481
## 3 TCGA-OR-A5J3 0 NA 956.303
## 4 TCGA-OR-A5J4 1 423 NA
## 5 TCGA-OR-A5J5 1 365 1169.636
## ... ... ... ... ...
## 88 TCGA-PK-A5H9 0 NA 890.866
## 89 TCGA-PK-A5HA 0 NA 683.572
## 90 TCGA-PK-A5HB 0 NA 237.370
## 91 TCGA-PK-A5HC 0 NA NA
## 92 TCGA-P6-A5OG 1 383 815.345
## RNASeq2GeneNorm_CTNNB1 gistict_TP53 gistict_CTNNB1 Mutations_TP53
## <numeric> <numeric> <numeric> <numeric>
## 1 5634.47 0 0 0
## 2 62658.39 0 1 1
## 3 6337.43 0 0 0
## 4 NA 1 0 0
## 5 5979.06 0 0 0
## ... ... ... ... ...
## 88 5258.99 0 0 0
## 89 8120.17 -1 0 0
## 90 5257.81 -1 -1 0
## 91 NA 1 1 0
## 92 6390.10 -1 1 NA
## Mutations_CTNNB1
## <numeric>
## 1 0
## 2 1
## 3 0
## 4 0
## 5 0
## ... ...
## 88 0
## 89 0
## 90 0
## 91 0
## 92 NA
Key points
MultiAssayExperiment
object is
important to be able to restrict observations and measurements to
particular phenotypes or sample typeslongFormat
and
wideFormat
are helpful for downstream analysis functions
that require a certain type of input formatSingleCellMultiModal
is an ExperimentHub
package that serves multiple datasets obtained from GEO and other
sources and represents them as MultiAssayExperiment
objects. We provide several multi-modal datasets including scNMT, 10X
Multiome, seqFISH, CITEseq, SCoPE2, and others. The scope of the package
is is to provide data for benchmarking and analysis.
Users can access the data for a particular technology with the
appropriate function. For example, to obtain a small data.frame of what
data is available for scNMT
, the user would enter:
scNMT("mouse_gastrulation", dry.run = TRUE, version = '2.0.0')
## snapshotDate(): 2022-04-26
## ah_id mode file_size rdataclass rdatadateadded rdatadateremoved
## 1 EH3753 acc_cgi 21.1 Mb matrix 2020-09-03 <NA>
## 2 EH3754 acc_CTCF 1.2 Mb matrix 2020-09-03 <NA>
## 3 EH3755 acc_DHS 16.2 Mb matrix 2020-09-03 <NA>
## 4 EH3756 acc_genebody 60.1 Mb matrix 2020-09-03 <NA>
## 5 EH3757 acc_p300 0.2 Mb matrix 2020-09-03 <NA>
## 6 EH3758 acc_promoter 33.8 Mb matrix 2020-09-03 <NA>
## 7 EH3760 met_cgi 12.1 Mb matrix 2020-09-03 <NA>
## 8 EH3761 met_CTCF 0.1 Mb matrix 2020-09-03 <NA>
## 9 EH3762 met_DHS 3.9 Mb matrix 2020-09-03 <NA>
## 10 EH3763 met_genebody 33.9 Mb matrix 2020-09-03 <NA>
## 11 EH3764 met_p300 0.1 Mb matrix 2020-09-03 <NA>
## 12 EH3765 met_promoter 18.7 Mb matrix 2020-09-03 <NA>
## 13 EH3766 rna 43.5 Mb matrix 2020-09-03 <NA>
We can see that there are a few assays for each modality.
The user can also use the help function to get a list of all the functions available in the package:
help(package = "SingleCellMultiModal", help_type = "html")
library(MultiAssayExperiment)
library(survival)
library(survminer)
library(pheatmap)
These provide exercises and solutions using the miniACC
dataset.
miniACC
samples have data for each combination of
assays?Solution
The built-in upsetSamples
creates an “upset” Venn
diagram to answer this question:
data("miniACC")
upsetSamples(miniACC)
In this dataset only 43 samples have all 5 assays, 32 are missing reverse-phase protein (RPPAArray), 12 have only mutations and gistict, 2 are missing Mutations, 1 is missing gistict, etc.
Create a Kaplan-meier plot, using pathology_N_stage as a stratifying variable.
Solution
The colData provides clinical data for things like a Kaplan-Meier plot for overall survival stratified by nodal stage.
Surv(miniACC$days_to_death, miniACC$vital_status)
## [1] 1355 1677 NA+ 423 365 NA+ 490 579 NA+ 922 551 1750
## [13] NA+ 2105 NA+ 541 NA+ NA+ 490 NA+ NA+ 562 NA+ NA+
## [25] NA+ NA+ NA+ NA+ 289 NA+ NA+ NA+ 552 NA+ NA+ NA+
## [37] 994 NA+ NA+ 498 NA+ NA+ 344 NA+ NA+ NA+ NA+ NA+
## [49] NA+ NA+ NA+ NA+ NA+ 391 125 NA+ 1852 NA+ NA+ NA+
## [61] NA+ NA+ NA+ NA+ 1204 159 1197 662 445 NA+ 2385 436
## [73] 1105 NA+ 1613 NA+ NA+ 2405 NA+ NA+ NA+ NA+ NA+ 207
## [85] 0 NA+ NA+ NA+ NA+ NA+ NA+ 383
And remove any patients missing overall survival information:
miniACCsurv <-
miniACC[, complete.cases(miniACC$days_to_death, miniACC$vital_status), ]
miniACCsurv
## A MultiAssayExperiment object of 5 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 5:
## [1] RNASeq2GeneNorm: SummarizedExperiment with 198 rows and 28 columns
## [2] gistict: SummarizedExperiment with 198 rows and 33 columns
## [3] RPPAArray: SummarizedExperiment with 33 rows and 14 columns
## [4] Mutations: matrix with 97 rows and 33 columns
## [5] miRNASeqGene: SummarizedExperiment with 471 rows and 29 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
fit <- survfit(
Surv(days_to_death, vital_status) ~ pathology_N_stage,
data = colData(miniACCsurv)
)
ggsurvplot(fit, data = colData(miniACCsurv), risk.table = TRUE)
Choose the EZH2 gene for demonstration. This subsetting will drop assays with no row named EZH2:
wideacc = wideFormat(miniACC["EZH2", , ],
colDataCols=c("vital_status", "days_to_death", "pathology_N_stage"))
## harmonizing input:
## removing 216 sampleMap rows not in names(experiments)
wideacc$y = Surv(wideacc$days_to_death, wideacc$vital_status)
head(wideacc)
## DataFrame with 6 rows and 7 columns
## primary vital_status days_to_death pathology_N_stage
## <character> <integer> <integer> <character>
## 1 TCGA-OR-A5J1 1 1355 n0
## 2 TCGA-OR-A5J2 1 1677 n0
## 3 TCGA-OR-A5J3 0 NA n0
## 4 TCGA-OR-A5J4 1 423 n1
## 5 TCGA-OR-A5J5 1 365 n0
## 6 TCGA-OR-A5J6 0 NA n0
## RNASeq2GeneNorm_EZH2 gistict_EZH2 y
## <numeric> <numeric> <Surv>
## 1 75.8886 0 1355:1
## 2 326.5332 1 1677:1
## 3 190.1940 1 NA:0
## 4 NA -2 423:1
## 5 366.3826 1 365:1
## 6 30.7371 1 NA:0
Perform a multivariate Cox regression with EZH2 copy number (gistict), log2-transformed EZH2 expression (RNASeq2GeneNorm), and nodal status (pathology_N_stage) as predictors:
coxph(
Surv(days_to_death, vital_status) ~
gistict_EZH2 + log2(RNASeq2GeneNorm_EZH2) + pathology_N_stage,
data=wideacc
)
## Call:
## coxph(formula = Surv(days_to_death, vital_status) ~ gistict_EZH2 +
## log2(RNASeq2GeneNorm_EZH2) + pathology_N_stage, data = wideacc)
##
## coef exp(coef) se(coef) z p
## gistict_EZH2 -0.03723 0.96345 0.28205 -0.132 0.894986
## log2(RNASeq2GeneNorm_EZH2) 0.97731 2.65729 0.28105 3.477 0.000506
## pathology_N_stagen1 0.37749 1.45862 0.56992 0.662 0.507743
##
## Likelihood ratio test=16.28 on 3 df, p=0.0009942
## n= 26, number of events= 26
## (66 observations deleted due to missingness)
We see that EZH2 expression is significantly associated with overal survival (p < 0.001), but EZH2 copy number and nodal status are not. This analysis could easily be extended to the whole genome for discovery of prognostic features by repeated univariate regressions over columns, penalized multivariate regression, etc.
For further detail, see the main MultiAssayExperiment
vignette.
Part 1
For all genes where there is both recurrent copy number (gistict assay) and RNA-seq, calculate the correlation between log2(RNAseq + 1) and copy number. Create a histogram of these correlations. Compare this with the histogram of correlations between all unmatched gene - copy number pairs.
Solution
First, narrow down miniACC
to only the assays
needed:
subacc <- miniACC[, , c("RNASeq2GeneNorm", "gistict")]
## Warning: 'experiments' dropped; see 'metadata'
## harmonizing input:
## removing 216 sampleMap rows not in names(experiments)
Align the rows and columns, keeping only samples with both assays available:
subacc <- intersectColumns(subacc)
subacc <- intersectRows(subacc)
Create a list of numeric matrices:
subacc.list <- assays(subacc)
Log-transform the RNA-seq assay:
subacc.list[[1]] <- log2(subacc.list[[1]] + 1)
Transpose both, so genes are in columns:
subacc.list <- lapply(subacc.list, t)
Calculate the correlation between columns in the first matrix and columns in the second matrix:
corres <- cor(subacc.list[[1]], subacc.list[[2]])
And finally, create the histograms:
hist(diag(corres))
hist(corres[upper.tri(corres)])
Part 2
For the gene with highest correlation to copy number, make a box plot of log2 expression against copy number.
Solution
First, identify the gene with highest correlation between expression and copy number:
which.max(diag(corres))
## EIF4E
## 91
You could now make the plot by taking the EIF4E columns from each element of the list subacc.list list that was extracted from the subacc MultiAssayExperiment, but let’s do it by subsetting and extracting from the MultiAssayExperiment:
df <- wideFormat(subacc["EIF4E", , ])
head(df)
## DataFrame with 6 rows and 3 columns
## primary RNASeq2GeneNorm_EIF4E gistict_EIF4E
## <character> <numeric> <numeric>
## 1 TCGA-OR-A5J1 460.615 0
## 2 TCGA-OR-A5J2 371.225 0
## 3 TCGA-OR-A5J3 516.072 0
## 4 TCGA-OR-A5J5 1129.357 1
## 5 TCGA-OR-A5J6 822.078 0
## 6 TCGA-OR-A5J7 344.565 -1
boxplot(RNASeq2GeneNorm_EIF4E ~ gistict_EIF4E,
data=df, varwidth=TRUE,
xlab="GISTIC Relative Copy Number Call",
ylab="RNA-seq counts")