--- title: "2. HPAanalyze in-depth: Working with Human Protein Atlas (HPA) data in R" author: - name: Anh N. Tran affiliation: Northwestern University, IL, USA email: trannhatanh89@gmail.com date: 6/7/2019 output: BiocStyle::html_document: toc: true toc_depth: 2 toc_float: true number_sections: true vignette: > %\VignetteIndexEntry{"2. HPAanalyze in-depth: Working with Human Protein Atlas (HPA) data in R"} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse=TRUE, comment="#>", warning=FALSE, message=FALSE, error=FALSE, crop = NULL ) ``` ```{r library} library(BiocStyle) library(HPAanalyze) library(tibble) library(dplyr) library(ggplot2) ``` # Summary * **Background:** The Human Protein Atlas program aims to map human proteins via multiple technologies including imaging, proteomics and transcriptomics. * **Results:** `HPAanalyze` is an R package for retreiving and performing exploratory data analysis from HPA. It provides functionality for importing data tables and xml files from HPA, exporting and visualizing data, as well as download all staining images of interest. The package is free, open source, and available via Github. * **Conclusions:** `HPAanalyze` intergrates into the R workflow via the `tidyverse` philosophy and data structures, and can be used in combination with Bioconductor packages for easy analysis of HPA data. **Keywords:** Human Protein Atlas, Proteomics, Homo Sapiens, Visualization, Software # Background The Human Protein Atlas (HPA) is a comprehensive resource for exploration of human proteome which contains a vast amount of proteomics and transcriptomics data generated from antibody-based tissue micro-array profiling and RNA deep-sequencing. The program has generated protein expression profiles in human normal tissues with cell type-specific expression patterns, cancer and cell lines via an innovative immunohistochemistry-based approach. These profiles are accompanied by a large collection of high quality histological staining images, annotated with clinical data and quantification. The database also includes classification of protein into both functional classes (such as transcription factors or kinases) and project-related classes (such as candidate genes for cancer). Starting from version 4.0, the HPA includes subcellular location profiles generated based on confocal images of immunofluorescent stained cells. Together, these data provide a detailed picture of protein expression in human cells and tissues, facilitating tissue-based diagnostic and research. Data from the HPA are freely available via proteinatlas.org, allowing scientists to access and incorporate the data into their research. Previously, the R package *hpar* has been created for fast and easy programmatic access of HPA data. Here, we introduce *HPAanalyze*, an R package aims to simplify exploratory data analysis from those data, as well as provide other complementary functionality to *hpar*. ## The different HPA data formats The Human Protein Atlas project provides data via two main mechanisms: *Full datasets* in the form of downloadable compressed tab-separated files (.tsv) and *individual entries* in XML, RDF and TSV formats. The full downloadable datasets includes normal tissue, pathology (cancer), subcellular location, RNA gene and RNA isoform data. For individual entries, the XML format is the most comprehensive, providing information on the target protein, antibodies, summary for each tissue and detailed data from each sample including clinical data, IHC scoring and image download links. ## `HPAanalyze` overview `HPAanalyze` is designed to fullfill 3 main tasks: (1) Import, subsetting and export downloadable datasets; (2) Visualization of downloadable datasets for exploratory analysis; and (3) Working with the individual XML files. This package aims to serve researchers with little programming experience, but also allow power users to use the imported data as desired. ```{r echo=FALSE, fig.cap="HPAanalyze workflow.", out.width = '100%'} knitr::include_graphics("figures/workflow.png") ``` ### Obtaining `HPAanalyze` The stable version of *HPAanalyze* should be downloaded from Bioconductor: ``` if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("HPAanalyze") ``` The development version of *HPAanalyze* is available on Github can be installed with: ``` devtools::install_github("trannhatanh89/HPAanalyze") ``` Please cite: __Tran AN, Dussaq AM, Kennell T, Willey C, Hjelmeland A. _HPAanalyze: An R Package that Facilitates the Retrieval and Analysis of The Human Protein Atlas Data_. bioRxiv 355032; doi: https://doi.org/10.1101/355032 __ # Full dataset import, subsetting and export The `hpaDownload()` function downloads full datasets from HPA (*specifically, the .tsv format described above*) and imports them into R as a list of tibbles, the standard object of `tidyverse`, which can subsequently be subset with `hpaSubset()` and export into .xmlx files with `hpaExport()`. The standard object allow the imported data to be further processed in a traditional R workflow. The ability to quickly subset and export data gives researchers the option to use other non-R downstream tools, such as GraphPad for creating publication-quality graphics, or share a subset of data containing only proteins of interest. **You can skip this whole section if you only care about visualization, unless you need a specific version of the HPA datasets, or the RNA expression datasets.** ## Download and import data with `hpaDownload()` This function should be the first thing you use. It give you a list of data frames containing the datasets you specified, which you can then feed into other functions in this package. ```{r downloadedData, eval=FALSE} # this gives you the latest everything, which is nice to keep but not really necessary downloadedData <- hpaDownload(downloadList='all') summary(downloadedData) #> Length Class Mode #> normal_tissue 6 tbl_df list #> pathology 11 tbl_df list #> subcellular_location 11 tbl_df list #> rna_tissue 5 tbl_df list #> rna_cell_line 5 tbl_df list #> transcript_rna_tissue 4 tbl_df list #> transcript_rna_cell_line 4 tbl_df list ``` ### The "histology" datasets Most of the time, you will only need the "histology" datasets, which contain `normal_tissue`, `pathology` (basically cancers) and `subcellular_location`. ```{r histology} downloadedData <- hpaDownload(downloadList='histology', version='example') # version = "example" will load the HPA v18 datasets came with this package. That's sufficient for normal usage, and save you some time. ``` The `normal_tissue` dataset contains information about protein expression profiles in human tissues based on IHC staining. The datasets contain six columns: `ensembl` (Ensembl gene identifier); `gene` (HGNC symbol), `tissue` (tissue name); `cell_type` (annotated cell type); `level` (expression value); `reliability` (the gene reliability of the expression value). ``` {r normal_tissue} tibble::glimpse(downloadedData$normal_tissue, give.attr=FALSE) ``` The `pathology` dataset contains information about protein expression profiles in human tumor tissue based on IHC staining. The datasets contain eleven columns: `ensembl` (Ensembl gene identifier); `gene` (HGNC symbol); `cancer` (cancer type); `high`, `medium`, `low`, `not_detected` (number of patients annotated for different staining levels); `prognostic_favorable`, `unprognostic_favorable`, `prognostic_unfavorable`, `unprognostic_unfavorable` (log-rank p values for patient survival and mRNA correlation). ```{r pathology} tibble::glimpse(downloadedData$pathology, give.attr=FALSE) ``` The `subcellular_location` dataset contains information about subcellular localization of proteins based on IF stanings of normal cells. The datasets contain eleven columns: `ensembl` (Ensembl gene identifier); `gene` (HGNC symbol); `reliability` (gene reliability score); `enhanced` (enhanced locations); `supported` (supported locations); `approved` (approved locations); `uncertain` (uncertain locations); `single_cell_var_intensity` (locations with single-cell variation in intensity); `single_cell_var_spatial` (locations with spatial single-cell variation); `cell_cycle_dependency` (locations with observed cell cycle dependency); `go_id` (Gene Ontology Cellular Component term identifier). ```{r subcellular_location} tibble::glimpse(downloadedData$subcellular_location, give.attr=FALSE) ``` ### The "RNA" datasets The `rna_tissue` and `rna_cell_line` datasets contain RNA expression levels of 37 tissues and 64 cell lines based on RNA-seq. These datasets contain four columns each: `ensembl` (Ensembl gene identifier); `gene` (HGNC symbol); `tissue`/`cell_line` (type of sample); `value` + `unit` (expression level measured by transcripts per million). ```{r rna1, warning=FALSE, message=FALSE} downloadedData <- hpaDownload(downloadList='rna', version='v18') tibble::glimpse(downloadedData$rna_tissue, give.attr=FALSE) ``` ```{r rna2, warning=FALSE, message=FALSE} tibble::glimpse(downloadedData$rna_cell_line, give.attr=FALSE) ``` Similarly, the `transcript_rna_tissue` and `transcript_rna_cell_line` datasets contain RNA isoform levels. These datasets contain four columns each: `ensembl` (Ensembl gene identifier); `transcript` (Ensembl transcript identifier); `tissue`/`cell_line` (type of sample); `value` (expression level measured by transcripts per million). Note that these datasets are significantly larger than others and should only be downloaded when necessary. ```{r isoform1, eval=FALSE} downloadedData <- hpaDownload(downloadList='isoform', version='v18') # version = "v18" is an example of how you may download different versions of the HPA datasets. Just change the number. Note that not all versions are available from the HPA website. tibble::glimpse(downloadedData$transcript_rna_tissue, give.attr=FALSE) #> Observations: 27,535,996 #> Variables: 4 #> $ ensembl "ENSG00000000003", "ENSG00000000003", "ENSG0000000... #> $ transcript "ENST00000373020", "ENST00000494424", "ENST0000049... #> $ tissue "adipose tissue.V1", "adipose tissue.V1", "adipose... #> $ value 27.3577003, 0.0000000, 1.9341500, 1.6059300, 0.000... ``` ```{r isoform2, eval=FALSE} tibble::glimpse(downloadedData$transcript_rna_cell_line, give.attr=FALSE) #> Observations: 20,972,183 #> Variables: 4 #> $ ensembl "ENSG00000000003", "ENSG00000000003", "ENSG0000000... #> $ transcript "ENST00000373020", "ENST00000494424", "ENST0000049... #> $ cell_line "A-431.C35", "A-431.C35", "A-431.C35", "A-431.C35"... #> $ value 29.406799, 0.000000, 0.992916, 0.398387, 0.239204,... ``` ## List available parameter for subsetting with `hpaListParam()` To see what parameters are available for subsequent subsetting/visualizing, HPAanalyze includes the function `hpaListParam()`. The input for this function is the output of `hpaDownload`. *If you leave the argument blank, this function will give you the results for version 18.* ```{r list_param, eval=FALSE} ## If you use the output from hpaDownload() downloadedData <- hpaDownload(downloadList='all') str(hpaListParam(downloadedData)) #> List of 6 #> $ normal_tissue : chr [1:58] "adrenal gland" "appendix" "bone marrow" "breast" ... #> $ normal_cell : chr [1:82] "glandular cells" "lymphoid tissue" "hematopoietic cells" "adipocytes" ... #> $ cancer : chr [1:20] "breast cancer" "carcinoid" "cervical cancer" "colorectal cancer" ... #> $ subcellular_location: chr [1:32] "Cytosol" "Mitochondria" "Aggresome" "Plasma membrane" ... #> $ normal_tissue_rna : chr [1:37] "adipose tissue" "adrenal gland" "appendix" "bone marrow" ... #> $ cell_line_rna : chr [1:64] "A-431" "A549" "AF22" "AN3-CA" ... ``` ```{r list_param_2} ## If you use leave the argument blank str(hpaListParam()) ``` ## Subset data with `hpaSubset()` `hpaSubset()` filters the output of `hpaDownload()` for desirable target genes, tissues, cell types, cancer, and cell lines. The data will be subset only where applicable (i.e. `normal_tissue` will not be subset by *cancer*). The main purpose of `hpaSubset` is to prepare a manageable set of data to be exported. However, this function may also be useful for other data table manipulation purposes. The input for `targetGene` argument is a vector of strings of HGNC symbols. *If you leave the `data` argument blank, this function will automatically subset the bundled version 18 dataset, which may not contain all of the columns available if you download the data with `hpaDownload()`.* ```{r subset1, message=FALSE, warning=FALSE} downloadedData <- hpaDownload(downloadList='histology', version='example') sapply(downloadedData, nrow) ``` ```{r subset2, message=FALSE, warning=FALSE} geneList <- c('TP53', 'EGFR', 'CD44', 'PTEN', 'IDH1', 'IDH2', 'CYCS') tissueList <- c('breast', 'cerebellum', 'skin 1') cancerList <- c('breast cancer', 'glioma', 'melanoma') cellLineList <- c('A-431', 'A549', 'AF22', 'AN3-CA') subsetData <- hpaSubset(data=downloadedData, targetGene=geneList, targetTissue=tissueList, targetCancer=cancerList, targetCellLine=cellLineList) sapply(subsetData, nrow) ``` ## Export data with `hpaExport()` As the name suggests, `hpaExport()` exports the output of `hpaSubset()` to one .xlsx file. Each dataset is placed in a separate sheet. More formats such as .csv and .tsv might be added in future release; hence, the `fileType` argument is included. ```{r eval=FALSE} hpaExport(subsetData, fileName='subset.xlsx', fileType='xlsx') ``` # Visualization `HPAanalyze` provides the ability to quickly visualize data from downloaded HPA datasets with the `hpaVis` function family. The goal of these functions is to aid exploratory analysis of a group of target genes, which maybe particularly useful for gaining insights into pathways or gene signatures of interest. The `hpaVis` functions share a common syntax, where the input (`data` argument) is the output of `hpaDownload()` or `hpaSubset()` (although they do their own subseting so it is not necessary to use `hpaSubset()` unless you want to reduce the size of your data object). Depending on the function, the `target` arguments will let you choose to visualize your vectors of genes, tissue, cell types, etc. (See the help files for more details.) All of `hpaVis` functions generate standard `ggplot2` plots, which allow you to further customize colors and themes. Colors maybe changed via the `color` argument, while the default theme maybe overriden by setting the `customTheme` argument to `FALSE`. Currently, the `normal_tissue`, `pathology` and `subcellular_location` data can be visualized, with more functions planned for future releases. *For all functions in the `hpaVis` family, if you leave the `data` argument blank, they will plot version 18 by default.* ```{r visData, echo=FALSE, warning=FALSE, message=FALSE} downloadedData <- hpaDownload('histology', 'example') ``` ## Unbrella function `hpaVis()` `hpaVis` will plot all available plots by default. See the quick-start vignette for details. ```{r hpaVis_eg} hpaVis(targetGene = c("GCH1", "PTS", "SPR", "DHFR"), targetTissue = c("cerebellum", "cerebral cortex", "hippocampus"), targetCancer = c("glioma")) ``` ## Visualize tissue data with `hpaVisTissue()` `hpaVisTissue()` generates a "heatmap", in which the expression of proteins of interest (quantified IHC staining) are plotted for each cell type of each tissue. ```{r visTissue} geneList <- c('TP53', 'EGFR', 'CD44', 'PTEN', 'IDH1', 'IDH2', 'CYCS') tissueList <- c('breast', 'cerebellum', 'skin 1') hpaVisTissue(downloadedData, targetGene=geneList, targetTissue=tissueList) ``` ## Visualize expression in cancer with `hpaVisPatho()` `hpaVisPatho()` generates an arrays of column graphs showing the expression of proteins of interest in each cancer. This example also demonstrate how the colors of the graphs could be customized, which is a common functionality of the `hpaVis` family. ```{r visPatho} geneList <- c('TP53', 'EGFR', 'CD44', 'PTEN', 'IDH1', 'IDH2', 'CYCS') cancerList <- c('breast cancer', 'glioma', 'lymphoma', 'prostate cancer') colorGray <- c('slategray1', 'slategray2', 'slategray3', 'slategray4') hpaVisPatho(downloadedData, targetGene=geneList, targetCancer=cancerList, color=colorGray) ``` ## Visualize subcellular location data with `hpaVisSubcell()` `hpaVisSubcell()` generates a tile chart showing the subcellular locations (approved and supported) of proteins of interest. This example also demonstrate the customization of the output plot with ggplot2 functions, which is applicable to all `hpaVis` functions. Notice that the `customTheme` argument is set to `TRUE`. ```{r visSubcell} geneList <- c('TP53', 'EGFR', 'CD44', 'PTEN', 'IDH1', 'IDH2', 'CYCS') hpaVisSubcell(downloadedData, targetGene=geneList, customTheme=TRUE) + ggplot2::theme_minimal() + ggplot2::ylab('Subcellular locations') + ggplot2::xlab('Protein') + ggplot2::theme(axis.text.x=element_text(angle=45, hjust=1)) + ggplot2::theme(legend.position="none") + ggplot2::coord_equal() ``` # Individual xml import and image downloading The `hpaXml` function family import and extract data from individual XML entries from HPA. The `hpaXmlGet()` function downloads and imports data as "xml_document"/"xml_node" object, which can subsequently be processed by other `hpaXml` functions. The XML format from HPA contains a wealth of information that may not be covered by this package. However, users can extract any data of interest from the imported XML file using the xml2 package. A typical workflow for working with XML files includes the following steps: (1) Download and import XML file with `hpaXmlGet()`; (2) Extract the desired information with other `hpaXml` functions; and (3) Download histological staining pictures, which is currently supported by the `hpaXmlTissurExpr()` and `hpaXmlTissueExprSum()` functions. ## The umbrella function `hpaXml` `hpaXml` will take an Ensembl gene id (start with *ENSG*) and extract all availble information. You can also feed the ourput of hpaXmlGet to it. See the quick-start vignette for more details. ```{r eval=FALSE} EGFR <- hpaXml(inputXml='ENSG00000146648') names(EGFR) #> [1] "ProtClass" "TissueExprSum" "Antibody" "TissueExpr" ``` ## Import xml file with `hpaXmlGet()` The `hoaXmlGet()` function takes an Ensembl gene id (start with *ENSG*) and import the perspective XML file into R. This function calls the `xml2::read_xml()` under the hood, hence the resulting object may be processed further with *xml2* functions if desired. ```{r XmlGet, eval=FALSE} EGFRxml <- hpaXmlGet('ENSG00000146648') ``` ## View protein classes with `hpaXmlProtClass()` Protein class of queried protein can be extracted from the imported XML with `hpaXmlProtClass()`. The output of this function is a tibble of 4 columns: `id`, `name`, `parent_id` and `source` ```{r XmlProtClass, eval=FALSE} hpaXmlProtClass(EGFRxml) #> # A tibble: 40 x 4 #> id name parent_id source #> #> 1 Ez Enzymes #> 2 Ec ENZYME proteins Ez ENZYME #> 3 Et Transferases Ec ENZYME #> 4 Ki Kinases Ez UniProt #> 5 Kt Tyr protein kinases Ki UniProt #> 6 Ma Predicted membrane proteins MDM #> 7 Md Membrane proteins predicted by MDM MDM #> 8 Me MEMSAT3 predicted membrane proteins MEMSAT3 #> 9 Mf MEMSAT-SVM predicted membrane proteins MEMSAT-SVM #> 10 Mg Phobius predicted membrane proteins Phobius #> # ... with 30 more rows ``` ## Get summary and images of tissue expression with `hpaXmlTissueExprSum()` The function `hpaXmlTissueExprSum()` extract the summary of expression of protein of interest in normal tissue. The output of this function is a list of (1) a string contains one-sentence summary and (2) a dataframe of all tissues in which the protein was stained positive and a histological stain images of those tissue. ```{r XmlTissueExprSum, eval=FALSE} hpaXmlTissueExprSum(EGFRxml) #> $summary #> [1] "Cytoplasmic and membranous expression in several tissues, most abundant in placenta." #> #> $img #> tissue #> 1 cerebral cortex #> 2 lymph node #> 3 liver #> 4 colon #> 5 kidney #> 6 testis #> 7 placenta #> imageUrl #> 1 http://v18.proteinatlas.org/images/18530/41191_B_7_5_rna_selected.jpg #> 2 http://v18.proteinatlas.org/images/18530/41191_A_7_8_rna_selected.jpg #> 3 http://v18.proteinatlas.org/images/18530/41191_A_7_4_rna_selected.jpg #> 4 http://v18.proteinatlas.org/images/18530/41191_A_9_3_rna_selected.jpg #> 5 http://v18.proteinatlas.org/images/18530/41191_A_9_5_rna_selected.jpg #> 6 http://v18.proteinatlas.org/images/18530/41191_A_6_6_rna_selected.jpg #> 7 http://v18.proteinatlas.org/images/18530/41191_A_1_7_rna_selected.jpg ``` Those images can be downloaded automatically by setting the `downloadImg` argument to `TRUE`. Eg. `hpaXmlTissueExprSum(CCNB1xml, downloadImg=TRUE)` ## Get details of individual IHC samples with `hpaXmlAntibody()` and `hpaXmlTissueExpr()` More importantly, the XML files are the only format of HPA programmatically accesible data which contains information about each antibody and each tissue sample used in the project. `hpaXmlAntibody()` extract the antibody information and return a tibble with one row for each antibody. ```{r XmlAntibody, eval=FALSE} hpaXmlAntibody(EGFRxml) #> # A tibble: 5 x 4 #> id releaseDate releaseVersion RRID #> #> 1 CAB000035 2006-03-13 1.2 #> 2 HPA001200 2008-02-15 3.1 AB_1078723 #> 3 HPA018530 2008-12-03 4.1 AB_1848044 #> 4 CAB068186 2014-11-06 13 AB_2665679 #> 5 CAB073534 2015-10-16 14 ``` `hpaXmlTissueExpr()` extract information about all samples for each antibody above and return a list of tibbles. If antibody has not been used for IHC staining, the returned tibble with be empty. ```{r XmlTissueExpr1, eval = FALSE} tissueExpression <- hpaXmlTissueExpr(EGFRxml) summary(tissueExpression) #> Length Class Mode #> [1,] 18 tbl_df list #> [2,] 18 tbl_df list #> [3,] 18 tbl_df list #> [4,] 18 tbl_df list #> [5,] 18 tbl_df list ``` Each tibble contain clinical data (`patientid`, `age`, `sex`), tissue information (`snomedCode`, `tissueDescription`), staining results (`staining`, `intensity`, `location`) and one `imageUrl` for each sample. However, due to the large amount of data and the relatively large size of each image, `hpaXmlTissueExpr` does not provide an automated download option. ```{r XmlTissueExpr2, eval = FALSE} tissueExpression[[1]] #> # A tibble: 327 x 18 #> patientId age sex staining intensity quantity location imageUrl #> #> 1 1653 53 Male http://~ #> 2 1721 60 Fema~ http://~ #> 3 1725 57 Male http://~ #> 4 4 25 Male http://~ #> 5 512 34 Fema~ http://~ #> 6 2664 74 Fema~ http://~ #> 7 2665 88 Fema~ http://~ #> 8 1391 54 Fema~ http://~ #> 9 1447 45 Fema~ http://~ #> 10 1452 44 Fema~ http://~ #> # ... with 317 more rows, and 10 more variables: snomedCode1 , #> # snomedCode2 , snomedCode3 , snomedCode4 , #> # snomedCode5 , tissueDescription1 , tissueDescription2 , #> # tissueDescription3 , tissueDescription4 , #> # tissueDescription5 ``` `hpaTissueExprSum` and `hpaTissueExpr` provide download links to download relevant staining images, with the former function also gives the options to automate the downloading process. # Compatibility with `hpar` Bioconductor package Functionality| hpar | HPAanalyze ------------:|:--------------------------------|:-------------------------------- Datasets | Included in package | Download from server or use built-in dataset Query | Ensembl id | HGNC symbol for datasets, Ensembl id for XML Data version | One stable version | Latest by default, option to download older Release info | Access via functions | N/A View relevant browser page | Via `getHPA` function | N/A Visualization| N/A | Exploratory via `hpaVis` functions XML | N/A | Download and import via `hpaXml` functions Histology image| View by loading browser page | Extract links via `hpaXml` functions : (\#tab:table) Complementary functionality between hpar and HPAanalyze # Acknowledgements We appreciate the support of the National institutes of Health National Cancer Institute R01 CA151522 and funds from the Department of Cell, Developmental and Integrative Biology at the University of Alabama at Birmingham. # Copyright ```{r child = 'data/copyright'} ```