---
title: "1. HPAanalyze quick start guide: Acquire and visualize the Human Protein Atlas (HPA) data in one function"
author:
- name: Anh N. Tran
  affiliation: Northwestern University, Illinois, USA
  email: trannhatanh89@gmail.com
date: 6/7/2019
output: 
    BiocStyle::html_document:
        toc: true
        toc_depth: 2
        toc_float: true
        number_sections: true
vignette: >
  %\VignetteIndexEntry{"1. HPAanalyze quick start guide: Acquire and visualize the Human Protein Atlas (HPA) data in one function"}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse=TRUE,
  comment="#>",
  warning=FALSE,
  error=FALSE,
  crop = NULL
)
```

```{r library, message=FALSE, warning=FALSE, error=FALSE}
library(BiocStyle)
library(HPAanalyze)
library(dplyr)
```

# Background 

The Human Protein Atlas (HPA) is a comprehensive resource for exploration of human proteome which contains a vast amount of proteomics and transcriptomics data generated from antibody-based tissue micro-array profiling and RNA deep-sequencing.

The program has generated protein expression profiles in human normal tissues with cell type-specific expression patterns, cancer and cell lines via an innovative immunohistochemistry-based approach. These profiles are accompanied by a large collection of high quality histological staining images, annotated with clinical data and quantification. The database also includes classification of protein into both functional classes (such as transcription factors or kinases) and project-related classes (such as candidate genes for cancer). Starting from version 4.0, the HPA includes subcellular location profiles generated based on confocal images of immunofluorescent stained cells. Together, these data provide a detailed picture of protein expression in human cells and tissues, facilitating tissue-based diagnostic and research.

Data from the HPA are freely available via proteinatlas.org, allowing scientists to access and incorporate the data into their research. Previously, the R package *hpar* has been created for fast and easy programmatic access of HPA data. Here, we introduce *HPAanalyze*, an R package aims to simplify exploratory data analysis from those data, as well as provide other complementary functionality to *hpar*.

## The different HPA data formats

The Human Protein Atlas project provides data via two main mechanisms: *Full datasets* in the form of downloadable compressed tab-separated files (.tsv) and *individual entries* in XML, RDF and TSV formats. The full downloadable datasets includes normal tissue, pathology (cancer), subcellular location, RNA gene and RNA isoform data. For individual entries, the XML format is the most comprehensive, providing information on the target protein, antibodies, summary for each tissue and detailed data from each sample including clinical data, IHC scoring and image download links.

## `HPAanalyze` overview

`HPAanalyze` is designed to fullfill 3 main tasks: (1) Import, subsetting and export downloadable datasets; (2) Visualization of downloadable datasets for exploratory analysis; and (3) Working with the individual XML files. This package aims to serve researchers with little programming experience, but also allow power users to use the imported data as desired.

# Visualize protein expression data

Currently, this is available for the normal tissue, pathology (cancers) and subcellular location datasets. The fastest and easiest way is to use the defaults of `hpaVis`.

```{r hpaVis_eg}
hpaVis(targetGene = c("GCH1", "PTS", "SPR", "DHFR"),
       targetTissue = c("cerebellum", "cerebral cortex", "hippocampus"),
       targetCancer = c("glioma"))
```

Of course, we cannot visualize everything in those big datasets, so some defauts will be used and you will receive some warning messages.

```{r hpaVis_eg2, eval = FALSE}
hpaVis()

# No data provided. Use version 18.
# targetGene variable not specified, default to TP53, RB1, MYC, KRAS and EGFR.
# targetTissue variable not specified, default to breast.
# targetCellType variable not specified, visualize all.
# targetCancer variable not specified, default to breast cancer
# Use hpaListParam() to list possible values for target variables.

```

You can also use `hpaVis` to show just one or two of the three graphs.

```{r hpaVis_eg3}
hpaVis(visType = "Patho",
       targetGene = c("GCH1", "PTS", "SPR", "DHFR"),
       targetCancer = c("glioma", "breast cancer"))
```

One exception though, if you want to plot all cancers, use `hpaVisPatho` with `targetCancer = NULL` (default).

```{r hpaVisPatho_eg}
hpaVisPatho(targetGene = c("GCH1", "PTS", "SPR", "DHFR"))
```

There are many ways you could customize your plots. Please see the documentation for more details.

```{r doc_ex1, eval = FALSE}
?hpaVis # the easy umbrella to visualize protein expression levels
?hpaVisTissue # in normal tissue
?hpaVisSubcell # in subcellular compartments
?hpaVisPatho # in cancers
```

If you want to know what kind of data you can visualize, use `hpaListParam`. You will receive a list of parameters you can use. Please note that if you ask the functions to plot anything that is not on this list, they will just ignore it and plot what are available.

```{r listParam_ex, eval = FALSE}
hpaListParam()
```
```{r listParam_ex2, echo= FALSE}
hpaListParam() %>% glimpse()
```

# Acquiring individual sample data from the Human Protein Atlas

HPA provide data in two different formats: the more convenient summarized data tables that are used for the `hpaVis` functions above, and detailed annotated data for every sample and every antibody in the xml format. There is one xml for each protein, which contain all data generated by HPA. However, extracting information from these xml files into a tidy format is a challenge. The `hpaXml` functions are designed to help you easily access this.

The easiest way is to use the umbrella `hpaXml()` function. Please note that at the moment this function only accept Ensembl gene id.

```{r eval=FALSE}
EGFR <- hpaXml(inputXml='ENSG00000146648')
names(EGFR)

#> [1] "ProtClass"     "TissueExprSum" "Antibody"      "TissueExpr"   
```

This function will return a list. The first item `ProtClass` is the known and predicted class of the queried protein. The second item `TissueExprSum` give you the quick summary of protein expression in normal tissue and image url to download one representative image. The third item `Antibody` provide information on all antibodied used by HPA to stain for the queried protein. 

The last item `TissueExpr` is a list of data frame, one for each of the antibodies. Each row will have everything available regarding a sample stained with an antibody (clinical data, protein expression via IHC and the Url to download the original image). If you find an empty data frame, that means the antibody was used only for IF staining to determine subcellular locations. Currently HPAanalyze does not support extraction of that data.

To better understand the output, please read the documentation for other `hpaXml` functions, which was called under the hood by `hpaXml()`.

```{r doc_ex2, eval = FALSE}
?hpaXmlGet # import the xml file as "xml_document"
?hpaXmlProtClass
?hpaXmlTissueExprSum
?hpaXmlAntibody
?hpaXmlTissueExpr
```

# Copyright
```{r child = 'data/copyright'}
```