--- title: "ALPS - AnaLysis routines for ePigenomicS data" author: Venu Thatikonda date: "`r Sys.Date()`" output: html_document: toc: true toc_depth: 3 theme: united number_sections: true highlight: tango bibliography: ALPS-ref.bib vignette: > %\VignetteIndexEntry{ALPS-vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", dpi = 72, fig.align = "center" ) ``` # ALPS-Introduction **ALPS** (**A**na**L**ysis routines for e**P**igenomic**S** data) is an R package [@ALPS] that provides tools for the analysis and to produce publication-ready visualizations, mainly aimed at genome-wide epigenomics data, e.g. ChIP-seq, ATAC-seq etc. ## Bigwig files Bigwig files evolved to be a multi-purpose compressed binary format to store genome-wide data at base pair level. Bigwig files are mostly used to store genome-wide quantitative data such as ChIP-seq, ATAC-seq, WGBS, GRO-seq *etc*. Following figure illsutrates the important usecases with bigwig files.

## Generate bigwig files There are multiple ways one can generate bigwig files from BAM files, using [UCSC kent utils](https://genome.ucsc.edu/goldenpath/help/bigWig.html) [@kentUtils] or with the [deeptools bamCoverage](https://deeptools.readthedocs.io/en/develop/content/tools/bamCoverage.html) function [@deeptools_pub], which is the easiest way. Once the normalized bigwig files are generated and peaks are identified from BAM files, one would seldom use BAM files again in the entire workflow. The requirements of all downstream processes can be satisified with normalized bigwig files, e.g quantifying normalized read counts at peaks or promoters, visualizing enrichments in genome broswer or igv. After the peaks are identified, the immediate steps would be to quantify normalized read counts at the identified peaks in order to perform explorative data analysis (EDA), PCA, unsupervised clustering to identify patterns among samples under consideration and generate novel biological insights. ## ALPS - workflow `ALPS` package is designed in a way to start with a minimal set of input and to reach a rich source of insights from the data. At the most, most functions in `ALPS` require a data table with paths to bigwig files and associated sample meta information. Various functions will utilize this data table and generate downstream outputs. The package produces publication quality visualizations, of which most can be customized within R using `ggplot2` ecosystem. Following is the overview of the `ALPS` workflow and available functions

# Installation Install the `ALPS` package with the following code ```{r eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("ALPS") ``` # Example usecases with ALPS To demonstrate the utility of different functions, `ALPS` package comes with a set of example files that were taken from TCGA consortium's ATAC-seq data from [here](https://gdc.cancer.gov/about-data/publications/ATACseq-AWG) published at [@TCGA_ATAC]. Following steps walk you through loading the example data and how to use different function and how to integrate function's output with other R/bioconductor packages to ease the workflow process. ```{r setup} ## load the library library(ALPS) ``` ## Calculate enrichments at genomic regions Most of the explorative analyses in epigenomics starts with quantifying enrichments or methylations at a set of genomic regions e.g. promoter regions or identified peak regions. This quantifications will be used as an input to downstream analyses such as PCA, clustering. The function `multiBigwig_summary` takes sample data table with bigwig paths and corresponding bed file paths calculates enrichments at genomic regions. This function is a wrapper around `rtracklayer` bigwig utilities [@rtracklayer_pub]. The function simultaneously generates the consensus peak-set from all the bed files present in input data table before calculating enrichments. Read data table from `ALPS` package ```{r} chr21_data_table <- system.file("extdata/bw", "ALPS_example_datatable.txt", package = "ALPS", mustWork = TRUE) ## attach path to bw_path and bed_path d_path <- dirname(chr21_data_table) chr21_data_table <- read.delim(chr21_data_table, header = TRUE) chr21_data_table$bw_path <- paste0(d_path, "/", chr21_data_table$bw_path) chr21_data_table$bed_path <- paste0(d_path, "/", chr21_data_table$bed_path) chr21_data_table %>% head ``` Now run the function `multiBigwig_summary` to calculate enrichments from all bigwig files by simultaneosly preparing consensus peak-set from all bed files in the column `bed_path` ```{r} enrichments <- multiBigwig_summary(data_table = chr21_data_table, summary_type = "mean", parallel = FALSE) enrichments %>% head ``` With little teaking, the output from `multiBigwig_summary` can be very easily integrated with other R/bioconductor packages for explorative analysis, PCA or clustering. The function `get_variable_regions` takes the output of `multiBigwig_summary` or a similar format and returns a `n` number of scaled variable regions, which can directly be used with tools such as [`ComplexHeatmap`](https://bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html) [@complexheatmap]. Following is an example on how to integrate `multiBigwig_summary` output to `ComplexHeatmap` via `get_variable_regions` ```{r fig.align="center", fig.width=7} enrichments_matrix <- get_variable_regions(enrichments_df = enrichments, log_transform = TRUE, scale = TRUE, num_regions = 100) suppressPackageStartupMessages(require(ComplexHeatmap)) suppressPackageStartupMessages(require(circlize)) Heatmap(enrichments_matrix, name = "enrichments", col = colorRamp2(c(-1, 0, 1), c("green", "white", "red")), show_row_names = FALSE, show_column_names = TRUE, show_row_dend = FALSE, column_names_gp = gpar(fontsize = 8)) ``` ## Perform correlation among replicates/groups It is often of high interest in genome-wide quantitative data to check the correlations among replicates within a subgroup to identify specific patterns in the data. `plot_correlation` function is designed for such use cases. The function is compatible with the output of `multiBigwig_summary` and also with other tools output with similar format ```{r fig.align="center"} plot_correlation(enrichments_df = enrichments, log_transform = TRUE, plot_type = "replicate_level", sample_metadata = chr21_data_table) ``` Instead of correlations of replicates within and across groups, one can also do group level correlations after averaging all samples within a group. The argument `plot_type = "group_level"` in `plot_correlation` exactly does this. ```{r fig.align="center"} ## group_level plot_correlation(enrichments_df = enrichments, log_transform = TRUE, plot_type = "group_level", sample_metadata = chr21_data_table) ``` Either `replicate_level` or `group_level` plot appearance can be further modified with arguments that passed to [`corrplot::corrplot`](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html) or [`GGally::ggpairs`](https://ggobi.github.io/ggally/#ggallyggpairs) respectively. ## Plot enrichments across groups Once the group-specific genomic regions (or peaks) identified with various differential enrichments packages, e.g. DESeq2, diffBind, QSEA, one would be interested to visualize enrichment qunatities across all samples of all groups to show magnitude of differnce in enrichments. `plot_enrichments` function takes a data.frame of enrichments, either the output from `multiBigwig_summary` or a similar format and plots enrichments in a combination of box and violin plots. The function is motivated by the paper [@raincloud] and a ggplot2 extension `gghalves` [@gghalves]. There are two ways one can plot enrichment differences, one way is to directly plot group level enrichments after averaging all samples within a group for each region and the other way is plotting paired conditions for each group, e.g. untreated, treated enrichments for a transcription factor. In both cases function needs a `sample_metadata` table along with the `enrichments` data.frame. Following example illustrates the uses of `plot_enrichments` function uses in different settings. If `plot_type = "separate"`, function plots group level enrichments ```{r fig.pos="center"} ## plot_type == "separate" plot_enrichments(enrichments_df = enrichments, log_transform = TRUE, plot_type = "separate", sample_metadata = chr21_data_table) ``` If `plot_type = "overlap"`, function plots box plots along with overlap violins to show the distributions in paired conditions. The `sample_metadata` for these plots require one more additional column which describes sample status. See the following example ```{r} ## plot_type == "overlap" enrichemnts_4_overlapviolins <- system.file("extdata/overlap_violins", "enrichemnts_4_overlapviolins.txt", package = "ALPS", mustWork = TRUE) enrichemnts_4_overlapviolins <- read.delim(enrichemnts_4_overlapviolins, header = TRUE) ## metadata associated with above enrichments data_table_4_overlapviolins <- system.file("extdata/overlap_violins", "data_table_4_overlapviolins.txt", package = "ALPS", mustWork = TRUE) data_table_4_overlapviolins <- read.delim(data_table_4_overlapviolins, header = TRUE) ## enrichments table enrichemnts_4_overlapviolins %>% head ## metadata table data_table_4_overlapviolins %>% head ``` ```{r fig.align="center"} plot_enrichments(enrichments_df = enrichemnts_4_overlapviolins, log_transform = FALSE, plot_type = "overlap", sample_metadata = data_table_4_overlapviolins, overlap_order = c("untreated", "treated")) ``` There are additional arguments available for both `separate` and `overlap` to modify the appearance (please check `?plot_enrichments`), moreover the function returns a `ggplot2` object which enables the user to change additional components of the plot. ## Plot UCSC genome browser like plots In any genome-wide epigenomic analyses, it is often interesting to check enrichments at certain genomic loci, e.g. various histone modifications at a genomic region that define a chromatin state or co-binding of different transcription factors at a promoter or enhancer element. The classical way to acheive this task is to load all bigwig files into IGV or create a data hub at UCSC and navigate to the region of interest. This is not always practical and needs a substantial manual effort, in addition one requires a UCSC genome browser server in order to get this task done with the unpublished data. To circumvent this problem, several R/bioconductor packages were designed (e.g. `Gviz`, `karyoploteR`). Even within R environment, one needs to put a significant effort to create UCSC genome browser like figures. The function `plot_broswer_tracks` in `ALPS` package requires a minimal input of a data table and a genomic region and produces a publication quality browser like plot. The function uses utilities within `Gviz` package to generate the visualizations [@gviz]. Following code snippet illustrates how one can use this function ```{r fig.height=10, fig.width=7, fig.align="center"} ## gene_range gene_range = "chr21:45643725-45942454" plot_browser_tracks(data_table = chr21_data_table, gene_range = gene_range, ref_gen = "hg38") ``` ## Annotate genomic regions One of the usual tasks in genome-wide epigenomic analyses is to identify the genomic locations of peaks/binding sites. This gives an overview of where a particular transcription factor frequently binds or where a particular type of histone modifications are observed. The function `get_genomic_annotations` utilizes the above data table and returns the percentage of peaks or binding sites found in each of the genomic features such as promoters, UTRs, intergenetic regions etc. This function is a wrapper around `ChIPseeker`'s `annotatePeak` function [@chipseeker]. Function also offers an option with `merge_level` to merge overlaping peaks from different samples at different levels. * `all` creates a consensus peak set by merging overlaping peaks from all samples present in the `data_table` * `group_level` creates a group level consensus peak set. Meaning overlaping peaks from all samples of each group will be merged * `none` does not create any consensus peak set. Per-sample genomic annotations will be returned ```{r} g_annotations <- get_genomic_annotations(data_table = chr21_data_table, ref_gen = "hg38", tss_region = c(-1000, 1000), merge_level = "group_level") g_annotations %>% head ``` ## Plot genomic annotations The results returned from `get_genomic_annotations` can directly be passed to the function `plot_genomic_annotations` to visualize the percentages of peaks in each feature. The function can produce visualizations either as bar plot or heatmap ```{r fig.height=5, fig.width=5.5, fig.align="center"} plot_genomic_annotations(annotations_df = g_annotations, plot_type = "heatmap") ``` ```{r fig.height=4, fig.width=8, fig.align="center"} plot_genomic_annotations(annotations_df = g_annotations, plot_type = "bar") ``` ## Plot motif representations Transcription factors bind to DNA sequences with particular nucleotide stretches. A collection of all binding sites for a transcription factor can be represented by a sequence motif. Typically, transcription factor enrichment analyses utilizes these motif information and provide whether a given transcription factor is enriched in a given set of genomic regions, e.g. enhancers. Currently, there are number of different databases which provide transcription factor motifs in very different format. The function `plot_motif_logo` takes input from various databases e.g. MEME, TRANSFAC, JASPAR, HOMER or a simple PFM and plots binding site representations either as a sequence logo or a barplot. Barplot representation has a several advantages over sequence logo which are described [here](http://www.ensembl.info/2018/10/15/new-ensembl-motif-features/). The logo plot utilizes the function `ggseqlogo` from [`ggseqlogo`](https://github.com/omarwagih/ggseqlogo) package [@ggseqlogo]. Following example illustrates how to use the function to plot motif representations ```{r fig.align="center"} myc_transfac <- system.file("extdata/motifs", "MA0147.2.transfac", package = "ALPS", mustWork = TRUE) ## bar plot plot_motif_logo(motif_path = myc_transfac, database = "transfac", plot_type = "bar") ## logo plot plot_motif_logo(motif_path = myc_transfac, database = "transfac", plot_type = "logo") ``` # Acknowledgements `ALPS` package benefited suggestions from * [Gyan Prakash Mishra](https://twitter.com/gprakash047), Institute of Life Sciences * [Kevin Blighe](https://www.biostars.org/u/41557), Visiting professor * [Devon P Ryan](https://twitter.com/dpryan79), Max Plank Institute of Immunobiology and Epigenetics * Michael Fletcher, DKFZ/German Cancer Research Center * [Sequence logo as barplot](https://stackoverflow.com/questions/1611215/remove-a-git-commit-which-has-not-been-pushed) # Session info ```{r} sessionInfo() ``` # References