--- title: "`chipenrich`: Gene Set Enrichment For ChIP-seq Peak Data" author: "Ryan P. Welch, Chee Lee, Raymond G. Cavalcante, Chris Lee, Kai Wang, Laura J. Scott, Maureen A. Sartor" date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{chipenrich_vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction Gene set enrichment (GSE) testing enables the interpretation of lists of genes (e.g. from RNA-seq), or lists of genomic regions (e.g. from ChIP-seq), in terms of pathways and other biologically meaningful sets of genes. The `chipenrich` package is designed for GSE testing of large sets of genomic regions with different properties. The primary innovation of `chipenrich` is its accounting for biases that are known to affect the Type I error of such testing, and other properties of the data. It offers many options for gene set databases, and several other methods/options: tests for wide versus narrow genomic regions, a thousand versus > a million genomic regions, regions in promoters vs enhancers vs gene bodies, or accounting for the strength/score of each genomic region. The `chipenrich` package includes different enrichment methods for different use cases: * `broadenrich()` is designed for use with broad peaks that may intersect multiple gene loci, and cumulatively cover greater than 5% of the genome. For example, most ChIP-seq experiments for histone modifications. * `chipenrich()` is designed for use with 1,000s or 10,000s of narrow genomic regions which results in a relatively small percent of genes being assigned a genomic region. For example, many ChIP-seq experiments for transcription factors. * `polyenrich()` is also designed for narrow peaks, but for experiments with > ~50,000 genomic regions, or in cases where the number of binding sites per gene is thought to be important. If unsure whether to use chipenrich or polyenrich, then we recommend hybridenrich. * `hybridenrich()` is used when one is unsure of which method, between ChIP-Enrich or Poly-Enrich, is the optimal method. It is slightly more conservative than either test for an individual gene set, but is usually more powerful overall. We recommend using polyenrich(method='polyenrich-weighted', multiAssign=TRUE, .) with any enhancer locus definition. See example below for more details. # Concepts and Usage ```{r} library(chipenrich) ``` Due to several required dependencies, installation may take some time if the dependencies are not already installed. ## Peaks A ChIP-seq peak is a genomic region that represents a transcription factor binding event or the presence of a histone complex with a particular histone modification. Typically peaks are called with a peak caller (such as [MACS2](https://github.com/taoliu/MACS) or [PePr](https://github.com/shawnzhangyx/PePr/)) and represent relative enrichment of reads in a sample where the antibody is present versus input. Typically, peaks are output by a peak caller in [`BED`-like](https://genome.ucsc.edu/FAQ/FAQformat.html) format. User input for `chipenrich()`, `broadenrich()`, or `polyenrich()` may be peaks called from a ChIP-seq or ATAC-seq experiment or other large sets of genomic regions (e.g. from a family of repetitive elements or hydroxymethylation experiments) but we shall continue to refer to the input genomic regions as 'peaks' for simplicity. Peaks can be input as either a file path or a `data.frame`. If a file path, the following formats are fully supported via their file extensions: `.bed`, `.broadPeak`, `.narrowPeak`, `.gff3`, `.gff2`, `.gff`, and `.bedGraph` or `.bdg`. BED3 through BED6 files are supported under the `.bed` extension ([BED specification](https://genome.ucsc.edu/FAQ/FAQformat.html)). Files without these extensions are supported under the conditions that the first 3 columns correspond to `chr`, `start`, and `end` and that there is either no header column, or it is commented out. Files may be compressed with `gzip`, and so might end in `.narrowPeak.gz`, for example. For files with extension support, the `rtracklayer::import()` function is used to read peaks, so adherence to the mentioned file formats is necessary. If peaks are already in the R environment as a `data.frame`, the `GenomicRanges::makeGRangesFromDataFrame()` function is used to convert to a `GRanges` object. For the acceptable column names needed for correct interpretation, see `?GenomicRanges::makeGRangesFromDataFrame`. For the purpose of the vignette, we'll load some ChIP-seq peaks from the `chipenrich.data` companion package: ```{r, warning = FALSE, message = FALSE} data(peaks_E2F4, package = 'chipenrich.data') data(peaks_H3K4me3_GM12878, package = 'chipenrich.data') head(peaks_E2F4) head(peaks_H3K4me3_GM12878) ``` ## Genomes Genomes for fly, human, mouse, rat, and zebrafish are supported. Particular supported genome builds are given by: ```{r} supported_genomes() ``` ## Locus Definitions A gene locus definition is a way of defining the gene regulatory regions, and enables us to associate peaks with genes. The terms 'gene regulatory region' and 'gene locus' are used interchangeably in the vignette. An example is the '5kb' gene locus definition, which assigns the 5000 bp immediately up- and down- stream of transcription start sites (TSS) to the respective gene. A locus definition can also express from where one expects a transcription factor or histone modification to regulate genes. For example, a locus definition defined as 1kb upstream and downstream of a TSS (the 1kb definition) would capture TFs binding in proximal-promoter regions. ### Built-in locus definitions A number of locus definitions representing different regulatory paradigms are included in the package: * `nearest_tss`: The locus is the region spanning the midpoints between the TSSs of adjacent genes. Thus, each genomic region is assigned to the gene with the nearest TSS. * `nearest_gene`: The locus is the region spanning the midpoints between the boundaries of each gene, where a gene is defined as the region between the furthest upstream TSS and furthest downstream TES for that gene. If gene loci overlap, the midpoint of the overlap is used as a border. If a gene locus is nested in another, the larger locus is split in two. * `exon`: Each gene has multiple loci corresponding to its exons. Overlaps between different genes are allowed. * `intron`: Each gene has multiple loci corresponding to its introns. Overlaps between different genes are allowed. * `1kb`: The locus is the region within 1kb of any of the TSSs belonging to a gene. If TSSs from two adjacent genes are within 2 kb of each other, we use the midpoint between the two TSSs as the boundary for the locus for each gene. * `1kb_outside_upstream`: The locus is the region more than 1kb upstream from a TSS to the midpoint between the adjacent TSS. * `1kb_outside`: The locus is the region more than 1kb upstream or downstream from a TSS to the midpoint between the adjacent TSS. * `5kb`: The locus is the region within 5kb of any of the TSSs belonging to a gene. If TSSs from two adjacent genes are within 10 kb of each other, we use the midpoint between the two TSSs as the boundary for the locus for each gene. * `5kb_outside_upstream`: The locus is the region more than 5kb upstream from a TSS to the midpoint between the adjacent TSS. * `5kb_outside`: The locus is the region more than 5kb upstream or downstream from a TSS to the midpoint between the adjacent TSS. * `10kb`: The locus is the region within 10kb of any of the TSSs belonging to a gene. If TSSs from two adjacent genes are within 20 kb of each other, we use the midpoint between the two TSSs as the boundary for the locus for each gene. * `10kb_outside_upstream`: The locus is the region more than 10kb upstream from a TSS to the midpoint between the adjacent TSS. * `10kb_outside`: The locus is the region more than 10kb upstream or downstream from a TSS to the midpoint between the adjacent TSS. The complete listing of genome build and locus definition pairs can be listed with `supported_locusdefs()`: ```{r} # Take head because it's long head(supported_locusdefs()) ``` ### Custom locus definitions Users can create custom locus definitions for any of the `supported_genomes()`, and pass the file path as the value of the `locusdef` parameter in `broadenrich()`, `chipenrich()`, or `polyenrich()`. Custom locus definitions should be defined in a tab-delimited text file with column names `chr`, `start`, `end`, and `gene_id`. For example: ```{r, eval=FALSE} chr start end geneid chr1 839460 839610 148398 chr1 840040 840190 148398 chr1 840040 840190 57801 chr1 840800 840950 148398 chr1 841160 841310 148398 ``` ### Selecting a locus definition For a transcription factor ChIP-seq experiment, selecting a particular locus definition for use in enrichment testing implies how the TF is assumed to regulate genes. For example, selecting the `1kb` locus definition will imply that the biological processes found enriched are a result of TF regulation near the promoter. In contrast, selecting the `5kb_outside` locus definition will imply that the biological processes found enriched are a result of TF regulation distal from the promoter. Selecting a locus definition can also help reduce the noise in the enrichment tests. For example, if a TF is known to primarily regulate genes by binding around the promoter, then selecting the `1kb` locus definition can help to reduce the noise from TSS-distal peaks in the enrichment testing. The [`plot_dist_to_tss()` QC plot](#peak-distance-to-tss-distribution) displays where peak midpoints fall relative to TSSs genome-wide, and can help inform the choice of locus definition. For example, if many peaks fall far from the TSS, the `nearest_tss` locus definition may be a good choice because it will capture *all* peaks, whereas the `1kb` locus definition may not capture many of the peaks and adversely affect the enrichment testing. ## Gene Sets Gene sets are sets of genes that represent a particular biological function. ### Built-in gene sets Gene sets for fly, human, mouse, rat, and zebrafish are built in to `chipenrich`. Some organisms have gene sets that others do not, so check with: ```{r} # Take head because it's long head(supported_genesets()) ``` Descriptions of our built-in gene sets: * `GOBP`: Gene Ontology-Biological Processes, Bioconductor ver. 3.4.2. (http://www.geneontology.org/) * `GOCC`: Gene Ontology-Cell Component, Bioconductor ver. 3.4.2. (geneontology.org) * `GOMF`: Gene Ontology-Molecular Function, Bioconductor ver 3.4.2. (geneontology.org) * `biocarta_pathway`: BioCarta Pathway, Ver 6.0. (cgap.nci.nih.gov/Pathways/BioCarta_Pathways) * `ctd`: Comparative Toxicogenomics Database, Last updated June 06, 2017. Groups of genes that interact with specific chemicals to help understand enviromental exposures that affect human health. (ctdbase.org) * `cytoband`: Cytobands (NCBI). Groups of genes that reside in the same area of a chromosome. * `drug_bank`: Sets of gene that are targeted by a specific drug. Ver 5.0.7. (www.drugbank.ca) * `hallmark`: Hallmark gene sets (MSigDB). Ver 6.0. Specific biological states or processes that display coherent expression. (software.broadinstitute.org/gsea/msigdb/collections.jsp) * `immunologic`: Immunologic signatures (MSigDB). Ver 6.0. Gene sets that represent cell states within the immune system. (software.broadinstitute.org/gsea/msigdb/collections.jsp) * `kegg_pathway`: Kyoto Encyclopedia of Genes and Genomes. Ver 3.2.3. (genome.jp/kegg) * `mesh`: Gene Annotation with MeSH, the National Library of Medicine's controlled vocabulary for biology and medicine. Useful for testing hypotheses related to diseases, processes, other genes, confounders such as populations and experimental techniques, etc. based on knowledge from the literature that may not yet be formally described in any other gene sets. Last updated ~2013. (gene2mesh.ncibi.org) * `metabolite`: Metabolite concepts, defined from Edinburgh Human Metabolic Network database (Ma, et al., 2007) Contains gene sets coding for metabolic enzymes that catalyze reactions involving the respective metabolite. * `microrna`: microRNA targets (MSigDB). Ver 6.0. Gene sets containing genes with putative target sites of human mature miRNA. (software.broadinstitute.org/gsea/msigdb/collections.jsp) * `oncogenic`: Oncogenic signatures (MSigDB). Ver 6.0. Gene sets that represent signatures of pathways often disregulated in cancer. (software.broadinstitute.org/gsea/msigdb/collections.jsp) * `panther_pathway`: PANTHER Pathway. Ver 3.5. Contains primarily signaling pathways with subfamilies. (pantherdb.org/pathway) * `pfam`: Pfam Ver 31.0 (March 2017). A large collection of protein families. (pfam.xfam.org) * `protein_interaction_biogrid`: Protein Interaction from Biological General Repository for Interaction Datasets. Ver 3.4.151. (thebiogrid.org) * `reactome`: Reactome Pathway Database. Ver 61. (reactome.org) * `transcription_factors`: Transcription Factors (MSigDB). Ver 6.0. Gene sets that share upstream cis-regulatory motifs which can function as potential transcription factor binding sites. (software.broadinstitute.org/gsea/msigdb/collections.jsp) ### Custom gene sets Users can perform GSE on custom gene sets for any supported organism by passing the file path as the value of `genesets` parameter in `broadenrich()`, `chipenrich()`, `polyenrich()`, or `hybridenrich()`. Custom gene set definitions should be defined in a tab-delimited text file with a header. The first column should be the geneset ID or name, and the second column should be the Entrez IDs belonging to the geneset. For example: ```{r, eval=FALSE} gs_id gene_id GO:0006631 30 GO:0006631 31 GO:0006631 32 GO:0006631 33 GO:0006631 34 GO:0006631 35 GO:0006631 36 GO:0006631 37 GO:0006631 51 GO:0006631 131 GO:0006631 183 GO:0006631 207 GO:0006631 208 GO:0006631 215 GO:0006631 225 ``` If a gene set from another database is in the form of a list with an array of genes (e.g. EGSEA PMID: 29333246), you can convert it into the neccessary form by the following and input the path directly to the `genesets` parameter. Note: Converting from R to text back to R is quite redundant and we are planning to be able to bypass this step. ```{r eval = F} library(EGSEAdata) egsea.data("mouse") temp = Mm.H #Or some other gene sets geneset = do.call(rbind,lapply(1:length(temp), function(index) {data.frame(gs_id = rep(names(temp)[index], length(temp[[index]])),gene_id = unlist(strsplit(temp[[index]],split = " ")),stringsAsFactors = F)})) write.table(geneset, "./custom_geneset.txt", quote=F, sep="\t",row.names = F, col.names = T) ``` ## Mappability We define base pair mappability as the average read mappability of all possible reads of size K that encompass a specific base pair location, $b$. Mappability files from UCSC Genome Browser mappability track were used to calculate base pair mappability. The mappability track provides values for theoretical read mappability, or the number of places in the genome that could be mapped by a read that begins with the base pair location $b$. For example, a value of 1 indicates a Kmer read beginning at $b$ is mappable to one area in the genome. A value of 0.5 indicates a Kmer read beginning at $b$ is mappable to two areas in the genome. For our purposes, we are only interested in uniquely mappable reads; therefore, all reads with mappability less than 1 were set to 0 to indicate non-unique mappability. Then, base pair mappability is calculated as: $$ \begin{equation} M_{i} = (\frac{1}{2K-1}) \sum_{j=i-K+1}^{i+(K-1)} M_{j} \end{equation} $$ where $M_{i}$ is the mappability of base pair $i$, and $M_{j}$ is mappability (from UCSC's mappability track) of read $j$ where j is the start position of the K length read. ### Built-in mappability Base pair mappability for reads of lengths 24, 36, 40, 50, 75, and 100 base pairs for `hg19` and for reads of lengths 36, 40, 50, 75, and 100 base pairs `mm9` a included. See the complete list with: ```{r} # Take head because it's long head(supported_read_lengths()) ``` ### Custom mappability Users can use custom mappability with any built-in locus definition (if, for example, the read length needed is not present), or with a custom locus definition. Custom mappability should be defined in a tab-delimited text file with columns named `gene_id` and `mappa`. Gene IDs should be Entrez Gene IDs, and mappability should be in [0,1]. A check is performed to verify that the gene IDs in the locus definition and mappability overlap by at least 95\%. An example custom mappability file looks like: ```{r, eval=FALSE} mappa gene_id 0.8 8487 0.1 84 0.6 91 1 1000 ``` ## Testing for enrichment As stated in the introduction, the `chipenrich` package includes three classes of methods for doing GSE testing. For each method, we describe the intended use case, the model used for enrichment, and an example using the method. ### `broadenrich()` Broad-Enrich is designed for use with broad peaks that may intersect multiple gene loci, and/or cumulatively cover greater than 5\% of the genome. For example, ChIP-seq experiments for histone modifications or large sets of copy number alterations. The Broad-Enrich method uses the cumulative peak coverage of genes in its logistic regression model for enrichment: `GO ~ ratio + s(log10_length)`. Here, `GO` is a binary vector indicating whether a gene is in the gene set being tested, `ratio` is a numeric vector indicating the ratio of the gene covered by peaks, and `s(log10_length)` is a binomial cubic smoothing spline which adjusts for the relationship between gene coverage and locus length. ```{r, warning = FALSE, message = FALSE} gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich') results = broadenrich(peaks = peaks_H3K4me3_GM12878, genome = 'hg19', genesets = gs_path, locusdef = "nearest_tss", qc_plots = FALSE, out_name = NULL, n_cores=1) results.be = results$results print(results.be[1:5,1:5]) ``` ### `chipenrich()` ChIP-Enrich is designed for use with 1,000s or 10,000s of narrow genomic regions which results in a relatively small percent of genes being assigned a genomic region. For example, many ChIP-seq experiments for transcription factors. The ChIP-Enrich method uses the presence of a peak in its logistic regression model for enrichment: `peak ~ GO + s(log10_length)`. Here, `GO` is a binary vector indicating whether a gene is in the gene set being tested, `peak` is a binary vector indicating the presence of a peak in a gene, and `s(log10_length)` is a binomial cubic smoothing spline which adjusts for the relationship between the presence of a peak and locus length. ```{r, warning = FALSE, message = FALSE} # Without mappability gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich') results = chipenrich(peaks = peaks_E2F4, genome = 'hg19', genesets = gs_path,locusdef = "nearest_tss", qc_plots = FALSE, out_name = NULL, n_cores = 1) results.ce = results$results print(results.ce[1:5,1:5]) ``` ```{r, warning = FALSE, message = FALSE} # With mappability gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich') results = chipenrich(peaks = peaks_E2F4, genome = 'hg19', genesets = gs_path, locusdef = "nearest_tss", mappability=24, qc_plots = FALSE,out_name = NULL,n_cores=1) results.cem = results$results print(results.cem[1:5,1:5]) ``` ### `polyenrich()` Poly-Enrich is also designed for narrow peaks, for experiments with 100,000s of peaks, or in cases where the number of binding sites per gene affects its regulation. If unsure whether to use chipenrich or polyenrich, then we recommend hybridenrich. The Poly-Enrich method uses the number of peaks in genes in its negative binomial regression model for enrichment: `num_peaks ~ GO + s(log10_length)`. Here, `GO` is a binary vector indicating whether a gene is in the gene set being tested, `num_peaks` is a numeric vector indicating the number of peaks in each gene, and `s(log10_length)` is a negative binomial cubic smoothing spline which adjusts for the relationship between the number of peaks in a gene and locus length. ```{r, warning = FALSE, message = FALSE} gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich') results = polyenrich(peaks = peaks_E2F4, genome = 'hg19', genesets = gs_path, method = 'polyenrich', locusdef = "nearest_tss", qc_plots = FALSE, out_name = NULL, n_cores = 1) results.pe = results$results print(results.pe[1:5,1:5]) ``` Poly-Enrich can also be used for linking human distal enhancer regions to their target genes, which are not necessarily the adjacent genes. We optimized human distal enhancer to target gene locus definitions (locusdef="enhancer" or locusdef="enhancer_plus5kb"). locusdef="enhancer" uses only distal regions >5kb from a TSS, while locusdef="enhancer_plus5kb" combines distal enhancers (>5kb from a TSS) with promoters (<=5kb from a TSS) to capture all genomic regions. Poly-Enrich is strongly recommended when using either the 'enhancer' or 'enhancer_plus5kb' gene locus definition, because only polyenrich is able to properly split the weight of genomic regions that are assigned to multiple genes (multiAssign=TRUE). The performance of Poly-Enrich using the enhancer locusdefs can be found in our recent study (details in reference 5). Like chipenrich, polyenrich is designed for narrow peaks, but for experiments with > ~50,000 genomic regions, or in cases where the number of binding sites per gene is thought to be important. Poly-Enrich also allows weighting of individual genomic regions based on a score, which can be useful for differential methylation enrichment analysis and ChIP-seq. Currently the options are: `signalValue` and `logsignalValue`. `signalValue` weighs each genomic region or peak based on the Signal Value given in the narrowPeak format or a user-supplied column (column name should be `signalValue`), while `logsignalValue` takes the log of these values. To use weighted Poly-Enrich, use method = `polyenrich_weighted`. When using the `enhancer` or `enhancer_plus5kb` locus definition, we recommend selecting the `polyenrich_weighted` method, which in this case will automatically set multiAssign as `TRUE`. ```{r, warning = FALSE, message = FALSE} gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich') results = polyenrich(peaks = peaks_E2F4, genome = 'hg19', genesets = gs_path, method="polyenrich_weighted", locusdef = "enhancer_plus5kb", qc_plots = FALSE, out_name = NULL, n_cores = 1) results.pe = results$results print(results.pe[1:5,1:5]) ``` ### `hybridenrich()` The hybrid method is used when one is unsure of which method, between ChIP-Enrich or Poly-Enrich, is the optimal method, and the most statistically powerful results are desired for each gene set. The runtime for this method, however, is ~2X that of the others. The hybrid p-value is given as `2*min(chipenrich_pvalue, polyenrich_pvalue)`. This test will retain the same Type 1 level and will be a consistent test if one of chipenrich or polyenrich is consistent. This can be extended to any number of tests, but currently we only allow a hybrid test for chipenrich and polyenrich. For more information about chipenrich or polyenrich, see their respective sections. ```{r, warning = FALSE, message = FALSE} gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich') results = hybridenrich(peaks = peaks_E2F4, genome = 'hg19', genesets = gs_path, locusdef = "nearest_tss", qc_plots = F, out_name = NULL, n_cores = 1) results.hybrid = results$results print(results.hybrid[1:5,1:5]) ``` ### `proxReg()` The proximity to regulatory regions (proxReg) test is a complementary test to any gene set enrichment test on a set of genomic regions, not just exclusive to the methods in this package. It tests if the genomic regions tend to be closer to (or farther from) gene transcription start sites or enhancer regions in each gene set tested. Currently, testing proximity to enhancer regions is only compatible with the hg19 genome. The purpose of ProxReg is to provide additional information for interpreting gene set enrichment test results, as a gene set enrichment test alone does not give information about whether the genomic regions occur near promoter or enhancer regions. ProxReg first calculates the distance between the midpoints of peaks and the nearest transcription start site or the nearest enhancer region midpoint for each genomic region. Each genomic region is then assigned to the gene with the nearest transcription start site. The distances are then classified according to whether the gene is in the gene set or not, and a signed Wilcoxon rank-sum test is used to calculate if the regions are closer or farther in the gene set than average. ```{r, warning = FALSE, message = FALSE} gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich') results.prox = proxReg(peaks_E2F4, reglocation = 'tss', genome = 'hg19', genesets=gs_path, out_name=NULL) results.prox = results.prox$results print(results.prox[1:5,1:7]) ``` ## QC Plots Each enrich function outputs QC plots if `qc_plots = TRUE`. There are also stand-alone functions to make the QC plots without the need for GSE testing. The QC plots can be used to help determine which locus definition to use, or which enrichment method is more appropriate. ### Peak distance to TSS distribution This plot gives a distribution of the distance of the peak midpoints to the TSSs. It can help in [selecting a locus definition](#selecting-a-locus-definition). For example, if genes are primarily within 5kb of TSSs, then the `5kb` locus definition may be a good choice. In contrast, if most genes fall far from TSSs, the `nearest_tss` locus definition may be a good choice. ```{r, fig.align='center', fig.cap='E2F4 peak distances to TSS', fig.height=6, fig.width=6, fig.show='hold', warning = FALSE, message = FALSE} # Output in chipenrich and polyenrich plot_dist_to_tss(peaks = peaks_E2F4, genome = 'hg19') ``` ### Presence of peak versus locus length This plot visualizes the relationship between the presence of at least one peak in a gene locus and the locus length (on the log10 scale). For clarity of visualization, each point represents 25 gene loci binned after sorting by locus length. The expected fit under the assumptions of Fisher's Exact Test (horizontal line), and a binomial-based test (gray curve) are displayed to indicate how the dataset being enriched conforms to the assumption of each. The empirical spline used in the `chipenrich` method is in orange. ```{r, fig.align='center', fig.cap='E2F4 chipenrich spline without mappability', fig.height=6, fig.width=6, fig.show='hold', warning = FALSE, message = FALSE} # Output in chipenrich plot_chipenrich_spline(peaks = peaks_E2F4, locusdef = 'nearest_tss', genome = 'hg19') ``` ### Number of peaks versus locus length This plot visualizes the relationship between the number of peaks assigned to a gene and the locus length (on the log10 scale). For clarity of visualization, each point represents 25 gene loci binned after sorting by locus length. The empirical spline used in the `polyenrich` method is in orange. If many gene loci have multiple peaks assigned to them, `polyenrich` is likely an appropriate method. If there are a low number of peaks per gene, then `chipenrich()` may be the appropriate method. ```{r, fig.align='center', fig.cap='E2F4 polyenrich spline without mappability', fig.height=6, fig.width=6, fig.show='hold', warning = FALSE, message = FALSE} # Output in polyenrich plot_polyenrich_spline(peaks = peaks_E2F4, locusdef = 'nearest_tss', genome = 'hg19') ``` ### Gene coverage versus locus length This plot visualizes the relationship between proportion of the gene locus covered by peaks and the locus length (on the log10 scale). For clarity of visualization, each point represents 25 gene loci binned after sorting by locus length. ```{r, fig.align='center', fig.cap='H3K4me3 gene coverage', fig.height=6, fig.width=6, fig.show='hold', warning = FALSE, message = FALSE} # Output in broadenrich plot_gene_coverage(peaks = peaks_H3K4me3_GM12878, locusdef = 'nearest_tss', genome = 'hg19') ``` ## Output The output of `broadenrich()`, `chipenrich()`, and `polyenrich()` is a list with components corresponding to each section below. If `out_name` is not `NULL`, then a file for each component will be written to the `out_path` with prefixes of `out_name`. ### Assigned peaks Peak assignments are stored in `$peaks`. This is a peak-level summary. Each line corresponds to a peak intersecting a particular gene locus defined in the selected locus definition. In the case of `broadenrich()` peaks may be assigned to multiple gene loci. Doing `table()` on the `peak_id` column will indicate how many genes are assigned to each peak. ```{r} head(results$peaks) ``` ### Peaks-per-gene Peak information aggregated over gene loci is stored in `$peaks_per_gene`. This is a gene-level summary. Each line corresponds to aggregated peak information over the `gene_id` such as the number of peaks assigned to the gene locus or the ratio of the gene locus covered in the case of `broadenrich()`. ```{r} head(results$peaks_per_gene) ``` ### Gene set enrichment results GSE results are stored in `$results`. For convenience, gene set descriptions are provided in addition to the gene set ID (which is the same as the ID from the originating database). The `Status` column takes values of `enriched` if the `Effect` is \> 0 and `depleted` if \< 0, with `enriched` results being of primary importance. Finally, the `Geneset.Peak.Genes` column gives a list of gene IDs that had signal contributing to the test for enrichment. This list can be used to cross reference information from `$peaks` or `$peaks_per_gene` if desired. ```{r} head(results$results) ``` ## Assessing Type I Error with Randomizations Randomization of locus definitions allows for the assessment of Type I Error under the null hypothesis of no true gene set enrichment. A well-calibrated Type I Error means that the false positive rate is controlled, and the p-values reported for actual data can be trusted. In both Welch & Lee, et al. and Cavalcante, et al., we demonstrated that both `chipenrich()` and `broadenrich()` have well-calibrated Type I Error over dozens of publicly available ENCODE ChIP-seq datasets. Unpublished data suggests the same is true for `polyenrich()`. Within `chipenrich()`, `broadenrich()`, and `polyenrich()`, the `randomization` parameters can be used to assess the Type I Error for the data being analyzed. The randomization codes, and their effects are: * `NULL`: No randomizations, the default. * `complete`: Shuffle the `gene_id` and `symbol` columns of the `locusdef` together, without regard for the chromosome location, or locus length. The null hypothesis is that there is no true gene set enrichment. * `bylength`: Shuffle the `gene_id` and `symbol` columns of the `locusdef` together, within bins of 100 genes sorted by locus length. The null hypothesis is that there is no true gene set enrichment, but with preserved locus length relationship. * `bylocation`: Shuffle the `gene_id` and `symbol` columns of the `locusdef` together, within bins of 50 genes sorted by genomic location. The null hypothesis is that there is no true gene set enrichment, but with preserved genomic location. The return value of `chipenrich()`, `broadenrich()`, or `polyenrich()` with a selected randomization is the same list object described above. To assess the Type I error, the `alpha` level for the particular data set can be calculated by dividing the total number of gene sets with p-value < `alpha` by the total number of tests tested. Users may want to perform multiple randomizations for a set of peaks and take the median of the `alpha` values. ```{r, warning = FALSE, message = FALSE} # Assessing if alpha = 0.05 gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich') results = chipenrich(peaks = peaks_E2F4, genome = 'hg19', genesets = gs_path, locusdef = "nearest_tss", qc_plots = FALSE, randomization = 'complete', out_name = NULL, n_cores = 1) alpha = sum(results$results$P.value < 0.05) / nrow(results$results) # NOTE: This is for print(alpha) ``` # References 1. R.P. Welch^, C. Lee^, R.A. Smith, P. Imbriano, S. Patil, T. Weymouth, L.J. Scott, M.A. Sartor. "ChIP-Enrich: gene set enrichment testing for ChIP-seq data." Nucl. Acids Res. (2014) 42(13):e105. [doi:10.1093/nar/gku463](https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gku463) 2. R.G. Cavalcante, C. Lee, R.P. Welch, S. Patil, T. Weymouth, L.J. Scott, and M.A. Sartor. "Broad-Enrich: functional interpretation of large sets of broad genomic regions." Bioinformatics (2014) 30(17):i393-i400 [doi:10.1093/bioinformatics/btu444](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btu444) 3. C.T. Lee , R.G. Cavalcante, C. Lee, T. Qin, S. Patil, S. Wang, Z. Tsai, A.P. Boyle, M.A. Sartor. "Poly-Enrich: Count-based Methods for Gene Set Enrichment Testing with Genomic Regions." NAR genomics and bioinformatics 2.1 (2020): lqaa006. [doi.org/10.1093/nargab/lqaa006 ](https://academic.oup.com/nargab/article/2/1/lqaa006/5728474) 4. C.T. Lee, K. Wang, T. Qin, M.A. Sartor. "Testing proximity of genomic regions to transcription start sites and enhancers complements gene set enrichment testing." Frontiers in genetics 11 (2020): 199. [doi.org/10.3389/fgene.2020.00199](https://www.frontiersin.org/articles/10.3389/fgene.2020.00199/full) 5. T. Qin, C.T. Lee, R.G. Cavalcante, P. Orchard, H. Yao, H. Zhang, S. Wang, S. Patil, A.P. Boyle, M.A. Sartor, "Comprehensive enhancer-target gene assignments improve gene set level interpretation of genome-wide regulatory data." (2020) bioRxiv. [doi.org/10.1101/2020.10.22.351049](https://www.biorxiv.org/content/10.1101/2020.10.22.351049v1)