# Correction for multiple testing ## Overview The false discovery rate (FDR) is usually the most appropriate measure of error for high-throughput experiments. Control of the FDR can be provided by applying the Benjamini-Hochberg (BH) method [@benjamini1995] to a set of $p$-values. This is less conservative than the alternatives (e.g., Bonferroni) yet still provides some measure of error control. The most obvious approach is to apply the BH method to the set of $p$-values across all windows. This will control the FDR across the set of putative DB windows. However, the FDR across all detected windows is not necessarily the most relevant error rate. Interpretation of ChIP-seq experiments is more concerned with regions of the genome in which (differential) protein binding is found, rather than the individual windows. In other words, the FDR across all detected DB regions is usually desired. This is not equivalent to that across all DB windows as each region will often consist of multiple overlapping windows. Control of one will not guarantee control of the other [@lun2014]. To illustrate this difference, consider an analysis where the FDR across all window positions is controlled at 10\%. In the results, there are 18 adjacent window positions in one region and 2 windows in a separate region. The first set of windows is a truly DB region whereas the second set is a false positive. A window-based interpretation of the FDR is correct as only 2 of the 20 window positions are false positives. However, a region-based interpretation results in an actual FDR of 50%. To avoid misinterpretation of the FDR, *[csaw](https://bioconductor.org/packages/3.18/csaw)* provides a number of strategies to obtain region-level results. This involves defining the regions of interest - possibly from the windows themselves - and converting per-window statistics into a $p$-value for each region. Application of the BH method to the per-region $p$-values will then control the relevant FDR across regions. These strategies are demonstrated below using the NF-YA data. ## Grouping windows into regions ### Quick and dirty clustering {#sec:cluster} The `mergeWindows()` function provides a simple single-linkage algorithm to cluster windows into regions. Windows that are less than `tol` apart are considered to be adjacent and are grouped into the same cluster. The chosen `tol` represents the minimum distance at which two binding events are treated as separate sites. Large values (500 - 1000 bp) reduce redundancy and favor a region-based interpretation of the results, while smaller values (< 200 bp) allow resolution of individual binding sites.
```r #--- loading-files ---# library(chipseqDBData) tf.data <- NFYAData() tf.data bam.files <- head(tf.data$Path, -1) # skip the input. bam.files #--- counting-windows ---# library(csaw) frag.len <- 110 win.width <- 10 param <- readParam(minq=20) data <- windowCounts(bam.files, ext=frag.len, width=win.width, param=param) #--- filtering ---# binned <- windowCounts(bam.files, bin=10000, param=param) fstats <- filterWindowsGlobal(data, binned) filtered.data <- data[fstats$filter > log2(5),] #--- normalization ---# filtered.data <- normFactors(binned, se.out=filtered.data) #--- modelling ---# cell.type <- sub("NF-YA ([^ ]+) .*", "\\1", head(tf.data$Description, -1)) design <- model.matrix(~cell.type) colnames(design) <- c("intercept", "cell.type") library(edgeR) y <- asDGEList(filtered.data) y <- estimateDisp(y, design) fit <- glmQLFit(y, design, robust=TRUE) res <- glmQLFTest(fit, coef="cell.type") rowData(filtered.data) <- cbind(rowData(filtered.data), res$table) ```
```r library(csaw) merged <- mergeWindows(filtered.data, tol=1000L) merged$regions ``` ``` ## GRanges object with 3577 ranges and 0 metadata columns: ## seqnames ranges strand ## ## [1] chr1 7397901-7398110 * ## [2] chr1 9541401-9541510 * ## [3] chr1 9545301-9545360 * ## [4] chr1 10007401-10007460 * ## [5] chr1 13134451-13134510 * ## ... ... ... ... ## [3573] chrX_GL456233_random 336801-336910 * ## [3574] chrY 143051-143060 * ## [3575] chrY 259151-259210 * ## [3576] chrY 90808851-90808860 * ## [3577] chrY 90812851-90812910 * ## ------- ## seqinfo: 66 sequences from an unspecified genome ``` If many adjacent windows are present, very large clusters may be formed that are difficult to interpret. We perform a simple check below to determine whether most clusters are of an acceptable size. Huge clusters indicate that more aggressive filtering from Chapter \@ref(chap-filter) is required. This mitigates chaining effects by reducing the density of windows in the genome. ```r summary(width(merged$regions)) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 10 60 110 165 160 15660 ``` Alternatively, chaining can be limited by setting `max.width` to restrict the size of the merged intervals. Clusters substantially larger than `max.width` are split into several smaller subclusters of roughly equal size. The chosen value should be small enough so as to separate DB regions from unchanged neighbors, yet large enough to avoid misinterpretation of the FDR. Any value from 2000 to 10000 bp is recommended. This paramater can also interpreted as the maximum distance at which two binding sites are considered part of the same event. ```r merged.max <- mergeWindows(filtered.data, tol=1000L, max.width=5000L) summary(width(merged.max$regions)) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 10 60 110 164 160 4860 ``` ### Using external information Another approach is to group together windows that overlap with a pre-specified region of interest. The most obvious source of pre-specified regions is that of annotated features such as promoters or gene bodies. Alternatively, called peaks can be used provided that sufficient care has been taken to avoid loss of error control from data snooping [@lun2014]. Regardless of how they are specified, each region of interest corresponds to a group that contains all overlapping windows, as identified by the `findOverlaps` function from the *[GenomicRanges](https://bioconductor.org/packages/3.18/GenomicRanges)* package. ```r library(TxDb.Mmusculus.UCSC.mm10.knownGene) broads <- genes(TxDb.Mmusculus.UCSC.mm10.knownGene) broads <- resize(broads, width(broads)+3000, fix="end") olap <- findOverlaps(broads, rowRanges(filtered.data)) olap ``` ``` ## Hits object with 12867 hits and 0 metadata columns: ## queryHits subjectHits ## ## [1] 7 6995 ## [2] 18 8323 ## [3] 18 8324 ## [4] 18 8325 ## [5] 18 8326 ## ... ... ... ## [12863] 24521 6840 ## [12864] 24521 6841 ## [12865] 24524 6601 ## [12866] 24524 6602 ## [12867] 24524 6603 ## ------- ## queryLength: 24528 / subjectLength: 12352 ``` At this point, one might imagine that it would be simpler to just collect and analyze counts over the pre-specified regions. This is a valid strategy but will yield different results. Consider a promoter containing two separate sites that are identically DB in opposite directions. Counting reads across the promoter will give equal counts for each condition so changes within the promoter will not be detected. Similarly, imprecise peak boundaries can lead to loss of detection power due to "contamination" by reads in background regions. Window-based methods may be more robust as each interval of the promoter/peak region is examined separately [@lun2014], avoiding potential problems with peak-calling errors and incorrect/incomplete annotation. ## Obtaining per-region $p$-value ### Combining window-level $p$-values We compute a combined $p$-value for each region based on the $p$-values of the constituent windows [@simes1986]. This tests the joint null hypothesis for each region, i.e., that no enrichment is observed across any of its windows. Any DB within the region will reject the joint null and yield a low $p$-value for the entire region. The combined $p$-values are then adjusted using the BH method to control the region-level FDR. ```r tabcom <- combineTests(merged$ids, rowData(filtered.data)) is.sig.region <- tabcom$FDR <= 0.05 summary(is.sig.region) ``` ``` ## Mode FALSE TRUE ## logical 1666 1911 ``` Summarizing the direction of DB for each cluster requires some care as the direction of DB can differ between constituent windows. The `num.up.tests` and `num.down.tests` fields contain the number of windows that change in each direction, and can be used to gauge whether binding increases or decreases across the cluster. A complex DB event may be present if both `num.up.tests` and `num.down.tests` are non-zero (i.e., opposing changes within the region) or if the total number of windows is much larger than either number (e.g., interval of constant binding adjacent to the DB interval). Alternatively, the `direction` field specifies which DB direction contributes to the combined $p$-value. If `"up"`, the combined $p$-value for this cluster is driven by $p$-values of windows with positive log-fold changes. If `"down"`, the combined $p$-value is driven by windows with negative log-fold changes. If `"mixed"`, windows with both positive and negative log-fold changes are involved. This allows the dominant DB in significant clusters to be quickly summarized, as shown below. ```r table(tabcom$direction[is.sig.region]) ``` ``` ## ## down up ## 174 1737 ``` For pre-specified regions, the `combineOverlaps()` function will combine the $p$-values for all windows in each region. This is a wrapper around `combineTests()` for `Hits` objects. It returns a single combined $p$-value (and its BH-adjusted value) for each region. Regions that do not overlap any windows have values of `NA` in all fields for the corresponding rows. ```r tabbroad <- combineOverlaps(olap, rowData(filtered.data)) head(tabbroad[!is.na(tabbroad$PValue),]) ``` ``` ## DataFrame with 6 rows and 8 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## ## 7 1 1 0 3.88905e-05 0.000767486 up ## 18 4 4 0 2.91361e-04 0.002481887 up ## 23 3 3 0 2.90086e-02 0.051465052 up ## 25 2 0 0 7.37883e-01 0.760929460 up ## 28 3 3 0 2.06513e-04 0.002055317 up ## 36 4 4 0 6.33654e-03 0.017992101 up ## rep.test rep.logFC ## ## 7 6995 3.396579 ## 18 8326 3.446009 ## 23 315 1.331126 ## 25 9977 0.201974 ## 28 8774 3.503353 ## 36 2716 2.105493 ``` ```r is.sig.gene <- tabcom$FDR <= 0.05 table(tabbroad$direction[is.sig.gene]) ``` ``` ## ## down mixed up ## 86 29 1925 ``` ### Based on the most significant window {#sec:mostsig} Another approach is to use the single window with the strongest DB as a representative of the entire region. This is useful when a log-fold change is required for each cluster, e.g., for plotting. (In contrast, taking the average log-fold change across all windows in a region will understate the magnitude of DB, especially if the region includes some non-DB background intervals of the genome.) Identification of the most significant (i.e., "best") window is performed using the `getBestTest()` function. This reports the index of the window with the lowest $p$-value in each cluster as well as the associated statistics. ```r tab.best <- getBestTest(merged$ids, rowData(filtered.data)) head(tab.best) ``` ``` ## DataFrame with 6 rows and 8 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## ## 1 5 2 0 0.0226156 0.0461735 up ## 2 3 2 0 0.0144413 0.0332624 up ## 3 2 2 0 0.0334643 0.0623772 up ## 4 2 0 0 0.2113083 0.2674838 up ## 5 2 0 0 0.1716740 0.2251684 up ## 6 5 1 0 0.0470532 0.0807238 up ## rep.test rep.logFC ## ## 1 3 1.57363 ## 2 8 1.98359 ## 3 10 1.69023 ## 4 11 1.12035 ## 5 14 1.08336 ## 6 17 1.32717 ``` A Bonferroni correction is applied to the $p$-value of the best window in each region, based on the number of constituent windows in that region. This is necessary to account for the implicit multiple testing across all windows in each region. The corrected $p$-value is reported as `PValue` in `tab.best`, and can be used for correction across regions using the BH method to control the region-level FDR. In addition, it is often useful to report the start location of the best window within each cluster. This allows users to easily identify a relevant DB subinterval in large regions. For example, the sequence of the DB subinterval can be extracted for motif discovery. ```r tabcom$rep.start <- start(rowRanges(filtered.data))[tab.best$rep.test] head(tabcom[,c("rep.logFC", "rep.start")]) ``` ``` ## DataFrame with 6 rows and 2 columns ## rep.logFC rep.start ## ## 1 1.40808 7398001 ## 2 1.81146 9541501 ## 3 1.57495 9545351 ## 4 1.03347 10007401 ## 5 1.08336 13134501 ## 6 1.26970 13372551 ``` The same approach can be applied to the overlaps between windows and pre-specified regions, using the `getBestOverlaps()` wrapper function. This is demonstrated below for the broad gene body example. As with `combineOverlaps()`, regions with no windows are assigned `NA` in the output table, but these are removed here to show some actual results. ```r tab.best.broad <- getBestOverlaps(olap, rowData(filtered.data)) tabbroad$rep.start <- start(rowRanges(filtered.data))[tab.best.broad$rep.test] head(tabbroad[!is.na(tabbroad$PValue),c("rep.logFC", "rep.start")]) ``` ``` ## DataFrame with 6 rows and 2 columns ## rep.logFC rep.start ## ## 7 3.396579 32657101 ## 18 3.446009 8259301 ## 23 1.331126 92934601 ## 25 0.201974 71596101 ## 28 3.503353 4137001 ## 36 2.105493 100187601 ``` ### Wrapper functions For convenience, the steps of merging windows and computing statistics are implemented in a single wrapper function. This simply calls `mergeWindows()` followed by `combineTests()` and `getBestTest()`. ```r merge.res <- mergeResults(filtered.data, rowData(filtered.data), tol=100, merge.args=list(max.width=5000)) names(merge.res) ``` ``` ## [1] "regions" "combined" "best" ``` An equivalent wrapper function is also available for handling overlaps to pre-specified regions. This simply calls `findOverlaps()` followed by `combineOverlaps()` and `getBestOverlaps()`. ```r broad.res <- overlapResults(filtered.data, regions=broads, tab=rowData(filtered.data)) names(broad.res) ``` ``` ## [1] "regions" "combined" "best" ``` ## Squeezing out more detection power ### Integrating across multiple window sizes {#sec:bin-integrate} Repeating the analysis with different window sizes may uncover new DB events at different resolutions. Multiple sets of DB results are integrated by clustering adjacent windows together (even if they differ in size) and combining $p$-values within each of the resulting clusters. The example below uses the H3 acetylation data from Chapter \@ref(chap-norm). Some filtering is performed to avoid excessive chaining in this demonstration. Corresponding tables of DB results should also be obtained -- for brevity, mock results are used here. ```r library(chipseqDBData) ac.files <- H3K9acData()$Path ac.small <- windowCounts(ac.files, width=150L, spacing=100L, filter=25, param=param) ac.large <- windowCounts(ac.files, width=1000L, spacing=500L, filter=35, param=param) # TODO: actually do the analysis here. # In the meantime, mocking up results for demonstration purposes. ns <- nrow(ac.small) mock.small <- data.frame(logFC=rnorm(ns), logCPM=0, PValue=runif(ns)) nl <- nrow(ac.large) mock.large <- data.frame(logFC=rnorm(nl), logCPM=0, PValue=runif(nl)) ``` The `mergeResultsList()` function merges windows of all sizes into a single set of regions, and computes a combined $p$-value from the associated $p$-values for each region. Equal contributions from each window size are enforced by setting `equiweight=TRUE`, which uses a weighted version of Simes' method [@benjamini1997]. The weight assigned to each window is inversely proportional to the number of windows of that size in the same cluster. This avoids the situation where, if a cluster contains many small windows, the DB results for the analysis with the small window size contribute most to the combined $p$-value. This is not ideal when results from all window sizes are of equal interest. ```r cons.res <- mergeResultsList(list(ac.small, ac.large), tab.list=list(mock.small, mock.large), equiweight=TRUE, tol=1000) cons.res$regions ``` ``` ## GRanges object with 30486 ranges and 0 metadata columns: ## seqnames ranges strand ## ## [1] chr1 4774001-4776500 * ## [2] chr1 4784501-4787000 * ## [3] chr1 4806501-4809000 * ## [4] chr1 4856501-4860000 * ## [5] chr1 5082501-5084500 * ## ... ... ... ... ## [30482] chrY 38230601-38230750 * ## [30483] chrY 73037501-73039000 * ## [30484] chrY 75445901-75446150 * ## [30485] chrY 88935501-88937000 * ## [30486] chrY 90812501-90814000 * ## ------- ## seqinfo: 66 sequences from an unspecified genome ``` ```r cons.res$combined ``` ``` ## DataFrame with 30486 rows and 8 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## ## 1 4 0 0 0.1935228 0.984516 up ## 2 13 0 0 0.3175779 0.992647 down ## 3 11 0 1 0.0182334 0.953773 down ## 4 22 0 0 0.2569802 0.989281 down ## 5 9 0 0 0.1256143 0.979292 up ## ... ... ... ... ... ... ... ## 30482 1 0 0 0.140177 0.981093 up ## 30483 5 0 0 0.830126 0.999396 mixed ## 30484 2 0 0 0.549871 0.999070 down ## 30485 5 0 0 0.627884 0.999070 mixed ## 30486 2 0 0 0.702539 0.999396 down ## rep.test rep.logFC ## ## 1 238292 0.4106638 ## 2 238294 -1.2105409 ## 3 14 -0.0274603 ## 4 238301 -0.5587688 ## 5 34 0.1444136 ## ... ... ... ## 30482 238280 1.1905434 ## 30483 391248 -1.1170880 ## 30484 238284 -0.3724923 ## 30485 238286 -0.6123872 ## 30486 391253 -0.0994413 ``` Similarly, the `overlapResultsList()` function is used to merge windows of varying size that overlap pre-specified regions. ```r cons.broad <- overlapResultsList(list(ac.small, ac.large), tab.list=list(mock.small, mock.large), equiweight=TRUE, region=broads) cons.broad$regions ``` ``` ## GRanges object with 24528 ranges and 1 metadata column: ## seqnames ranges strand | gene_id ## | ## 100009600 chr9 21062393-21076096 - | 100009600 ## 100009609 chr7 84935565-84967115 - | 100009609 ## 100009614 chr10 77708457-77712009 + | 100009614 ## 100009664 chr11 45805087-45841171 + | 100009664 ## 100012 chr4 144157557-144165663 - | 100012 ## ... ... ... ... . ... ## 99889 chr3 84496093-85890516 - | 99889 ## 99890 chr3 110246109-110253998 - | 99890 ## 99899 chr3 151730922-151752960 - | 99899 ## 99929 chr3 65525410-65555518 + | 99929 ## 99982 chr4 136550540-136605723 - | 99982 ## ------- ## seqinfo: 66 sequences (1 circular) from mm10 genome ``` ```r cons.res$combined ``` ``` ## DataFrame with 30486 rows and 8 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## ## 1 4 0 0 0.1935228 0.984516 up ## 2 13 0 0 0.3175779 0.992647 down ## 3 11 0 1 0.0182334 0.953773 down ## 4 22 0 0 0.2569802 0.989281 down ## 5 9 0 0 0.1256143 0.979292 up ## ... ... ... ... ... ... ... ## 30482 1 0 0 0.140177 0.981093 up ## 30483 5 0 0 0.830126 0.999396 mixed ## 30484 2 0 0 0.549871 0.999070 down ## 30485 5 0 0 0.627884 0.999070 mixed ## 30486 2 0 0 0.702539 0.999396 down ## rep.test rep.logFC ## ## 1 238292 0.4106638 ## 2 238294 -1.2105409 ## 3 14 -0.0274603 ## 4 238301 -0.5587688 ## 5 34 0.1444136 ## ... ... ... ## 30482 238280 1.1905434 ## 30483 391248 -1.1170880 ## 30484 238284 -0.3724923 ## 30485 238286 -0.6123872 ## 30486 391253 -0.0994413 ``` In this manner, DB results from multiple window widths can be gathered together and reported as a single set of regions. Consolidation is most useful for histone marks and other analyses involving diffuse regions of enrichment. For such studies, the ideal window size is not known or may not even exist, e.g., if the widths of the enriched regions or DB subintervals are variable. ### Weighting windows on abundance Windows that are more likely to be DB can be upweighted to improve detection power. For example, in TF ChIP-seq data, the window of highest abundance within each enriched region probably contains the binding site. It is reasonable to assume that this window will also have the strongest DB. To improve power, the weight assigned to the most abundant window is increased relative to that of other windows in the same cluster. This means that the $p$-value of this window will have a greater influence on the final combined $p$-value. Weights are computed in a manner to minimize conservativeness relative to the optimal unweighted approaches in each possible scenario. If the strongest DB event is at the most abundant window, the weighted approach will yield a combined $p$-value that is no larger than twice the $p$-value of the most abundant window. (Here, the optimal approach would be to use the $p$-value of the most abundance window directly as a proxy for the $p$-value of the cluster.) If the strongest DB event is _not_ at the most abundant window, the weighted approach will yield a combined $p$-value that is no larger than twice the combined $p$-value without wweighting (which is optimal as all windows have equal probabilities of containing the strongest DB). All windows have non-zero weights, which ensures that any DB events in the other windows will still be considered when the $p$-values are combined. The application of this weighting scheme is demonstrated in the example below. First, the `getBestTest} function with \Rcode{by.pval=FALSE()` is used to identify the most abundant window in each cluster. Window-specific weights are then computed using the `upweightSummits} function, and supplied to \Rcode{combineTests()` to use in computing combined $p$-values. ```r tab.ave <- getBestTest(merged$id, rowData(filtered.data), by.pval=FALSE) weights <- upweightSummit(merged$id, tab.ave$rep.test) head(weights) ``` ``` ## [1] 1 5 1 1 1 1 ``` ```r tabcom.w <- combineTests(merged$id, rowData(filtered.data), weight=weights) head(tabcom.w) ``` ``` ## DataFrame with 6 rows and 8 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## ## 1 5 3 0 0.01279380 0.0297165 up ## 2 3 2 0 0.00962617 0.0245598 up ## 3 2 2 0 0.03160768 0.0568416 up ## 4 2 0 0 0.12888341 0.1696159 up ## 5 2 0 0 0.12875551 0.1695099 up ## 6 5 2 0 0.01693914 0.0359806 up ## rep.test rep.logFC ## ## 1 2 1.40808 ## 2 7 1.81146 ## 3 9 1.57495 ## 4 12 1.03347 ## 5 14 1.08336 ## 6 17 1.32717 ``` The weighting approach can also be applied to the clusters from the broad gene body example. This is done by replacing the call to `getBestTest} with one to \Rfunction{getBestOverlaps()`, as before. Similarly, `upweightSummit} can be replaced with \Rfunction{summitOverlaps()`. These wrappers are designed to minimize book-keeping problems when one window overlaps multiple regions. ```r broad.best <- getBestOverlaps(olap, rowData(filtered.data), by.pval=FALSE) head(broad.best[!is.na(broad.best$PValue),]) ``` ``` ## DataFrame with 6 rows and 8 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## ## 7 1 1 0 3.88905e-05 0.000755731 up ## 18 4 1 0 7.50992e-03 0.020295810 up ## 23 3 1 0 2.90086e-02 0.052950589 up ## 25 2 0 0 7.37883e-01 0.758205118 up ## 28 3 1 0 6.88376e-05 0.000995477 up ## 36 4 1 0 2.83839e-03 0.010705025 up ## rep.test rep.logFC ## ## 7 6995 3.396579 ## 18 8324 1.690290 ## 23 315 1.331126 ## 25 9977 0.201974 ## 28 8774 3.503353 ## 36 2717 1.957656 ``` ```r broad.weights <- summitOverlaps(olap, region.best=broad.best$rep.test) tabbroad.w <- combineOverlaps(olap, rowData(filtered.data), o.weight=broad.weights) ``` ### Filtering after testing but before correction Most of the filters in Chapter~\@ref(chap-filter) are applied before the statistical analysis. However, some of the approaches may be too aggressive, e.g., filtering to retain only local maxima or based on pre-defined regions. In such cases, it may be preferable to initially apply one of the other, milder filters. This ensures that sufficient windows are retained for stable normalization and/or EB shrinkage. The aggressive filters can then be applied after the window-level statistics have been calculated, but before clustering into regions and calculation of cluster-level statistics. This is still beneficial as it removes irrelevant windows that would increase the severity of the BH correction. It may also reduce chaining effects during clustering. ## FDR control in difficult situations ### Clustering only on DB windows for diffuse marks The clustering procedures described above rely on independent filtering to remove irrelevant windows. This ensures that the regions of interest are reasonably narrow and can be easily interpreted, which is typically the case for most protein targets, e.g., TFs, narrow histone marks. However, enriched regions may be very large for more diffuse marks. Such regions may be difficult to interpret when only the DB subinterval is of interest. To overcome this, a post-hoc analysis can be performed whereby only significant windows are used for clustering. ```r postclust <- clusterWindows(rowRanges(filtered.data), rowData(filtered.data), target=0.05, tol=100, max.width=1000) postclust$FDR ``` ``` ## [1] 0.04957 ``` ```r postclust$region ``` ``` ## GRanges object with 1977 ranges and 0 metadata columns: ## seqnames ranges strand ## ## [1] chr1 7398001-7398010 * ## [2] chr1 9541451-9541510 * ## [3] chr1 15805551-15805660 * ## [4] chr1 23256201-23256210 * ## [5] chr1 32172651-32172660 * ## ... ... ... ... ## [1973] chrX 102157051-102157110 * ## [1974] chrX 104482701-104482760 * ## [1975] chrX 106187051-106187060 * ## [1976] chrX 140456551-140456560 * ## [1977] chrX_GL456233_random 336801-336910 * ## ------- ## seqinfo: 66 sequences from an unspecified genome ``` This will define and cluster significant windows in a manner that controls the cluster-level FDR at 5%. The clustering step itself is performed using `mergeWindows()` with the specified parameters. Each cluster consists entirely of DB windows and can be directly interpreted as a DB region or a DB subinterval of a larger enriched region. This reduces the pressure on abundance filtering to obtain well-separated regions prior to clustering, e.g., for diffuse marks or in data sets with weak IP signal. That said, users should be aware that calculation of the cluster-level FDR is not entirely rigorous. As such, independent clustering and FDR control via Simes' method should be considered as the default for routine analyses. ### Using the empirical FDR for noisy data Some analyses involve comparisons of ChIP samples to negative controls. In such cases, any region exhibiting enrichment in the negative control over the ChIP samples must be a false positive. The number of significant regions that change in the "wrong" direction can be used as an estimate of the number of false positives at any given $p$-value threshold. Division by the number of discoveries changing in the "right" direction yields an estimate of the FDR, i.e., the empirical FDR [@zhang2008]. This strategy is implemented in the `empiricalFDR()` function, which controls the empirical FDR across clusters based on their combined $p$-values. Its use is demonstrated below, though the output is not meaningful in this situation as genuine changes in binding can be present in both directions. ```r empres <- empiricalFDR(merged$id, rowData(filtered.data)) ``` The empirical FDR is useful for analyses of noisy data with high levels of non-specific binding. This is because the estimate of the number of false positives adapts to the observed number of regions exhibiting enrichment in the negative controls. In contrast, the standard BH method in `combineTests()` relies on proper type I error control during hypothesis testing. As non-specific binding events tend to be condition-specific, they are indistinguishable from DB events and assigned low $p$-values, resulting in loss of FDR control. Thus, for noisy data, use of the empirical FDR may be more appropriate to control the proportion of "experimental" false positives. However, calculation of the empirical FDR is not as statistically rigorous as that of the BH method, so users are advised to only apply it when necessary. ### Detecting complex DB Complex DB events involve changes to the shape of the binding profile, not just a scaling increase/decrease to binding intensity. Such regions may contain multiple sites that change in binding strength in opposite directions, or peaks that change in width or position between conditions. This often manifests as DB in opposite directions in different subintervals of a region. Some of these events can be identified using the `mixedTests()` function. ```r tab.mixed <- mixedTests(merged$ids, rowData(filtered.data)) tab.mixed ``` ``` ## DataFrame with 3577 rows and 10 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## ## 1 5 5 0 0.997738 1 mixed ## 2 3 2 0 0.997593 1 mixed ## 3 2 2 0 0.991634 1 mixed ## 4 2 0 0 0.947173 1 mixed ## 5 2 0 0 0.957081 1 mixed ## ... ... ... ... ... ... ... ## 3573 3 0 3 1.000000 1 mixed ## 3574 1 0 0 0.931630 1 mixed ## 3575 2 0 0 0.853330 1 mixed ## 3576 1 1 0 0.967612 1 mixed ## 3577 2 0 0 0.749056 1 mixed ## rep.up.test rep.up.logFC rep.down.test rep.down.logFC ## ## 1 2 1.40808 3 1.57363 ## 2 7 1.81146 8 1.98359 ## 3 9 1.57495 10 1.69023 ## 4 12 1.03347 11 1.12035 ## 5 14 1.08336 14 1.08336 ## ... ... ... ... ... ## 3573 12345 -2.607002 12345 -2.607002 ## 3574 12347 -0.987658 12347 -0.987658 ## 3575 12349 -0.636963 12348 -0.440200 ## 3576 12350 1.245995 12350 1.245995 ## 3577 12351 -0.430332 12352 -0.163333 ``` `mixedTests()` converts the $p$-value for each window into two one-sided $p$-values. The one-sided $p$-values in each direction are combined using Simes' method, and the two one-sided combined $p$-values are themselves combined using an intersection-union test [@berger1996bioequivalence]. The resulting $p$-value is only low if a region contains strong DB in both directions. `combineTests()` also computes some statistics for informal detection of complex DB. For example, the `num.up.tests` and `num.down.tests` fields can be used to identify regions with changes in both directions. The `direction` field will also label some regions as `"mixed"`, though this is not comprehensive. Indeed, regions labelled as `"up"` or `"down"` in the `direction` field may also correspond to complex DB events, but will not be labelled as `"mixed"` if the significance calculations are dominated by windows changing in only one direction. ### Enforcing a minimal number of DB windows On occasion, we may be interested in genomic regions that contain at least a minimal number or proportion of DB windows. This is motivated by the desire to avoid detecting DB regions where only a small subinterval exhibits a change, instead favoring more systematic changes throughout the region that are easier to interpret. We can identify these regions using the `minimalTests()` function. ```r tab.min <- minimalTests(merged$ids, rowData(filtered.data), min.sig.n=3, min.sig.prop=0.5) tab.min ``` ``` ## DataFrame with 3577 rows and 8 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## ## 1 5 2 0 0.0898582 0.1634907 up ## 2 3 2 0 0.1071287 0.1854788 up ## 3 2 2 0 0.0334643 0.0858077 up ## 4 2 0 0 0.2113083 0.3035635 up ## 5 2 0 0 0.2388546 0.3316704 up ## ... ... ... ... ... ... ... ## 3573 3 0 3 1.69623e-05 0.00123825 down ## 3574 1 0 0 1.36739e-01 0.22131946 down ## 3575 2 0 0 5.86682e-01 0.68760453 down ## 3576 1 0 0 6.47760e-02 0.13112825 up ## 3577 2 0 0 1.00000e+00 1.00000000 down ## rep.test rep.logFC ## ## 1 1 1.238160 ## 2 6 1.094909 ## 3 9 1.574948 ## 4 12 1.033474 ## 5 13 0.738997 ## ... ... ... ## 3573 12346 -2.535065 ## 3574 12347 -0.987658 ## 3575 12348 -0.440200 ## 3576 12350 1.245995 ## 3577 12352 -0.163333 ``` `minimalTests()` applies a Holm-Bonferroni correction to all windows in the same cluster and picks the $x$^th^-smallest adjusted $p$-value (where $x$ is defined from `min.sig.n` and `min.sig.prop`). This tests the joint null hypothesis that the per-window null hypothesis is false for fewer than $x$ windows in the cluster. If the $x$^th^-smallest $p$-value is low, this provides strong evidence against the joint null for that cluster. As an aside, this function also has some utility outside of ChIP-seq contexts. For example, we might want to obtain a single $p$-value for a gene set based on the presence of a minimal percentage of differentially expressed genes. Alternatively, we may be interested in ranking genes in loss-of-function screens based on a minimal number of shRNA/CRISPR guides that exhibit a significant effect. These problems are equivalent to that of identifying a genomic region with a minimal number of DB windows. ## Session information {-}
``` R version 4.3.1 (2023-06-16) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.3 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] chipseqDBData_1.17.0 [2] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 [3] GenomicFeatures_1.54.0 [4] AnnotationDbi_1.64.0 [5] csaw_1.36.0 [6] SummarizedExperiment_1.32.0 [7] Biobase_2.62.0 [8] MatrixGenerics_1.14.0 [9] matrixStats_1.0.0 [10] GenomicRanges_1.54.0 [11] GenomeInfoDb_1.38.0 [12] IRanges_2.36.0 [13] S4Vectors_0.40.0 [14] BiocGenerics_0.48.0 [15] BiocStyle_2.30.0 [16] rebook_1.12.0 loaded via a namespace (and not attached): [1] DBI_1.1.3 bitops_1.0-7 [3] biomaRt_2.58.0 CodeDepends_0.6.5 [5] rlang_1.1.1 magrittr_2.0.3 [7] compiler_4.3.1 RSQLite_2.3.1 [9] dir.expiry_1.10.0 png_0.1-8 [11] vctrs_0.6.4 stringr_1.5.0 [13] pkgconfig_2.0.3 crayon_1.5.2 [15] fastmap_1.1.1 ellipsis_0.3.2 [17] dbplyr_2.3.4 XVector_0.42.0 [19] utf8_1.2.4 promises_1.2.1 [21] Rsamtools_2.18.0 rmarkdown_2.25 [23] graph_1.80.0 purrr_1.0.2 [25] bit_4.0.5 xfun_0.40 [27] zlibbioc_1.48.0 cachem_1.0.8 [29] jsonlite_1.8.7 progress_1.2.2 [31] blob_1.2.4 later_1.3.1 [33] DelayedArray_0.28.0 interactiveDisplayBase_1.40.0 [35] BiocParallel_1.36.0 parallel_4.3.1 [37] prettyunits_1.2.0 R6_2.5.1 [39] bslib_0.5.1 stringi_1.7.12 [41] limma_3.58.0 rtracklayer_1.62.0 [43] jquerylib_0.1.4 Rcpp_1.0.11 [45] bookdown_0.36 knitr_1.44 [47] httpuv_1.6.12 Matrix_1.6-1.1 [49] tidyselect_1.2.0 rstudioapi_0.15.0 [51] abind_1.4-5 yaml_2.3.7 [53] codetools_0.2-19 curl_5.1.0 [55] lattice_0.22-5 tibble_3.2.1 [57] withr_2.5.1 shiny_1.7.5.1 [59] KEGGREST_1.42.0 evaluate_0.22 [61] BiocFileCache_2.10.0 xml2_1.3.5 [63] ExperimentHub_2.10.0 Biostrings_2.70.0 [65] pillar_1.9.0 BiocManager_1.30.22 [67] filelock_1.0.2 generics_0.1.3 [69] RCurl_1.98-1.12 BiocVersion_3.18.0 [71] hms_1.1.3 xtable_1.8-4 [73] glue_1.6.2 metapod_1.10.0 [75] tools_4.3.1 AnnotationHub_3.10.0 [77] BiocIO_1.12.0 locfit_1.5-9.8 [79] GenomicAlignments_1.38.0 XML_3.99-0.14 [81] grid_4.3.1 edgeR_4.0.0 [83] GenomeInfoDbData_1.2.11 restfulr_0.0.15 [85] cli_3.6.1 rappdirs_0.3.3 [87] fansi_1.0.5 S4Arrays_1.2.0 [89] dplyr_1.1.3 sass_0.4.7 [91] digest_0.6.33 SparseArray_1.2.0 [93] rjson_0.2.21 memoise_2.0.1 [95] htmltools_0.5.6.1 lifecycle_1.0.3 [97] httr_1.4.7 mime_0.12 [99] statmod_1.5.0 bit64_4.0.5 ```