This vignette for the R package peakCombiner guides you through the preparation of all accepted input files and how to define the best parameters for your analysis. Our goal is to provide you with an easy to understand, transparent and modular framework to create the consensus regions file you need to address your scientific question.
Genome-wide epigenomic data sets like ChIP-seq and ATAC-seq typically use tools (e.g. MACS (Zhang et al. 2008)) to identify genomic regions of interest, so-called peaks, usually for multiple sample replicates and across experimental conditions. Many downstream analyses require a consensus set of genomic regions relevant to the experiment, but current tools within the R ecosystem to easily and flexibly create combined peak sets from conditions and replicates are limited. Here, we describe peakCombiner, a fully R-based, user-friendly, transparent, and flexible tool that allows even novice R users to create a high-quality consensus peak list. The modularity of its functions allows an easy way to filter and recenter input and output data. A broad range of accepted input data formats can be used as input to peakCombiner, and the resulting consensus regions set can be exported to a file or used immediately as the starting point for most downstream peak analyses.
The package peakCombiner contains four main functions to curate and combine genomic regions into one set of consensus regions. A short overview about each the functions is in the table below.
| Function name | What it does | Where to learn more | 
|---|---|---|
| prepareInputRegions | Transforms your data into the format needed by peakCombiner for all following steps. | See in section Prepare input data. | 
| centerExpandRegions | Modifies your genomic regions by centering and then expanding them to a uniform size. | See in section Center and expand regions. | 
| filterRegions | Allows you to filter your genomic regions. | See in section Filter regions. | 
| combineRegions | Combine overlapping genomic regions to create a single set of consensus genomic regions. | See in section Combine regions. | 
Each peakCombiner function provides the parameter outputFormat that allows you to choose the output format of the function. The accepted values are ‘tibble’, ‘GenomicRanges’, and ‘data.frame’. The default value is ‘tibble’. The output format is used to define the structure of the returned data, which is described in the section Standard genomic regions format. Additionally, the parameter showMessages that by default prints feedback messages as functions run. If you plan to use peakCombiner non-interactively in a workflow, you can silence these messages by setting showMessages = FALSE. Note that error messages will still be printed to help you to troubleshoot potential problems.
The modularity of peakCombiner is enabled by standardizing how the genomic regions and samples are represented in the input and output of all its functions. We chose to use a tibble because this allows you to easily plot and explore the data.
The columns are described in the table below. Only ‘chrom’, ‘start’, ‘end’, ‘sample_name’ are required to use peakCombiner, but you have more possibilities for centerExpandRegions and filterRegions if you can provide the optional ‘score’ and ‘center’ columns.
If you can’t provide data for the optional columns, prepareInputRegions adds sensible defaults for you.
We recommend that you use either a GenomicRanges object or a tibble using prepareInputRegions (see section “Prepare input data”), but you can also prepare it yourself (see section “Load from pre-loaded tibble”)
| Column name | Requirement | Content | Details | 
|---|---|---|---|
| ‘chrom’ | required | Name of the chromosome. | - | 
| ‘start’ | required | Start coordinate of the genomic region. | 1-based coordinate system, NOT like BED files which are 0-based. | 
| ‘end’ | required | End coordinate of the genomic region. | - | 
| ‘name’ | optional | Can be any unique identifier for a sample and genomic region, but prepareInputRegions concatenates the ‘sample_name’ and row number. | |
| ‘score’ | optional | Selected enrichment value used to rank genomic regions. | Bigger values are more important. For example, qValue from Macs2, -log10FDR from another method, or fold enrichment over background computed from your favorite method. If not provided in input data, prepareInputRegionsis creating this column and populating it with ‘0’. | 
| ‘strand’ | optional | Value to define on which strand genomic region is located. | Values are ‘+’, ‘-’, or ‘.’. If not provided in input data, prepareInputRegionsis creating this column and populating it with ‘.’. | 
| ‘center’ | optional | Absolute genomic coordinate of the center of the region. | Note that another common representation of ‘center’ is to report the number of base pairs of the summit from the ‘start’ coordinate. We call this value ‘summit’ instead of ‘center’. You can use prepareInputRegionsto create ‘center’ from ‘summit’. If not provided in input data,prepareInputRegionsis creating this column as the midpoint of each region. | 
| ‘sample_name’ | required | Unique identifier for each sample (provided in the input data). | Avoid special characters like spaces or tabs. | 
peakCombinerpeakCombiner can be installed from Bioconductor using the BiocManager package:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("peakCombiner")library("peakCombiner")
library("ggplot2")Here you find an overview of a complete run of the peakCombiner recommended workflow.
Lets use some of the provided synthetic data sets.
utils::data("syn_data_tibble")Here you have all the function in the recommended order with all available parameters.
data_prepared <- prepareInputRegions(
  data = syn_data_tibble,
  outputFormat = "tibble",
  showMessages = FALSE
)
centerExpandRegions(
  data = data_prepared,
  centerBy = "center_column",
  expandBy = NULL,
  genome = NA,
  trim_start = TRUE,
  outputFormat = "GenomicRanges",
  showMessages = FALSE
)
#> # A tibble: 51 × 8
#>    chrom  start   end name            score strand center sample_name 
#>    <chr>  <int> <int> <chr>           <dbl> <chr>   <dbl> <chr>       
#>  1 Chr2     251   950 control_rep1|9     80 .         601 control_rep1
#>  2 chr1     -49   650 control_rep1|1     96 .         301 control_rep1
#>  3 chr1     -49   650 control_rep1|2     95 .         301 control_rep1
#>  4 chr1    -149   550 control_rep1|3     46 .         201 control_rep1
#>  5 chr1    -249   450 control_rep1|5     26 .         101 control_rep1
#>  6 chr1     -49   650 control_rep1|6     25 .         301 control_rep1
#>  7 chr10    -49   650 control_rep1|7     75 .         301 control_rep1
#>  8 chr2     -49   650 control_rep1|8     50 .         301 control_rep1
#>  9 chr4 2   -49   650 control_rep1|10    30 .         301 control_rep1
#> 10 chr4-2  -149   550 control_rep1|11    20 .         201 control_rep1
#> # ℹ 41 more rows
#> GRanges object with 51 ranges and 4 metadata columns:
#>        seqnames    ranges strand |             name     score    center
#>           <Rle> <IRanges>  <Rle> |      <character> <numeric> <numeric>
#>    [1]     Chr2   251-950      * |   control_rep1|9        80       601
#>    [2]     chr1     1-650      * |   control_rep1|1        96       301
#>    [3]     chr1     1-650      * |   control_rep1|2        95       301
#>    [4]     chr1     1-550      * |   control_rep1|3        46       201
#>    [5]     chr1     1-450      * |   control_rep1|5        26       101
#>    ...      ...       ...    ... .              ...       ...       ...
#>   [47]     chr1     1-650      * | treatment_rep3|4        97       301
#>   [48]     chr1     1-650      * | treatment_rep3|5        97       301
#>   [49]    chr10    51-750      * | treatment_rep3|6        80       401
#>   [50]     chr2    51-750      * | treatment_rep3|7        10       401
#>   [51]   chr4?2     1-650      * | treatment_rep3|8        25       301
#>           sample_name
#>           <character>
#>    [1]   control_rep1
#>    [2]   control_rep1
#>    [3]   control_rep1
#>    [4]   control_rep1
#>    [5]   control_rep1
#>    ...            ...
#>   [47] treatment_rep3
#>   [48] treatment_rep3
#>   [49] treatment_rep3
#>   [50] treatment_rep3
#>   [51] treatment_rep3
#>   -------
#>   seqinfo: 9 sequences from an unspecified genome; no seqlengths
data_center_expand <- centerExpandRegions(
  data = data_prepared,
  centerBy = "center_column",
  expandBy = NULL,
  genome = NA,
  trim_start = TRUE,
  outputFormat = "tibble",
  showMessages = FALSE
)
#> # A tibble: 51 × 8
#>    chrom  start   end name            score strand center sample_name 
#>    <chr>  <int> <int> <chr>           <dbl> <chr>   <dbl> <chr>       
#>  1 Chr2     251   950 control_rep1|9     80 .         601 control_rep1
#>  2 chr1     -49   650 control_rep1|1     96 .         301 control_rep1
#>  3 chr1     -49   650 control_rep1|2     95 .         301 control_rep1
#>  4 chr1    -149   550 control_rep1|3     46 .         201 control_rep1
#>  5 chr1    -249   450 control_rep1|5     26 .         101 control_rep1
#>  6 chr1     -49   650 control_rep1|6     25 .         301 control_rep1
#>  7 chr10    -49   650 control_rep1|7     75 .         301 control_rep1
#>  8 chr2     -49   650 control_rep1|8     50 .         301 control_rep1
#>  9 chr4 2   -49   650 control_rep1|10    30 .         301 control_rep1
#> 10 chr4-2  -149   550 control_rep1|11    20 .         201 control_rep1
#> # ℹ 41 more rows
data_filtered <- filterRegions(
  data = data_center_expand,
  includeByChromosomeName = NULL,
  excludeByBlacklist = NULL,
  includeAboveScoreCutoff = NULL,
  includeTopNScoring = NULL, 
  outputFormat = "tibble",
  showMessages = FALSE
)
consensus_peak <- combineRegions(
  data = data_filtered,
  foundInSamples = 2,
  combinedCenter = "nearest",
  annotateWithInputNames = FALSE,
  combinedSampleName = "my_new_sample_name",
  outputFormat = "tibble",
  showMessages = FALSE
)
consensus_final <- centerExpandRegions(
  data = consensus_peak,
  expandBy = 350,
  outputFormat = "tibble",
  showMessages = FALSE
)
#> # A tibble: 5 × 8
#>   chrom  start   end name                 score strand center sample_name       
#>   <chr>  <int> <int> <chr>                <dbl> <chr>   <dbl> <chr>             
#> 1 chr1     -49   650 my_new_sample_name|1   100 .         301 my_new_sample_name
#> 2 chr2     -49   650 my_new_sample_name|2    50 .         301 my_new_sample_name
#> 3 chr10    -49   650 my_new_sample_name|3    95 .         301 my_new_sample_name
#> 4 chr4 2   -49   650 my_new_sample_name|4    35 .         301 my_new_sample_name
#> 5 chr4-2   -49   650 my_new_sample_name|5    30 .         301 my_new_sample_name
consensus_final
#> # A tibble: 5 × 8
#>   chrom  start   end name                 score strand center sample_name       
#>   <chr>  <dbl> <int> <chr>                <dbl> <chr>   <dbl> <chr>             
#> 1 chr1       1   650 my_new_sample_name|1   100 .         301 my_new_sample_name
#> 2 chr2       1   650 my_new_sample_name|2    50 .         301 my_new_sample_name
#> 3 chr10      1   650 my_new_sample_name|3    95 .         301 my_new_sample_name
#> 4 chr4 2     1   650 my_new_sample_name|4    35 .         301 my_new_sample_name
#> 5 chr4-2     1   650 my_new_sample_name|5    30 .         301 my_new_sample_namePlease note that the message occurring during the filterRegions is expected if chromosome names between input sample and your provided blacklist are not absolutely identical.
Finally we export the resulting consensus regions tibble in a data format of the users choice (GenomicRanges, tibble, data.frame), which then can be saved as BED file.
rtracklayer::export.bed(consensus_final, paste0(here::here(), "/lists/consensus_regions.bed"), format = "bed")In this section, we explain how to prepare the accepted, standardized input files and how to add needed features to these files. prepareInputRegions is the mandatory first step to prepare your input data in the format needed for all of the following steps in peakCombiner (for details see section “Standard genomic regions format”). This function accepts three types of input formats:
Please note when a GenomicRanges object, a tibble or data.frame (not the sample sheet) is used as input, the package assumes the input is 1-based. When using a sample sheet, the input is expected to be a standard BED (or BED-like) file which is 0-based. The packages takes care of this for the user and all outputs are 1-based to be aligned with GenomicRanges.
Please note that the function prepareInputRegions also has a filtering step, which automatically checks for genomic regions with the same values in the columns ‘chrom’, ‘start’, ‘end’ and ‘sample_name’ and filters for the strongest enriched summit (based on the ‘score’ values) per region. Additional summits are removed. Regions from peak callers can have multiple summits annotated per enriched genomic regions. This step is not optional.
prepareInputRegions returns the data that you need for all your future work in a data format of your choice (GenomicRanges, tibble, data.frame). All these formats can be used for downstream functions of this package. For more details see the section “Standard genomic regions format”.
This is for you if you want to jump right in with some code. One of the easiest ways (and the way we recommend unless you are an advanced user) is to provide peakCombiner with the list of samples and files paths for the peak files in BED-like format, and some information about the format of the peak files.
utils::data("syn_sample_sheet", package = "peakCombiner")
syn_sample_sheet
#> # A tibble: 6 × 4
#>   sample_name file_path                              file_format score_colname
#>   <chr>       <chr>                                  <chr>       <chr>        
#> 1 control01   data-raw/syn_control_rep1.narrowPeak   narrowPeak  qValue       
#> 2 control02   data-raw/syn_control_rep2.narrowPeak   narrowPeak  qValue       
#> 3 control03   data-raw/syn_control_rep3.narrowPeak   narrowPeak  qValue       
#> 4 treatment01 data-raw/syn_treatment_rep1.narrowPeak narrowPeak  qValue       
#> 5 treatment02 data-raw/syn_treatment_rep2.narrowPeak narrowPeak  qValue       
#> 6 treatment03 data-raw/syn_treatment_rep3.narrowPeak narrowPeak  qValueThis is the expected structure of a sample_sheet.
If you already have your BED-like region files loaded into your R session, you can alternatively provide them to prepareInputRegions as a single GenomicRanges, tibble or data.frame object with genomic regions and named columns (see section “Standard genomic regions format”) that, importantly, include a unique sample identifier (‘sample_name’). Here, we show as an example the tibble object.
utils::data("syn_data_control01", package = "peakCombiner")
syn_data_control01
#> # A tibble: 11 × 6
#>    chrom  start   end score strand center
#>    <chr>  <dbl> <dbl> <dbl> <chr>   <dbl>
#>  1 chr1     301  1000    96 .         600
#>  2 chr1    3501  4400    95 .        3800
#>  3 chr1    4701  5300    46 .        4900
#>  4 chr1    4701  5300    45 .        5100
#>  5 chr1    5601  6100    26 .        5700
#>  6 chr1    6701  7400    25 .        7000
#>  7 chr10    301  1000    75 .         600
#>  8 chr2     301  1000    50 .         600
#>  9 Chr2     101   800    80 .         700
#> 10 chr4 2   301  1000    30 .         600
#> 11 chr4-2   401  1100    20 .         600utils::data("syn_data_treatment01", package = "peakCombiner")
syn_data_treatment01
#> # A tibble: 10 × 6
#>    chrom  start   end score strand center
#>    <chr>  <dbl> <dbl> <dbl> <chr>   <dbl>
#>  1 chr1     200   900   100 .         500
#>  2 chr1     201   900   100 .         500
#>  3 chr1    2601  3200    99 .        2800
#>  4 chr1    4501  5200    60 .        4700
#>  5 chr1    4501  5200    59 .        5000
#>  6 chr1    5801  6300    30 .        6100
#>  7 chr1    6701  7400    29 .        7000
#>  8 chr10    201   900    95 .         500
#>  9 chr2     201   900    50 .         500
#> 10 chr4-2   201   900    30 .         500Let’s combine the two tibbles.
combined_input <- syn_data_control01 |>
  dplyr::mutate(sample_name = "control-rep1") |>
  rbind(syn_data_treatment01 |>
    dplyr::mutate(sample_name = "treatment-rep1"))
combined_input |>
  dplyr::group_by(sample_name) |>
  dplyr::summarize(num_regions = dplyr::n())
#> # A tibble: 2 × 2
#>   sample_name    num_regions
#>   <chr>                <int>
#> 1 control-rep1            11
#> 2 treatment-rep1          10prepareInputRegions(
  data = combined_input,
  outputFormat = "tibble",
  startsAreBased = 1,
  showMessages = FALSE
)
#> # A tibble: 19 × 8
#>    chrom  start   end name              score strand center sample_name   
#>    <chr>  <dbl> <dbl> <chr>             <dbl> <chr>   <dbl> <chr>         
#>  1 Chr2     101   800 control-rep1|9       80 .         700 control-rep1  
#>  2 chr1     301  1000 control-rep1|1       96 .         600 control-rep1  
#>  3 chr1    3501  4400 control-rep1|2       95 .        3800 control-rep1  
#>  4 chr1    4701  5300 control-rep1|3       46 .        4900 control-rep1  
#>  5 chr1    5601  6100 control-rep1|5       26 .        5700 control-rep1  
#>  6 chr1    6701  7400 control-rep1|6       25 .        7000 control-rep1  
#>  7 chr10    301  1000 control-rep1|7       75 .         600 control-rep1  
#>  8 chr2     301  1000 control-rep1|8       50 .         600 control-rep1  
#>  9 chr4 2   301  1000 control-rep1|10      30 .         600 control-rep1  
#> 10 chr4-2   401  1100 control-rep1|11      20 .         600 control-rep1  
#> 11 chr1     200   900 treatment-rep1|1    100 .         500 treatment-rep1
#> 12 chr1     201   900 treatment-rep1|2    100 .         500 treatment-rep1
#> 13 chr1    2601  3200 treatment-rep1|3     99 .        2800 treatment-rep1
#> 14 chr1    4501  5200 treatment-rep1|4     60 .        4700 treatment-rep1
#> 15 chr1    5801  6300 treatment-rep1|6     30 .        6100 treatment-rep1
#> 16 chr1    6701  7400 treatment-rep1|7     29 .        7000 treatment-rep1
#> 17 chr10    201   900 treatment-rep1|8     95 .         500 treatment-rep1
#> 18 chr2     201   900 treatment-rep1|9     50 .         500 treatment-rep1
#> 19 chr4-2   201   900 treatment-rep1|10    30 .         500 treatment-rep1If you set the startsAreBased = 0, the package assumes the input is 0-based and the value in the column start is adjusted.
prepareInputRegions(
  data = combined_input,
  outputFormat = "tibble",
  startsAreBased = 0,
  showMessages = FALSE
)
#> # A tibble: 19 × 8
#>    chrom  start   end name              score strand center sample_name   
#>    <chr>  <dbl> <dbl> <chr>             <dbl> <chr>   <dbl> <chr>         
#>  1 Chr2     102   800 control-rep1|9       80 .         700 control-rep1  
#>  2 chr1     302  1000 control-rep1|1       96 .         600 control-rep1  
#>  3 chr1    3502  4400 control-rep1|2       95 .        3800 control-rep1  
#>  4 chr1    4702  5300 control-rep1|3       46 .        4900 control-rep1  
#>  5 chr1    5602  6100 control-rep1|5       26 .        5700 control-rep1  
#>  6 chr1    6702  7400 control-rep1|6       25 .        7000 control-rep1  
#>  7 chr10    302  1000 control-rep1|7       75 .         600 control-rep1  
#>  8 chr2     302  1000 control-rep1|8       50 .         600 control-rep1  
#>  9 chr4 2   302  1000 control-rep1|10      30 .         600 control-rep1  
#> 10 chr4-2   402  1100 control-rep1|11      20 .         600 control-rep1  
#> 11 chr1     201   900 treatment-rep1|1    100 .         500 treatment-rep1
#> 12 chr1     202   900 treatment-rep1|2    100 .         500 treatment-rep1
#> 13 chr1    2602  3200 treatment-rep1|3     99 .        2800 treatment-rep1
#> 14 chr1    4502  5200 treatment-rep1|4     60 .        4700 treatment-rep1
#> 15 chr1    5802  6300 treatment-rep1|6     30 .        6100 treatment-rep1
#> 16 chr1    6702  7400 treatment-rep1|7     29 .        7000 treatment-rep1
#> 17 chr10    202   900 treatment-rep1|8     95 .         500 treatment-rep1
#> 18 chr2     202   900 treatment-rep1|9     50 .         500 treatment-rep1
#> 19 chr4-2   202   900 treatment-rep1|10    30 .         500 treatment-rep1We recommend that you use a sample sheet to prepare your data, especially if you have ‘narrowPeak’ or ‘broadPeak’ files. prepareInputRegions loads in the genomic regions file and formats them so that they are ready to be used in peakCombiner’s other functions. The sample sheet has three required columns and one optional column, and is described as follows.
| Column name | Requirement | Column content | Details | 
|---|---|---|---|
| ‘sample_name’ | required | Unique name for each input sample. | The user defines this value. Please avoid special characters (‘space’, ‘-’, etc.). | 
| ‘file_path’ | required | Path to the file in which the genomic regions are stored. | For example, the path to a ‘BED’ or ‘narrowPeak’ file. Can be a relative path. | 
| ‘file_format’ | required | Recognized values are: ‘BED’, ‘narrowPeak’, and ‘broadPeak’. | Used to correctly name columns from input files. Must be the same for all samples loaded at once. | 
| ‘score_colname’ | optional | The exact name of the original column to be used as score or the number of the column having the the metric used to rank peak importance, where bigger values are more important. | If not provided, column ‘9’ will be used for ‘narrowPeak’ or ‘broadPeak’ file formats. Column ‘9’ corresponds to the qValueas described in the UCSC documentation here. | 
Please note if you load your data with a sample_sheet, the loaded BEd (BED-like) files are supposed to be in 0-based and consequently the starting coordinate will be ajusted to have the output as 1-based.
The following code illustrates how you prepare an accepted sample sheet in R.
Identify paths to file.
First, let’s get the paths to the peak files we want to use and save it.
file_names Create a sample sheet.
Next, we create a tibble (named ‘sample_sheet’) with the correct column names (‘sample_name’, ‘file_path’, ‘file_format’, ‘score_colname’) to load in our data.
utils::data("syn_sample_sheet", package = "peakCombiner")
sample_sheet <- syn_sample_sheet
sample_sheet
#> # A tibble: 6 × 4
#>   sample_name file_path                              file_format score_colname
#>   <chr>       <chr>                                  <chr>       <chr>        
#> 1 control01   data-raw/syn_control_rep1.narrowPeak   narrowPeak  qValue       
#> 2 control02   data-raw/syn_control_rep2.narrowPeak   narrowPeak  qValue       
#> 3 control03   data-raw/syn_control_rep3.narrowPeak   narrowPeak  qValue       
#> 4 treatment01 data-raw/syn_treatment_rep1.narrowPeak narrowPeak  qValue       
#> 5 treatment02 data-raw/syn_treatment_rep2.narrowPeak narrowPeak  qValue       
#> 6 treatment03 data-raw/syn_treatment_rep3.narrowPeak narrowPeak  qValueWith this step you create a new tibble containing all the required information to run prepareInputRegions.
Prepare data from the sample sheet.
Now we use the prepared tibble (sample_sheet) and add it as argument data into the function prepareInputRegions.
prepareInputRegions(
  data = sample_sheet,
  startsAreBased = 0,
  showMessages = FALSE
)This returned table is a GenomicRanges object (default) or a tibble that contains all required information formatted correctly in order to use the downstream functions within peakCombiner. For more information about its structure, go back to the “Standard genomic regions format” section.
If you have already been working with your genomic regions from within an R session, you may have them pre-loaded as a tibble (this section) or as a GenomicRanges object (next section: Load from GenomicRanges object). This could be either that per sample one tibble exists or that you ran peakCombiner before on two data sets and now want to combine these. In this section we show how you prepare your input data from pre-loaded tibbles.
We start by loading in two of our synthetic data sets in ‘narrowPeak’ format, a common format generated by peak callers like MACS. Most BED-like files (including ‘narrowPeak’ files) don’t include column names in the file so you will have to name them yourself using the standard naming conventions as described by UCSC for BED or narrowPeak or broadPeak. As we know what to expect in each column we can name the columns correctly: ‘chrom’ (X1), ‘start’ (X2), ‘end’ (X3), ‘name’ (X4), ‘score’ (X5) and the ‘strand’ (X6). To be clear, the names that you define for the columns are used by prepareInputRegions to property format the data.
Columns named ‘chrom’, ‘start’, ‘end’, ‘name’, ‘score’, ‘strand’, ‘center’ and ‘sample_name’ are maintained. Please note that within peakCombiner we call columns with relative information about the peak summit ‘summit’ and with absolute values ‘center’. So, if a column named ‘summit’ is provided, it is used to populate a column named ‘center’. All other columns are dropped at the end of prepareInputRegions.
Prepare input files
Now let’s load the first narrowPeak file. Note that the columns are named already correctly and we expect this from your data as well.
utils::data(syn_control_rep1_narrowPeak)
syn_control_rep1_narrowPeak
#> # A tibble: 11 × 10
#>    chrom  start   end name  score strand signalValue pValue qValue  peak
#>    <chr>  <dbl> <dbl> <chr> <dbl> <chr>        <dbl>  <dbl>  <dbl> <dbl>
#>  1 chr1     301  1000 .         0 .               96     -1   3.96   299
#>  2 chr1    3501  4400 .         0 .               95     -1   3.96   299
#>  3 chr1    4701  5300 .         0 .               46     -1   3.33   199
#>  4 chr1    4701  5300 .         0 .               45     -1   3.31   399
#>  5 chr1    5601  6100 .         0 .               26     -1   2.83    99
#>  6 chr1    6701  7400 .         0 .               25     -1   2.80   299
#>  7 chr10    301  1000 .         0 .               75     -1   3.75   299
#>  8 chr2     301  1000 .         0 .               50     -1   3.40   299
#>  9 Chr2     101   800 .         0 .               80     -1   3.81   599
#> 10 chr4 2   301  1000 .         0 .               30     -1   2.95   299
#> 11 chr4-2   401  1100 .         0 .               20     -1   2.60   199And the second file.
utils::data(syn_treatment_rep1_narrowPeak)
syn_treatment_rep1_narrowPeak
#> # A tibble: 10 × 10
#>    chrom  start   end name  score strand signalValue pValue qValue  peak
#>    <chr>  <dbl> <dbl> <chr> <dbl> <chr>        <dbl>  <dbl>  <dbl> <dbl>
#>  1 chr1     200   900 .         0 .              100     -1   4      300
#>  2 chr1     201   900 .         0 .              100     -1   4      299
#>  3 chr1    2601  3200 .         0 .               99     -1   3.99   199
#>  4 chr1    4501  5200 .         0 .               60     -1   3.56   199
#>  5 chr1    4501  5200 .         0 .               59     -1   3.54   499
#>  6 chr1    5801  6300 .         0 .               30     -1   2.95   299
#>  7 chr1    6701  7400 .         0 .               29     -1   2.92   299
#>  8 chr10    201   900 .         0 .               95     -1   3.96   299
#>  9 chr2     201   900 .         0 .               50     -1   3.40   299
#> 10 chr4-2   201   900 .         0 .               30     -1   2.95   299Add column ‘sample_name’
Now we add a column named ‘sample_name’ to each of our tibbles.
control <- syn_control_rep1_narrowPeak |>
  dplyr::mutate(sample_name = "control-rep1")
control
#> # A tibble: 11 × 11
#>    chrom  start   end name  score strand signalValue pValue qValue  peak
#>    <chr>  <dbl> <dbl> <chr> <dbl> <chr>        <dbl>  <dbl>  <dbl> <dbl>
#>  1 chr1     301  1000 .         0 .               96     -1   3.96   299
#>  2 chr1    3501  4400 .         0 .               95     -1   3.96   299
#>  3 chr1    4701  5300 .         0 .               46     -1   3.33   199
#>  4 chr1    4701  5300 .         0 .               45     -1   3.31   399
#>  5 chr1    5601  6100 .         0 .               26     -1   2.83    99
#>  6 chr1    6701  7400 .         0 .               25     -1   2.80   299
#>  7 chr10    301  1000 .         0 .               75     -1   3.75   299
#>  8 chr2     301  1000 .         0 .               50     -1   3.40   299
#>  9 Chr2     101   800 .         0 .               80     -1   3.81   599
#> 10 chr4 2   301  1000 .         0 .               30     -1   2.95   299
#> 11 chr4-2   401  1100 .         0 .               20     -1   2.60   199
#> # ℹ 1 more variable: sample_name <chr>treatment <- syn_treatment_rep1_narrowPeak |>
  dplyr::mutate(sample_name = "treatment-rep1")
treatment
#> # A tibble: 10 × 11
#>    chrom  start   end name  score strand signalValue pValue qValue  peak
#>    <chr>  <dbl> <dbl> <chr> <dbl> <chr>        <dbl>  <dbl>  <dbl> <dbl>
#>  1 chr1     200   900 .         0 .              100     -1   4      300
#>  2 chr1     201   900 .         0 .              100     -1   4      299
#>  3 chr1    2601  3200 .         0 .               99     -1   3.99   199
#>  4 chr1    4501  5200 .         0 .               60     -1   3.56   199
#>  5 chr1    4501  5200 .         0 .               59     -1   3.54   499
#>  6 chr1    5801  6300 .         0 .               30     -1   2.95   299
#>  7 chr1    6701  7400 .         0 .               29     -1   2.92   299
#>  8 chr10    201   900 .         0 .               95     -1   3.96   299
#>  9 chr2     201   900 .         0 .               50     -1   3.40   299
#> 10 chr4-2   201   900 .         0 .               30     -1   2.95   299
#> # ℹ 1 more variable: sample_name <chr>Combine multiple tibbles
Finally, combine the multiple input tibbles into one.
combined_input <- control |>
  rbind(treatment)
combined_input
#> # A tibble: 21 × 11
#>    chrom  start   end name  score strand signalValue pValue qValue  peak
#>    <chr>  <dbl> <dbl> <chr> <dbl> <chr>        <dbl>  <dbl>  <dbl> <dbl>
#>  1 chr1     301  1000 .         0 .               96     -1   3.96   299
#>  2 chr1    3501  4400 .         0 .               95     -1   3.96   299
#>  3 chr1    4701  5300 .         0 .               46     -1   3.33   199
#>  4 chr1    4701  5300 .         0 .               45     -1   3.31   399
#>  5 chr1    5601  6100 .         0 .               26     -1   2.83    99
#>  6 chr1    6701  7400 .         0 .               25     -1   2.80   299
#>  7 chr10    301  1000 .         0 .               75     -1   3.75   299
#>  8 chr2     301  1000 .         0 .               50     -1   3.40   299
#>  9 Chr2     101   800 .         0 .               80     -1   3.81   599
#> 10 chr4 2   301  1000 .         0 .               30     -1   2.95   299
#> # ℹ 11 more rows
#> # ℹ 1 more variable: sample_name <chr>And check how many rows we have now for each sample.combined_input |>
  dplyr::group_by(sample_name) |>
  dplyr::count(name = "number_of_entries")
#> # A tibble: 2 × 2
#> # Groups:   sample_name [2]
#>   sample_name    number_of_entries
#>   <chr>                      <int>
#> 1 control-rep1                  11
#> 2 treatment-rep1                10Both ’sample_name’s are found, so we know that we have successfully combined the data sets.
Prepare data from the pre-loaded tibble
After preparing the pre-loaded tibble, we run the function prepareInputRegions and use the tibble in the parameter data.
prepareInputRegions(
  data = combined_input,
  outputFormat = "tibble",
  startsAreBased = 1,
  showMessages = FALSE
)
#> # A tibble: 19 × 8
#>    chrom  start   end name              score strand center sample_name   
#>    <chr>  <dbl> <dbl> <chr>             <dbl> <chr>   <dbl> <chr>         
#>  1 Chr2     101   800 control-rep1|9        0 .        450. control-rep1  
#>  2 chr1     301  1000 control-rep1|1        0 .        650. control-rep1  
#>  3 chr1    3501  4400 control-rep1|2        0 .       3950. control-rep1  
#>  4 chr1    4701  5300 control-rep1|3        0 .       5000. control-rep1  
#>  5 chr1    5601  6100 control-rep1|5        0 .       5850. control-rep1  
#>  6 chr1    6701  7400 control-rep1|6        0 .       7050. control-rep1  
#>  7 chr10    301  1000 control-rep1|7        0 .        650. control-rep1  
#>  8 chr2     301  1000 control-rep1|8        0 .        650. control-rep1  
#>  9 chr4 2   301  1000 control-rep1|10       0 .        650. control-rep1  
#> 10 chr4-2   401  1100 control-rep1|11       0 .        750. control-rep1  
#> 11 chr1     200   900 treatment-rep1|1      0 .        550  treatment-rep1
#> 12 chr1     201   900 treatment-rep1|2      0 .        550. treatment-rep1
#> 13 chr1    2601  3200 treatment-rep1|3      0 .       2900. treatment-rep1
#> 14 chr1    4501  5200 treatment-rep1|4      0 .       4850. treatment-rep1
#> 15 chr1    5801  6300 treatment-rep1|6      0 .       6050. treatment-rep1
#> 16 chr1    6701  7400 treatment-rep1|7      0 .       7050. treatment-rep1
#> 17 chr10    201   900 treatment-rep1|8      0 .        550. treatment-rep1
#> 18 chr2     201   900 treatment-rep1|9      0 .        550. treatment-rep1
#> 19 chr4-2   201   900 treatment-rep1|10     0 .        550. treatment-rep1The output tibble from prepare_input_data can now be used for your next steps with peakCombiner. For details about the accepted file structure see section “Standard genomic regions format”.
In memory GenomicRanges object listing the genomic regions in a sample. This object is very similar to the tibble above, except that chrom, start, and end are instead described using the GenomicRanges nomenclature (See here for details ).
Load in a GenomicRanges object
As first step we load the provided synthetic data originating from a GenomicRanges object.
utils::data("syn_data_granges")
syn_data_granges
#>    seqnames start  end width strand score center    sample_name
#> 1      chr1   200  900   701      *   100    500 treatment_rep1
#> 2      chr1     1  900   900      *    97    500 treatment_rep3
#> 3      chr1   101  300   200      *    94    200   control_rep2
#> 4      chr1   301  900   600      *    94    500   control_rep2
#> 5      chr1   201  900   700      *   100    500 treatment_rep1
#> 6      chr1   301  900   600      *    98    600 treatment_rep2
#> 7      chr1   301 1000   700      *    96    600   control_rep1
#> 8      chr1   301 1100   800      *    93    500   control_rep3
#> 9      chr1  1301 1600   300      *    97   1400 treatment_rep3
#> 10     chr1  1901 2200   300      *    98   2000 treatment_rep2
#> 11     chr1  2501 3100   600      *    97   2800 treatment_rep3
#> 12     chr1  2501 3400   900      *    98   3000 treatment_rep2
#> 13     chr1  2601 3200   600      *    99   2800 treatment_rep1
#> 14     chr1  3501 4200   700      *    44   3800   control_rep2
#> 15     chr1  3501 4400   900      *    95   3800   control_rep1
#> 16     chr1  3601 4400   800      *    43   3900   control_rep3
#> 17     chr1  4501 5000   500      *    97   4800 treatment_rep3
#> 18     chr1  4501 5200   700      *    60   4700 treatment_rep1
#> 19     chr1  4501 5200   700      *    59   5000 treatment_rep1
#> 20     chr1  4501 5300   800      *    98   4800 treatment_rep2
#> 21     chr1  4501 5300   800      *    98   5100 treatment_rep2
#> 22     chr1  4601 5100   500      *    93   4900   control_rep3
#> 23     chr1  4601 5200   600      *    94   4800   control_rep2
#> 24     chr1  4701 5300   600      *    46   4900   control_rep1
#> 25     chr1  4701 5300   600      *    45   5100   control_rep1
#> 26     chr1  5601 6100   500      *    26   5700   control_rep1
#> 27     chr1  5701 6400   700      *    98   6200 treatment_rep2
#> 28     chr1  5801 6300   500      *    30   6100 treatment_rep1
#> 29     chr1  6701 7400   700      *    25   7000   control_rep1
#> 30     chr1  6701 7400   700      *    44   7000   control_rep2
#> 31     chr1  6701 7400   700      *    43   7000   control_rep3
#> 32     chr1  6701 7400   700      *    29   7000 treatment_rep1
#> 33     chr1  6701 7400   700      *    98   7000 treatment_rep2
#> 34     chr1  6701 7400   700      *    97   7000 treatment_rep3
#> 35    chr10   101  800   700      *    95    400   control_rep2
#> 36    chr10   101  900   800      *    80    500 treatment_rep3
#> 37    chr10   201  900   700      *    95    500 treatment_rep1
#> 38    chr10   301 1000   700      *    75    600   control_rep1
#> 39    chr10   301 1000   700      *    90    600 treatment_rep2
#> 40    chr10   301 1000   700      *    90    600   control_rep3
#> 41     chr2   101  800   700      *    30    400   control_rep2
#> 42     chr2   101  900   800      *    10    500 treatment_rep3
#> 43     chr2   201  900   700      *    50    500 treatment_rep1
#> 44     chr2   301 1000   700      *    50    600   control_rep1
#> 45     chr2   301 1000   700      *    10    600   control_rep3
#> 46     chr2   301 1000   700      *    30    600 treatment_rep2
#> 47     Chr2   101  800   700      *    80    700   control_rep1
#> 48   chr4 2   301 1000   700      *    30    600   control_rep1
#> 49   chr4 2   101  800   700      *    25    400   control_rep2
#> 50   chr4 2   301 1000   700      *    35    600   control_rep3
#> 51   chr4-2   401 1100   700      *    20    600   control_rep1
#> 52   chr4-2   201  900   700      *    30    500 treatment_rep1
#> 53   chr4?2   101  900   800      *    25    400 treatment_rep3
#> 54   chr4|2   101  800   700      *    80    400   control_rep2
#> 55    chr42   301 1000   700      *    90    600 treatment_rep2The column names are based on its original GenomicRanges file format. This allows us to easily transform it into a GenomicRanges object. Note that normally we expect you to have the GenomicRanges object pre-loaded and want to use the peakCombiner on this data set. For the purpose of showing you how a accepted GenomicRanges object has to be structured we transform it here.
GenomicRanges_data <- GenomicRanges::makeGRangesFromDataFrame(
  syn_data_granges,
  keep.extra.columns = TRUE
)
GenomicRanges_data
#> GRanges object with 55 ranges and 3 metadata columns:
#>        seqnames    ranges strand |     score    center    sample_name
#>           <Rle> <IRanges>  <Rle> | <numeric> <numeric>    <character>
#>    [1]     chr1   200-900      * |       100       500 treatment_rep1
#>    [2]     chr1     1-900      * |        97       500 treatment_rep3
#>    [3]     chr1   101-300      * |        94       200   control_rep2
#>    [4]     chr1   301-900      * |        94       500   control_rep2
#>    [5]     chr1   201-900      * |       100       500 treatment_rep1
#>    ...      ...       ...    ... .       ...       ...            ...
#>   [51]   chr4-2  401-1100      * |        20       600   control_rep1
#>   [52]   chr4-2   201-900      * |        30       500 treatment_rep1
#>   [53]   chr4?2   101-900      * |        25       400 treatment_rep3
#>   [54]   chr4|2   101-800      * |        80       400   control_rep2
#>   [55]    chr42  301-1000      * |        90       600 treatment_rep2
#>   -------
#>   seqinfo: 9 sequences from an unspecified genome; no seqlengthsPrepare input from GenomicRanges object
You can simply use your GenomicRanges object in the parameter ‘data’ and load it in. The output tibble from prepare_input_data can now be used for your next steps with peakCombiner. For details about the accepted file structure see section “Standard genomic regions format”.
prepareInputRegions(
  data = GenomicRanges_data,
  outputFormat = "GenomicRanges",
  showMessages = FALSE
)
#> GRanges object with 52 ranges and 4 metadata columns:
#>        seqnames    ranges strand |             name     score    center
#>           <Rle> <IRanges>  <Rle> |      <character> <numeric> <numeric>
#>    [1]     Chr2   101-800      * |   control_rep1|9        80       700
#>    [2]     chr1  301-1000      * |   control_rep1|1        96       600
#>    [3]     chr1 3501-4400      * |   control_rep1|2        95      3800
#>    [4]     chr1 4701-5300      * |   control_rep1|3        46      4900
#>    [5]     chr1 5601-6100      * |   control_rep1|5        26      5700
#>    ...      ...       ...    ... .              ...       ...       ...
#>   [48]     chr1 4501-5000      * | treatment_rep3|4        97      4800
#>   [49]     chr1 6701-7400      * | treatment_rep3|5        97      7000
#>   [50]    chr10   101-900      * | treatment_rep3|6        80       500
#>   [51]     chr2   101-900      * | treatment_rep3|7        10       500
#>   [52]   chr4?2   101-900      * | treatment_rep3|8        25       400
#>           sample_name
#>           <character>
#>    [1]   control_rep1
#>    [2]   control_rep1
#>    [3]   control_rep1
#>    [4]   control_rep1
#>    [5]   control_rep1
#>    ...            ...
#>   [48] treatment_rep3
#>   [49] treatment_rep3
#>   [50] treatment_rep3
#>   [51] treatment_rep3
#>   [52] treatment_rep3
#>   -------
#>   seqinfo: 9 sequences from an unspecified genome; no seqlengthsWe recommend to use the function prepareInputRegions to prepare the input data in the format needed for all of the following steps in peakCombiner. In theory you can also manually provide an expected input data when preparing your data following the descriptions in the section “Standard genomic regions format”.
Unlike ‘narrowPeak’ files, BED files typically do not include columns for summits (‘summit’) or significance (‘score’) for your peaks. For that reason, we recommend using ‘narrowPeak’ files if possible. Sometimes you may only have ‘chrom’, ‘start’, and ‘end’ and you may still use peakCombiner. Here we show you how to load it.
Lets load a BED file.
utils::data("syn_data_bed")
syn_data_bed |> 
  dplyr::arrange(sample_name)
#> # A tibble: 55 × 4
#>    chrom  start   end sample_name 
#>    <chr>  <dbl> <dbl> <chr>       
#>  1 chr1     300  1000 control_rep1
#>  2 chr1    3500  4400 control_rep1
#>  3 chr1    4700  5300 control_rep1
#>  4 chr1    4700  5300 control_rep1
#>  5 chr1    5600  6100 control_rep1
#>  6 chr1    6700  7400 control_rep1
#>  7 chr10    300  1000 control_rep1
#>  8 chr2     300  1000 control_rep1
#>  9 Chr2     100   800 control_rep1
#> 10 chr4 2   300  1000 control_rep1
#> # ℹ 45 more rowsBy default the chromosomal coordinates are 0-based.
When we pull the ‘sample_name’ column we see the different number of entries for each sample name.
syn_data_bed |>
  dplyr::group_by(sample_name) |>
  dplyr::summarize(num_regions = dplyr::n())
#> # A tibble: 6 × 2
#>   sample_name    num_regions
#>   <chr>                <int>
#> 1 control_rep1            11
#> 2 control_rep2             9
#> 3 control_rep3             7
#> 4 treatment_rep1          10
#> 5 treatment_rep2          10
#> 6 treatment_rep3           8And now we use it as input for prepareInputRegions.
prepareInputRegions(
  data = syn_data_bed,
  outputFormat = "data.frame",
   startsAreBased = 0,
  showMessages = FALSE
)
#> # A tibble: 51 × 8
#>    chrom  start   end name            score strand center sample_name 
#>    <chr>  <dbl> <dbl> <chr>           <dbl> <chr>   <dbl> <chr>       
#>  1 Chr2     101   800 control_rep1|9      0 .         450 control_rep1
#>  2 chr1     301  1000 control_rep1|1      0 .         650 control_rep1
#>  3 chr1    3501  4400 control_rep1|2      0 .        3950 control_rep1
#>  4 chr1    4701  5300 control_rep1|3      0 .        5000 control_rep1
#>  5 chr1    5601  6100 control_rep1|5      0 .        5850 control_rep1
#>  6 chr1    6701  7400 control_rep1|6      0 .        7050 control_rep1
#>  7 chr10    301  1000 control_rep1|7      0 .         650 control_rep1
#>  8 chr2     301  1000 control_rep1|8      0 .         650 control_rep1
#>  9 chr4 2   301  1000 control_rep1|10     0 .         650 control_rep1
#> 10 chr4-2   401  1100 control_rep1|11     0 .         750 control_rep1
#> # ℹ 41 more rowsPlease note here that the information messages are informing you about all missing values and with which default values these columns are populated. The ‘score’ is set to 0 as no information can be obtained from a classical BED file about enrichment values. The column ‘strand’ is populated with the value ‘.’, representing that no strand information is known. The ‘center’ is calculated based on the arithmetical midpoint of each region as no ‘summit’ input column was found. The resulting tibble can be used with all functions (centerExpandRegions, filterRegions, combineRegions) of the package but certain option are limited due to the missing information in the input.
For instance, centerExpandRegions is limited to use the ‘midpoint’ as center as no summit information is provided (See section Center and expand regions). filterRegions the options ‘includeAboveScoreCutoff’ and ‘includeTopNScoring’ do rely on values in the column ‘score’ to be populated with different values (See section Filter regions) and can not be used.
The package peakCombiner is expecting to work with 1-based values instead of 0-based or a mix setup as in BED file. For example in narrowPeak files the start coordinate is included in the region, while the end coordinate is not. This follows the classical definition of a BED file in UCSC, where the start is included but the end not, meaning the start is 0-based and the end is 1-based. As normally in displaying genomic regions browsers both coordinates are used in the 1-based way, we decided to use for peakCombiner exclusivity the 1-based approach. A good cheat sheet is linked here.
If you load your data using a sample sheet “Load from sample sheet”, peakCombiner takes care of the details for you.
After you prepare your input data, you may want to resize your peaks to a single, consistent size for your downstream analyses. In this case, we recommend that you first center and expand your genomic regions by using centerExpandRegions. Please note that this function is optional, so not required but recommended to run centerExpandRegions before you combine your regions.
The quickest way to get started is to call centerExpandRegions using just the default parameters. In this case, centerExpandRegions updates the ‘start’ and ‘end’ coordinates of each genomic region such that it is centered around ‘center’ and resized to the median of all input regions.
centerExpandRegions(
  data = data_prepared,
  centerBy = "center_column",
  expandBy = NULL,
  outputFormat = "tibble",
  showMessage = FALSE
)
#> # A tibble: 51 × 8
#>    chrom  start   end name            score strand center sample_name 
#>    <chr>  <int> <int> <chr>           <dbl> <chr>   <dbl> <chr>       
#>  1 Chr2     251   950 control_rep1|9     80 .         601 control_rep1
#>  2 chr1     -49   650 control_rep1|1     96 .         301 control_rep1
#>  3 chr1     -49   650 control_rep1|2     95 .         301 control_rep1
#>  4 chr1    -149   550 control_rep1|3     46 .         201 control_rep1
#>  5 chr1    -249   450 control_rep1|5     26 .         101 control_rep1
#>  6 chr1     -49   650 control_rep1|6     25 .         301 control_rep1
#>  7 chr10    -49   650 control_rep1|7     75 .         301 control_rep1
#>  8 chr2     -49   650 control_rep1|8     50 .         301 control_rep1
#>  9 chr4 2   -49   650 control_rep1|10    30 .         301 control_rep1
#> 10 chr4-2  -149   550 control_rep1|11    20 .         201 control_rep1
#> # ℹ 41 more rows
#> # A tibble: 51 × 8
#>    chrom  start   end name            score strand center sample_name 
#>    <chr>  <dbl> <int> <chr>           <dbl> <chr>   <dbl> <chr>       
#>  1 Chr2     251   950 control_rep1|9     80 .         601 control_rep1
#>  2 chr1       1   650 control_rep1|1     96 .         301 control_rep1
#>  3 chr1       1   650 control_rep1|2     95 .         301 control_rep1
#>  4 chr1       1   550 control_rep1|3     46 .         201 control_rep1
#>  5 chr1       1   450 control_rep1|5     26 .         101 control_rep1
#>  6 chr1       1   650 control_rep1|6     25 .         301 control_rep1
#>  7 chr10      1   650 control_rep1|7     75 .         301 control_rep1
#>  8 chr2       1   650 control_rep1|8     50 .         301 control_rep1
#>  9 chr4 2     1   650 control_rep1|10    30 .         301 control_rep1
#> 10 chr4-2     1   550 control_rep1|11    20 .         201 control_rep1
#> # ℹ 41 more rowsThe table you obtained has altered coordinates for the genomic region. The center of the region is defined as the value found in the column ‘center’ and the expansion is based on the median region size of all input regions.
The expected input can be a GenomicRanges, tibble or data.frame as described previously (See section “Standard genomic regions format”). We suggest you prepare it with the function prepareInputRegions.
Define the new center
By default, centerExpandRegions uses the existing value in the ‘center’ column. (centerBy = “center_column”) which is the recommended approach.
However, you can also overwrite the existing ‘center’ value by computing the midpoint of the genomic region (centerBy = “midpoint”). This makes sense to do only if you have loaded your data from a peak caller that contain summits or the location of the strongest enrichment (e.g., MACS narrowPeak), but you would rather use the midpoint.
Define the value to expand
By default, centerExpandRegions (expandBy = NULL) calculates a reasonable value for you from your data by using half of the median regions length of all your inputdata to expand. You can alternatively provide a numeric value. For example, when ‘expandBy’ is set to 250 (expandBy = 250), this results in regions of size 500 centered around ‘center’ (for more details see section “Expand from ‘center’”).
Center & expanding the regions
centerExpandRegions(
  data = data_prepared,
  centerBy = "center_column",
  expandBy = NULL,
  outputFormat = "tibble",
  showMessages = TRUE
)
#> ℹ Input value for `expandBy` is "NULL". Median of all input genomic regions is
#>   calculated and returned for expansion.
#> ✔ `expandBy` was calculated from the input data and set to "350".
#> ℹ Genomic regions will be expanded by 350bp in both direction.
#>   
#> ℹ Argument `outputFormat` is set to "tibble".
#> ℹ Input data `data` has no assigned genome.
#> ! Input data `data` has no assigned genome (NA).
#> ℹ Argument `genome` set to NA.
#> ℹ Only start will be trimmed.
#> ℹ Provide input `data` is a <data.frame> with three or four columns and paths
#>   to existing files.
#> → Start loading and preparing data.
#> ℹ Argument `trim_start` is TRUE.
#> → Checking <class> and "values" of all columns.
#> ✔ Structure of data was successfully checked to be an accepted input.
#>   
#> → Starting with expanding genomic regions from the column center.
#> → Genomic regions will be centered and expanded.
#> ℹ Used genome for trimming is NA.
#> → Expanding genomic regions from the column center by 350 before and 350 after
#>   the center.
#> # A tibble: 51 × 8
#>    chrom  start   end name            score strand center sample_name 
#>    <chr>  <int> <int> <chr>           <dbl> <chr>   <dbl> <chr>       
#>  1 Chr2     251   950 control_rep1|9     80 .         601 control_rep1
#>  2 chr1     -49   650 control_rep1|1     96 .         301 control_rep1
#>  3 chr1     -49   650 control_rep1|2     95 .         301 control_rep1
#>  4 chr1    -149   550 control_rep1|3     46 .         201 control_rep1
#>  5 chr1    -249   450 control_rep1|5     26 .         101 control_rep1
#>  6 chr1     -49   650 control_rep1|6     25 .         301 control_rep1
#>  7 chr10    -49   650 control_rep1|7     75 .         301 control_rep1
#>  8 chr2     -49   650 control_rep1|8     50 .         301 control_rep1
#>  9 chr4 2   -49   650 control_rep1|10    30 .         301 control_rep1
#> 10 chr4-2  -149   550 control_rep1|11    20 .         201 control_rep1
#> # ℹ 41 more rows
#> ✔ Genomic regions were successfully centered and expanded.
#>   
#> ℹ Argument `trim_start` is set to "TRUE".
#> ℹ Atgument `genome` is set to NA.
#> → Trimming start coordinates of resulting genomic regions.
#> ℹ Some newly-defined genomic regions have a start coordinate below "1".
#> → Values of name for these sites: "control_rep1|1", "control_rep1|2",
#>   "control_rep1|3", "control_rep1|5", "control_rep1|6", "control_rep1|7",
#>   "control_rep1|8", "control_rep1|10", "control_rep1|11", "control_rep2|1",
#>   "control_rep2|2", "control_rep2|3", "control_rep2|4", "control_rep2|5",
#>   "control_rep2|6", "control_rep2|7", "control_rep2|8", "control_rep2|9", …,
#>   "treatment_rep3|5", and "treatment_rep3|8".
#> ✔ These genomic regions were trimmed to get start coordinate "1".
#> ✔ Genomic regions were successfully centered and expanded.
#>   
#> ℹ Output format is set to "tibble".
#> # A tibble: 51 × 8
#>    chrom  start   end name            score strand center sample_name 
#>    <chr>  <dbl> <int> <chr>           <dbl> <chr>   <dbl> <chr>       
#>  1 Chr2     251   950 control_rep1|9     80 .         601 control_rep1
#>  2 chr1       1   650 control_rep1|1     96 .         301 control_rep1
#>  3 chr1       1   650 control_rep1|2     95 .         301 control_rep1
#>  4 chr1       1   550 control_rep1|3     46 .         201 control_rep1
#>  5 chr1       1   450 control_rep1|5     26 .         101 control_rep1
#>  6 chr1       1   650 control_rep1|6     25 .         301 control_rep1
#>  7 chr10      1   650 control_rep1|7     75 .         301 control_rep1
#>  8 chr2       1   650 control_rep1|8     50 .         301 control_rep1
#>  9 chr4 2     1   650 control_rep1|10    30 .         301 control_rep1
#> 10 chr4-2     1   550 control_rep1|11    20 .         201 control_rep1
#> # ℹ 41 more rowsYou can appreciate that values for ‘start’ and ‘end’ are changed, while the number of input regions stays the same.
Define the expansion value.
Finally some examples how you can define the expansion, either symmetrically or asymmetrically (for more details see section Expand from ‘center’).
centerExpandRegions(
  data = data_prepared,
  centerBy = "center_column",
  expandBy = c(500),
  outputFormat = "tibble",
  showMessages = FALSE
)
#> # A tibble: 51 × 8
#>    chrom  start   end name            score strand center sample_name 
#>    <chr>  <int> <int> <chr>           <dbl> <chr>   <dbl> <chr>       
#>  1 Chr2     101  1100 control_rep1|9     80 .         601 control_rep1
#>  2 chr1    -199   800 control_rep1|1     96 .         301 control_rep1
#>  3 chr1    -199   800 control_rep1|2     95 .         301 control_rep1
#>  4 chr1    -299   700 control_rep1|3     46 .         201 control_rep1
#>  5 chr1    -399   600 control_rep1|5     26 .         101 control_rep1
#>  6 chr1    -199   800 control_rep1|6     25 .         301 control_rep1
#>  7 chr10   -199   800 control_rep1|7     75 .         301 control_rep1
#>  8 chr2    -199   800 control_rep1|8     50 .         301 control_rep1
#>  9 chr4 2  -199   800 control_rep1|10    30 .         301 control_rep1
#> 10 chr4-2  -299   700 control_rep1|11    20 .         201 control_rep1
#> # ℹ 41 more rows
#> # A tibble: 51 × 8
#>    chrom  start   end name            score strand center sample_name 
#>    <chr>  <dbl> <int> <chr>           <dbl> <chr>   <dbl> <chr>       
#>  1 Chr2     101  1100 control_rep1|9     80 .         601 control_rep1
#>  2 chr1       1   800 control_rep1|1     96 .         301 control_rep1
#>  3 chr1       1   800 control_rep1|2     95 .         301 control_rep1
#>  4 chr1       1   700 control_rep1|3     46 .         201 control_rep1
#>  5 chr1       1   600 control_rep1|5     26 .         101 control_rep1
#>  6 chr1       1   800 control_rep1|6     25 .         301 control_rep1
#>  7 chr10      1   800 control_rep1|7     75 .         301 control_rep1
#>  8 chr2       1   800 control_rep1|8     50 .         301 control_rep1
#>  9 chr4 2     1   800 control_rep1|10    30 .         301 control_rep1
#> 10 chr4-2     1   700 control_rep1|11    20 .         201 control_rep1
#> # ℹ 41 more rowscenterExpandRegions(
  data = data_prepared,
  centerBy = "center_column",
  expandBy = c(100, 1000),
  outputFormat = "tibble",
  showMessages = FALSE
)
#> # A tibble: 51 × 8
#>    chrom  start   end name            score strand center sample_name 
#>    <chr>  <int> <int> <chr>           <dbl> <chr>   <dbl> <chr>       
#>  1 Chr2     501  1600 control_rep1|9     80 .         601 control_rep1
#>  2 chr1     201  1300 control_rep1|1     96 .         301 control_rep1
#>  3 chr1     201  1300 control_rep1|2     95 .         301 control_rep1
#>  4 chr1     101  1200 control_rep1|3     46 .         201 control_rep1
#>  5 chr1       1  1100 control_rep1|5     26 .         101 control_rep1
#>  6 chr1     201  1300 control_rep1|6     25 .         301 control_rep1
#>  7 chr10    201  1300 control_rep1|7     75 .         301 control_rep1
#>  8 chr2     201  1300 control_rep1|8     50 .         301 control_rep1
#>  9 chr4 2   201  1300 control_rep1|10    30 .         301 control_rep1
#> 10 chr4-2   101  1200 control_rep1|11    20 .         201 control_rep1
#> # ℹ 41 more rows
#> # A tibble: 51 × 8
#>    chrom  start   end name            score strand center sample_name 
#>    <chr>  <int> <int> <chr>           <dbl> <chr>   <dbl> <chr>       
#>  1 Chr2     501  1600 control_rep1|9     80 .         601 control_rep1
#>  2 chr1     201  1300 control_rep1|1     96 .         301 control_rep1
#>  3 chr1     201  1300 control_rep1|2     95 .         301 control_rep1
#>  4 chr1     101  1200 control_rep1|3     46 .         201 control_rep1
#>  5 chr1       1  1100 control_rep1|5     26 .         101 control_rep1
#>  6 chr1     201  1300 control_rep1|6     25 .         301 control_rep1
#>  7 chr10    201  1300 control_rep1|7     75 .         301 control_rep1
#>  8 chr2     201  1300 control_rep1|8     50 .         301 control_rep1
#>  9 chr4 2   201  1300 control_rep1|10    30 .         301 control_rep1
#> 10 chr4-2   101  1200 control_rep1|11    20 .         201 control_rep1
#> # ℹ 41 more rowsIn all cases, the input regions are altered. If you are unsure how to define this parameter or have no prior expectations, don’t specify a value for the parameter ‘expandBy’ to enable its default behavior to calculate the expansion as median of the input regions from input data.
In this section we want to go into more detail to help you define the ideal parameters for the function centerExpandRegions. The expected input data structure we described early in this section “Standard genomic regions format”.
After the preparation of the input data, the first step, we recommend to center and expand the genomic regions by using centerExpandRegions. It is useful if you want all of your peaks to be the same size for your downstream analyses or if you want to use the information about the maximum enrichment in each reach (often called ‘summit’) (stored in the column ‘center’), normally obtained by some peak callers (e.g., MACS2). The function allows you to automatically center your regions of interest on these summits to capture information about the most important part within a genomic region (e.g., TF-binding site or highest peak).
There are two concepts that are relevant to understand the function centerExpandRegions:
In general, there are two approaches to define a ‘center’ of a genomic region and the function centerExpandRegions allows you to decide which one to use.
The first option for you is to use pre-defined summit information (e.g., obtained from a peak caller like MACS2) (centerBy = center_column). Such information is provided when input regions were identified by a peak calling tool that exactly defines where in the region the maximum enriched signal is found. The second option is to calculate the arithmetic midpoint of the genomic region (centerBy = midpoint). (For details see the help for prepareInputRegions).
The second factor you have to think about is by how many nucleotides you want to expand from the center to re-define your genomic regions. The function centerExpandRegions allows you to either use the input data to calculate the expansion value or to provide one or two numeric values to expand. Your decision is strongly impacted by the type of signal you are looking for. Traditionally, ATAC-seq, transcription factor ChIP-seq and some histone marks ChIP-seq (For details see ENCODE recommendations) show a very narrow, sharp signal, and your region size and expansion value should reflect that. On the other hand, some histone marks are enriched on large domains showing broad patterns. Therefore, prior knowledge of the signal you are looking for is key to choosing the best option for this parameter. Sometimes you may not have a prior expectation of region size, so the default in peakCombiner is to choose a reasonable default value from the data (expandBy = NULL) as the median input region length / 2 for expansion.
You can also choose to expand the genomic region from the new center asymmetrically, by different lengths before and after the center position (For examples see section “[Run center and expand”) by providing a vector with two values (the ‘center’ minus the first value defines the start and the ‘center’ plus the second defines the end).
It is theoretically possible that centering and expanding could result in coordinate out side of the chromosomal boundaries. This of course would be invalid. Taking advantage of GenomicRanges function, the user can provide the information of the reference genome based on supported genomes by GenomicRanges (For details see the help for centerExpandRegions). If no genome is provided the package peakCombiner has a parameter to trim the start coordinate to 0 (trim_start = TRUE). This means that if the start coordinate is negative, it is automatically replaced with 1. If you do not want this behavior, you can set trim_start = FALSE and the function will return the original negative value. The user gets a printed feedback on how often and where this happened so you can debug if necessary. If these edge cases are critical to your downstream analyses, we suggest that you double-check this.
If you have chosen to run centerExpandRegions on your input regions, you probably also want to center and expand the consensus regions you get from combineRegions so that your final consensus set of regions all have the same length and are well-aligned at their centers. When you combine genomic regions during combineRegions, the genomic regions are alternated. This means that the median region size could be different compared to your input data region size, if you try to let centerExpandRegions choose the default region length from the data. For that reason, we suggest that you save the values of the calculated expansion (which is printed out) for the input regions and when you run the centerExpandRegions again on the consensus file to use this value for the ‘expandBy’ parameter. Doing so allows you to define the consensus region length by the exact input region length and not just a subset.
The next function filterRegions allows you to refine the genomic regions based on four different parameters in exact the order provided here:
This is an optional step that can help you retain the most high-quality genomic regions or reduce the overall number of genomic regions for each sample. You can apply filterRegions multiple times on the same data set, one after another, in order to explore the effect of each individual filtering steps.
As a quick first example, you can easily exclude blacklisted regions as follows. We recommend using the ENCODE-annotated blacklisted regions for either human ([hg38][https://www.encodeproject.org/files/ENCFF356LFX/]) or mouse ([mm10][https://www.encodeproject.org/files/ENCFF547MET/]). Alternatively, you can also use a data frame in bed file format as well.
backlist <- tibble::tibble(chrom = c("chr1"),
                           start = c(100),
                           end = c(1000))
backlist
#> # A tibble: 1 × 3
#>   chrom start   end
#>   <chr> <dbl> <dbl>
#> 1 chr1    100  1000data_filtered <- filterRegions(
  data = data_center_expand,
  excludeByBlacklist = backlist,
  outputFormat = "tibble",
  showMessages = FALSE
)
data_filtered
#> # A tibble: 21 × 8
#>    chrom  start   end name            score strand center sample_name 
#>    <chr>  <int> <int> <chr>           <dbl> <chr>   <dbl> <chr>       
#>  1 Chr2     251   950 control_rep1|9     80 .         601 control_rep1
#>  2 chr10      1   650 control_rep1|7     75 .         301 control_rep1
#>  3 chr2       1   650 control_rep1|8     50 .         301 control_rep1
#>  4 chr4 2     1   650 control_rep1|10    30 .         301 control_rep1
#>  5 chr4-2     1   550 control_rep1|11    20 .         201 control_rep1
#>  6 chr10      1   650 control_rep2|6     95 .         301 control_rep2
#>  7 chr2       1   650 control_rep2|7     30 .         301 control_rep2
#>  8 chr4 2     1   650 control_rep2|8     25 .         301 control_rep2
#>  9 chr4|2     1   650 control_rep2|9     80 .         301 control_rep2
#> 10 chr10      1   650 control_rep3|5     90 .         301 control_rep3
#> # ℹ 11 more rowsYou can filter based on four parameters:
includeByChromosomeName - Retains only chromosomes that are in the provided vector. By not including mitochondrial, sex, or non-classical chromosomes, genomic regions found on these chromosomes can be removed. By default, this filter is not applied.
excludeByBlacklist - A tibble can be provided listing the genomic regions to remove (having ‘chrom’, ‘start’, and ‘end’ column names). By default, this filter is not applied.
includeAboveScoreCutoff - Single numeric value that defines the threshold above which all genomic regions will be retained based on the values in the column ‘score’. The ‘score’ column in the peakCombiner input data must be non-zero for this parameter to be useful. Recall that prepareInputRegions sets the ‘score’ by default if possible (e.g., it takes the value of -log10(FDR) using a ‘narrowPeak’ file from MACS as input). Importantly, applying this filter retains a variable number of genomic regions per sample, all having a score greater than the includeAboveScoreCutoff parameter. By default, this filter is not applied.
includeTopNScoring - Single numeric value that defines how many of the top scoring genomic regions (using the column ‘score’) are retained. All other genomic regions are discarded. Importantly, applying this filter retains includeTopNScoring regions per sample, which means that the minimum enrichment levels may vary between samples. Note that if multiple genomic regions have the same ‘score’ cutoff value, then all of those genomic regions are included. In this case, the number of resulting regions retained may be a bit higher than the input parameter. By default, this filter is not applied.
Let’s see how what happens when all four filtering options are used at once.
filterRegions(
  data = data_center_expand,
  includeByChromosomeName = c("chr1", "chr2", "chr4"),
  excludeByBlacklist = backlist,
  includeAboveScoreCutoff = 2.5,
  includeTopNScoring = 6,
  outputFormat = "tibble",
  showMessages = TRUE
)
#> → Checking <class> and "values" of all columns.
#> ✔ Structure of data was successfully checked to be an accepted input.
#>   
#> ℹ Provide input `data` is a <data.frame> with three or four columns and paths
#>   to existing files.
#> → Start loading and preparing data.
#> ℹ Argument `outputFormat` is set to "tibble".
#> → The argument `includeByChromosomeName` is a class <character> of length 3 and
#>   will be used to retain matchhing chromsome names in chrom.
#> ! Input for `includeByChromosomeName` contains values not found in the input
#>   data.
#> ℹ The following chromosome name you entered is not used: "chr4".
#> → Is "includeByChromosomeName" correctly defined?
#> ✔ Entries in chrom with the values "chr1" and "chr2" are retained.
#> ℹ The following 7 entries in column chrom from the input data were not
#>   retained: "Chr2", "chr10", "chr4 2", "chr4-2", "chr4|2", "chr42", and
#>   "chr4?2".
#> ✔ Input data was filtered to retain regions on defined chromosome.
#>   
#> → User provied dataframe will be used for blacklist filtering.
#> ! Provided blacklist contains chromosome names (in chrom) not found in input
#>   data.
#> ℹ The following blacklist chromosome has no match: "chr2".
#> → Note to user: Please doublecheck this observation.
#> ✔ Input data was filtered by blacklist.
#>   
#> → Significance in score is filtered and all regions above 2.5 will be retained.
#> ℹ A total of 6 of 6 input regions are retained with value in score a above 2.5.
#> ✔ Input data was filtered to retain regions with a score above the defined
#>   threshold.
#>   
#> ℹ The argument `includeTopNScoring` extracted the the top 6 regions by score
#>   per sample (based on the values in sample_name).
#> → The top enriched 6 regions per sample will be retained.
#> ℹ The argument `includeTopNScoring` was defined as 6.
#> → The following "sample_names" contain less regions then defined by
#>   `includeTopNScoring`: control_rep1, control_rep2, control_rep3,
#>   treatment_rep1, treatment_rep2, and treatment_rep3
#> ! No genomic regions will be removed for such samples.
#> ✔ Input data was filtered and the top 6 enriched regions per sample are
#>   retained.
#>   
#> ✔ Filtered dataset will be returned.
#> ℹ Output format is set to "tibble".
#> # A tibble: 6 × 8
#>   chrom start   end name             score strand center sample_name   
#>   <chr> <int> <int> <chr>            <dbl> <chr>   <dbl> <chr>         
#> 1 chr2      1   650 control_rep1|8      50 .         301 control_rep1  
#> 2 chr2      1   650 control_rep2|7      30 .         301 control_rep2  
#> 3 chr2      1   650 control_rep3|6      10 .         301 control_rep3  
#> 4 chr2      1   650 treatment_rep1|9    50 .         301 treatment_rep1
#> 5 chr2      1   650 treatment_rep2|9    30 .         301 treatment_rep2
#> 6 chr2     51   750 treatment_rep3|7    10 .         401 treatment_rep3The filtering occurred in the order of the parameters and can be described as following:
The function filterRegions can help you curate the input genomic regions when your peak-calling pipeline uses the default options and you still need to refine the peaks further, for example when using an automated pipeline for annotation and peak calling. The idea here is to to provide a quick and easy way to clean-up your input data if needed. We suggest that you run the function filterRegions after the function centerExpandRegions just in case centerExpandRegions alters the genomic coordinates enough to overlap with a blacklisted region.
filterRegions allows a step-wise optimization of selection criteria of regions of interest and can be used multiple time on the same data set.
Following is more detail the filtering options in peakCombiner, what we recommend to do. All of these parameters are optional, and by default no filtering is done.
Sometimes you may need to focus your analysis on a single chromosome (e.g., to reduce the running time when testing) or set of chromosomes (just the canonical human chromosomes). In this case, include_chromosomes can help you to define the set of chromosomes you want to focus on in your analysis.
In this subsection, we provide an example to identify “canonical” human chromosomes directly from the input data, save it as vector and use this as input for the filtering step.
Extract a vector of chromosomes to include
First we extract all chromosome names from the input data.
input_chrom <-
  data_center_expand |>
  dplyr::select(chrom) |>
  unique()
input_chrom
#> # A tibble: 9 × 1
#>   chrom 
#>   <chr> 
#> 1 Chr2  
#> 2 chr1  
#> 3 chr10 
#> 4 chr2  
#> 5 chr4 2
#> 6 chr4-2
#> 7 chr4|2
#> 8 chr42 
#> 9 chr4?2Here we see that in this data set we have some unexpected values for chromosome names like “chr4 2”, “chr4|2” or “chr42”. Let’s modify this vector to only keep what we consider to be the “canonical” chromosomes. In real world human data sets, you may find names like “chr11_KI270721v1_random” or “chrUn_GL000195v1” that you might want to remove for your downstream analyses
To do so, the next step is to filter with regular expressions to maintain only wanted chromosome names.
include_chrom <- input_chrom |>
  dplyr::filter(grepl("^chr[0-9]$|^chr[1-2][0-9]$|^chr[XYM]", chrom)) |>
  dplyr::pull(chrom) |>
  unique() |>
  sort()
include_chrom
#> [1] "chr1"  "chr10" "chr2"Finally, we can use this vector of good names in filterRegions for the parameter includeByChromosomeName
data_filtered_chr <- filterRegions(
  data = data_center_expand,
  includeByChromosomeName = include_chrom,
  excludeByBlacklist = NULL,
  includeAboveScoreCutoff = NULL,
  includeTopNScoring = NULL,
  outputFormat = "tibble",
  showMessages = FALSE
)
data_filtered_chr
#> # A tibble: 42 × 8
#>    chrom start   end name           score strand center sample_name 
#>    <chr> <dbl> <int> <chr>          <dbl> <chr>   <dbl> <chr>       
#>  1 chr1      1   650 control_rep1|1    96 .         301 control_rep1
#>  2 chr1      1   650 control_rep1|2    95 .         301 control_rep1
#>  3 chr1      1   550 control_rep1|3    46 .         201 control_rep1
#>  4 chr1      1   450 control_rep1|5    26 .         101 control_rep1
#>  5 chr1      1   650 control_rep1|6    25 .         301 control_rep1
#>  6 chr10     1   650 control_rep1|7    75 .         301 control_rep1
#>  7 chr2      1   650 control_rep1|8    50 .         301 control_rep1
#>  8 chr1      1   450 control_rep2|1    94 .         101 control_rep2
#>  9 chr1      1   550 control_rep2|2    94 .         201 control_rep2
#> 10 chr1      1   650 control_rep2|3    44 .         301 control_rep2
#> # ℹ 32 more rowsLet’s confirm that we did indeed have unwanted chromosome names in the input data.
data_center_expand |>
  dplyr::group_by(sample_name) |>
  dplyr::summarize(num_regions = dplyr::n())
#> # A tibble: 6 × 2
#>   sample_name    num_regions
#>   <chr>                <int>
#> 1 control_rep1            10
#> 2 control_rep2             9
#> 3 control_rep3             7
#> 4 treatment_rep1           8
#> 5 treatment_rep2           9
#> 6 treatment_rep3           8And we can also confirm that only the good chromosomes remain after this filtering step.
data_filtered_chr |>
  dplyr::group_by(chrom) |>
  dplyr::summarize(num_regions = dplyr::n())
#> # A tibble: 3 × 2
#>   chrom num_regions
#>   <chr>       <int>
#> 1 chr1           30
#> 2 chr10           6
#> 3 chr2            6It is often recommended to exclude blacklisted genomic regions from your analyses, and the excludeByBlacklist parameter allows you to do that easily. Genomic regions that overlap a blacklisted region with at least 1 base are removed. We suggest to use blacklisted regions from ENCODE for human (hg38) and mouse (mm10). The blacklists were manually curated by ENCODE by combing several published blacklisted regions.
While we recommend to use the provided ENCODE blacklists, you can alternatively provide your own blacklist as a tibble containing genomic regions with columns named ‘chrom’, ‘start’ and ‘end’. In more general terms, you can use any file with genomic locations here to remove overlapping regions, for instance if you want to remove regions overlapping with promoters you can use this functions here to remove these.
Here is some code to directly download the file from ENCODE, save it locally and load it to your enviroment.
# Download the blacklist BED file from ENCODE and save it as "blacklist_human.bed"
download.file(
  url = "https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz",
  destfile = "blacklist_human.bed.gz",
  mode = "wb"
)
# Import the BED file
blacklist_granges <- readr::read_tsv("blacklist_human.bed.gz", col_names = c("chrom", "start", "end"))
blacklist_grangesfilterRegions(
  data = data_center_expand,
  excludeByBlacklist = blacklist_granges,
  showMessages = FALSE)An alternative is to load it via the package AnnotationHub to creates and manages a local cache for such files.
# Load library (AnnotationHub would need to be added to DESCRIPTION under Suggests if used in the vignette)
library(AnnotationHub)
#> Loading required package: BiocGenerics
#> Loading required package: generics
#> 
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#> 
#>     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#>     setequal, union
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#>     mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#>     unsplit, which.max, which.min
#> Loading required package: BiocFileCache
#> Loading required package: dbplyr
# Get annotation hub handle (will download index if called for the first time)
ah <- AnnotationHub::AnnotationHub()
# List ENCODE related data providers
unique(grep("ENCODE", ah$dataprovider, value = TRUE))
#> [1] "ENCODE Project"   "GENCODE"          "ENCODE SCREEN v3" "ENCODE"          
#> [5] "ENCODE cCREs"
# Query for "blacklist"
dm <- AnnotationHub::query(ah, "blacklist")
dm
#> AnnotationHub with 15 records
#> # snapshotDate(): 2025-10-17
#> # $dataprovider: GitHub, excluderanges
#> # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabdi...
#> # $rdataclass: GRanges
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH107304"]]' 
#> 
#>              title                        
#>   AH107304 | T2T.excluderanges            
#>   AH107307 | hg38.Boyle.hg38-Excludable.v2
#>   AH107314 | hg19.Boyle.hg19-Excludable.v2
#>   AH107321 | mm39.excluderanges           
#>   AH107322 | mm10.Boyle.mm10-Excludable.v2
#>   ...        ...                          
#>   AH107342 | T2T.Lareau.chm13v2.0_peaks   
#>   AH107343 | hg38.Lareau.hg38_peaks       
#>   AH107344 | hg19.Lareau.hg19_peaks       
#>   AH107345 | mm10.Lareau.mm10_peaks       
#>   AH107346 | mm9.Lareau.mm9_peaks
# get specific object (will be downloaded if called for the first time)
blacklist_ah <- dm[["AH107343"]]
#> loading from cache
#> require("GenomicRanges")
# get info for a specific object
AnnotationHub::recordStatus(ah, "AH107343")
#>     record status  dateadded
#> 1 AH107343 Public 2022-10-19
AnnotationHub::getInfoOnIds(ah, "AH107343")
#>           ah_id fetch_id                  title rdataclass status biocversion
#> 323240 AH107343   114089 hg38.Lareau.hg38_peaks    GRanges Public        3.16
#>        rdatadateadded rdatadateremoved file_size
#> 323240     2022-10-19             <NA>     23985
mcols(dm)["AH107343",]
#> DataFrame with 1 row and 15 columns
#>                           title dataprovider      species taxonomyid
#>                     <character>  <character>  <character>  <integer>
#> AH107343 hg38.Lareau.hg38_peaks       GitHub Homo sapiens       9606
#>               genome            description coordinate_1_based
#>          <character>            <character>          <integer>
#> AH107343        hg38 Regions of high homo..                  1
#>                      maintainer rdatadateadded preparerclass
#>                     <character>    <character>   <character>
#> AH107343 Mikhail Dozmorov <mi..     2022-10-19 excluderanges
#>                                                            tags  rdataclass
#>                                                          <AsIs> <character>
#> AH107343 FunctionalAnnotation,GenomicSequence,hg38;Lareau;NUMTs     GRanges
#>                       rdatapath              sourceurl  sourcetype
#>                     <character>            <character> <character>
#> AH107343 excluderanges/hg38.L.. https://drive.google..         RDS
mcols(dm)["AH107343","description"]
#> [1] "Regions of high homology to mtDNA (NUMT regions) defined by caleblareau/mitoblacklist"blacklist_ahfilterRegions(
  data = data_center_expand,
  excludeByBlacklist = blacklist_ah,
  showMessages = FALSE)Additionally, other R packages as excluderages do also provide regions to be used within this function.
If you would like to remove blacklisted regions from multiple sources, you can run filterRegions repeatedly, for example first to remove the ENCODE blacklisted regions and then to remove your own blacklisted sites. We show how to do this below.
Setting this parameter allows you to retain any peak that has a bigger score than the includeAboveScoreCutoff value based on the values in the column ‘score’. Recall that prepareInputRegions defines the ‘score’ by default if possible (e.g., it takes the value of -log10(FDR) using a ‘narrowPeak’ file from MACS2 as input). As scores might differ between experiments, we recommend that you look at the distribution of the values in ‘score’ to help you choose the best threshold value for includeAboveScoreCutoff for your data set.
Here is some code to do that on our small, synthetic dataset. The distribution you see will look different.
data_center_expand |>
  ggplot2::ggplot(ggplot2::aes(x = score)) +
  ggplot2::geom_histogram(binwidth = 10) +
  ggplot2::xlim(0,NA)
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_bar()`).See that we have very few sites with low values. We use a cutoff of 35 to remove lowly enriched sites and apply this value to the filtering step.
data_filtered_cutoff <- filterRegions(
  data = data_center_expand,
  includeAboveScoreCutoff = 35,
  outputFormat = "tibble",
  showMessages = FALSE
)Let’s compare the sites remaining in the input data set and after filtering.
dim(data_center_expand)
#> [1] 51  8
dim(data_filtered_cutoff)
#> [1] 38  8When we check the range of the values in ‘score’ columns we see the effects of the filtering.
range(data_center_expand |>
  dplyr::pull(score))
#> [1]  10 100
range(data_filtered_cutoff |>
  dplyr::pull(score))
#> [1]  35 100Again, we see that sites with ‘score’ 35 and below are removed.
data_filtered_cutoff |>
  ggplot2::ggplot(ggplot2::aes(x = score)) +
  ggplot2::geom_histogram(binwidth = 10) +
  ggplot2::xlim(0, NA)
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_bar()`).You can also select a fixed number of highest scoring regions per sample to extract the top enriched sites. An information message is shown if any sample does not have the required number of regions left in your input data. If your ‘score’ values vary widely between samples you may select widely different numbers of regions using the include_above_cutoff. In this case, using this approach will help you select similar numbers of regions for each sample. The exact same number of regions may not be selected for each sample because sometimes multiple genomic regions may have the same ‘score’ value. In this case, all of the tied genomic regions are retained.
data_center_expand |>
  dplyr::group_by(sample_name) |>
  dplyr::summarize(num_regions = dplyr::n())
#> # A tibble: 6 × 2
#>   sample_name    num_regions
#>   <chr>                <int>
#> 1 control_rep1            10
#> 2 control_rep2             9
#> 3 control_rep3             7
#> 4 treatment_rep1           8
#> 5 treatment_rep2           9
#> 6 treatment_rep3           8
filterRegions(
  data = data_center_expand,
  includeTopNScoring = 8,
  outputFormat = "tibble",
  showMessages = FALSE
) |>
  dplyr::group_by(sample_name) |>
  dplyr::summarize(num_regions = dplyr::n())
#> ✔ Filtered dataset will be returned.
#> ℹ Output format is set to "tibble".
#> # A tibble: 6 × 2
#>   sample_name    num_regions
#>   <chr>                <int>
#> 1 control_rep1             8
#> 2 control_rep2             8
#> 3 control_rep3             7
#> 4 treatment_rep1           8
#> 5 treatment_rep2           8
#> 6 treatment_rep3           8We requested that only the top 8 genomic regions would be retained for each sample, and we can see in the information messages that peakCombiner that one sample (‘control03’) contains less then the required 8 sites. For the remaining samples, we select the expected 8 regions per sample.
After loading and preparing the regions from your samples, you can use them to build your set of consensus regions using combineRegions.
The most important parameter that affects the final set of consensus regions is ‘foundInSamples’, which allows you to retain a genomic region based on how many samples it was found in (counting the unique entries in ‘sample_name’). It is up to you to define this parameter based on your input data and analysis goals. The default is ‘2’ samples, as we assume that each condition has three biological replicates and most users would like to focus on regions to be found in at lest two of these replicates. See the explained in detail section “Define parameter to best combine regions” for more considerations on setting this parameter.
The other parameters allow you to configure how the new consensus regions are annotated. For example, you can define how the ‘center’ column is populated (‘combinedCenter’) and what new ‘sample_name’ should be used for the new consensus regions (‘combinedSampleName’). Based on the selected value for the parameter ‘combinedCenter’, the new column ‘score’ is filled (for details see section Which center to select for the consensus regions?). If you want to trace which samples and genomic regions contributed to a new consensus region, you can use ‘annotateWithInputNames’ to create a new column ‘input_name’ describing links to the input genomic regions.
Following is a quick example that uses default parameters.
combineRegions(
  data = data_filtered,
  foundInSamples = 2,
  combinedCenter = "nearest",
  annotateWithInputNames = FALSE,
  combinedSampleName = "my_new_sample_name",
  outputFormat = "tibble",
  showMessage = FALSE
)
#> # A tibble: 4 × 8
#>   chrom  start   end name                 score strand center sample_name       
#>   <chr>  <dbl> <dbl> <chr>                <dbl> <chr>   <dbl> <chr>             
#> 1 chr2       1   650 my_new_sample_name|1    50 .         301 my_new_sample_name
#> 2 chr10      1   650 my_new_sample_name|2    95 .         301 my_new_sample_name
#> 3 chr4 2     1   650 my_new_sample_name|3    35 .         301 my_new_sample_name
#> 4 chr4-2     1   550 my_new_sample_name|4    30 .         301 my_new_sample_nameWe run the function combineRegions and give the value ‘consensus’ to the parameter combinedSampleName and we want to have the link to our input regions by setting annotate_with_input_name to ‘TRUE’.
In brief, a description what the specific parameter you can define do. For more details please see below section “Define parameter to best combine regions”.
foundInSamples - Defines in how many input samples a genomic site has to be found to be included into the consensus set.combinedCenter - To stay within the expected data structure of peakCombiner, we add new values for ‘center’ and ‘score’ at the end of this function to make it an accepted input for centerExpandRegions and filterRegions. You can choose between the options ‘nearest’, ‘strongest’ or ‘middle’ for combinedCenter.annotateWithInputNames - You can define if for each consensus region the ‘name’ values of all contributing input regions are combined and saved in a new column ‘input_names’. This allows you to link back each consensus region to the input regions.combinedSampleName - You can define the new value for the column ‘sample_name’, which is then used to create the unique ‘name’ column.consensus_peak_list <- combineRegions(
  data = data_filtered,
  foundInSamples = 2,
  combinedCenter = "nearest",
  annotateWithInputNames = TRUE,
  combinedSampleName = "consensus",
  outputFormat = "tibble",
  showMessages = TRUE
)
#> ℹ Argument `outputFormat` is set to "tibble".
#> ℹ Provide input `data` is a <data.frame> with three or four columns and paths
#>   to existing files.
#> → Start loading and preparing data.
#> → Checking <class> and "values" of all columns.
#> ✔ Structure of data was successfully checked to be an accepted input.
#>   
#> → Start with the disjoining and filtering genomic regions.
#> ✔ Disjoin and filter by `foundInSamples` of genomic regions successfully
#>   finished.
#>   
#> → Start with combining remaining genomic regions.
#> ✔ Combining remaining genomic regions was successfully finished.
#>   
#> → Start with identification of overlaps between the original summit and
#>   remaining genomic regions.
#> ℹ Remaining regions without overlap will be removed.
#> Joining with `by = join_by(ranking)`
#> ✔ Retained genomic regions with input data summit overlap was successfully
#>   finished.
#>   
#> → Information from input center will be added to output data frame.
#> ℹ Argument `combinedCenter` was defined as "nearest".
#> ℹ The mean of all input centers is calculated and the nearest input center is
#>   used
#> → Center information in center and score are added to the output data frame.
#> ✔ Output data frame columns center and score were successfully populated.
#>   
#> → The value "consensus" for column sample_name was provided.
#> ℹ Column sample_name is filled with provided value consensus.
#> ℹ Column name is created as unique identifier for each row containing
#>   sample_name and the row number.
#> ✔ The columns sample_name and name were successfully populated.
#>   
#> ℹ Argument `annotateWithInputNames` was set to "TRUE".
#> → Column input_names is added to output data frame.
#> ✔ Additional column input_names was successfully populated.
#>   
#> ✔ All required columns in output data frame were successfully populated.
#>   
#> ✔ Genomic regions were successfully combined.
#>   
#> ℹ Output format is set to "tibble".
consensus_peak_list
#> # A tibble: 4 × 9
#>   chrom  start   end name        score strand center sample_name input_names    
#>   <chr>  <dbl> <dbl> <chr>       <dbl> <chr>   <dbl> <chr>       <chr>          
#> 1 chr2       1   650 consensus|1    50 .         301 consensus   control_rep1|8…
#> 2 chr10      1   650 consensus|2    95 .         301 consensus   control_rep1|7…
#> 3 chr4 2     1   650 consensus|3    35 .         301 consensus   control_rep1|1…
#> 4 chr4-2     1   550 consensus|4    30 .         301 consensus   control_rep1|1…Here you can see the resulting consensus regions tibble with all the known column and the newly added ‘input_names’. In conclusion, we provide here multiple options for you to customize the resulting regions file based on your personal needs.
Here we briefly describe what is happening under the hood.
Binning and counting of each genomic input region:
First step is to apply the GenomicRanges function disjoin to split up peaks depending on the number of input samples contributing. These numbers are counted to get the coverage of each genomic region. These counts are then used for filtering using the parameter foundInSamples, which is defined by the user. If foundInSamples is a fraction between 0 and 1, then only include genomic regions that are found in at least this fraction of input samples. Default value is 2. Please note if foundInSamples is set to 1 all genomic regions covered in at least one sample are included, this makes this function to simply merge all input regions.
Recombine separated genomic regions:
As next step we perform reduce, another GenomicRanges operation, to combine regions adjacent to each other. If regions are separated by at least a single nucelotide, we keep them separately.
Overlap with summits:
This is a quality control function for the resulting genomic regions specifically developed for peakCombiner. To remove potential false positive sites, we designed the function combineRegions to check for an overlap between newly combined regions and input summits. If at least one overlap is found, the new region will be maintained, otherwise it is removed.
Restore data structure:
In this final step, we restore the accepted data structure for working with peakCombiner functions to ensure the modularity of the package. This allows a user to apply the functions center_expand_resgions and filterRegions to the combined consensus region file. Therefore, ‘center’, ‘score’ and ‘sample_name’ columns are created and populated based on the user defined parameters.
Here we explain the parameters you can set for the function combineRegions and the defaults we recommend.
As described in “Run to combine regions”, the default behavior is to retain genomic regions that overlap in at least two samples. When should you considering using a non-default value?
Setting foundInSamples = 1 includes every region from every sample. This is useful if you simply want a union of all peak regions regardless of how often they were found across your replicates. For example, you may already be very confident in the quality of the input genomic regions or you have previously run peakCombiner on subsets of your data to get consensus regions for replicates of the same condition.
When you have many replicates per condition, you may find it helpful to increase the stringency by increasing foundInSamples parameter value.
In general, as you increase foundInSamples, peakCombiner will return fewer and smaller consensus genomic regions.
combineRegions(
  data = data_filtered,
  foundInSamples = 2,
  outputFormat = "tibble",
  showMessages = FALSE,
  combinedSampleName = "foundInSamples_2_example"
)
#> # A tibble: 4 × 8
#>   chrom  start   end name                       score strand center sample_name 
#>   <chr>  <dbl> <dbl> <chr>                      <dbl> <chr>   <dbl> <chr>       
#> 1 chr2       1   650 foundInSamples_2_example|1    50 .         301 foundInSamp…
#> 2 chr10      1   650 foundInSamples_2_example|2    95 .         301 foundInSamp…
#> 3 chr4 2     1   650 foundInSamples_2_example|3    35 .         301 foundInSamp…
#> 4 chr4-2     1   550 foundInSamples_2_example|4    30 .         301 foundInSamp…If the parameter foundInSamples is set to ‘1’, this function basically merges all input regions.
combineRegions(
  data = data_filtered,
  foundInSamples = 1,
  outputFormat = "tibble",
  showMessages = FALSE,
  combinedSampleName = "foundInSamples_1_example"
)
#> # A tibble: 8 × 8
#>   chrom  start   end name                       score strand center sample_name 
#>   <chr>  <dbl> <dbl> <chr>                      <dbl> <chr>   <dbl> <chr>       
#> 1 chr2       1   750 foundInSamples_1_example|1    50 .         301 foundInSamp…
#> 2 chr10      1   750 foundInSamples_1_example|2    95 .         301 foundInSamp…
#> 3 chr42      1   650 foundInSamples_1_example|3    90 .         301 foundInSamp…
#> 4 chr4 2     1   650 foundInSamples_1_example|4    35 .         301 foundInSamp…
#> 5 chr4-2     1   650 foundInSamples_1_example|5    30 .         301 foundInSamp…
#> 6 chr4?2     1   650 foundInSamples_1_example|6    25 .         301 foundInSamp…
#> 7 chr4|2     1   650 foundInSamples_1_example|7    80 .         301 foundInSamp…
#> 8 Chr2     251   950 foundInSamples_1_example|8    80 .         601 foundInSamp…You only need to consider how to select the center of your new genomic region if you plan to run centerExpandRegions to update all the consensus genomic regions so that they are the same size, or if you plan to use the ‘score’ value for your downstream analyses. Each consensus region can be derived from several input regions, each of which has its own ‘center’ and ‘score’ value.
There are three ways you can define the ‘center’ of new genomic regions. Your choice also defines how the ‘score’ is calculated.
Three different option are available:
middle - the mathematical midpoint of the new region. The resulting ‘score’ for each consensus region is the mean of all input score values contributing to that site.strongest - the ‘center’ of the input region that has the the highest ‘score’ of all overlapping input regions. ‘score’ is taken from the used input center.nearest - the ‘center’ of the input region that is closest to mean of the ‘center’s of all overlapping input regions (default). ’score’ is taken from the used input center.Lets have a look at the ‘center’ and ‘score’ within the synthetic data, when setting combinedCenter to ‘strongest’.
combineRegions(
  data = data_filtered,
  combinedCenter = "strongest",
  outputFormat = "tibble",
  showMessages = FALSE
) |> dplyr::select("center", "score")
#> # A tibble: 4 × 2
#>   center score
#>    <dbl> <dbl>
#> 1    301    50
#> 2    301    95
#> 3    301    35
#> 4    301    30Or to ‘middle’.
combineRegions(
  data = data_filtered,
  combinedCenter = "middle",
  outputFormat = "tibble",
  showMessages = FALSE
) |> dplyr::select("center", "score")
#> # A tibble: 4 × 2
#>   center score
#>    <dbl> <dbl>
#> 1   326.  30  
#> 2   326.  87.5
#> 3   326.  30  
#> 4   276.  25Or to ‘nearest’.
combineRegions(
  data = data_filtered,
  combinedCenter = "nearest",
  outputFormat = "tibble",
  showMessages = FALSE
) |> dplyr::select("center", "score")
#> # A tibble: 4 × 2
#>   center score
#>    <dbl> <dbl>
#> 1    301    50
#> 2    301    95
#> 3    301    35
#> 4    301    30We recommend to use the nearest option as this will give you a best common center and associated score.
This vignette was built using:
sessionInfo()
#> R version 4.5.1 Patched (2025-08-23 r88802)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] GenomicRanges_1.61.6  Seqinfo_0.99.3        IRanges_2.43.5       
#>  [4] S4Vectors_0.47.4      AnnotationHub_3.99.6  BiocFileCache_2.99.6 
#>  [7] dbplyr_2.5.1          BiocGenerics_0.55.4   generics_0.1.4       
#> [10] ggplot2_4.0.0         peakCombiner_0.99.603 BiocStyle_2.37.1     
#> 
#> loaded via a namespace (and not attached):
#>  [1] KEGGREST_1.49.2      gtable_0.3.6         xfun_0.53           
#>  [4] bslib_0.9.0          httr2_1.2.1          Biobase_2.69.1      
#>  [7] vctrs_0.6.5          tools_4.5.1          curl_7.0.0          
#> [10] tibble_3.3.0         AnnotationDbi_1.71.2 RSQLite_2.4.3       
#> [13] blob_1.2.4           pkgconfig_2.0.3      RColorBrewer_1.1-3  
#> [16] S7_0.2.0             lifecycle_1.0.4      compiler_4.5.1      
#> [19] farver_2.1.2         stringr_1.5.2        Biostrings_2.77.2   
#> [22] htmltools_0.5.8.1    sass_0.4.10          yaml_2.3.10         
#> [25] crayon_1.5.3         pillar_1.11.1        jquerylib_0.1.4     
#> [28] tidyr_1.3.1          cachem_1.1.0         tidyselect_1.2.1    
#> [31] digest_0.6.37        stringi_1.8.7        dplyr_1.1.4         
#> [34] purrr_1.1.0          bookdown_0.45        BiocVersion_3.22.0  
#> [37] labeling_0.4.3       rprojroot_2.1.1      fastmap_1.2.0       
#> [40] grid_4.5.1           here_1.0.2           cli_3.6.5           
#> [43] magrittr_2.0.4       dichromat_2.0-0.1    utf8_1.2.6          
#> [46] withr_3.0.2          filelock_1.0.3       scales_1.4.0        
#> [49] rappdirs_0.3.3       bit64_4.6.0-1        rmarkdown_2.30      
#> [52] XVector_0.49.1       httr_1.4.7           bit_4.6.0           
#> [55] png_0.1-8            memoise_2.0.1        evaluate_1.0.5      
#> [58] knitr_1.50           rlang_1.1.6          glue_1.8.0          
#> [61] DBI_1.2.3            BiocManager_1.30.26  jsonlite_2.0.0      
#> [64] R6_2.6.1Zhang, Yong, Tao Liu, Clifford A Meyer, Jérôme Eeckhoute, David S Johnson, Bradley E Bernstein, Chad Nusbaum, et al. 2008. “Model-Based Analysis of Chip-Seq (MACS).” Genome Biology 9 (9): R137. https://doi.org/doi:10.1186/gb-2008-9-9-r137.