Analyzing MeRIP-seq data with TRESS

Zhenxing Guo* and Hao Wu**

*zhenxing.guo@emory.edu
**hao.wu@emory.edu

25 April 2023

Abstract

This vignette introduces the use of the package TRESS (${\it T}$oobox for m${\it R}$NA ${\it E}$pigenetics ${\it S}$equencing analysi${\it S}$), which is designed for analysis of MeRIP-seq data. It provides functionalities for m6A methylation identification and differential m6A methylation identification. TRESS utilizes Bayesian hierarchical negative binomial models to extract signals in raw read counts, and then conducts Wald tests to detect m6A methylation and differential methylation.

Package

TRESS 1.6.0

1 Introduction
- 1.1 Background
- 1.2 Installation
2 Input data preparation
- 2.1 Peak calling
- 2.2 Differential peak calling
3 Example dataset
4 Detection of transcriptome wide peaks
5 Detection of transcriptome wide differential peaks
- 5.1 Divide genome and obtain candidate DMR regions
- 5.2 Call DMRs from candidates
  - 5.2.1 Two-group comparison
  - 5.2.2 General experiment design
Session info

1 Introduction

1.1 Background

The post-transcriptional epigenetic modiﬁcation on mRNA is an emerging ﬁeld to study the gene regulatory mechanism and their association with diseases. Recently developed high-throughput sequencing technology named Methylated RNA Immunoprecipitation Sequencing (MeRIP-seq) enables one to proﬁle mRNA epigenetic modiﬁcation transcriptome-wide.

In MeRIP-seq, mRNA is first fragmented into approximately 100-nucleotide-long oligonucleotides, and then immunoprecipitated by an anti-m$^6$A affinity purified antibody. In addition to the immunoprecipitated (IP) samples, libraries are also prepared for input control fragments to measure the corresponding reference mRNA abundance. This process is an RNA-seq experiment. After sequencing, the reads from both the IP and the input samples are aligned to the reference genome. Due to the enrichment from IP process, transcriptomic regions with m$^6$A will have more reads clustered and have peak-like shapes when visualizing the read counts along the genome. Therefore, people often refer the m$^6$A regions as “peaks”, which is a term usually used in ChIP-seq to represent the protein binding sites. Figure 1 shows some example peaks on the Fat2 gene from a dataset to study m$^6$A dynamics during mouse brain development, where m$^6$A in cerebellums from 2-week old mice are profiled with two biological replicates.

Figure 1: Peaks from MeRIP-seq in a mouse brain study

Two major tasks in the analysis of MeRIP-seq data are to identify transcriptome-wide m$^6$A methylation (namely “peak calling”), and differential m6A methylation.

For the first problem, TRESS builds a two-step procedure to identify transcriptome wide m$^6$A regions. In the first step, it quickly scans the whole transcriptome and losely identify candidate regions using an ad hoc algorithm. In the second step, it models read counts from candidate regions using an empirical hierarchical negative binomial model to accounts for all-sources variations. It also imposes a prior on the dispersion of methylation, which induces a shrinkage variance estimate by borrowing information from all candidate regions. Wald test is constructed to detect significant m$^6$A regions from the candidates. For the second problem, TRESS constructs a general linear framework based on hierarchical negative binomial model, to connect methylation level of each region with factors of interest. This characteristic makes TRESS applicable not only for two-group comparisons, but also in studies with complex design.

1.2 Installation

From GitHub:

install.packages("devtools") # if you have not installed "devtools" package
library(devtools)
install_github("https://github.com/ZhenxingGuo0015/TRESS", 
               build_vignettes = TRUE)

From Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# The following initializes usage of Bioc devel
BiocManager::install(version='devel')

BiocManager::install("TRESS")

To view the package vignette in HTML format, run the following lines in R

library(TRESS)
vignette("TRESS")

2 Input data preparation

2.1 Peak calling

TRESS requires paired input control and IP BAM files for each replicate of all samples: “input1.bam & ip1.bam”, “input2.bam & ip2.bam”, …. in order to call peaks from all replicates. The BAM files contain mapped reads sequenced from respective samples and are the output of sequence alignment tools like Bowtie2.

For illustration purpose, we include four example BAM files and one corresponding genome annotation file in our publicly available data package datasetTRESon github, which can be installed with

install_github("https://github.com/ZhenxingGuo0015/datasetTRES")

The BAM files contain sequencing reads (only on chromosome 19) from two input & IP mouse brain cerebellum samples.

In addition to BAM files, TRESS also needs a path to an annotation file in order to obtain transcriptome-wide bins, bin-level read counts and annotation of each peak. The annotation file is actually is a TXDB and is saved in format of *.sqlite, which is easily created using R function makeTxDbFromUCSC() from Bioconductor package GenomicFeatures:

## Directly use "makeTxDbFromUCSC" function to create one
library(GenomicFeatures)
txdb = makeTxDbFromUCSC("mm9", "knownGene")
# saveDb(txdb, file = paste0("YourPATH", "/", "YourGenome.sqlite")

## or load a TxDb annotation package like
# library(TxDb.Mmusculus.UCSC.mm9.knownGene)
# txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
# saveDb(txdb, file = paste0("YourPATH", "/", "mm9_knownGene.sqlite")

2.2 Differential peak calling

Model fitting: Similar to peak calling, differential peak calling in TRESS also needs BAM files and corresponding genome annotation file saved in *.sqlite format. However, to compare methylation level across different conditions, paired input & IP BAM files for samples of all conditions are required. Also, the input order of BAM files from different conditions should be appropriately listed in case that samples from different conditions are mistakenly treated as one group.

As TRESS is designed for differential analysis under general experimental design, then sample attributes determined by all factors in study should also be provided to construct a design matrix. For this, TRESS requires a dataframe containing, for each factor, the attribute value of all samples (the order of sample should be exactly the same as BAM files taken by TRESS). A particular model determining which factor will be included into design matrix should also be provided.

Hypothesis testing: All aforementioned input requirements are for model fitting in TRESS. For hypothesis testing, TRESS requires a contrast of coefficients. The contrast should be in line with the name and order of all coefficients in the design matrix. It can be a vector for simple linear relationship detection or a matrix for composite relationship detection.

3 Example dataset

TRESS provides three example MeRIP-seq dataset. The raw sequencing reads of all three data were downloaded from GEO, and mapped using Bowtie2 to generate corresponding BAM files. With BAM files, we apply TRESS to obtain transcriptome bins and candidate regions. A small subset of both bin-level and region-level data are stored as examples in TRESS for package illustration purpose.

The first example dataset Basal comes from 7 basal mouse brain samples (GEO accession number GSE113781). The total number of bins and candidate regions for original data are 1,620,977 and 13,017 respectively. Due to space limit, only data of randomly selected 1000 bins and 500 regions (bins and regions are not necessarily overlapped) are saved. In particular, for each bin and candidate region, both the genomic coordinate and read counts (across 7 paired input & ip replicates) are kept.

library(TRESS)
data("Basal")
Bins <- Basal$Bins$Bins
BinCounts <- Basal$Bins$Counts
dim(BinCounts)

## [1] 1000   14

Candidates <- Basal$Candidates$Regions
CandidatesCounts <- Basal$Candidates$Counts
sf <- Basal$Bins$sf

Check the coordinates of each bin and candidate region.

head(Bins, 3)

##    chr   start     end width strand
## 1 chr1 3195951 3196000    50      -
## 2 chr1 3196001 3196050    50      -
## 3 chr1 3196051 3196100    50      -

head(Candidates, 3)

##    chr     start       end strand    summit
## 1 chr9  90099501  90099700      -  90099625
## 2 chr5 145620001 145620200      + 145620075
## 3 chr2 130408151 130408300      - 130408275

Check the read counts in each bin and candidate region.

dim(BinCounts)

## [1] 1000   14

head(BinCounts, 3)

##      basal_Input_rep1 basal_IP_rep1 basal_Input_rep2 basal_IP_rep2
## [1,]                2             7                3             0
## [2,]                4             6                8             0
## [3,]               44             8               26             2
##      basal_Input_rep3 basal_IP_rep3 basal_Input_rep4 basal_IP_rep4
## [1,]                4             2                2             2
## [2,]               10             2                4             2
## [3,]               22             2               26            12
##      basal_Input_rep5 basal_IP_rep5 basal_Input_rep6 basal_IP_rep6
## [1,]                6             0                4             1
## [2,]               10             0                6             2
## [3,]               14             2               26             4
##      basal_Input_rep7 basal_IP_rep7
## [1,]                1             0
## [2,]                6             2
## [3,]               13             6

dim(CandidatesCounts)

## [1] 500  14

head(CandidatesCounts,3)

##   basal_Input_rep1 basal_IP_rep1 basal_Input_rep2 basal_IP_rep2
## 1               95          2206               92          1430
## 2              750         20841              797         12699
## 3               58          1857               78          1320
##   basal_Input_rep3 basal_IP_rep3 basal_Input_rep4 basal_IP_rep4
## 1              150          1810              164          2380
## 2             1013         17832              956         23479
## 3              112          1449              161          2452
##   basal_Input_rep5 basal_IP_rep5 basal_Input_rep6 basal_IP_rep6
## 1              122          1415              135          1559
## 2              963         15379              859         15951
## 3              111          1762              118          1625
##   basal_Input_rep7 basal_IP_rep7
## 1               80           988
## 2              831         12311
## 3              111          1357

Check the library size factor of each sample in this data.

sf

## basal_Input_rep1    basal_IP_rep1 basal_Input_rep2    basal_IP_rep2 
##        0.6807884        1.7213164        0.7472330        1.1091955 
## basal_Input_rep3    basal_IP_rep3 basal_Input_rep4    basal_IP_rep4 
##        0.7403452        1.3782818        1.0150407        2.2242626 
## basal_Input_rep5    basal_IP_rep5 basal_Input_rep6    basal_IP_rep6 
##        0.7245767        1.2234676        0.6533518        1.2120361 
## basal_Input_rep7    basal_IP_rep7 
##        0.6148696        1.1139263

The second example dataset DMR_M3vsWT comes from human Hela cell lines (GEO46705). This dataset contains samples from two conditions: wild type (WT) and METTL3-Knockout (M3-KO). Each sample contains two input & ip replicates. For comparison between original M3-KO and WT samples, TRESS called 10841 candidate DMRs, but data (genomic coordinates and read counts matrix) for only 200 randomly selected candidate DMRs are kept. Library size factor for each sample is also saved. In addition to 200 candidate DMRs, data for 737 bins overlapping with 6 example candidate DMRs are also included for visualization purpose.

data("DMR_M3vsWT")
Candidates <- DMR_M3vsWT$Regions
CandidatesCounts <- DMR_M3vsWT$Counts
sf <- DMR_M3vsWT$sf

### bins overlapping with 6 candidates
bins <- DMR_M3vsWT$Bins

Check the coordidates of candidate DMRs between WT and M3-KO samples.

dim(Candidates)

## [1] 200   5

head(Candidates, 3)

##       chr     start       end strand            summit
## 2291 chr4  56875218  56875817      + 56875342_56875592
## 3337 chr6  74284144  74284293      -          74284268
## 4518 chr9 136433265 136433564      +         136433439

Check the read counts in candidate DMRs between WT and M3-KO samples.

dim(CandidatesCounts)

## [1] 200   8

head(CandidatesCounts, 3)

##      WT_Input_rep1 WT_IP_rep1 WT_Input_rep2 WT_IP_rep2 M3_Input_rep1 M3_IP_rep1
## 2291           130       1081           114        739           427        540
## 3337           694        815           867        449           171       2047
## 4518            82        731           104        404           268        350
##      M3_Input_rep2 M3_IP_rep2
## 2291           529        518
## 3337           229       1855
## 4518           298        384

Check the library size factor of each sample.

sf

## WT_Input_rep1    WT_IP_rep1 WT_Input_rep2    WT_IP_rep2 M3_Input_rep1 
##     0.5662048     1.9513728     0.6839914     1.1201070     0.8798930 
##    M3_IP_rep1 M3_Input_rep2    M3_IP_rep2 
##     1.7702098     0.8732603     1.7191598

Check the overlapps between bins and candidate DMRs.

library(GenomicRanges)
bins.GR <- GRanges(Rle(bins$chr), IRanges(bins$start, bins$end), 
                   Rle(bins$strand))
Candidates.GR <- GRanges(Rle(Candidates$chr), 
                         IRanges(Candidates$start, Candidates$end), 
                   Rle(Candidates$strand))
library(GenomicFeatures)
sum(countOverlaps(Candidates.GR, bins.GR) > 0)

## [1] 6

The third example dataset DMR_SixWeekvsTwoWeek comes from mouse brain samples (GEO144032). It consists of samples from mouse brain cortex and hypothalamus regions at two time points, 2- and 6-week-old. Each sample contains two paired input and ip replicates. we focus on differential methylation between 2- and 6-week-old samples in each brain region. A total of 15,724 candidate DMRs are called by TRESS and only 200 randomly selected candidate DMRs are kept. Library size factor for each sample is also saved. In addition, methylation ratio of 200 candidate DMRs in each sample are also included to provide an intuitive comparison of methylation level in samples of different conditions.

data("DMR_SixWeekvsTwoWeek")
Candidates <- DMR_SixWeekvsTwoWeek$Regions
CandidatesCounts <- DMR_SixWeekvsTwoWeek$Counts
sf <- DMR_SixWeekvsTwoWeek$sf

Check coordidates of candidate DMRs between six and two week samples.

dim(Candidates)

## [1] 200   5

head(Candidates, 3)

##       chr    start      end strand   summit
## 255  chr1 44047402 44047551      + 44047476
## 4   chr16  4649245  4649394      +  4649319
## 3    chr7  5049157  5049306      +  5049231

Check read counts in candidate DMRs between six and two week samples.

dim(CandidatesCounts)

## [1] 200  16

head(CandidatesCounts, 3)

##     2wk_cortex_input_rep1 2wk_cortex_IP_rep1 2wk_cortex_input_rep2
## 255                    35                  4                    44
## 4                      16                 54                     5
## 3                     160                237                   102
##     2wk_cortex_IP_rep2 2wk_hypothalamus_input_rep1 2wk_hypothalamus_IP_rep1
## 255                  5                          63                       16
## 4                   48                          26                       53
## 3                  280                         216                      267
##     2wk_hypothalamus_input_rep2 2wk_hypothalamus_IP_rep2 6wk_cortex_input_rep1
## 255                          24                       16                     7
## 4                            23                      102                    29
## 3                           180                      459                   121
##     6wk_cortex_IP_rep1 6wk_cortex_input_rep2 6wk_cortex_IP_rep2
## 255                 22                     0                 14
## 4                   31                    20                  9
## 3                  129                   124                102
##     6wk_hypothalamus_input_rep1 6wk_hypothalamus_IP_rep1
## 255                          20                        4
## 4                            19                       26
## 3                            99                      135
##     6wk_hypothalamus_input_rep2 6wk_hypothalamus_IP_rep2
## 255                          15                        7
## 4                            12                       35
## 3                           105                      156

Check the library size factor of each sample.

sf

##       2wk_cortex_input_rep1          2wk_cortex_IP_rep1 
##                   1.5475925                   1.4187376 
##       2wk_cortex_input_rep2          2wk_cortex_IP_rep2 
##                   1.0272155                   0.9727845 
## 2wk_hypothalamus_input_rep1    2wk_hypothalamus_IP_rep1 
##                   1.8370234                   1.4062373 
## 2wk_hypothalamus_input_rep2    2wk_hypothalamus_IP_rep2 
##                   1.1989765                   1.8434219 
##       6wk_cortex_input_rep1          6wk_cortex_IP_rep1 
##                   0.7262692                   0.9211943 
##       6wk_cortex_input_rep2          6wk_cortex_IP_rep2 
##                   0.4519580                   0.7454718 
## 6wk_hypothalamus_input_rep1    6wk_hypothalamus_IP_rep1 
##                   0.7324076                   0.8373596 
## 6wk_hypothalamus_input_rep2    6wk_hypothalamus_IP_rep2 
##                   0.5351408                   1.1228511

Check methylation ratio of 200 candidate DMRs across all samples.

MeRatio <- DMR_SixWeekvsTwoWeek$MeRatio
head(MeRatio, 3)

##     2wk_Cortex_rep1 2wk_Cortex_rep2 2wk_Hypothalamus_rep1 2wk_Hypothalamus_rep2
## 255       0.1108468       0.1071387             0.2491189             0.3024580
## 4         0.7863946       0.9102105             0.7269940             0.7425612
## 3         0.6177052       0.7435040             0.6175588             0.6238538
##     6wk_Cortex_rep1 6wk_Cortex_rep2 6wk_Hypothalamus_rep1 6wk_Hypothalamus_rep2
## 255       0.7124642       0.9036763             0.1488874             0.1819433
## 4         0.4573393       0.2143442             0.5448147             0.5816000
## 3         0.4566768       0.3327581             0.5439460             0.4145467

4 Detection of transcriptome wide peaks

Peak calling in TRESS is performed using one wrapper function TRESS_peak(). Here we use example BAM files from datasetTRES as an example.

## Directly take BAM files in "datasetTRES" available on github
library(datasetTRES)
Input.file = c("cb_input_rep1_chr19.bam", "cb_input_rep2_chr19.bam")
IP.file = c("cb_ip_rep1_chr19.bam", "cb_ip_rep2_chr19.bam")
BamDir = file.path(system.file(package = "datasetTRES"), "extdata/")
annoDir = file.path(system.file(package = "datasetTRES"),
                    "extdata/mm9_chr19_knownGene.sqlite")
OutDir = "/directory/to/output"  
TRESS_peak(IP.file = IP.file,
           Input.file = Input.file,
           Path_To_AnnoSqlite = annoDir,
           InputDir = BamDir,
           OutputDir = OutDir, # specify a directory for output
           experiment_name = "examplebyBam", # name your output 
           filetype = "bam")

### example peaks
peaks = read.table(file.path(system.file(package = "TRESS"),
                           "extdata/examplebyBam_peaks.xls"),
                 sep = "\t", header = TRUE)
head(peaks[, -c(5, 14, 15)], 3)

To replace the example BAM files with your BAM files, the codes are:

## or, take BAM files from your path
Input.file = c("input_rep1.bam", "input_rep2.bam")
IP.file = c("ip_rep1.bam", "ip_rep2.bam")
BamDir = "/directory/to/BAMfile"
annoDir = "/path/to/xxx.sqlite"
OutDir = "/directory/to/output"
TRESS_peak(IP.file = IP.file,
           Input.file = Input.file,
           Path_To_AnnoSqlite = annoDir,
           InputDir = BamDir,
           OutputDir = OutDir,
           experiment_name = "xxx",
           filetype = "bam")
peaks = read.table(paste0(OutDir, "/", 
                          "xxx_peaks.xls"), 
                   sep = "\t", header = TRUE)
head(peaks, 3)

As a wrapper function, TRESS_peak() combines multiple subfunctions to detect m6A regions transcriptome-wide. The whole process involves following steps:

$\quad 1$. Divide genome into equal sized bins and obtain read counts in each replicate using DivideBins().

$\quad 2$. Call candidate regions based on bin-level data across all replicates using CallCandidates()

$\quad 3$. Fit Negative binomial models for read counts in candidate regions to detect and rank peaks among all candidates using CallPeaks.multiRep().

If there is only one replicate, functions in step 2 and step 3 will be replaced by one function CallPeaks.oneRep(), while function DivideBins() will still be used to obtain bin-level data. Please see the man pages for detailed usage of DivideBins(). Given bin-level data, peak calling by sub-functions are included in following Section 4.1 and Section 4.2.

4.1 Peak calling with multiple replicates

4.1.1 Obtain candidate regions

With bin-level read counts, binomial test is first conducted for each bin. Then an ad hoc bump-finding algorithm is applied to merge significant bins (and/or bins with large fold change) and form bumps in each replicate. Bumps from all input & IP replicates are unioned together to construct a list of candidate regions. Both binomial tests and bump-finding are done by function CallCandidates() in TRESS.

Here, we use example dataset Basal introduced in Section 3 to run CallCandidates().

## load in first example dataset
data("Basal") 
Candidates = CallCandidates(
    Counts = Basal$Bins$Counts,
    bins = Basal$Bins$Bins
    )

4.1.2 Call peaks from candidates

After obtaining candidate regions, TRESS models read counts in candidate regions using a hierarchical negative-binomial model. Wald tests are then conducted to detect significant regions from candidates, where a region with methylation level significantly higher than the background is considered as sigificant. The background methylation level is estimated based on total read counts from non-candidate regions, which is calculated as the total bin-counts subtracted by counts from candidate regions. This can be obtained using function BgMethylevel() in TRESS. With background methylation level estimated, complete parameter estimation and statistical inference for candidate regions are achieved by function CallPeaks.multiRep() in TRESS, where argument Candidates is a list containing at least two components: genomic coordinates (e.g., “chr”, “start” and “end” etc) of all candidate regions and their read counts across all samples. Here is an example, provided that the background methylation level is 0.5,

data("Basal") ### load candidate regions
Basal$Candidates$sf = Basal$Bins$sf
peaks1 = CallPeaks.multiRep(
  Candidates = Basal$Candidates,
  mu.cutoff = 0.5
  )
head(peaks1, 3)

A word about CallPeaks.multiRep() Different criteria are available in TRESS to filter peaks based on results of Wald tests, including p-values, FDR and log fold change (logFC). One needs to first specify a criterion through argument WhichThreshold and then provide a cutoff for that criterion through one or two arguments named *.cutoff. By default, TRESS adopts both FDR and logFC (throughWhichThreshold = "fdr_lfc") to select peaks. The default cutoffs for FDR and logFC are 0.05 and 0.7 (for fold change of 2) respectively, meaning that peaks with FDR < 0.05 and logFC > 0.7 will be kept. With WhichThreshold = "fdr_lfc", one can change the cutoffs to more stringent values, e.g., FDR < 0.01 and logFC > 1.6 by setting fdr.cutoff = 0.01 and lfc.cutoff = 1.6:

### use different threshold to filter peaks
peaks2 = CallPeaks.multiRep(
  Candidates = Basal$Candidates,
  mu.cutoff = 0.5, 
  fdr.cutoff = 0.01, 
  lfc.cutoff = 1.6  
  )

However, one can also set WhichThreshold = "fdr" to select peaks only with FDR, then by default peaks with FDR < 0.05 will be kept:

### use different threshold to filter peaks
peaks3 = CallPeaks.multiRep(
  Candidates = Basal$Candidates,
  mu.cutoff = 0.5, 
  WhichThreshold = "fdr"
  )

Please read the manual page of function CallPeaks.multiRep() for more details on each argument.

Given the usage of function “CallPeaks.multiRep()”, it can also be adopted to re-rank existing peaks with our developed methods, if their read counts are available. This may perform bad if one didn’t properly estimate library size factor (one argument of Candidates) for each sample. Based on our experience, the estimation of size factor should be based on the bin-level counts across the whole transcriptome, not the region-level counts. For background methylation level, one can use 0.5 but it would be informative if estimating it from the data.

4.2 Peak calling with one replicate

The above two-step approach is for data with multiple replicates. For data with only one replicate, TRESS_peak() calls function CallPeaks.oneRep() for peak calling given bin-level data. In this case, bumps in the only one replicate are taken as final list of peaks. The statistical significance of each peak comes from binomial tests. Here is an example of how to run function CallPeaks.oneRep().

# A toy example
data("Basal")
bincounts = Basal$Bins$Counts[, 1:2]
sf0 = Basal$Bins$sf[1:2]
bins = Basal$Bins$Bins
peaks = CallPeaks.oneRep(Counts = bincounts, 
                         sf = sf0, bins = bins)
head(peaks, 3)

4.3 Peak visualization

With pre-called peaks in hand, one can visualize them using function “ShowOnePeak()” in TRESS. The usage of this function is

ShowOnePeak(onePeak, allBins, binCounts, ext = 500, ylim = c(0,1))

In order to run this function, one needs to have: 1) “onePeak”: a pre-called peak saved as a dataframe, which contains genomic positions for that peak: “chr”, “start”, “end”, “strand”; 2) “allBins”: genomic positions (“chr”, “start”, “end”, “strand”) of bins overlapping at least with the pre-called peak; 3) “binCounts”: the corresponding bin-level read counts in each replicate. This function will plot for each replicate: the methylation level of bins (blue bars) within the target peak (shade region in pink), and the normalized sequencing depth for input samples (curves in grey). We show some example plots here:

peaks = read.table(file.path(system.file(package = "TRESS"),
                             "extdata/examplebyBam_peaks.xls"),
                   sep = "\t", header = TRUE) 
load(file.path(system.file(package = "TRESS"),
               "extdata/examplebyBam.rda"))
allBins = as.data.frame(bins$bins)
colnames(allBins)[1] = "chr"
allBins$strand = binStrand

head(peaks, 1)

##     chr    start      end strand   summit    lg.fc cb_input_rep1_chr19.bam
## 1 chr19 23239316 23240265      + 23240090 1.357281                     529
##   cb_ip_rep1_chr19.bam cb_input_rep2_chr19.bam cb_ip_rep2_chr19.bam       mu
## 1                 6092                     380                 2479 0.863035
##        mu.var    stats     shrkPhi shrkTheta pvals p.adj   rScore
## 1 0.001154263 13.20332 0.002447933  9.974182     0     0 2.850148

ShowOnePeak(onePeak = peaks[1,], allBins = allBins, 
            binCounts = allCounts)

5 Detection of transcriptome wide differential peaks

Unlike peak calling which implements all steps in one function TRESS_peak(), TRESS separates the model fitting, which is the most computationally heavy part, from the hypothesis testing. The model fitting is implemented by TRESS_DMRfit() while the hypothesis testing is implemented by TRESS_DMRtest(). Given an experimental design with multiple factors, the parameter estimation (model fitting) only needs to be performed once, and then the hypothesis testing for DMR calling can be performed for different factors efficiently.

Given all required files, differential peak calling in TRESS can be done through,

InputDir = "/directory/to/BAMfile"
Input.file = c("input1.bam", "input2.bam",..., "inputN.bam")
IP.file = c("ip1.bam", "ip2.bam", ..., "ipN.bam")
OutputDir = "/directory/to/output"
Path_sqlit = "/path/to/xxx.sqlite"
variable = "YourVariable" # a dataframe containing both
# testing factor and potential covariates, 
# e.g., for a two-group comparison with balanced samples 
# variable = data.frame(Trt = rep(c("Ctrl", "Trt"), each = N/2))
model = "YourModel"     # e.g. model = ~1 + Trt
DMR.fit = TRESS_DMRfit(IP.file = IP.file,
                       Input.file = Input.file,
                       Path_To_AnnoSqlite = Path_sqlit,
                       variable = variable,
                       model = model,
                       InputDir = InputDir,
                       OutputDir = OutputDir,
                       experimentName = "xxx"
                       )
CoefName(DMR.fit)# show the name and order of coefficients 
                 # in the design matrix
Contrast = "YourContrast" # e.g., Contrast = c(0, 1)
DMR.test = TRESS_DMRtest(DMR = DMR.fit, contrast = Contrast)

Similar to TRESS_peak(), TRESS_DMRfit() involves multiple procedures for differential analysis:

$\quad 1$. Divide the whole genome into equal sized bins and obtain bin read counts for samples across all conditions using DivideBins()

$\quad 2$. Loosely call candidate DMRs from samples across all conditions using CallCandidates().

$\quad 3$. Model candidate DMRs using Negative binomial distribution and estimates all parameters for each candidate DMR, using CallDMRs.paramEsti().

5.1 Divide genome and obtain candidate DMR regions

TRESS applies the same algorithm used in peak calling to call candidate DMRs, which is a union of candidate regions from group-specific samples. Given $N$ paired input and IP BAM files, candidate DMRs from all samples are obtained using:

InputDir = "/directory/to/BAMfile"
Input.file = c("input1.bam", "input2.bam",..., "inputN.bam")
IP.file = c("ip1.bam", "ip2.bam", ..., "ipN.bam")
OutputDir = "/directory/to/output"
Path_sqlit = "/path/to/xxx.sqlite"
allBins = DivideBins(IP.file = IP.file,
                     Input.file = Input.file,
                     Path_To_AnnoSqlite = Path_sqlit,
                     InputDir = InputDir,
                     OutputDir = OutputDir,
                     experimentName = "example")
Candidates = CallCandidates(Counts = allBins$binCount, 
                            bins = allBins$bins)
Candidates = filterRegions(Candidates)

As listed above, prior to DMR detection, TRESS filters out some candidates who have small marginal coefficient of variation in methylation ratio ( $\frac{\text{normalized } IP}{ \text{normalized } IP \text{ + }\text{normalized } input}$). Similar to modeling transcriptome regions instead of bins, this pre-filtering step helps to reduce the hypothesis testing space and thus improves statistical power after multiple testing correction.

5.2 Call DMRs from candidates

With candidate DMRs, TRESS fits hierarchical negative binomial models for read counts in them, where methylation level of each candidate DMR is connected to factors of interest using a linear frame work. The Negative binomial model fitting here consists of: (1) estimate coefficients in design matrix for each candidate DMR; (2) estimate dispersion of methylation for each candidate DMR;

After model fitting, TRESS conducts Wald test for each candidate DMR. As mentioned earlier, the methylation level of each candidate is linked to factors in a linear frame work. This characteristic allows for testing all coefficients in the design and any linear combination of them through a particular contrast. Therefore, TRESS can be applied in studies only considering a two-group comparison and studies with a more complex design.

5.2.1 Two-group comparison

Here, we use a human data DMR_M3vsWT as an example. As introduced in the second example dataset in Section 3, it consists of read counts in candidate DMRs from samples under two conditions: WT and M3-KO. There are two replicates of paired IP and input data for samples from each condition. Given candidate DMRs, the estimation of parameters in each candidate DMR can be done through:

data("DMR_M3vsWT") # data from TRESS
head(DMR_M3vsWT$Counts, 3)
variable = data.frame(predictor = rep(c("WT", "M3"), c(2, 2)))
model = ~1+predictor
DMR.fit = CallDMRs.paramEsti(counts = DMR_M3vsWT$Counts,
                             sf = DMR_M3vsWT$sf,
                             variable = variable,
                             model = model)

Note, the sample order in above variable$predictor must be exactly the same as that in data matrix DMR_M3vsWT$Counts.

To compare methylation levels between WT and M3-KO samples, do,

CoefName(DMR.fit) ## show variable name of each coefficient
DMR.test = TRESS_DMRtest(DMR = DMR.fit, contrast = c(0, 1))
DMR.test[which(DMR.test$padj < 0.05)[1:3], ]

The option contrast = c(0, 1) is equivalent to test coefficient $\beta^{predictorWT} = 0$.

Individual DMR can be visualized using function ShowOnePeak():

data("DMR_M3vsWT")
snames = paste0(rep(c("WT_", "M3_"), each = 2), 
                rep(c("Replicate1", "Replicate2"), 2))
ShowOnePeak(onePeak = DMR_M3vsWT$Regions[1, ],
            allBins = DMR_M3vsWT$Bins,
            binCounts = DMR_M3vsWT$BinsCounts,
            isDMR = TRUE,
            Sname = snames)

5.2.2 General experiment design

For data from experiments with a general design including multiple factors, hypothesis testing can be performed for one, multiple, or any linear combination of the parameters.

Here, we use an example mouse brain data DMR_SixWeekvsTwoWeek as an example to illustrate how TRESS detects DMRs in a multifactor design. As mentioned in Section 3, it contains candidate read counts from two mouse brain regions (cortex and hypothalamus) at two time points (2-week-old and 6-week-old). Each sample contains two paired input and IP replicates.

To fit TRESS on this data without considering the interaction between time and region, do,

library(TRESS)
data("DMR_SixWeekvsTwoWeek")
variable = data.frame(time = rep(c("2wk", "6wk"), each = 4),
                    region = rep(rep(c("Cortex", "Hypothalamus"), 
                                       each = 2),2)
                    )
model = ~1 + time + region 
DMRfit = CallDMRs.paramEsti(counts = DMR_SixWeekvsTwoWeek$Counts, 
                            sf = DMR_SixWeekvsTwoWeek$sf,
                            variable = variable,
                            model = model)

## [1] "Start to estimate preliminary MLE ..."
## [1] "Update phi with its posterior ..."
## The 1th sampling of data...
## The 2th sampling of data...
## The 3th sampling of data...
## The 4th sampling of data...
## The 5th sampling of data...
## The 6th sampling of data...
## The 7th sampling of data...
## The 8th sampling of data...
## The 9th sampling of data...
## The 10th sampling of data...
## [1] "Calculate variance-covariance of coefficients..."
## [1] "Output all estimate..."

Again, the sample order in variable$time and in variable$region must be exactly the same as that in data matrix DMR_SixWeekvsTwoWeek$Counts.

To test the whole time effect, do:

CoefName(DMRfit)

## [1] "(Intercept)"        "time6wk"            "regionHypothalamus"

DMRtest = TRESS_DMRtest(DMRfit, contrast = c(0, 1, 0))
idx = order(abs(DMRtest$stat), decreasing = TRUE)
DMRtest[idx[1:3], ]

##    baseMean      logOR     lorSE   stat       pvalue         padj
## 1 0.4834676  1.3607333 0.2450920  5.552 2.825333e-08 5.481145e-06
## 2 0.6561851 -0.9591358 0.1850481 -5.183 2.181443e-07 2.116000e-05
## 4 0.6205323 -1.4985708 0.3041453 -4.927 8.343612e-07 4.218062e-05

DMR_SixWeekvsTwoWeek$MeRatio[idx[1:3],]

##   2wk_Cortex_rep1 2wk_Cortex_rep2 2wk_Hypothalamus_rep1 2wk_Hypothalamus_rep2
## 1       0.1958101       0.3814389             0.3317863             0.3795577
## 2       0.7854077       0.7823082             0.6885950             0.7802045
## 4       0.7863946       0.9102105             0.7269940             0.7425612
##   6wk_Cortex_rep1 6wk_Cortex_rep2 6wk_Hypothalamus_rep1 6wk_Hypothalamus_rep2
## 1       0.5260813       0.5728586             0.7698927             0.7103151
## 2       0.5797789       0.5206678             0.5920979             0.5204208
## 4       0.4573393       0.2143442             0.5448147             0.5816000

Here, the reason we specify contrast = c(0, 1, 0) is that the second column of design matrix model.matrix( model, variable) corresponds to time.
This contrast is equivalent to test the coefficient of time equal to 0, i.e., $\beta^{\text{time6wk}} = 0$.

If one considers the interaction between time and region and wants to test region-specific time effect, just include time*region into model and do,

model.int = ~1+ time + region + time*region 
model.matrix( model.int, variable)

##   (Intercept) time6wk regionHypothalamus time6wk:regionHypothalamus
## 1           1       0                  0                          0
## 2           1       0                  0                          0
## 3           1       0                  1                          0
## 4           1       0                  1                          0
## 5           1       1                  0                          0
## 6           1       1                  0                          0
## 7           1       1                  1                          1
## 8           1       1                  1                          1
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$time
## [1] "contr.treatment"
## 
## attr(,"contrasts")$region
## [1] "contr.treatment"

DMRfit.int = CallDMRs.paramEsti(
  counts = DMR_SixWeekvsTwoWeek$Counts, 
  sf = DMR_SixWeekvsTwoWeek$sf, 
  variable = variable,
  model = model.int
  )

## [1] "Start to estimate preliminary MLE ..."
## [1] "Update phi with its posterior ..."
## The 1th sampling of data...
## The 2th sampling of data...
## The 3th sampling of data...
## The 4th sampling of data...
## The 5th sampling of data...
## The 6th sampling of data...
## The 7th sampling of data...
## The 8th sampling of data...
## The 9th sampling of data...
## The 10th sampling of data...
## [1] "Calculate variance-covariance of coefficients..."
## [1] "Output all estimate..."

Then, in cortex, DMRs caused by time effect can be obtained through:

CoefName(DMRfit.int)

## [1] "(Intercept)"                "time6wk"                   
## [3] "regionHypothalamus"         "time6wk:regionHypothalamus"

CTX.6vs2 = TRESS_DMRtest(DMRfit.int, contrast = c(0, 1, 0, 0))
idx = order(abs(CTX.6vs2$stat), decreasing = TRUE)
CTX.6vs2[idx[1:3], ]

##      baseMean     logOR     lorSE   stat       pvalue         padj
## 4   0.6205323 -2.276190 0.4100494 -5.551 2.840188e-08 5.680376e-06
## 255 0.3426412  3.339239 0.6151941  5.428 5.700713e-08 5.700713e-06
## 3   0.5438187 -1.173738 0.2545241 -4.612 3.997721e-06 2.665147e-04

DMR_SixWeekvsTwoWeek$MeRatio[
  idx[1:3], 
  grepl("Cortex",colnames(DMR_SixWeekvsTwoWeek$MeRatio))]

##     2wk_Cortex_rep1 2wk_Cortex_rep2 6wk_Cortex_rep1 6wk_Cortex_rep2
## 4         0.7863946       0.9102105       0.4573393       0.2143442
## 255       0.1108468       0.1071387       0.7124642       0.9036763
## 3         0.6177052       0.7435040       0.4566768       0.3327581

In hypothalamus, DMRs caused by time effect can be obtained through:

HTH.6vs2 = TRESS_DMRtest(DMRfit.int, contrast = c(0, 1, 0, 1))
idx = order(abs(HTH.6vs2$stat), decreasing = TRUE)
HTH.6vs2[idx[1:3], ]

##     baseMean    logOR     lorSE  stat       pvalue         padj
## 64 0.5472207 2.106293 0.4145528 5.081 3.756905e-07 0.0000751381
## 1  0.4834676 1.602397 0.3427155 4.676 2.931112e-06 0.0002931112
## 36 0.5742661 2.005997 0.4481574 4.476 7.601952e-06 0.0004473374

DMR_SixWeekvsTwoWeek$MeRatio[
  idx[1:3], 
  grepl("Hypothalamus",colnames(DMR_SixWeekvsTwoWeek$MeRatio))]

##    2wk_Hypothalamus_rep1 2wk_Hypothalamus_rep2 6wk_Hypothalamus_rep1
## 64             0.3211882             0.5110739             0.8158931
## 1              0.3317863             0.3795577             0.7698927
## 36             0.3121385             0.4428793             0.8194071
##    6wk_Hypothalamus_rep2
## 64             0.9110851
## 1              0.7103151
## 36             0.8325004

Detection of DMRs by interaction of time and region (meaning that the temporal difference in methylation of these DMRs changes spatially) can be done through:

CTXvsHTH.6vs2 = TRESS_DMRtest(DMRfit.int, contrast = c(0, 0, 0, 1))
idx = order(abs(CTXvsHTH.6vs2$stat), decreasing = TRUE)
CTXvsHTH.6vs2[idx[1:3], ]

##      baseMean     logOR     lorSE   stat       pvalue         padj
## 255 0.3426412 -3.908666 0.7945591 -4.919 8.685917e-07 0.0001737183
## 669 0.7239899  1.736373 0.4252211  4.083 4.437035e-05 0.0036600171
## 672 0.5595251 -4.510067 1.1250368 -4.009 6.102350e-05 0.0036600171

DMR_SixWeekvsTwoWeek$MeRatio[
  idx[1:3], c(1,2,5,6,3,4,7,8)]

##     2wk_Cortex_rep1 2wk_Cortex_rep2 6wk_Cortex_rep1 6wk_Cortex_rep2
## 255       0.1108468       0.1071387       0.7124642       0.9036763
## 669       0.7854224       0.6935109       0.6467831       0.6260932
## 672       0.4999804       0.1310775       0.7862882       0.9368236
##     2wk_Hypothalamus_rep1 2wk_Hypothalamus_rep2 6wk_Hypothalamus_rep1
## 255             0.2491189             0.3024580             0.1488874
## 669             0.5920845             0.7076847             0.8763107
## 672             0.6902550             0.6243643             0.1551789
##     6wk_Hypothalamus_rep2
## 255             0.1819433
## 669             0.8640297
## 672             0.6559275

Session info

## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] GenomicFeatures_1.52.0 AnnotationDbi_1.62.0   Biobase_2.60.0        
##  [4] GenomicRanges_1.52.0   GenomeInfoDb_1.36.0    IRanges_2.34.0        
##  [7] TRESS_1.6.0            S4Vectors_0.38.0       BiocGenerics_0.46.0   
## [10] BiocStyle_2.28.0      
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0            dplyr_1.1.2                
##  [3] blob_1.2.4                  filelock_1.0.2             
##  [5] Biostrings_2.68.0           bitops_1.0-7               
##  [7] fastmap_1.1.1               RCurl_1.98-1.12            
##  [9] BiocFileCache_2.8.0         GenomicAlignments_1.36.0   
## [11] XML_3.99-0.14               digest_0.6.31              
## [13] lifecycle_1.0.3             KEGGREST_1.40.0            
## [15] RSQLite_2.3.1               magrittr_2.0.3             
## [17] compiler_4.3.0              rlang_1.1.0                
## [19] sass_0.4.5                  progress_1.2.2             
## [21] tools_4.3.0                 utf8_1.2.3                 
## [23] yaml_2.3.7                  rtracklayer_1.60.0         
## [25] knitr_1.42                  prettyunits_1.1.1          
## [27] bit_4.0.5                   curl_5.0.0                 
## [29] DelayedArray_0.26.0         xml2_1.3.3                 
## [31] BiocParallel_1.34.0         grid_4.3.0                 
## [33] fansi_1.0.4                 biomaRt_2.56.0             
## [35] SummarizedExperiment_1.30.0 cli_3.6.1                  
## [37] rmarkdown_2.21              crayon_1.5.2               
## [39] generics_0.1.3              httr_1.4.5                 
## [41] rjson_0.2.21                DBI_1.1.3                  
## [43] cachem_1.0.7                stringr_1.5.0              
## [45] zlibbioc_1.46.0             BiocManager_1.30.20        
## [47] XVector_0.40.0              restfulr_0.0.15            
## [49] matrixStats_0.63.0          vctrs_0.6.2                
## [51] Matrix_1.5-4                jsonlite_1.8.4             
## [53] bookdown_0.33               hms_1.1.3                  
## [55] bit64_4.0.5                 magick_2.7.4               
## [57] jquerylib_0.1.4             glue_1.6.2                 
## [59] codetools_0.2-19            stringi_1.7.12             
## [61] BiocIO_1.10.0               tibble_3.2.1               
## [63] pillar_1.9.0                rappdirs_0.3.3             
## [65] htmltools_0.5.5             GenomeInfoDbData_1.2.10    
## [67] R6_2.5.1                    dbplyr_2.3.2               
## [69] evaluate_0.20               lattice_0.21-8             
## [71] highr_0.10                  png_0.1-8                  
## [73] Rsamtools_2.16.0            memoise_2.0.1              
## [75] bslib_0.4.2                 Rcpp_1.0.10                
## [77] xfun_0.39                   MatrixGenerics_1.12.0      
## [79] pkgconfig_2.0.3