ChIP-seq Analysis

library(memes)
library(magrittr)
library(ggplot2)
suppressPackageStartupMessages(library(GenomicRanges))

See package website for full vignette

The Bioconductor build system does not have the MEME Suite installed, therefore these vignettes will not contain any R output. To view the full vignette, visit this article page on the memes website at this link

Introduction

In this vignette, we’ll explore using memes to deeply analyze a set of ChIP-seq peaks to identify motifs to explain differences in transcription factor binding, and consequences to chromatin accessibility at these ChIP peaks.

We will use a dataset from (Nystrom, 2020) which investigated the binding of the transcription factor E93 during Drosophila wing development. These data are useful because they contain multiple groupings with interesting motif properties between groups.

The ChIP experiments were performed under two conditions: ectopic expression of E93 during early development, and under wild-type conditions in a late stage of development. Peaks are annotated based on whether they are found in both conditions (“Endogenous” peaks), only in the ectopic expression condition (“Ectopic” peaks), or only found under wild-type conditions (“Orphan” peaks).

A key question from this observation is whether there are motifs which can distinguish these three categories of binding from eachother. To investigate this question, we will first look for enrichment of known transcription factor motifs within each binding category using the AME tool.

To start, ensure memes can detect your local install of the MEME Suite (see vignette("install_guide") for help).

check_meme_install()

Prepare peaks for analysis

A subset of annotated E93 ChIP summits are shipped with the memes package and can be loaded as follows:

The ChIP summits have 3 additional annotation columns: id, peak_binding_description, and e93_sensitive_behavior. id is a unique identifier for each peak, peak_binding_description describes the condition in which E93 is observed to bind to this region (described in detail below), and e93_sensitive_behavior describes the change in chromatin accessibility observed at this site following E93 expression. These annotations will be used throughout this vignette to identify motifs that are associated with differences in E93 binding, and differences in chromatin accessibility at E93 bound sites.

The summit of a ChIP peak represents the single base-pair position with the highest amplitude ChIP signal within a ChIP peak. Typically, motif analysis gives the best results when performed in a small window (100 - 200 bp) around the summit. We can resize our summits to use a 100bp window using plyranges:

Determinants of ectopic and orphan binding

One approach to identify enriched motifs is to search for enrichment of known transcription factor (TF) motifs within sequences of interest. The Fly Factor Survey is a database of known Drosophila TF motifs that we will use throught this vignette. These data are also packaged with memes:

Pre-filtering database for expressed transcription factors

A key consideration when using a database of known transcription factors is whether to include motifs for TFs which are not expressed. On one hand, including all motifs allows maximum discovery of potential regulators, but on the other hand may detect factors which are not relevant under your experimental conditions. Another consideration for many tools is that increasing the number of tested motifs increases the multiple testing penaly, so limiting the motif database to a smaller number of relevant candidates can be a way to improve statistical power. There is no one right answer to this question. For this vignette, I will give an example for how one could go about filtering a database to consider only expressed factors. For other analyses, this question will need to be carefully considered.

First, we load the motif database into R using read_meme. To more easily manipulate the motif metadata, we can convert this to a data.frame using to_df.

To simplify the vignette, we’ve preproccessed RNAseq data from our tissue of interest and shipped them with memes. These are FPKM normalized counts for each transcription factor gene at early and late developmental times.

Choosing a method to decide whether a gene is expressed or not is a non-trivial topic. DO NOT BLINDLY DO WHAT I DO HERE. To keep this vignette on topic, I will use an overly simplistic cutoff of FPKM > 5 to define expressed genes. In this example, I will consider any gene that exceeds the cutoff at either timepoint as expressed.

Finally, I will filter the full motif database to select only those motifs corresponding to expressed gene.

Below is a comparison of the number of motifs in the database before and after filtering.

Using this filtering strategy, we’ve decreased our multiple-testing penalty by ~54%.

As a convenience function, we can set a default motif database to use for each memes function by setting the R option meme_db to a universalmotif object. This can be done as follows:

Alternatively, you can pass a universalmotif object to the database parameter for any memes function and this will override the global option.

Examination of binding categories with AME

Next, to test for enrichment of known motifs between E93 binding categories, we first collect the sequences of each category like so:

The data are returned as a Biostrings List, where each list entry represents the sequences of each E93 binding category.

Finally, we test each set of sequencing using AME with the runAme function. This will run AME on each set of input sequences in the Biostrings List.

This returns a list object where each entry is the AME results for each category. For example, here is what the results for the ectopic peaks look like:

Visualizing AME results

The plot_ame_heatmap() function provides a quick way to visualize AME results. It is built on top of ggplot2, so all ggplot2 functions can be used to further modify the plot.

By default, it uses the -log10(adjusted p-value) as the heat values. See the documentation (?plot_ame_heatmap) for additional notes on customization.

We can visualize the top 10 hits from the ectopic binding results as follows:

To plot results from multiple runs together, they must first be joined into a single data frame. The ame_by_binding object is a list whose names correspond to the E93 binding category. The list can be combined into a data.frame using dplyr::bind_rows. Setting .id = "binding_type" creates a new column binding_type that contains the names from the ame_by_binding list. In this way, the ame_res data.frame contains all AME results for each run, which can be distinguished by the binding_type column.

It is possible to aggregate results from multiple runs into a heatmap by setting the group parameter in plot_ame_heatmap(). This will stratify the y-axis by the entries in the column passed to group. This results in clustering motifs with shared hits across categories along the x-axis for quickly visualizing motifs that may be shared or distinct between multiple groups.

Reducing redundant motif hits

Another key consideration for the above visualization is that in the FlyFactorSurvey database we used, different TFs can have multiple motif entries in the database which are all detected separately by AME. Here, when returning the top 5 hits from each group, you can see, for example, that a motif matching “Aef1” is reported 2 times within the top 5 hits of the ectopic sites. In this situation, it makes sense to summarize the data at the TF level, instead of the motif level. Note: There may be exceptions to this if, for example, a TF has multiple DNA binding sequences it can recognize, in which case having multiple hits may reflect a biological property of your sequences. You will have to handle this on a case-by-case basis for interesting hits and different motif databases. Here, we can see that at least for Aef1, the consensus sequences are very similar.

How to solve this problem will vary with different motif databases (For details on how to pre-process a motif database, see vignette("tidy_motifs")). In the tidy version of the FlyFactorSurvey database, the altnames of the motifs are set to the transcription factor gene symbol. This information is included in the AME results as the motif_alt_id column. In order to reduce the occurrance of redundant hits in the heatmap, we select the hit for each TF with the lowest p-value (i.e. the most significant hit) as follows:

AME Heatmap Visualization

If you have read vignette("core_ame"), the “normalized rank” heatmap was introduced as an alternative visualization method useful when AME produces very large numbers of hits, or when p-values are on very different scales between groups. The ame_compare_heatmap_methods() function can be used to visualize why.

Below is a comparison of the distribution of values when using -log10(adj.pvalue) (A) vs normalized ranks (B). Because there are relatively few hits in the results (~30), and the number of hits between groups varies more than the -log10(p-value) distributions, the “normalize” method will produce a misleading heatmap relative to the -log10(p-value) map.

A third option for adjusting this visualization is to cap the heatmap scale values at a certain value. A good rule of thumb for selecting this cutoff is to view the cumulative distribution plot above in (A), and select a value on the x-axis which captures a majority of the data. Here, we see that -log10(adj.pvalue) < 10 captures >90% of the entopic & ectopic hits, but only ~50% of the orphan hits. By selecting 10 as the heatmap scale cap, we can improve the dynamic range of the signal for values <10, while preserving the information that many orphan sites generally have higher scores than the other categories. The heatmap scale can be capped at 10 by setting scale_max = 10. Below is a comparison of the 3 different visualizations. Notice how the intensity of the orphan vs other categories hits changes.

De-novo motif similarity by binding

Discovery of de-novo motifs within binding categories can be another way to identify meaningful motifs that distinguish between ectopic, entopic, and orphan sites which does not rely on known motif information. The MEME tool, Dreme, discovers short de-novo motifs in input sequences. We begin the analysis by searching for 5 de-novo motifs per binding category and shuffling the input sequences to use as the control set.

We pass the same by_binding list as the input to runDreme which will run Dreme on each set of sequences independently. The results are returned in list format, which we combine into a data.frame using dplyr::bind_rows as above.

Next, we want to compare the de-novo motifs between each binding category to identify motifs which could distinguish the three groups. One way to interrogate this is to examine the correlation score between each motif.

Rename the motifs to indicate the binding category they were discovered in (for visualization purposes).

Create the correlation heatmap:

This heatmap reveals that most motifs discovered in each group are highly similar to the motifs found in other groups.

Test de-novo motif enrichment using AME

The above analysis suggests the motif content of the different binding categories are highly similar in sequence composition. To extend these analyses, we can use AME to test for motif enrichment of the de-novo discovered motifs within each binding category, and determine whether the motifs detected in one category are indeed enriched in another. To do this, we can provide the de-novo motifs as the AME database to test for their enrichment in each sequence category.

runAme() allows using a runDreme() results object as the database input by passing it within a list(). Naming the list() entry produces an informative motif_db name in the results data.frame.

Plotting the heatmap of results reveals that indeed a majority of the de-novo motifs discovered within a single category are detected in all 3 categories, supporting the conclusion that orphan, ectopic, and entopic sites are highly similar in sequence content.

However, there are 2 interesting motifs which distinguish orphan and ectopic sites from entopic sites. To help identify which TFs these motifs might belong to, we can use TomTom to match them to known TF motifs.

First, we select the motif id’s that are not found in entopic sites.

Next, we use the motif id’s from the unique AME results to select those entries in the Dreme results object, and run TomTom on that subset.

Finally, we visualize the TomTom results to identify candidate TFs driving the presence of ectopic and orphan sites.

Motifs in opening vs closing sites

Another category of ChIP peaks in this dataset are ones that overlap regions with dynamic changes to chromatin accessibility in response to E93 expression. These peaks are annotated in the e93_sensitive_behavior column which indicates whether sites have higher accessibility after ectopic E93 expression (“Increasing”), lower accessibility after ectopic E93 expression (“Decreasing”), or if the accessibility does not change (“Static”).

To identify motifs associated with Increasing or Decreasing accessibility in response to ectopic E93 expression, we first grab the DNA sequences of the Increasing, Decreasing, and Static sites by splitting the ChIP peaks on the e93_sensitive_behavior column.

Proper selection of the control set of sequences is critical to identifying biologically relevant motifs. Because we are interested in motifs associated with Increasing or Decreasing behavior, we want to exclude detection of motifs that are simply associated with E93 binding. To do this, we can use the “Static” category of E93 ChIP peaks, because these theoretically represent E93 binding sites without motifs that influence chromatin accessibility. To start, we will search for de-novo motifs within dynamic E93 binding sites using Dreme.

The by_sens list has three entries: Increasing, Decreasing, and Static. When calling runDreme on a sequence list, you can pass the name of a list entry as the control argument to use that set as the background sequences for the remaining sets of sequences. Below we search for de-novo motifs in Increasing and Decreasing peaks using the Static peaks as the background set.

The results are returned as a list with two entries: Increasing, and Decreasing. These can be joined into a single data.frame using dplyr::bind_rows.

Next, we match the de-novo motifs with known TF motifs using TomTom. The TomTom results will be added as new columns to the input data. TomTom data columns begin with best_match_ summarizing the top hit, and the tomtom column is a nested data.frame storing the full results for each input motif.

We can visualize some of the top TomTom hits for each de-novo motif using view_tomtom_hits(). Passing a number to view_tomtom_hits() shows that many matches ranked in descending order. Passing no value will show all hits. It’s a good idea to inspect all hits visually. This can be an extremely revealing step, as it will give you insight into which motifs are generally similar to your input motif. You may also discover proteins with identical motifs, in which case you may need to decide which TF is the better candidate. For brevity, I’ll only show the top 3 hits for each motif.

After manually inspecting the above hits, we may decide that a lower-ranked motif is a better match by eye, or because of domain-specific knowledge that makes a different TF a better candidate. We can swap the best matched motif using force_best_match(). This will update the best_match_ columns and re-rank the tomtom nested data column to set a lower ranked match to the top hit. The updates best_match_ columns will reflect the values corresponding to the new match.

As an example, I’ll set the top hit for m03_AKGG to the lower ranked pho_SANGER_10 since it has weaker consensus sequence in the regions that don’t overlap the de-novo motif.

Replotting the top hits, you can see that “pho_SANGER_10” is now listed as the top hit for “m03_AKGG”.

Finally, we can make publication quality figures using cowplot. This code is a little complicated, but hopefully gives an idea for how you can use R graphics to generate nice motif figures!

A striking result from this is that the de-novo motifs associated with Decreasing sites match the E93 motif itself.

Scanning for motif matches using FIMO

Because we discover a de-novo motif from DREME that matches to E93 in Decreasing sites, this suggests that the E93 motifs in Decreasing sites have different properties relative to E93 motifs in Increasing or Static sites. For example, the E93 motif may have higher similarity to the canonical sequence, may be present in greater number, or may have different positioning in Decreasing sites relative to Increasing or Static sites. To test these questions directly, we can identify individual E93 motif matches within each sequence and investigate their properties. The FIMO tool is used to identify the positions of motif matches in a set of sequences. It will return the coordinates, sequence, and a score (higher values are more similar to the consensus) for each motif found in the input sequences.

In order to scan for the E93 motif within target sequences, we first need a copy of the E93 motif. Although we could import the motifs from our local meme database, it is also possible to use motifs pulled from a MotifDb query as follows. I’ll use that approach here as an example, but if this were a real analysis, I’d take the result from our local database.

Just like runDreme() and runAme(), runFimo() takes a Biostrings::XStringSet as input. Using get_sequence() to generate these for DNA sequences is preferred because it creates sequence inputs that can be parsed into genomic coordinates by Fimo.

The results of runFimo() are returned as a GRanges object containing the positions of each motif discovered in the input sequences. The best way to integrate these data with our input peaks is to use the plyranges suite of tools for performing overlaps and joins between GRanges objects.

Counting the number of motifs per peak

plyranges extends dplyr-like syntax to range objects. Here we add a count for each motif per peak in the n_motifs column. We also add a column has_motif which will be a binary indicator of whether a peak contains any motifs.

First, we want to determine whether E93 sensitive sites are more likely to have E93 motifs in certain response types. Here we can see that sensitive decreasing sites are more likely to have E93 motifs than sensitive increasing or insensitive static sites.

To investigate whether E93-sensitive sites that increase, decrease, or remain static have different numbers of E93 motifs, we plot the fraction of sites with each number of motifs per group. Here we can see that in addition to being more likely to contain an E93 motif, sensitive decreasing sites are more likely to to contain 2 or more matches, where 10% contain at least 2 motifs.

Finally, we want to assess whether the quality of E93 motifs is different between sensitivity categories. To examine this, we need to determine which motifs are found in which peaks. We use plyranges::join_overlap_intersect to return motif entries appended with peak-level metadata, like the peak id each motif is found within.

We use the FIMO score as a proxy for quality, where higher scores are better matches to the motif. Here we examine only the best match (highest score) motif within each peak.

Finally, visualize the FIMO scores for the best E93 motif within each ChIP peak across categories. This reveals that Decreasing sites appear to have higher FIMO scores, suggesting the E93 motifs within Decreasing sites are higher quality E93 motifs than those in Increasing or Static sites.

We can test whether the distribution of FIMO scores between multiple groups is statistically significantly different using an Anderson-Darling test. For a pairwise comparison (in case you only have 2 groups to test), a Kolmogorov-Smirnov test is commonly used instead. The PMCMRplus::adAllPairsTest() performs the Anderson-Darling test and corrects for multiple testing. This reveals that Decreasing FIMO scores are significantly different from Increasing and Static scores (p < 0.05), but Increasing and Static scores are not significantly different from eachother (p > 0.05), supporting the conclusion that E93 motifs in Decreasing sites are higher quality than those in Increasing or Static sites.

We can visualize the differences in E93 motif quality between groups by reconstructing PCM’s from the matched motif sequences. Here, it is clear that although each category contains an E93 motif match, the sequence content of those motifs differs between categories, providing a nice visual representation of the statistical analysis above.

We can also build a more complex representation of these data by making a heatmap of the sequence variability for each matched sequence using plot_sequence_heatmap(). This function will plot each sequence as a row of a heatmap where each column represents the position relative to the motif start coordinate. The cells are colored by the DNA sequence found at that position for each sequence. Using this type of plot it becomes even more clear how variability at position 6 differs between Decreasing and Increasing sites, where Decreasing sites have a very consistent “C” at position 6, while Increasing sites are more likely to have non-C bases at this position.

Centrality of E93 motif

Next we want to visualize whether the E93 motif has altered positioning across the response categories. To do this we add the metadata for each nearest motif to our peak summits. Setting distance = TRUE in plyranges::join_nearest adds a distance column indicating the distance to the nearest joined region.

We can visualize these data as a cumuluative distribution plot. In this example, it does appear that decreasing sites tend to be closer to the peak summit than increasing or static sites.

Conclusion

This is just one example of how memes can integrate multiple data sources to perform deep analysis of genomics data. Here we went from a set of annotated ChIP peaks do multifaceted interrogation of motif content. Each approach has advantages and disadvantages. The most important thing is to critically evaluate the data at every step: visualize de-novo motif hits with view_motifs, compare matches using view_tomtom_hits, and think carefully about which sequences to use as control regions.

References

Nystrom SL, Niederhuber MJ, McKay DJ. Expression of E93 provides an instructive cue to control dynamic enhancer activity and chromatin accessibility during development. Development 2020 Mar 16;147(6).

Session Info

sessionInfo()