--- title: VCF filter rules author: - name: Kévin Rue-Albrecht email: kevinrue67@gmail.com affiliation: - Department of Medicine, Imperial College London, UK - Nuffield Department of Medicine, University of Oxford, UK date: "`r doc_date()`" package: "`r pkg_ver('TVTB')`" abstract: > Currently, `VCF` objects of the `r Biocpkg("VariantAnnotation")` package may be subsetted by indices or names of variant records. The `r Biocpkg("TVTB")` package extends `FilterRules` of the `r Biocpkg("S4Vectors")` package to provide filter rules readily applicable to individual slots of `VCF` objects. These new classes of filter rules provide containers for powerful expressions and functions to facilitate filtering of variants using combinations of filters applicable to FIXED fields, INFO fields, and Ensembl VEP predictions stored in a given INFO field of VCF files. vignette: > %\VignetteIndexEntry{VCF filter rules} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: toc_float: true bibliography: TVTB.bib --- # Motivation {#Motivation} ## Background `VCF` objects of the `r Biocpkg("VariantAnnotation")` package contain a plethora of information imported from specific fields of source VCF files and stored in dedicated slots (*e.g.* `fixed`, `info`, `geno`), as well as optional Ensembl VEP predictions [@RN1] stored under a given key of their INFO slot. This information may be used to identify and filter variants of interest for further analysis. However, the size of genetic data sets and the variety of filter rules---and their combinatorial explosion---create considerable challenges in terms of workspace memory and entropy (*i.e.* size and number of objects in the workspace, respectively). The `FilterRules` class implemented in the `r Biocpkg("S4Vectors")` package provides a powerful tool to create flexible and lightweight filter rules defined in the form of `expression` and `function` objects that can be evaluated within given environments. The `r Biocpkg("TVTB")` package extends this `FilterRules` class into novel classes of *VCF filter rules*, applicable to information stored in the distinct slots of `VCF` objects (*i.e.* `CollapsedVCF` and `ExpandedVCF` classes), as described below: ```{r filterRulesMotivation, echo=FALSE, results='asis'} motivations <- readRDS( system.file("misc", "motivations_rules.rds", package = "TVTB") ) pander::pandoc.table( motivations, paste( "Motivation for each of the new classes extending `FilterRules`, to define VCF filter rules." ), style = "multiline", split.cells = c(15, 45), split.tables = c(Inf) ) ``` Table: Motivation for each of the new classes extending `FilterRules` to define *VCF filter rules*. Note that `FilterRules` objects themselves are applicable to `VCF` objects, with two important difference from the above specialised classes: * Expressions must explicitely refer to the different `VCF` slots * As a consequence, a single expression can refer to fields from different `VCF` slots, for instance: ```{r FilterRules} fr <- S4Vectors::FilterRules(list( mixed = function(x){ VariantAnnotation::fixed(x)[,"FILTER"] == "PASS" & VariantAnnotation::info(x)[,"MAF"] >= 0.05 } )) fr ``` ## Features {#Features} As they inherit from the `FilterRules` class, these new classes benefit from accessors and methods defined for their parent class, including: * *VCF filter rules* can be toggled individually between an active and an inactive states * *VCF filter rules* can be subsetted, edited, replaced, and deleted To account for the more complex structure of `VCF` objects, some of the new *VCF filter rules* classes implemented in the `r Biocpkg("TVTB")` package require additional information stored in new dedicated slots, associated with the appropriate accessors and setters. For instance: * `VcfVepRules` require the INFO key where predictions of the Ensembl Variant Effect Predictor are stored in a `VCF` object. The `vep` accessor method may be used to access this slot. * `VcfFilterRules`---which may combine any number of filter rules stored in `FixedRules`, `VcfFixedRules`, `VcfInfoRules`, `VcfVepRules`, and other `VcfFilterRules` objects--- mark each filter rule with their type in the combined object. The information is stored in the `type` slot, which may be accessed using the *read-only* accessor method `type`. # Demonstration data {#DemoData} For the purpose of demonstrating the utility and usage of *VCF filter rules*, a set of variants and associated phenotype information was obtained from the [1000 Genomes Project](http://www.1000genomes.org) Phase 3 release. It can be imported as a `CollapsedVCF` object using the following code: ```{r makeCollapsedVCF} library(TVTB) extdata <- system.file("extdata", package = "TVTB") vcfFile <- file.path(extdata, "chr15.phase3_integrated.vcf.gz") tabixVcf <- Rsamtools::TabixFile(file = vcfFile) vcf <- VariantAnnotation::readVcf(file = tabixVcf) ``` VCF filter rules may be applied to `ExpandedVCF` objects equally: ```{r makeExpandedVCF} evcf <- VariantAnnotation::expand(x = vcf, row.names = TRUE) ``` ## CollapsedVCF and ExpandedVCF {#VcfClasses} As described in the documentation of the `r Biocpkg("VariantAnnotation")` package, the key difference between `CollapsedVCF` and `ExpandedVCF` objects ---both extending the `VCF` class---is the expansion of multi-allelic records into bi-allelic records, respectively. In other words (quoting the `r Biocpkg("VariantAnnotation")` documentation): > "`CollapsedVCF` objects contains the ALT data as a > `DNAStringSetList` allowing for multiple alleles per variant. > In contrast, the `ExpandedVCF` stores the ALT data as a `DNAStringSet` > where the ALT column has been expanded to create a flat form of the data > with one row per variant-allele combination." This difference has implications for filter rules using the `"ALT"` field of the `info` slot, as demonstrated in a [later section](#AltField). # Fields available for the definition of filter rules {#VcfFields} First, let us examine which fields (*i.e.* column names) are available in the `VCF` objects to create *VCF filter rules*: ```{r simpleRules} fixedVcf <- colnames(fixed(vcf)) fixedVcf infoVcf <- colnames(info(vcf)) infoVcf csq <- ensemblVEP::parseCSQToGRanges(x = evcf) vepVcf <- colnames(mcols(csq)) vepVcf ``` \bioccomment{ Although multi-allelic records present in `CollapsedVCF` objects are expanded into bi-allelic records in `ExpandedVCF` objects, names fields in the different slots are identical in the corresponding collapsed and expanded objects. } # Usage of VCF filter rules {#FilterUsage} ## Filter rules using a single field {#SingleFieldFilter} The value of a particular field can be used to define expressions that represent simple filter rules based on that value alone. Multiple rules may be stored in any one `FilterRules` objects. Ideally, *VCF filter rules* should be named to facilitate their use, but also as a reminder of the purpose of each particular rule. For instance, in the chunk of code below, two filter rules are defined using fields of the `fixed` slot: * A rule named `"pass"` identifies variants for which the value in the `FILTER` field is `"PASS"` * A rule named `"qual20"` identifies variants where the value in the `QUAL` field is greater than or equal to `20` ```{r simpleFixedRules} fixedRules <- VcfFixedRules(exprs = list( pass = expression(FILTER == "PASS"), qual20 = expression(QUAL >= 20) )) active(fixedRules)["qual20"] <- FALSE summary(evalSeparately(fixedRules, vcf)) ``` In the example above, all variants pass the active `"pass"` filter, while the deactivated rules `"qual20"`) automatically returns `TRUE` for all variants. \bioccomment{ `active(fixedRules["qual20"]) <- FALSE` would be wrong syntax, as it does not change the active state of the `fixedRules` object; instead, it changes the active state of a temporary object made of the selected filter rule. As a consequence, the `fixedRules` object would be left unchanged after this command. } ## Filter rules using multiple fields {#MultipleFieldsFilter} It is also possible for *VCF filter rules* to use multiple fields (of the same `VCF` slot) in a single expression. In the chunk of code below, the *VCF filter rule* identifies variants for which both the `"REF"` and `"ALT"` values (in the INFO slot) are one of the four nucleotides (*i.e.* a simple definition of single nucleotide polymorphisms; SNPs): ```{r addFixedRule} nucleotides <- c("A", "T", "G", "C") SNPrule <- VcfFixedRules(exprs = list( SNP = expression( as.character(REF) %in% nucleotides & as.character(ALT) %in% nucleotides) )) summary(evalSeparately(SNPrule, evcf, enclos = .GlobalEnv)) ``` Some considerations regarding the above filter rule: * considering that the filter rule requires the `nucleotides` character vector, the global environment must be supplied as the enclosing environment to successfully evaluate the expression * `"REF"` and `"ALT"` are stored as `DNAStringSet` in `CollapsedVCF` objects and must be converted to `character` in order to successfully apply the method `%in%`. ## Calculations in filter rules {#CalculationFilter} Expressions that define filter rules may also include calculations. In the chunk of code below, two simple *VCF filter rules* are defined using fields of the `info` slot: * A rule named `"samples"` identifies variants where at least 90% of samples have data (*i.e.* the `NS` value is greater than or equal to `0.9` times the total number of samples) * A rule named `"avgSuperPopAF"` calculates the average of the allele frequencies calculated in each the five super-populations (available in several INFO fields), and subsequently identifies variants with an average value greater than `0.05`. ```{r simpleInfoRules} infoRules <- VcfInfoRules(exprs = list( samples = expression(NS > (0.9 * ncol(evcf))), avgSuperPopAF = expression( (EAS_AF + EUR_AF + AFR_AF + AMR_AF + SAS_AF) / 5 > 0.05 ) )) summary(evalSeparately(infoRules, evcf, enclos = .GlobalEnv)) ``` \bioccomment{ Here, the global environment must be supplied as the filter rule requires the `evcf` object itself to access its number of columns (not only as the environment in which the rules must be evaluated). } ## Functions in filter rules {#FunctionFilter} It may be more convenient to define filters as `function` objects. For instance, the chunk of code below: * first, defines a function that: + expects the `info` slot of a `VCF` object as input + identifies variants where at least two thirds of the super-populations show an allele frequency greater than 5% * next, defines a *VCF filter rule* using the above function ```{r infoFunctionRule} AFcutoff <- 0.05 popCutoff <- 2/3 filterFUN <- function(envir){ # info(vcf) returns a DataFrame; rowSums below requires a data.frame df <- as.data.frame(envir) # Identify fields storing allele frequency in super-populations popFreqCols <- grep("[[:alpha:]]{3}_AF", colnames(df)) # Count how many super-population have an allele freq above the cutoff popCount <- rowSums(df[,popFreqCols] > AFcutoff) # Convert the cutoff ratio to a integer count popCutOff <- popCutoff * length(popFreqCols) # Identifies variants where enough super-population pass the cutoff testRes <- (popCount > popCutOff) # Return a boolean vector, required by the eval method return(testRes) } funFilter <- VcfInfoRules(exprs = list( commonSuperPops = filterFUN )) summary(evalSeparately(funFilter, evcf)) ``` \bioccomment{ Filter rules written as functions may use values in the the global environment, without the need to supply it as the enclosing environment. } Notably, the `filterFUN` function may also be applied separately to the `info` slot of `VCF` objects: ```{r applyInfoFunction} summary(filterFUN(info(evcf))) ``` ## Pattern matching in filter rules {#PatternFilter} The `grepl` function is particularly suited for the purpose of `FilterRules` as they return a `logical` vector: ```{r greplFilter} missenseFilter <- VcfVepRules( exprs = list( exact = expression(Consequence == "missense_variant"), grepl = expression(grepl("missense", Consequence)) ), vep = "CSQ") summary(evalSeparately(missenseFilter, evcf)) ``` In the above chunk of code: * the filter rule named `"exact"` matches only the given value, associated with `27` variants, * the filter rule named `"grepl"` also matches an extra two variants associated with the value `"missense_variant&splice_region_variant"` matched by the given pattern. By deduction, the two rules indicate together that those two variants were not assigned the `"missense_variant"` prediction. # Using ALT data in the `fixed` slot of `VCF` objects {#AltField} As detailed in an earlier section introducing the [demonstration data](#VcfClasses), and more thoroughly in the documentation of the `r Biocpkg("VariantAnnotation")` package, `CollapsedVCF` and `ExpandedVCF` classes differ in the class of data stored in the `"ALT"` field of their respective `fixed` slot. As as result, *VCF filter rules* using data from this field must take into account the `VCF` class in order to handle the data appropriately: ## `ExpandedVCF` objects {#ALTExpandedVCF} A key aspect of `ExpandedVCF` objects is that the `"ALT"` field of their `fixed` slot may store only a single allele per record as a `DNAStringSet` object. For instance, in an earlier section that demonstrated [Filter rules using multiple raw fields](#MultipleFieldsFilter), ALT data of the `fixed` slot in an `ExpandedVCF` object had to be re-typed from `DNAStringSet` to `character` before the `%in%` function could be applied. Nevertheless, *VCF filter rules* may also make use of methods associated with the `DNAStringSet` class. For instance, genetic insertions may be identified using the fields `"REF"` and `"ALT"` fields of the `fixed` slot: ```{r fixedInsertionFilter} fixedInsertionFilter <- VcfFixedRules(exprs = list( isInsertion = expression( Biostrings::width(ALT) > Biostrings::width(REF) ) )) evcf_fixedIns <- subsetByFilter(evcf, fixedInsertionFilter) as.data.frame(fixed(evcf_fixedIns)[,c("REF", "ALT")]) ``` Here, the above `VcfFixedRules` is synonym to a distinct `VcfVepRules` using the Ensembl VEP prediction `"VARIANT_CLASS"`: ```{r vepInsertionFilter} vepInsertionFilter <- VcfVepRules(exprs = list( isInsertion = expression(VARIANT_CLASS == "insertion") )) evcf_vepIns <- subsetByFilter(evcf, vepInsertionFilter) as.data.frame(fixed(evcf_vepIns)[,c("REF", "ALT")]) ``` ## `CollapsedVCF` objects {#ALTCollapsedVCF} In contrast to `ExpandedVCF`, `CollapsedVCF` may contain more than one allele per record in their `"ALT"` field (`fixed` slot), represented by a `DNAStringSetList` object. As a result, *VCF filter rules* using the `"ALT"` field of the `info` slot in `CollapsedVCF` objects may use methods dedicated to `DNAStringSetList` to handle the data. For instance, multi-allelic variants may be identified by the following `VcfFixedRules`: ```{r fixedMultiallelicFilter} multiallelicFilter <- VcfFixedRules(exprs = list( multiallelic = expression(lengths(ALT) > 1) )) summary(eval(multiallelicFilter, vcf)) ``` # Combination of multiple types of *VCF filter rules* {#VcfFilterRules} Any number of `VcfFixedRules`, `VcfInfoRules`, and `VcfVepRules`---or even `VcfFilterRules` themselves---may be combined into a larger object of class `VcfFilterRules`. Notably, the *active* state of each filter rule is transferred to the combined object. Even though the `VcfFilterRules` class acts as a container for multiple types of *VCF filter rules*, the resulting `VcfFilterRules` object also extends the `FilterRules` class, and as a result can be evaluated and used to subset `VCF` objects identically to any of the specialised more specialised classes. During the creation of `VcfFixedRules` objects, each *VCF filter rule* being combined is marked with a *type* value, indicating the VCF slot in which the filter rule must be evaluated. This information is stored in the new `type` slot of `VcfFixedRules` objects. For instance, it is possible to combine two `VcfFixedRules` (containing two and one filter rules, respectively), one `VcfInfoRules`, and one `VcfVepRules` defined earlier in this vignette: ```{r combineRules} vignetteRules <- VcfFilterRules( fixedRules, SNPrule, infoRules, vepInsertionFilter ) vignetteRules active(vignetteRules) type(vignetteRules) summary(evalSeparately(vignetteRules, evcf, enclos = .GlobalEnv)) ``` Clearly[^1], the *VCF filter rules* `SNP` and `isInsertion` are mutually exclusive, which explains the final `0` variants left after filtering. Conveniently, either of these rules may be deactivated before evaluating the remaining active filter rules: [^1]: This statement below would be more evident if the `summary` method was displaying the result of `evalSeparately` in this vignette as it does it in an R session. ```{r combinedRulesDeactivated} active(vignetteRules)["SNP"] <- FALSE summary(evalSeparately(vignetteRules, evcf, enclos = .GlobalEnv)) ``` As a result, the deactivated filter rule (`"SNP"`) now returns `TRUE` for all variants, leaving a final `2` variants[^2] pass the remaining active filter rules: [^2]: Again, this statement would benefit from the result of `evalSeparately` being displayed identically to an R session. * INFO/FILTER equal to `"PASS"` * INFO/NS greater than 90% of the number of samples in the data set * Average of super-population allele frequencies greater than `0.05` * Ensembl VEP prediction `VARIANT_CLASS` equal to `"insertion"` Finally, the following chunk of code demonstrates how `VcfFilterRules` may also be created from the combination of `VcfFilterRules`, either with themselves or with any of the classes that define more specific *VCF filter rules*. Notably, when `VcfFilterRules` objects are combined, the `type` and `active` value of each filter rule is transferred to the combined object: **Combine `VcfFilterRules` with `VcfVepRules`** ```{r combineFilterVep} combinedFilters <- VcfFilterRules( vignetteRules, # VcfFilterRules missenseFilter # VcfVepRules ) type(vignetteRules) type(combinedFilters) active(vignetteRules) active(missenseFilter) active(combinedFilters) ``` **Combine multiple `VcfFilterRules` with `VcfFilterRules` (and more)** To demonstrate this action, another `VcfFilterRules` must first be created. This can be achieve by simply re-typing a `VcfVepRules` defined earlier: ```{r combineFilterFilter} secondVcfFilter <- VcfFilterRules(missenseFilter) secondVcfFilter ``` It is now possible to combine the two `VcfFilterRules`. Let us even combine another `VcfInfoRules` object in the same step: ```{r combineMultipleRules} manyRules <- VcfFilterRules( vignetteRules, # VcfFilterRules secondVcfFilter, # VcfFilterRules funFilter # VcfInfoRules ) manyRules active(manyRules) type(manyRules) summary(evalSeparately(manyRules, evcf, enclos = .GlobalEnv)) ``` Critically, users must be careful to combine rules **all** compatible with the class of `VCF` object in which it will be evaluated (*i.e.* `CollapsedVCF` or `ExpandedVCF`). # Session info {#SessionInfo} Here is the output of `sessionInfo()` on the system on which this document was compiled: ```{r sessionInfo, echo=FALSE} sessionInfo() ``` # References {#References} [R]: http://r-project.org [RStudio]: http://www.rstudio.com/