--- title: "StructuralVariantAnnotation Quick Overview" author: "Ruining Dong, Daniel Cameron" date: "`r Sys.Date()`" output: html_document: toc: yes toc_float: collapsed: yes smooth_scroll: yes vignette: > %\VignetteIndexEntry{Structural Variant Annotation Package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(#echo = TRUE, collapse = TRUE, comment = "#>") ``` ## Introduction This vignette outlines a workflow of parsing and plotting structural variants from Variant Call Format (VCF) using the `StructuralVariantAnnotation` package. `StructuralVariantAnnotation` contains useful helper functions for reading and interpreting structural variant calls. The package contains functions for parsing VCFs from a number of popular callers as well as functions for dealing with breakpoints involving two separate genomic loci encoded as `GRanges` objects. ## Installation The *StructuralVariationAnnotation* package can be installed from *Bioconductor* as follows: ```{r installation, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("StructuralVariantAnnotation") ``` ## Breakpoint notation The [VCF standard](https://samtools.github.io/hts-specs/VCFv4.3.pdf) describes two types of SV notations. One is by SV types, i.e. insertions, deletions, inversions, translocations, etc. The other is by breakend notations, often labelled with `SVTYPE=BND`. To describe a SV with breakend notations, each SV has two positions, each captured by one breakend (except for inversions, which have 4 separate records). Each breakend includes a genomic locus, as well as a half interval extending out to the partner breakend. In VCF BND notations, the `ALT` field encodes directional information of the partner breakend. * 1 200 . N N[5:500[ partner breakend immediately after chr1:200, starting from chr5:500 and extending rightwards * 1 200 . N ]5:500]N partner breakend immediately before chr1:200, extending from the left and ending at chr5:500 * 1 200 . N [5:500[N partner breakend immediately before chr1:200, starting from chr5:500 and extending rightwards * 1 200 . N N]5:500] partner breakend immediately after chr1:200, extending from the left and ending at chr5:500 ## Using GRanges for structural variants: a breakend-centric data structure Unlike breakpoint-centric data structures such as the `Pairs` object that `rtracklayer` uses to load BEDPE files, this package uses a breakend-centric notation. Breakends are stored in a GRanges object with strand used to indicate breakpoint orientation. Consistent with how breakends are encoded in VCF `+` indicates that the breakpoint occurs immediately after the given position, with `-` indicating the breakpoint occurs immediately before the given position. Breakpoints are represented using a `partner` field containing the name of the breakend at the other side of the breakend. Both single breakends and breakpoints are supported but many-to-many breakend partner mappings supported by the VCF `MATEID` field are not: each breakend must have 0 (single breakend) or 1 (breakpoint) partner breakends. This notation was chosen as it simplifies many common operations, and annotations are breakend-level annotations. These include annotation associated with genomic positions (e.g. genes, repeats, mappability), as well as breakend-level attributes of a breakpoint such as variant allele fractions (e.g. a structural variant can be homozygous at one breakend, but heterzygous at the other breakend). ## Workflow ### Loading data from VCF VCF data is parsed into a `VCF` object using the `readVCF` function from the Bioconductor package `VariantAnnotation`. Simple filters could be applied to a `VCF` object to remove unwanted calls. More information about `VCF` objects can be found by consulting the vignettes in the VariantAnnotation package with `browseVignettes("VariantAnnotation")`. StructuralVariantAnnotation supports structural variants reported in the following VCF notations: * Non-symbolic allele * Symbolic allele with `SVTYPE` of `DEL`, `INS`, and `DUP`. * Breakpoint notation `SVTYPE=BND` * Single breakend notation In addition to parsing spec-compliant VCFs, additional logic has been added to enable parsing of non-compliant variants for the following callers: * Pindel (`SVTYPE=RPL`) * manta (`INv3`, `INV5` fields) * Delly (`SVTYPE=TRA`, `CHR2`, `CT` fields) * TIGRA (`SVTYPE=CTX`) Breakpoint ambiguity reported using the spec-defined `CIPOS` is, by default, incorporated into the GRanges breakend intervals. ```{r input, warning=FALSE, message=FALSE} suppressPackageStartupMessages(require(StructuralVariantAnnotation)) suppressPackageStartupMessages(require(VariantAnnotation)) vcf.file <- system.file("extdata", "gridss.vcf", package = "StructuralVariantAnnotation") vcf <- VariantAnnotation::readVcf(vcf.file, "hg19") gr <- breakpointRanges(vcf) ``` `partner()` returns the breakpoint `GRanges` object with the order rearranged such that the partner breakend on the other side of each breakpoint corresponds with the local breakend. ```{r} partner(gr) ``` Single breakends are loaded using the `breakendRanges()` function. The `GRanges` object is of the same form as `breakpointRanges()` but as the breakend partner is not specified, the partner is NA. A single GRanges object can contain both breakend and breakpoint variants. ```{r} colo829_vcf <- VariantAnnotation::readVcf(system.file("extdata", "COLO829T.purple.sv.ann.vcf.gz", package = "StructuralVariantAnnotation")) colo829_bpgr <- breakpointRanges(colo829_vcf) colo829_begr <- breakendRanges(colo829_vcf) colo829_gr <- sort(c(colo829_begr, colo829_bpgr)) colo829_gr[seqnames(colo829_gr) == "6"] ``` ### Ensuring breakpoint consistency Functions such as `findBreakpointOverlaps()` require the `GRanges` object to be composed entirely of valid breakpoints. Subsetting a breakpoint `GRanges` object can result in one side of a breakpoint getting filtered with the remaining orphaned record no longer valid as its partner no longer exists. Such record can be filtered ```{r} colo828_chr6_breakpoints <- colo829_gr[seqnames(colo829_gr) == "6"] # A call to findBreakpointOverlaps(colo828_chr6_breakpoints, colo828_chr6_breakpoints) # will fail as there are a) single breakends, and b) breakpoints with missing partners colo828_chr6_breakpoints <- colo828_chr6_breakpoints[colo828_chr6_breakpoints$partner %in% names(colo828_chr6_breakpoints)] # As expected, each call on chr6 only overlaps with itself countBreakpointOverlaps(colo828_chr6_breakpoints, colo828_chr6_breakpoints) ``` Note that if you did want to include inter-chromosomal breakpoints involving chromosome 6, you would need to update the filtering criteria to include records with chr6 on either side. In such cases, the filtering logic can be simplified by the `selfPartnerSingleBreakends` parameter of `partner()`. When `selfPartnerSingleBreakends=TRUE`, the partner of single breakend events is considered to be the single breakend itself. ```{r} colo828_chr6_breakpoints <- colo829_gr[ seqnames(colo829_gr) == "6" | seqnames(partner(colo829_gr, selfPartnerSingleBreakends=TRUE)) == "6"] # this way we keep the chr3<->chr6 breakpoint and don't create any orphans head(colo828_chr6_breakpoints, 1) ``` ### Breakpoint Overlaps `findBreakpointOverlaps()` and `countBreakpointOverlaps()` are functions for finding and counting overlaps between breakpoint objects. All breakends must have their partner breakend included in the GRanges. A valid overlap requires that breakends on boths sides overlap. To demonstrate the `countBreakpointOverlaps()` function, we use a small subset of data from our structural variant caller benchmarking paper to construct precision recall curves for a pair of callers. ```{r} truth_vcf <- readVcf(system.file("extdata", "na12878_chr22_Sudmunt2015.vcf", package = "StructuralVariantAnnotation")) truth_svgr <- breakpointRanges(truth_vcf) truth_svgr <- truth_svgr[seqnames(truth_svgr) == "chr22"] crest_vcf <- readVcf(system.file("extdata", "na12878_chr22_crest.vcf", package = "StructuralVariantAnnotation")) # Some SV callers don't report QUAL so we need to use a proxy VariantAnnotation::fixed(crest_vcf)$QUAL <- info(crest_vcf)$left_softclipped_read_count + info(crest_vcf)$left_softclipped_read_count crest_svgr <- breakpointRanges(crest_vcf) crest_svgr$caller <- "crest" hydra_vcf <- readVcf(system.file("extdata", "na12878_chr22_hydra.vcf", package = "StructuralVariantAnnotation")) hydra_svgr <- breakpointRanges(hydra_vcf) hydra_svgr$caller <- "hydra" svgr <- c(crest_svgr, hydra_svgr) svgr$truth_matches <- countBreakpointOverlaps(svgr, truth_svgr, # read pair based callers make imprecise calls. # A margin around the call position is required when matching with the truth set maxgap=100, # Since we added a maxgap, we also need to restrict the mismatch between the # size of the events. We don't want to match a 100bp deletion with a # 5bp duplication. This will happen if we have a 100bp margin but don't also # require an approximate size match as well sizemargin=0.25, # We also don't want to match a 20bp deletion with a 20bp deletion 80bp away # by restricting the margin based on the size of the event, we can make sure # that simple events actually do overlap restrictMarginToSizeMultiple=0.5, # HYDRA makes duplicate calls and will sometimes report a variant multiple # times with slightly different bounds. countOnlyBest prevents these being # double-counted as multiple true positives. countOnlyBest=TRUE) ``` Once we know which calls match the truth set, we can generate Precision-Recall and ROC curves for each caller using one of the many ROC R packages, or directly with dplyr. ```{r} suppressPackageStartupMessages(require(dplyr)) suppressPackageStartupMessages(require(ggplot2)) ggplot(as.data.frame(svgr) %>% dplyr::select(QUAL, caller, truth_matches) %>% dplyr::group_by(caller, QUAL) %>% dplyr::summarise( calls=dplyr::n(), tp=sum(truth_matches > 0)) %>% dplyr::group_by(caller) %>% dplyr::arrange(dplyr::desc(QUAL)) %>% dplyr::mutate( cum_tp=cumsum(tp), cum_n=cumsum(calls), cum_fp=cum_n - cum_tp, Precision=cum_tp / cum_n, Recall=cum_tp/length(truth_svgr))) + aes(x=Recall, y=Precision, colour=caller) + geom_point() + geom_line() + labs(title="NA12878 chr22 CREST and HYDRA\nSudmunt 2015 truth set") ``` ### Converting between BEDPE, Pairs, and breakpoint GRanges The package supports converting GRanges objects to BEDPE files. The BEDPE format is defined by [`bedtools`](https://bedtools.readthedocs.io/en/latest/content/general-usage.html). This is achieved using `breakpointgr2pairs` and `pairs2breakpointgr` functions to convert to and from the GRanges `Pairs` notation used by `rtracklayer` ```{r} suppressPackageStartupMessages(require(rtracklayer)) # Export to BEDPE rtracklayer::export(breakpointgr2pairs(gr), con="gridss.bedpe") # Import to BEDPE bedpe.gr <- pairs2breakpointgr(rtracklayer::import("gridss.bedpe")) ``` ## Visualising breakpoint pairs via circos plots One way of visualising paired breakpoints is by circos plots. Here we use the package [`circlize`](https://doi.org/10.1093/bioinformatics/btu393) to demonstrate breakpoint visualisation. The `bedpe2circos` function takes BEDPE-formatted dataframes (see `breakpointgr2bedpe()`) and plotting parameters for the `circos.initializeWithIdeogram()` and `circos.genomicLink()` functions from `circlize`. To generate a simple circos plot of paired breakpoints: ```{r} suppressPackageStartupMessages(require(circlize)) colo829_bpgr_with_chr_prefix <- colo829_bpgr seqlevelsStyle(colo829_bpgr_with_chr_prefix) <- "UCSC" pairs <- breakpointgr2pairs(colo829_bpgr_with_chr_prefix) circos.initializeWithIdeogram() circos.genomicLink(as.data.frame(S4Vectors::first(pairs)), as.data.frame(S4Vectors::second(pairs))) circos.clear() ``` Alternatively, the plotting package `ggbio` provides flexible track functions which bind with `ggplot2` objects. It takes `GRanges` objects as input and supports circos plots. To plot structural variant breakpoints in a circos plot using `ggbio`, we need to first prepare the breakpoint GRanges. The function requires a special column, indicating the end of the link using GRanges format, which we can add to `gr` using [`plyranges`](https://bioconductor.org/packages/release/bioc/html/plyranges.html). ```{r add to.gr} suppressPackageStartupMessages(require(ggbio)) gr.circos <- colo829_bpgr[seqnames(colo829_bpgr) %in% seqlevels(biovizBase::hg19sub)] seqlevels(gr.circos) <- seqlevels(biovizBase::hg19sub) mcols(gr.circos)$to.gr <- granges(partner(gr.circos)) ``` We can then plot the breakpoints against reference genomes. ```{r ggbio} p <- ggbio() + circle(gr.circos, geom="link", linked.to="to.gr") + circle(biovizBase::hg19sub, geom='ideo', fill='gray70') + circle(biovizBase::hg19sub, geom='scale', size=2) + circle(biovizBase::hg19sub, geom='text', aes(label=seqnames), vjust=0, size=3) p ``` ## SessionInfo ```{r} sessionInfo() ```