%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{09 Working with Called Variants -- Exercises} \documentclass{article} <>= stopifnot(BiocInstaller::biocVersion() == "2.13") BiocStyle::latex() library(knitr) opts_chunk$set(cache=TRUE, tidy=FALSE) @ <>= suppressPackageStartupMessages({ library(VariantAnnotation) library(AnnotationHub) }) @ \usepackage{theorem} \newtheorem{Ext}{Exercise} \newenvironment{Exercise}{ \renewcommand{\labelenumi}{\alph{enumi}.}\begin{Ext}% }{\end{Ext}} \newenvironment{Solution}{% \noindent\textbf{Solution:}\renewcommand{\labelenumi}{\alph{enumi}.}% }{\bigskip} \title{Working with Called Variants in \Bioconductor{}} \author{Martin Morgan (\url{mtmorgan@fhcrc.org})} \date{4 February 2014} \begin{document} \maketitle \section*{Before \Bioconductor: calling variants!} The following illustrates a variant calling work flow best practice; GATK\footnote{\url{http://www.broadinstitute.org/gatk/guide/best-practices}}. Alternatives exist, e.g., \R{} / \Bioconductor{} \Biocpkg{VariantTools}. \includegraphics[width=\textwidth]{figures/Broad_Variants_Best_Practices_workflow.png} \section{Variant Call Format (VCF) files} The Variant Call Format (VCF) is a complex file format to store information about variants. It is described in a specification\footnote{\url{https://github.com/samtools/hts-specs}}. \scriptsize \begin{verbatim} ##fileformat=VCFv4.2 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig= ##phasing=partial ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##FILTER= ##FILTER= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3 \end{verbatim} \normalsize According to the VCF specifcation, the REF and ALT columns from the example shows (in order): (a) a good simple SNP; (b) a possible SNP that has been filtered out because its quality is below 10; (c) a site at which two alternate alleles are called, with one of them (T) being ancestral (possibly a reference sequencing error); (d) a site that is called monomorphic reference (i.e. with no alternate alleles); and (e) a microsatellite with two alternative alleles, one a deletion of 2 bases (TC), and the other an insertion of one base (T). Evidence on which the variants have been called are in the INFO column. Genotype data are given for three samples. Two of the samples are phased (`\texttt{|}') and the third unphased (`\texttt{\/}`), with per sample genotype quality, depth and haplotype qualities (the latter only for the phased samples) given as well as the genotypes. The microsatellite calls are unphased. Alternatives formats exist, including MAF\footnote{\url{https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format}} (mutation annotation format). \section{Working with variants} The vignette \href{http://bioconductor.org/packages/release/bioc/vignettes/VariantAnnotation/inst/doc/VariantAnnotation.pdf}{Introduction to \Biocpkg{VariantAnnotation}} provides a great introduction to working with variants in \R. \begin{Exercise} Explore section 2.1 of the vignette, to become comfortable with basic input and exploration of the data. \end{Exercise} \begin{Exercise} Explore section 2.2 of the vignette to become familiar with selective input of the file. \end{Exercise} \begin{Exercise} Explore the \Rfunction{readGeno} and \Rfunction{readInfo} functions for rapidly getting simple representations of specific VCF information. \end{Exercise} \section{Looking for SNPs in regulatory regions} This section walks through a brief exercise illustrating how variant calls from one public data resource (the Human Genome Project) can be combined with regulatory information from a second public resource (the ENCODE) project to arrive at novel insights. It is based on section 5.1 of the \href{http://bioconductor.org/packages/release/bioc/vignettes/VariantAnnotation/inst/doc/filterVcf.pdf}{\Rfunction{filterVcf}: Extract Variants of Interest from a Large VCF File} vignette in the \Biocpkg{VariantAnnotation} package. \begin{Exercise} This exercise does some preparation of a subset of a VCF file. \begin{itemize} \item Read the chr7 subset of data available in the \Biocpkg{VariantAnnotation} package. \item Filter the data to identify simple SNPs -- REF alleles of a single character; ALT alleles with only one allele, and with the allele a single character. \item What other filters might be appropriate? See section 4 of the \Rfunction{filterVcf} vignette for additional ideas. \end{itemize} \end{Exercise} \begin{Solution} Start with data from a a subset of chromosome 7 <>= library(VariantAnnotation) fl <- system.file("extdata", "chr7-sub.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl, "hg19") @ %% Select just those variants that appear to be simple SNPs, with a single REF and ALT allele and a all ALT alleles a single character. <>= isSimpleRef <- nchar(ref(vcf)) == 1 isSimpleAlt <- ## 'sapply': apply a function to each ALT record ## 'length(elt) == 1: the record has only one allele... ## 'nchar(elt) == 1: ... and the allele is a single character sapply(alt(vcf), function(elt) (length(elt) == 1) && nchar(elt) == 1) keep <- isSimpleRef & isSimpleAlt vcf <- vcf[keep] nrow(vcf) @ \end{Solution} \begin{Exercise} Follow section 5.1 of the \Rfunction{filterVcf} vignette to load CTCF transcription factor binding regions identified in MCF-7 Breast Cancer Cell Lines as \Rclass{GRanges} objects. \end{Exercise} \begin{Exercise} Follow section 5.2 of the \Rfunction{filterVcf} vignette to find SNPs in the CTCF binding region. \end{Exercise} \end{document}