%\VignetteIndexEntry{Introduction to SeqVarTools} %\VignetteDepends{SeqVarTools} \documentclass[11pt]{article} <>= BiocStyle::latex() @ \begin{document} \title{Introduction to SeqVarTools} \author{Stephanie M. Gogarten} \maketitle \tableofcontents \section{Introduction} \Biocpkg{SeqArray} provides an alternative to the Variant Call Format (VCF) for storage of variants called from sequencing data, enabling efficient storage, fast access to subsets of the data, and rapid computation. \Biocpkg{SeqVarTools} provides an interface to the \Biocpkg{SeqArray} storage format with tools for many common tasks in variant analysis. It is highly recommended to read the vignette in \Biocpkg{SeqArray} in addition to this document to understand the data structure and full array of features. \section{Converting a Variant Call Format (VCF) file} To work with \Biocpkg{SeqVarTools}, we must first convert a VCF file into the \Biocpkg{SeqArray} GDS format. All information in the VCF file is preserved in the resulting GDS file. <<>>= library(SeqVarTools) vcffile <- seqExampleFileName("vcf") gdsfile <- "tmp.gds" seqVCF2GDS(vcffile, gdsfile, verbose=FALSE) gds <- seqOpen(gdsfile) gds @ We can look at some basic information in this file, such as the reference and alternate alleles. <<>>= head(refChar(gds)) head(altChar(gds)) @ How many alleles are there for each variant? <<>>= table(nAlleles(gds)) @ Two variants have 3 alleles (1 REF and 2 ALT). We can extract the second alterate allele for these variants by using the argument \Rcode{n=2} to \Rfunction{altChar}. <<>>= multi.allelic <- which(nAlleles(gds) > 2) altChar(gds)[multi.allelic] altChar(gds, n=1)[multi.allelic] altChar(gds, n=2)[multi.allelic] @ These two sites have three alleles, two are each single nucleotides and the third is a dinucleotide, representing an indel. <<>>= table(isSNV(gds)) isSNV(gds)[multi.allelic] @ Chromosome and position can be accessed as vectors or as a \Rclass{GRanges} object. <<>>= head(seqGetData(gds, "chromosome")) head(seqGetData(gds, "position")) granges(gds) @ We can also find the sample and variant IDs. <<>>= head(seqGetData(gds, "sample.id")) head(seqGetData(gds, "variant.id")) @ The variant IDs are sequential integers created by \Rfunction{seqVCF2GDS}. We may wish to rename them to something more useful. Note the ``annotation/'' prefix required to retrive the ``id'' variable. We need to confirm that the new IDs are unique (which is not always the case for the ``annotation/id'' field). <<>>= rsID <- seqGetData(gds, "annotation/id") head(rsID) length(unique(rsID)) == length(rsID) @ Renaming the variant IDs requires modifying the GDS file, so we have to close it first. <<>>= seqClose(gds) setVariantID(gdsfile, rsID) gds <- seqOpen(gdsfile) head(seqGetData(gds, "variant.id")) @ Note that using character strings for variant.id instead of integers may decrease performance for large datasets. \Rfunction{getGenotype} transforms the genotypes from the internal storage format to VCF-like character strings. <<>>= geno <- getGenotype(gds) dim(geno) geno[1:10, 1:5] @ \Rfunction{getGenotypeAlleles} returns the nucleotides instead of integers. <<>>= geno <- getGenotypeAlleles(gds) geno[1:10, 1:5] @ \section{Applying methods to subsets of data} If a dataset is large, we may want to work with subsets of the data at one time. We can use \Rfunction{applyMethod} to select a subset of variants and/or samples. \Rfunction{applyMethod} is essentially a wrapper for \Rfunction{seqSetFilter} that enables us to apply a method or function to a data subset in one line. If it is desired to use the same filter multiple times, it may be more efficient to set the filter once instead of using \Rfunction{applyMethod}. <<>>= samp.id <- seqGetData(gds, "sample.id")[1:10] var.id <- seqGetData(gds, "variant.id")[1:5] applyMethod(gds, getGenotype, variant=var.id, sample=samp.id) @ As an alternative to specifying variant ids, we can use a GRanges object to select a range on chromosome 22. <<>>= library(GenomicRanges) gr <- GRanges(seqnames="22", IRanges(start=1, end=250000000)) geno <- applyMethod(gds, getGenotype, variant=gr) dim(geno) @ \section{Examples} \subsection{Transition/transversion ratio} The transition/transversion ratio (TiTv) is frequently used as a quality metric. We can calculate TiTv over the entire dataset or by sample. <<>>= titv(gds) head(titv(gds, by.sample=TRUE)) @ Alternatively, we can plot TiTv binned by various metrics (allele frequency, missing rate, depth) to assess variant quality. We need the ids of the variants that fall in each bin. <<>>= binVar <- function(var, names, breaks) { names(var) <- names var <- sort(var) mids <- breaks[1:length(breaks)-1] + (breaks[2:length(breaks)] - breaks[1:length(breaks)-1])/2 bins <- cut(var, breaks, labels=mids, right=FALSE) split(names(var), bins) } @ <>= variant.id <- seqGetData(gds, "variant.id") afreq <- alleleFrequency(gds) maf <- pmin(afreq, 1-afreq) maf.bins <- binVar(maf, variant.id, seq(0,0.5,0.02)) nbins <- length(maf.bins) titv.maf <- rep(NA, nbins) for (i in 1:nbins) { capture.output(titv.maf[i] <- applyMethod(gds, titv, variant=maf.bins[[i]])) } plot(as.numeric(names(maf.bins)), titv.maf, xlab="MAF", ylab="TiTv") @ <>= miss <- missingGenotypeRate(gds) miss.bins <- binVar(miss, variant.id, c(seq(0,0.5,0.05),1)) nbins <- length(miss.bins) titv.miss <- rep(NA, nbins) for (i in 1:nbins) { capture.output(titv.miss[i] <- applyMethod(gds, titv, variant=miss.bins[[i]])) } plot(as.numeric(names(miss.bins)), titv.miss, xlab="missing rate", ylab="TiTv") @ <>= depth <- seqApply(gds, "annotation/format/DP", mean, margin="by.variant", as.is="double", na.rm=TRUE) depth.bins <- binVar(depth, variant.id, seq(0,200,20)) nbins <- length(depth.bins) titv.depth <- rep(NA, nbins) for (i in 1:nbins) { capture.output(titv.depth[i] <- applyMethod(gds, titv, variant=depth.bins[[i]])) } plot(as.numeric(names(depth.bins)), titv.depth, xlab="mean depth", ylab="TiTv") @ \subsection{Heterozygosity} We will calculate the ratio of heterozygotes to non-reference homozygotes by sample. First, we filter the data to exclude any variants with missing rate $<0.1$ or heterozygosity$>0.6$\%. <<>>= miss.var <- missingGenotypeRate(gds, margin="by.variant") het.var <- heterozygosity(gds, margin="by.variant") filt <- seqGetData(gds, "variant.id")[miss.var <= 0.1 & het.var <= 0.6] @ We calculate the heterozygosity and homozyogity by sample, using only the variants selected above. <>= seqSetFilter(gds, variant.id=filt) hethom <- hethom(gds) hist(hethom, main="", xlab="Het/Hom Non-Ref") seqSetFilter(gds) @ \subsection{Principal Component Analysis} We can do Principal Component Analysis (PCA) to separate subjects by ancestry. All the samples in the example file are CEU, so we expect to see only one cluster. <>= pc <- pca(gds) names(pc) plot(pc$eigenvect[,1], pc$eigenvect[,2]) @ See also the packages \Biocpkg{SNPRelate} and \Biocpkg{GENESIS} for determining relatedness and population structure. \subsection{Hardy-Weinberg Equilibrium} We can test for deviations from Hardy-Weinberg Equilibrium (HWE), which can reveal variants of low quality. A test with permuted genotypes gives expected values under the null hypothesis of HWE. <>= hw <- hwe(gds) pval <- -log10(sort(hw$p)) hw.perm <- hwe(gds, permute=TRUE) x <- -log10(sort(hw.perm$p)) plot(x, pval, xlab="-log10(expected P)", ylab="-log10(observed P)") abline(0,1,col="red") @ The inbreeding coefficient can also be used as a quality metric. For variants, this is 1 - observed heterozygosity / expected heterozygosity. <>= hist(hw$f) @ We can also calculate the inbreeding coefficient by sample. <<>>= ic <- inbreedCoeff(gds, margin="by.sample") range(ic) @ \subsection{Mendelian errors} Checking for Mendelian errors is another way to assess variant quality. The example data contains a trio (child, mother, and father). <<>>= data(pedigree) pedigree[pedigree$family == 1463,] err <- mendelErr(gds, pedigree, verbose=FALSE) table(err$by.variant) err$by.trio @ The example data do not have any Mendelian errors. \subsection{Association tests} We can run single-variant association tests for continuous or binary traits. We use a \Rclass{SeqVarData} object to combine the genotypes with sample annotation. We use the pedigree data from the previous section and add simulated phenotypes. <>= library(Biobase) sample.id <- seqGetData(gds, "sample.id") pedigree <- pedigree[match(sample.id, pedigree$sample.id),] n <- length(sample.id) pedigree$phenotype <- rnorm(n, mean=10) pedigree$case.status <- rbinom(n, 1, 0.3) sample.data <- AnnotatedDataFrame(pedigree) seqData <- SeqVarData(gds, sample.data) ## continuous phenotype assoc <- regression(seqData, outcome="phenotype", covar="sex", model.type="linear") head(assoc) pval <- -log10(sort(assoc$Wald.Pval)) n <- length(pval) x <- -log10((1:n)/n) plot(x, pval, xlab="-log10(expected P)", ylab="-log10(observed P)") abline(0, 1, col = "red") @ For binary phenotypes, there are two options, "logistic" and "firth". "logistic" uses \Rfunction{glm} and performs a Wald test. "firth" uses \Rfunction{logistf}. We recommend using the Firth test for rare variants \cite{wang2014} <>= assoc <- regression(seqData, outcome="case.status", covar="sex", model.type="firth") head(assoc) pval <- -log10(sort(assoc$PPL.Pval)) n <- length(pval) x <- -log10((1:n)/n) plot(x, pval, xlab="-log10(expected P)", ylab="-log10(observed P)") abline(0, 1, col = "red") @ For aggregate tests, see the package \Biocpkg{GENESIS}. <<>>= seqClose(gds) @ <>= unlink(gdsfile) @ \bibliography{SeqVarTools} \end{document}