--- title: "Integration with R" author: "Xiuwen Zheng (Department of Biostatistics, University of Washington, Seattle)" date: "Aug 31, 2016" output: html_document: theme: spacelab toc: true number_sections: true pdf_document: toc: true toc_depth: 3 bibliography: seqarray_bib.bib vignette: > %\VignetteIndexEntry{R Integration} %\VignetteDepends{gdsfmt} %\VignetteKeywords{whole-genome, sequencing, WGS, SNV} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- . . . The SeqArray package is designed for R programming environment, and enables high-performance computing in the multi-core symmetric multiprocessing and loosely coupled computer cluster framework. The features of SeqArray are extended with other existing R packages for WGS data analyses, and the R codes for demonstration are available in the package vignette [R Integration](http://www.bioconductor.org/packages/release/bioc/vignettes/SeqArray/inst/doc/R_Integration.html). ![](seqarray_workflow.svg) **Figure 1**: SeqArray framework and flowchart. The SeqArray format is built on top of the Genomic Data Structure (GDS) format, and GDS is a generic data container with hierarchical structure for storing multiple array-oriented data sets. A high-level R interface to GDS files is provided in the gdsfmt package with a C++ library, and the SeqArray package offers functionalities specific to sequencing data. At a minimum a SeqArray file contains sample and variant identifiers, position, chromosome, reference and alternate alleles for each variant. Parallel computing environments, like multi-core computer clusters, are enabled with SeqArray. The functionality of SeqArray is extended by SeqVarTools, SNPRelate, GENESIS and other R/Bioconductor packages for WGS analyses. . ```{r echo=FALSE} options(width=110) ``` ```{r} library(SeqArray) # open a SeqArray file in the package (1000 Genomes Phase1, chromosome 22) file <- seqOpen(seqExampleFileName("KG_Phase1")) seqSummary(file) ``` . . . . . . . # SeqArray Functions ## Key R Functions **Table 1**: The key functions in the SeqArray package. | Function | Description | |:-------------|:-------------------------------------------| | seqVCF2GDS | Reformat VCF files. [»](http://zhengxwen.github.io/SeqArray/release/help/seqVCF2GDS.html) | | seqSetFilter | Define a data subset of samples or variants. [»](http://zhengxwen.github.io/SeqArray/release/help/seqSetFilter.html) | | seqGetData | Get data from a SeqArray file with a defined filter. [»](http://zhengxwen.github.io/SeqArray/release/help/seqGetData.html) | | seqApply | Apply a user-defined function over array margins. [»](http://zhengxwen.github.io/SeqArray/release/help/seqApply.html) | | seqParallel | Apply functions in parallel. [»](http://zhengxwen.github.io/SeqArray/release/help/seqParallel.html) | Genotypic data and annotations are stored in an array-oriented manner, providing efficient data access using the R programming language. Table 1 lists five key functions provided in the SeqArray package and many data analyses can be done using just these functions. `seqVCF2GDS()` converts VCF files to SeqArray format. Multiple cores in an SMP architecture within one or more compute nodes in a compute cluster can be used to simultaneously reformat the data. `seqVCF2GDS()` utilizes R's connection interface to read VCF files incrementally, and it is able to import data from http/ftp texts and the standard output of a command-line tool via a pipe. `seqSetFilter()` and `seqGetData()` can be used together to retrieve data for a selected set of samples from a defined genomic region. `GRanges` and `GRangesList` objects defined in the Bioconductor core packages are supported via `seqSetFilter()` [@Gentleman:2004aa; @Lawrence2013]. `seqApply()` applies a user-defined function to array margins of genotypes and annotations. The function that is applied can be defined in R as is typical, or via C/C++ code using the Rcpp package [@eddelbuettel2011rcpp]. `seqParallel()` utilizes the facilities in the parallel package [@Rossini2007; @R2016] to perform calculations on a SeqArray file in parallel. ## Calculating Allele Frequencies We illustrate the SeqArray functions by implementing an example to calculate the frequency of reference allele across all chromosomes. If a genomic region is specified via `seqSetFilter()`, the calculation is performed within the region instead of using all variants. `seqApply()` enables applying a user-defined function to the margin of genotypes, and the R code is shown as follows: ```{r} af <- seqApply(file, "genotype", as.is="double", margin="by.variant", FUN=function(x) { mean(x==0L, na.rm=TRUE) }) head(af) ``` where `file` is a SeqArray file, `as.is` indicates returning a numeric vector, `margin` is specified for applying the function by variant. The variable `x` in the user-defined function is an allele-by-sample integer matrix at a site and `0L` denotes the reference allele where the suffix `L` indicates the number is an integer. The Rcpp package simplifies integration of compiled C++ code with R [@eddelbuettel2011rcpp], and the function can be dynamically defined with inlined C/C++ codes: ```{r} library(Rcpp) cppFunction(" double CalcAlleleFreq(IntegerVector x) { int len=x.size(), n=0, n0=0; for (int i=0; i < len; i++) { int g = x[i]; if (g != NA_INTEGER) { n++; if (g == 0) n0++; } } return double(n0) / n; }") ``` where *IntegerVector* indicates the input variable `x` is an integer vector, `NA_INTEGER` is missing value and the function counts how many zeros and non-missing values for calculating frequency. The name *CalcAlleleFreq* can be passed to `seqApply()` directly: ```{r} af <- seqApply(file, "genotype", as.is="double", margin="by.variant", FUN=CalcAlleleFreq) head(af) ``` The C++ integration is several times faster than the R implementation, suggesting an efficient approach with C/C++ when real-time performance is required. It is able to run the calculation in parallel. The genotypes of a SeqArray file are automatically split into non-overlapping parts according to different variants or samples, and the results from client processes collected internally: ```{r} af <- seqApply(file, "genotype", as.is="double", margin="by.variant", FUN=function(x) { mean(x==0L, na.rm=TRUE) }, parallel=2) head(af) ``` Here `parallel` specifies the number of cores. ## PCA R Implementation Principal Component Analysis (PCA) is a common tool used in exploratory data analysis for high-dimensional data. PCA is often involved with the calculation of covariance matrix, and the following R code implements the calculation proposed in [@Patterson:2006:PLoS-Genet:17194218]. The user-defined function computes the covariance matrix for each variant and adds up to a total matrix `s`. The argument `.progress=TRUE` enables the display of progress information during the calculation. ```R # covariance variable with an initial value s <- 0 seqApply(file, "$dosage", function(x) { p <- 0.5 * mean(x, na.rm=TRUE) # allele frequency g <- (x - 2*p) / sqrt(p*(1-p)) # normalized by allele frequency g[is.na(g)] <- 0 # correct missing values s <<- s + (g %o% g) # update the cov matrix s in the parent environment }, margin="by.variant", .progress=TRUE) # scaled by the number of samples over the trace s <- s * (nrow(s) / sum(diag(s))) # eigen-decomposition eig <- eigen(s) ``` ``` [..................................................] 0%, ETC: --- [==================>...............................] 36%, ETC: 4.1m ... [==================================================] 100%, completed in 14.3m ``` ```{r} # covariance variable with an initial value s <- 0 seqBlockApply(file, "$dosage", function(x) { p <- 0.5 * colMeans(x, na.rm=TRUE) # allele frequencies (a vector) g <- (t(x) - 2*p) / sqrt(p*(1-p)) # normalized by allele frequency g[is.na(g)] <- 0 # correct missing values s <<- s + crossprod(g) # update the cov matrix s in the parent environment }, margin="by.variant", .progress=TRUE) # scaled by the number of samples over the trace s <- s * (nrow(s) / sum(diag(s))) # eigen-decomposition eig <- eigen(s, symmetric=TRUE) ``` `seqParallel()` utilizes the facilities offered by the R parallel package to perform calculations within a cluster or SMP environment, and the genotypes are automatically split into non-overlapping parts. The parallel implementation with R is shown as follows, and the C optimized function is also available in the SNPRelate package. ```R # the datasets are automatically split into four non-overlapping parts genmat <- seqParallel(2, file, FUN = function(f) { s <- 0 # covariance variable with an initial value seqBlockApply(f, "$dosage", function(x) { p <- 0.5 * colMeans(x, na.rm=TRUE) # allele frequencies (a vector) g <- (t(x) - 2*p) / sqrt(p*(1-p)) # normalized by allele frequency g[is.na(g)] <- 0 # correct missing values s <<- s + crossprod(g) # update the cov matrix s in the parent environment }, margin="by.variant") s # output }, .combine = "+", # sum "s" of different processes together split = "by.variant") # scaled by the number of samples over the trace genmat <- genmat * (nrow(genmat) / sum(diag(genmat))) # eigen-decomposition eig <- eigen(genmat, symmetric=TRUE) ``` ```{r fig.width=5, fig.height=5, fig.align='center'} # figure plot(eig$vectors[,1], eig$vectors[,2], xlab="PC 1", ylab="PC 2") ``` More examples can be found: [SeqArray Data Format and Access](./SeqArrayTutorial.html#examples) ## Parallel Implementation The default setting for the analysis functions in the SeqArray package is serial implementation, but users can setup a cluster computing environment manually via `seqParallelSetup()` and distribute the calculations to multiple cores or even more than 100 cluster nodes. ```{r} # use 2 cores for demonstration seqParallelSetup(2) # numbers of distinct alleles per site table(seqNumAllele(file)) # reference allele frequencies summary(seqAlleleFreq(file, ref.allele=0L)) # close the cluster environment seqParallelSetup(FALSE) ``` . . . . . . . # Bioconductor Features ## GRanges and GRangesList In this section, we illustrate how to work with Bioconductor core packages for performing common queries to retrieve data from a SeqArray file. The `GRanges` and `GRangesList` classes manipulate genomic range data and can be used in the function `seqSetFilter()` to define a data subset. For example, the annotation information of each exon, the coding range and transcript ID are stored in the `TxDb.Hsapiens.UCSC.hg19.knownGene` object for the UCSC known gene annotations on hg19. ```R library(TxDb.Hsapiens.UCSC.hg19.knownGene) ``` ```R # get the exons grouped by gene txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene txs <- exonsBy(txdb, "gene") ``` where `exonsBy()` returns a `GRangesList` object for all known genes in the database. ```R seqSetFilter(file, txs) # define an exon filter ``` ``` ## # of selected variants: 1,050 ``` ```R # VCF export with exon variants seqGDS2VCF(file, "exons.vcf.gz") ``` ``` ## Tue Jan 3 15:43:55 2017 ## VCF Export: exons.vcf.gz ## 1,092 samples, 1,050 variants ## INFO Field: ## FORMAT Field: ## [..................................................] 0%, ETC: --- [==================================================] 100%, completed in 0s ## Tue Jan 3 15:43:55 2017 Done. ``` If random-access memory is sufficiently large, users could load all exon variants via `seqGetData(file, "genotype")`; otherwise, data have to be loaded by chunk or a user-defined function is applied over variants by `seqApply()`. ## VariantAnnotation SeqArray can also export data with selected variants and samples as a `VCF` object for use with the VariantAnnotation package [@Obenchain2014]: ```{r message=FALSE} library(VariantAnnotation) ``` ```{r} # select a region [10Mb, 30Mb] on chromosome 22 seqSetFilterChrom(file, 22, from.bp=10000000, to.bp=30000000) vcf <- seqAsVCF(file, chr.prefix="chr") vcf ``` ```R locateVariants(vcf, txdb, CodingVariants()) ``` ``` ## GRanges object with 524 ranges and 9 metadata columns: ## seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID CDSID ## | ## 1 chr22 [17071862, 17071862] - | coding 1579 1579 128 74436 216505 ## 2 chr22 [17073170, 17073170] - | coding 271 271 129 74436 216505 ## 3 chr22 [17589225, 17589225] + | coding 1116 1116 377 73481 214034 ## 4 chr22 [17601466, 17601466] - | coding 552 552 382 74444 216522 ## 5 chr22 [17629357, 17629357] - | coding 424 424 394 74446 216528 ## ... ... ... ... . ... ... ... ... ... ... ## 520 chr22 [29913278, 29913278] - | coding 1567 1567 7023 74771 217273 ## 521 chr22 [29924156, 29924156] - | coding 977 977 7030 74768 217279 ## 522 chr22 [29924156, 29924156] - | coding 977 977 7030 74769 217279 ## 523 chr22 [29924156, 29924156] - | coding 977 977 7030 74770 217279 ## 524 chr22 [29924156, 29924156] - | coding 977 977 7030 74771 217279 ## GENEID PRECEDEID FOLLOWID ## ## 1 150160 ## 2 150160 ## 3 23765 ## 4 27439 ## 5 27440 ## ... ... ... ... ## 520 8563 ## 521 8563 ## 522 8563 ## 523 8563 ## 524 8563 ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths ``` ```{r echo=FALSE} unlink("exons.vcf.gz", force=TRUE) ``` . . . . . . . # Integration with SeqVarTools The [SeqVarTools](http://www.bioconductor.org/packages/release/bioc/html/SeqVarTools.html) package is available on Bioconductor, which defines S4 classes and methods for other common operations and analyses on SeqArray datasets. The vignette of SeqVarTools is [http://www.bioconductor.org/packages/release/bioc/vignettes/SeqVarTools/inst/doc/SeqVarTools.pdf](http://www.bioconductor.org/packages/release/bioc/vignettes/SeqVarTools/inst/doc/SeqVarTools.pdf). ## Linear Regression The SeqVarTools package extends SeqArray by providing methods for many tasks common to quality control and analysis of sequence data. Methods include: transition/transversion ratio, heterozygosity and homozygosity rates, singleton counts, Hardy-Weinberg equilibrium, Mendelian error checking, and linear and logistic regression. Additionally, SeqVarTools defines a new class to link the information present in the SeqArray file to additional sample and variant annotation provided by the user, such as sex and phenotype. One could select a subset of samples in a file and run a linear regression on all variants: ```{r message=FALSE} library(Biobase) ``` ```{r eval=FALSE} library(SeqVarTools) ``` ```{r} data(KG_P1_SampData) KG_P1_SampData head(pData(KG_P1_SampData)) # show KG_P1_SampData ``` ```{r eval=FALSE} # link sample data to SeqArray file seqData <- SeqVarData(file, sample.data) # set sample and variant filters female <- sampleData(seqData)$sex == "female" seqSetFilter(seqData, sample.sel=female) # run linear regression res <- regression(seqData, outcome="phenotype", covar="age") head(res) ``` ``` ## variant.id n freq Est SE Wald.Stat Wald.Pval ## 1 1 567 0.6887125 -0.090555715 0.0699074 1.677974724 0.19519378 ## 2 2 567 0.9400353 0.009685877 0.1321824 0.005369459 0.94158602 ## 3 3 567 0.9991182 -0.378945215 1.0238102 0.136997920 0.71128392 ## 4 4 567 1.0000000 NA NA NA NA ## 5 5 567 0.9356261 -0.009732930 0.1281883 0.005764880 0.93947733 ## 6 6 567 0.9982363 -1.379486822 0.7233651 3.636804912 0.05651529 ``` . . . . . . . # Integration with SNPRelate Parallel implementations of relatedness and principal component analysis with SeqArray format are enabled in the package SNPRelate, to detect and adjust for population structure and cryptic relatedness in association studies. The kernel of SNPRelate was optimized with SIMD instructions and multi-thread algorithms, and it was designed for bi-allelic SNP data originally [@Zheng2012]. In order to analyze sequence variant calls, and SNPRelate has been rewritten to take the dosages of reference alleles as an input genotype matrix with distinct integers 0, 1, 2 and NA for SeqArray files. Therefore no format conversion is required for WGS analyses. Principal component analysis is implemented in the SNPRelate function `snpgdsPCA()`, and the exact and new randomized algorithms are both provided [@Patterson:2006:PLoS-Genet:17194218; @Galinsky2016]. The randomized matrix algorithm is designed to reduce the running time for large number of study individuals (e.g., greater than 10,000 samples). Relatedness analyses include PLINK method of moment (MoM), KING robust methods, GCTA genetic relationship matrix (GRM) and individual-perspective beta estimator [@Purcell:2007:Am-J-Hum-Genet:17701901; @Manichaikul2010; @Yang2011; @weir2015snps], and these algorithms are computationally efficient and optimized with SIMD instructions. In addition, fixation index ($F_\text{st}$) has been widely used to measure the genetic difference between populations, and the calculations of moment estimators are available in the SNPRelate package with all variants or a sliding window [@Weir:1984:Evolution; @Weir:2002:Annu-Rev-Genet:12359738; @Weir2005]. ## LD-based Marker Pruning It is suggested to perform marker pruning before running PCA and IBD analyses on WGS variant data, to reduce the influence of linkage disequilibrium and rare variants. ```{r} library(SNPRelate) set.seed(1000) # may try different LD thresholds for sensitivity analysis snpset <- snpgdsLDpruning(file, ld.threshold=0.2, maf=0.01) names(snpset) head(snpset$chr22) # variant.id # get all selected variant id snpset.id <- unlist(snpset) ``` ## Principal Component Analysis ```{r} # Run PCA pca <- snpgdsPCA(file, snp.id=snpset.id, num.thread=2) # variance proportion (%) pc.percent <- pca$varprop*100 head(round(pc.percent, 2)) ``` ```{r fig.width=5, fig.height=5, fig.align='center'} # plot the first 4 eigenvectors with character=20 and size=0.5 plot(pca, eig=1:4, pch=20, cex=0.5) ``` Population information are available: ```{r} pop.code <- factor(seqGetData(file, "sample.annotation/Population")) head(pop.code) popgroup <- list( EastAsia = c("CHB", "JPT", "CHS", "CDX", "KHV", "CHD"), European = c("CEU", "TSI", "GBR", "FIN", "IBS"), African = c("ASW", "ACB", "YRI", "LWK", "GWD", "MSL", "ESN"), SouthAmerica = c("MXL", "PUR", "CLM", "PEL"), India = c("GIH", "PJL", "BEB", "STU", "ITU")) colors <- sapply(levels(pop.code), function(x) { for (i in 1:length(popgroup)) { if (x %in% popgroup[[i]]) return(names(popgroup)[i]) } NA }) colors <- as.factor(colors) legend.text <- sapply(levels(colors), function(x) paste(levels(pop.code)[colors==x], collapse=",")) legend.text ``` ```{r fig.width=5, fig.height=5, fig.align='center'} # make a data.frame tab <- data.frame(sample.id = pca$sample.id, EV1 = pca$eigenvect[,1], # the first eigenvector EV2 = pca$eigenvect[,2], # the second eigenvector Population = pop.code, stringsAsFactors = FALSE) head(tab) # draw plot(pca, pch=20, cex=0.75, main="1KG Phase 1, chromosome 22", col=colors[tab$Population]) legend("topright", legend=legend.text, col=1:length(legend.text), pch=19, cex=0.75) ``` ## Relatedness Analysis For relatedness analysis, Identity-By-Descent (IBD) estimation in [SNPRelate](http://www.bioconductor.org/packages/release/bioc/html/SNPRelate.html) can be done by the method of moments (MoM) [@Purcell:2007:Am-J-Hum-Genet:17701901]. ```{r} # YRI samples sample.id <- seqGetData(file, "sample.id") CEU.id <- sample.id[pop.code == "CEU"] ``` ```{r fig.width=5, fig.height=5, fig.align='center'} # Estimate IBD coefficients ibd <- snpgdsIBDMoM(file, sample.id=CEU.id, snp.id=snpset.id, num.thread=2) # Make a data.frame ibd.coeff <- snpgdsIBDSelection(ibd) head(ibd.coeff) plot(ibd.coeff$k0, ibd.coeff$k1, xlim=c(0,1), ylim=c(0,1), xlab="k0", ylab="k1", main="CEU samples (MoM)") lines(c(0,1), c(1,0), col="red", lty=2) ``` ## Identity-By-State Analysis For $n$ study individuals, `snpgdsIBS()` can be used to create a $n \times n$ matrix of genome-wide average IBS pairwise identities. To perform cluster analysis on the $n \times n$ matrix of genome-wide IBS pairwise distances, and determine the groups by a permutation score: ```{r} set.seed(1000) ibs.hc <- snpgdsHCluster(snpgdsIBS(file, snp.id=snpset.id, num.thread=2)) ``` Here is the population information we have known: ```{r fig.width=10, fig.height=5, fig.align='center'} # Determine groups of individuals by population information rv <- snpgdsCutTree(ibs.hc, samp.group=as.factor(colors[pop.code])) plot(rv$dendrogram, leaflab="none", main="1KG Phase 1, chromosome 22", edgePar=list(col=rgb(0.5,0.5,0.5,0.75), t.col="black")) legend("bottomleft", legend=legend.text, col=1:length(legend.text), pch=19, cex=0.75, ncol=4) ``` ## Fixation Index ($F_\text{st}$) Fixation index (Fst) has been widely used to measure the genetic difference between populations, and the calculations of moment estimators are available in the SNPRelate package with all variants or a sliding window. ```{r fig.width=10, fig.height=5, fig.align='center'} # sliding windows (window size: 500kb) sw <- snpgdsSlidingWindow(file, winsize=500000, shift=100000, FUN="snpgdsFst", as.is="numeric", population=pop.code) plot(sw$chr22.pos/1000, sw$chr22.val, xlab="genome coordinate (kb)", ylab="population-average Fst", main="1KG Phase 1, chromosome 22") abline(h=mean(sw$chr22.val), lty=3, col="red", lwd=2) ``` . . . . . . . # GENESIS The [GENESIS](http://www.bioconductor.org/packages/GENESIS) package offers methodology for estimating and accounting for population and pedigree structure in genetic analyses. The current implementation provides functions to perform PC-AiR and PC-Relate [@Conomos2015; @Conomos2016]. PC-AiR performs PCA on whole-genome genotypes taking into account known or cryptic relatedness in the study sample. PC-Relate uses ancestry representative principal components to estimate measures of recent genetic relatedness. In addition, GENESIS includes support for SeqArray files in mixed model association testing and aggregate tests of rare variants like burden and SKAT tests. . . . . . . . # Resources 1. gdsfmt R package: [http://github.com/zhengxwen/gdsfmt](http://github.com/zhengxwen/gdsfmt), [http://bioconductor.org/packages/gdsfmt](http://bioconductor.org/packages/gdsfmt) 2. SeqArray R package: [http://github.com/zhengxwen/SeqArray](http://github.com/zhengxwen/SeqArray), [http://bioconductor.org/packages/SeqArray](http://bioconductor.org/packages/SeqArray) 3. SeqVarTools R package: [http://bioconductor.org/packages/SeqVarTools](http://bioconductor.org/packages/SeqVarTools) 4. SNPRelate R package: [http://github.com/zhengxwen/SNPRelate](http://github.com/zhengxwen/SNPRelate), [http://bioconductor.org/packages/SNPRelate](http://bioconductor.org/packages/SNPRelate) 5. GENESIS R package: [http://bioconductor.org/packages/GENESIS](http://bioconductor.org/packages/GENESIS) # Session Information ```{r} seqClose(file) ``` ```{r} sessionInfo() ``` # References