---
title: "Scalable Generalized Mixed Models in PheWAS using SAIGEgds"
author: "Xiuwen Zheng (Genomics Research Center, AbbVie, North Chicago)"
date: "Mar, 2020"
output:
    html_document:
        theme: spacelab
        toc: true
        number_sections: true
    pdf_document:
        toc: true
        toc_depth: 3
vignette: >
    %\VignetteIndexEntry{SAIGEgds Tutorial}
    %\VignetteDepends{SAIGEgds}
    %\VignetteKeywords{GWAS, PheWAS, mixed models}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

.

.

```{r echo=FALSE}
options(width=110)
```

A phenome-wide association study (PheWAS) is known to be a powerful tool in discovery and replication of genetic association studies. The recent development in the UK Biobank resource with deep genomic and phenotyping data has provided unparalleled research opportunities. To reduce the computational complexity and cost of PheWAS, the SAIGE (scalable and accurate implementation of generalized mixed model [1]) method was proposed recently, controlling for case-control imbalance and sample structure in single variant association studies. However, it is still computationally challenging to analyze the associations of thousands of phenotypes with whole-genome variant data, especially for disease diagnoses using the ICD-10 codes of deep phenotyping.

Here we develop a new high-performance statistical package (SAIGEgds) for large-scale PheWAS using mixed models [2]. In this package, the SAIGE method is implemented with optimized C++ codes taking advantage of sparse structure of genotype dosages. SAIGEgds supports efficient genomic data structure (GDS) files [3] including both integer genotypes and numeric imputed dosages. Benchmarks using the UK Biobank White British genotype data (N=430K) with coronary heart disease and simulated cases, show that SAIGEgds is 5 to 6 times faster than the SAIGE R package in the steps of fitting null models and p-value calculations. When used in conjunction with high-performance computing (HPC) clusters and/or cloud resources, SAIGEgds provides an efficient analysis pipeline for biobank-scale PheWAS.

.

# Installation

* Bioconductor repository (available soon):
```R
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("SAIGEgds")
```
The `BiocManager::install()` approach may require that you build from source, i.e. `make` and compilers must be installed on your system -- see the [R FAQ](http://cran.r-project.org/faqs.html) for your operating system; you may also need to install dependencies manually.

.

.

# Examples

```{r message=FALSE}
library(SeqArray)
library(SAIGEgds)

# the genotype file in SeqArray GDS format (1000 Genomes Phase1, chromosome 22)
(geno_fn <- seqExampleFileName("KG_Phase1"))
```

## Preparing SNP data for genetic relationship matrix

```{r}
# open a SeqArray file in the package (1000 Genomes Phase1, chromosome 22)
gds <- seqOpen(geno_fn)
```

The LD pruning is provided by [snpgdsLDpruning()](https://rdrr.io/bioc/SNPRelate/man/snpgdsLDpruning.html) in the pacakge [SNPRelate](http://bioconductor.org/packages/SNPRelate):
```{r}
library(SNPRelate)

set.seed(1000)
snpset <- snpgdsLDpruning(gds)
str(snpset)

snpset.id <- unlist(snpset, use.names=FALSE)  # get the variant IDs of a LD-pruned set
head(snpset.id)
```

Create a genotype file for genetic relationship matrix (GRM) using the LD-pruned SNP set:
```{r}
grm_fn <- "grm_geno.gds"
seqSetFilter(gds, variant.id=snpset.id)

# export to a GDS genotype file without annotation data
seqExport(gds, grm_fn, info.var=character(), fmt.var=character(), samp.var=character())
```

If genotypes are split by chromosomes, `seqMerge()` in the SeqArray package can be used to combine the LD-pruned SNP sets.

```{r}
# close the file
seqClose(gds)
```

.

## Fitting the null model

A simulated phenotype data is used to demonstrate the model fitting:
```{r}
set.seed(1000)
sampid <- seqGetData(grm_fn, "sample.id")  # sample IDs in the genotype file

pheno <- data.frame(
    sample.id = sampid,
    y  = sample(c(0, 1), length(sampid), replace=TRUE, prob=c(0.95, 0.05)),
    x1 = rnorm(length(sampid)),
    x2 = rnorm(length(sampid)),
    stringsAsFactors = FALSE)
head(pheno)

# null model fitting using GRM from grm_fn
grm_fn
glmm <- seqFitNullGLMM_SPA(y ~ x1 + x2, pheno, grm_fn, trait.type="binary", sample.col="sample.id")
```

.

## P-value calculations

```{r}
# genetic variants stored in the file geno_fn
geno_fn
# calculate, using 2 processes
assoc <- seqAssocGLMM_SPA(geno_fn, glmm, mac=10, parallel=2)

head(assoc)

# filtering based on p-value
assoc[assoc$pval < 5e-4, ]
```

The output could be directly saved to an R object file or a GDS file:
```{r}
# save to 'assoc.gds'
seqAssocGLMM_SPA(geno_fn, glmm, mac=10, parallel=2, res.savefn="assoc.gds")
```

Open the output GDS file using the functions in the [gdsfmt](http://www.bioconductor.org/packages/gdsfmt) package:
```{r}
# open the GDS file
(f <- openfn.gds("assoc.gds"))
# get p-values
pval <- read.gdsn(index.gdsn(f, "pval"))
summary(pval)
closefn.gds(f)
```

Load association results using the function `seqSAIGE_LoadPval()` in SAIGEgds:
```{r}
res <- seqSAIGE_LoadPval("assoc.gds")
head(res)
```

.

## QQ plot for p-values

```{r fig.width=4, fig.height=4, fig.align='center'}
n <- nrow(assoc)
# expected p-values
exp.pval <- (1:n) / n
# observed p-values
obs.pval <- sort(assoc$pval)

plot(-log10(exp.pval), -log10(obs.pval), xlab="-log10(expected P)", ylab="-log10(observed P)", cex=2/3)
abline(0, 1, col="red", lty=2)
```

.

.


# Session Information

```{r}
sessionInfo()
```

```{r echo=FALSE}
unlink(c("grm_geno.gds", "assoc.gds"), force=TRUE)
```

.


# References

1. Zhou W, Nielsen JB, Fritsche LG, Dey R, Gabrielsen ME, Wolford BN, LeFaive J, VandeHaar P, Gagliano SA, Gifford A, Bastarache LA, Wei WQ, Denny JC, Lin M, Hveem K, Kang HM, Abecasis GR, Willer CJ, Lee S. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. *Nat Genet* (2018). Sep;50(9):1335-1341. [DOI: 10.1038/s41588-018-0184-y](https://doi.org/10.1038/s41588-018-0184-y)
2. Zheng X, Davis J.Wade, SAIGEgds -- an efficient statistical tool for large-scale PheWAS with mixed models; (Abstract 1920276). *The Annual Meeting of The American Society of Human Genetics (ASHG)*, Oct 15-19, 2019, Houston, US.
3. Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D. SeqArray -- A storage-efficient high-performance data format for WGS variant calls. *Bioinformatics* (2017). [DOI: 10.1093/bioinformatics/btx145](http://dx.doi.org/10.1093/bioinformatics/btx145).

.


# See also

[SeqArray](http://www.bioconductor.org/packages/SeqArray): Data Management of Large-scale Whole-genome Sequence Variant Calls

[SNPRelate](http://www.bioconductor.org/packages/SNPRelate): Parallel Computing Toolset for Relatedness and Principal Component Analysis of SNP Data

[gds2bgen](https://github.com/zhengxwen/gds2bgen): Format conversion from bgen to gds