---
title: "Tutorial for `swfdr` package"
author: "Jeff Leek, Simina Boca"
date: "`r BiocStyle::doc_date()`"
bibliography: myRefs.bib
output: BiocStyle::pdf_document
vignette: >
  %\VignetteIndexEntry{Tutorial for swfdr package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Package overview

This package allows users to estimate the science-wise false discovery rate from @JagerEtAl2013, using an EM approach due to the presence of rounding and censoring. It also allows users to estimate the proportion of true null hypotheses in the presence of covariates, using a regression framework, as per @BocaEtAl2015.

The package is loaded using:
```{r}
library(swfdr)
```

## Estimating the science-wise false discovery rate

The science-wise false discovery rate (swfdr) is defined in @JagerEtAl2013 as the rate that published research results are false positives. It is based on using reported p-values reported in biomedical journals and assuming that, for the p-values below 0.05, those corresponding to false positives are distributed as U$(0, 0.05)$, while those corresponding to true positives are distributed as tBeta$(\alpha, \beta; 0.05)$, where $\alpha$ and $\beta$ are unknown and tBeta is a Beta distribution truncated at $0.05$. The estimation of the swfdr is complicated by  truncation (e.g. reporting ``p < 0.01``) and rounding (e.g. p-values are often rounded to two significant digits). An EM algorithm is used to estimate $\alpha, \beta$, as well as the swfdr.

### Example: Estimate the swfdr based on p-values from biomedical journals

We include a dataset containing 15,653 p-values from articles in 5 biomedical journals (American Journal of Epidemiology, BMJ, Jama, Lancet, New England Journal of Medicine), over 11 years (2000-2010).
This is obtained from web-scraping, using the code at \url{https://github.com/jtleek/swfdr/blob/master/getPvalues.R} and is already loaded in the package.
```{r}
colnames(journals_pVals)
```

A description of the variables in `journals_pVals`can be found on its help page. In particular, `pvalue` gives the p-value, `pvalueTruncated` is a flag for whether it is truncated, `year`  is the year of publication, and `journal` the journal. 
```{r}
table(journals_pVals$year)
table(journals_pVals$journal)
```

### The `calculateSwfdr` function
    
This function estimates the swfdr. It inputs the following parameters:    

* `pValues` Numerical vector of p-values
* `truncated` Vector of 0s and 1s with indices corresponding to those in pValues; 1 indicates that the p-values is truncated, 0 that it is not truncated
* `rounded` Vector of 0s and 1s with indices corresponding to those in pValues; 1 indicates that the p-values is rounded, 0 that it is not rounded
* `pi0` Initial prior probability that a hypothesis is null (default is 0.5)
* `alpha` Initial value of parameter alpha from Beta(alpha, beta) true positive distribution (default is 1)
* `beta` Initial value of parameter beta from Beta(alpha, beta) true positive distribution (default is 50)
* `numEmIterations` The number of EM iterations (default is 100)

Given that it runs an EM algorithm, it is somewhat computationally intensive. We show an example of applying it to all the p-values from the abstracts for articles published in the American Journal of Epidemiology in 2015. First, we subset the `journals_pVals` and only consider the p-values below $0.05$, as in @JagerEtAl2013:
```{r}
journals_pVals1 <- dplyr::filter(journals_pVals,
                                 year == 2005,
                                 journal == "American Journal of Epidemiology",
                                 pvalue < 0.05)

dim(journals_pVals1)
```

Next, we define vectors corresponding to the truncation status and the rouding status (defined as rounding to 2 significant digits) and use these vectors, along with the vector of p-values, and the number of EM iterations, as inputs to the `calculateSwfdr` function:
```{r}
tt <- data.frame(journals_pVals1)[,2]
rr <- rep(0,length(tt))
rr[tt == 0] <- (data.frame(journals_pVals1)[tt==0,1] == 
                  round(data.frame(journals_pVals1)[tt==0,1],2))
pVals <- data.frame(journals_pVals1)[,1]
resSwfdr <- calculateSwfdr(pValues = pVals, 
                           truncated = tt, 
                           rounded = rr, numEmIterations=100)
names(resSwfdr)
```

The following values are returned:

* `pi0` Final value of prior probability - estimated from EM - that a hypothesis is null, i.e. estimated swfdr
* `alpha` Final value of parameter alpha - estimated from EM - from Beta(alpha, beta) true positive distribution
* `beta` Final value of parameter beta - estimated from EM - from Beta(alpha, beta) true positive distribution
* `z` Vector of expected values of the indicator of whether the p-value is null or not - estimated from EM - for the non-rounded p-values (values of NA represent the rounded p-values)
* `n0` Expected number of rounded null p-values - estimated from EM - between certain cutpoints (0.005, 0.015, 0.025, 0.035, 0.045, 0.05)
* `n` Number of rounded p-values between certain cutpoints (0.005, 0.015, 0.025, 0.035, 0.045, 0.05)

### Results from example dataset

For the example dataset we considered, the results are as follows:
```{r}
resSwfdr
```

Thus, the estimated swfdr for papers published in American Journal of Epidemiology in 2005 is `r round(resSwfdr$pi0,3)` i.e. `r round(resSwfdr$pi0 * 100,1)`\% of the discoveries - defined as associations with p < 0.05 - are expected to be false discoveries.

## Estimating the proportion of true null hypothesis in the presence of covariates

As in @BocaEtAl2015, we denote by $\pi_0(x)$ the proportion of true null hypotheses as a function of a covariate $x$. This is estimated based on a list of p-values $p_1,\ldots,p_m$ corresponding to a set
of null hypotheses, $H_{01},\ldots,H_{0m}$, and a design matrix $X$.
The design matrix considers relevant meta-data, which could be valuable for improving estimatong of the 
prior probability that a hypothesis is true or false.

### Example: Adjust for sample size and allele frequency in BMI GWAS meta-analysis

We consider an example from the meta-analysis of data from a genome-wide association study (GWAS) for
body mass index (BMI) from @LockeEtAl2015. A subset of this data, corresponding to 50,000 
single nucleotide polymorphisms (SNPs) is already loaded with the package.
```{r}
head(BMI_GIANT_GWAS_sample)
dim(BMI_GIANT_GWAS_sample)
```
A description of the variables in `BMI_GIANT_GWAS_sample` can be found on its help page.  In particular, `p` gives the p-values for the association between the SNPs and BMI; `N` gives the total sample size considered in the study of a particular SNP; and `Freq_MAF_Hapmap` gives the frequency of the minor (less frequent allele) for a particular SNP in Hapmap. The column `Freq_MAF_Int_Hapmap` provides 3 approximately equal intervals for the Hapmap MAFs:
```{r}
table(BMI_GIANT_GWAS_sample$Freq_MAF_Int_Hapmap)
```

### The `lm_pi0` function
    
This function estimates $\pi_0(x)$. It inputs the following parameters:    

* `pValues` Numerical vector of p-values
* `lambda` Numerical vector of thresholds in $[0,1)$ at which $\pi_0(x)$ is estimated. Default thresholds are $(0.05, 0.10, \ldots, 0.95)$.
* `X` Design matrix (one test per row, one variable per column). Do not include the intercept.
* `type` Type of regression, "logistic" or "linear." Default is logistic.
* `smooth.df` Number of degrees of freedom when estimating $\pi_0(x)$ by smoothing. Default is $3$.
* `threshold` If `TRUE` (default), all estimates are thresholded at 0 and 1, if `FALSE`, none of them are. 

To apply it to the BMI GWAS dataset, we first create the design matrix, using natural cubic splines with 5 degrees of freedom to model `N` and 3 discrete categories for the MAFs:
```{r}
X <- model.matrix(~ splines::ns(N,5) + Freq_MAF_Int_Hapmap, data = BMI_GIANT_GWAS_sample)[,-1]
head(X)
```

We then run the `lm_pi0` function:
```{r}
pi0x <- lm_pi0(pValues=BMI_GIANT_GWAS_sample$p, 
               X=X, smooth.df=3)
names(pi0x)
```

The following values are returned:

* `pi0` Numerical vector of smoothed estimate of $\pi_0(x)$. The length is the number of rows in $X$.
* `pi0.lambda` Numerical matrix of estimated $\pi_0(x)$ for each value of lambda. The number of columns is the number of tests, the number of rows is the length of lambda.
* `lambda` Vector of the values of `lambda` used in calculating `pi0.lambda`.
* `pi0.smooth` Matrix of fitted values from the smoother fit to the $\pi_0(x)$ estimates at each value of lambda (same number of rows and columns as `pi0.lambda`).

### Results from BMI GWAS meta-analysis example

We first add the estimates of $\pi_0(x)$ for $\lambda=0.8$, $\lambda=0.9$, and the final smoothed value
to the `BMI_GIANT_GWAS_sample` object:
```{r}
BMI_GIANT_GWAS_sample$fitted0.8 <- pi0x$pi0.lambda[,round(pi0x$lambda,2)==0.8]
BMI_GIANT_GWAS_sample$fitted0.9 <- pi0x$pi0.lambda[,round(pi0x$lambda,2)==0.9]
BMI_GIANT_GWAS_sample$fitted.final.smooth <- pi0x$pi0
```

We next create a long data frame so that we can use the plotting tools in `ggplot2`:
```{r}
ldf <- reshape2::melt(BMI_GIANT_GWAS_sample,
                      id.vars=colnames(BMI_GIANT_GWAS_sample)[-grep("fitted",
                                                                    colnames(BMI_GIANT_GWAS_sample))],
                      value.name = "pi0",variable.name = "lambda")
ldf$lambda <- as.character(ldf$lambda)
ldf$lambda[ldf$lambda=="fitted0.8"] <- "lambda=0.8"
ldf$lambda[ldf$lambda=="fitted0.9"] <- "lambda=0.9"
ldf$lambda[ldf$lambda=="fitted.final.smooth"] <- "final smoothed pi0(x)"

head(ldf)
```

The plot of the estimates of $\pi_0(x)$ against the sample size $N$, stratified by the MAF categories can thus be obtained:
```{r, BMI_GWAS_plot}
library(ggplot2)
ggplot(ldf, aes(x=N, y=pi0))+
  geom_line(aes(col=Freq_MAF_Int_Hapmap, linetype=lambda)) +
  ylab("Estimated proportion of nulls") +
  guides(color=guide_legend(title="MAF in HapMap CEU population"),
         linetype=guide_legend(title="Estimate"))

```

## References