---
title: "Notes on Bioconductor's GRASP2 database curation"
output:
  BiocStyle::html_document:
    toc: yes
---

<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Notes on Bioconductor's GRASP2 database curation}
%\VignettePackage{grasp2db}
-->

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

Package: `grasp2db`<br />
Author: Martin Morgan<br />
Modification date: 2014-12-31<br />
Compilation date: `r Sys.Date()`

# Introduction

This document outlines steps taken to create Bioconductor's version of
the GRASP2 data base.  [GRASP][] (Genome-Wide Repository of
Associations Between SNPs and Phenotypes) v2.0 was released in
September 2014.  The Bioconductor AnnotationHub resource is derived
from the [v 2.0.0.0 release][GRASPzip].

[GRASP]: http://apps.nhlbi.nih.gov/Grasp/Overview.aspx
[GRASPzip]: https://s3.amazonaws.com/NHLBI_public/GRASP/GraspFullDataset2.zip

The primary reference for version 2 is: Eicher JD, Landowski C,
Stackhouse B, Sloan A, Chen W, Jensen N, Lien J-P, Leslie R, Johnson
AD (2014) GRASP v 2.0: an update to the genome-wide repository of
associations between SNPs and phenotypes. Nucl Acids Res, published
online Nov 26, 2014 PMID 25428361. 

Other vignettes in the grasp2db package contain details of the GRASP2
data base.

# Processing

The script `system.file(package="grasp2db", "scripts",
"grasp2AnnotationHub.R")` processes GRASP2 to the Bioconductor sqlite
representation. The script downloads the [ZIP][GRASPzip] file,
uncompresses the contents to a single tab-delimited text file,
performs some necessary data cleaning, and stores the data in a
partially normalized sqlite data base. The sqlite data base is
distributed using the Bioconductor `r Biocpkg("AnnotationHub")`
package.

Data cleaning and transformation to sqlite are performed by the
`grasp2db:::.db_create()` function. The major steps include

1. Standardizing column names

2. Standardizing some aspects of data representation

3. Output to 3 sqlite tables.

## Standardizing column names

Column names are standardized using
`grasp2db:::.db_clean_colnames()`. The following columns are renamed:

Original                  | Standardized
------------------------- | ------------
SNPid(dbSNP134)           | SNPid_dbSNP134
chr(hg19)                 | chr_hg19
pos(hg19)                 | pos_hg19
SNPid(in paper)           | SNPidInPaper
InNHGRIcat(as of 3/31/12) | InNHGRIcat_3_31_12
Initial Sample Description| DiscoverySampleDescription
LS SNP                    | LS_SNP

All other column names were transformed to CamelCase by removing
non-alphabetical characters and capitalizing the subsequent letter,
e.g., `Exclusively Male/Female` becomes `ExclusivelyMaleFemale`.

## Data cleaning

`grasp2db:::.db_clean_chunk()` standardized data.

NHLBIkey is supposed to be a unique integer-valued identifier, but the
GRASP2fullDataset file contains 47 rows with keys `2.36501E+14` or
`2.29412E+14`. These rows have been removed.

Columns `TotalSamples(discovery+replication)`,
`TotalDiscoverySamples`, and `Total replication samples` were removed
(these values are easily calculated if desired).

A column `NegativeLog10PBin` was created to represent decades of
increasing log10 significance, `round(-log10(Pvalue))`.

The `CreationDate` and `LastCurationDate` columns were standardized so
that the dates `8/17/12` and `8/17/2012` are represented consistently
as `8/17/2012`.

The `HUBfield` date formats refering to `Jan2014` or `14-Jan` were
standardized to `1/1/2014`.

The `LocationWithinPaper` entries without a space between `Table12`,
`Figure12`, or `FullData` were replaced with a space equivalent, e.g.,
`Table 12`.

The `dbSNPvalidation` column replaced `""`, `"NO"`, `"YES"` with
logical `NA`, `FALSE`, `TRUE`.

The `dbSNPClinStatus` column entries were standardized to lower case.

### Outstanding issues

The `Phenotype` (and other?) column contains string representations
(apparently) using the CP1250 encoding, as well as variants differing
only by character case. In R and on platforms supporting CP1250
encoding, offending vectors can be transformed to their portable and
cannonical representation using

```{r eval=FALSE}
P = iconv(Phenotype, "CP1250", "UTF-8")
p = tolower(P)
Phenotype = P[match(p, p)]
```

## SQLite representation

Data were partially normalized into 3 tables.

`study` contains information on each publication present in the data
base, using `PMID` as a unique key. See
`grasp2db:::.db_accumulate_study()`.

`count` contains the number of samples each variant was found in,
summarized by sample (`Discovery` or `Replication`) and population
(e.g., `European`, `Hispanic`), using `NHLBIkey` as a unique key. See
`grasp2db:::.db_write_count()`.

`variant` contains information about each variant, and in particular
`NHLBIkey` and `PMID` to relate this table to the `study` and `count`
tables. See `grasp2db:::.db_write_variant()`.

Indexes were created on PMID (variant and study tables) and NHLBIkey
(variant and count tables) fields, and on the Phenotype, dbSNPid,
chromosome and position, and NegativeLog10PBin fields (variant table).

# Use

The database is available for use in this package as

```{r eval=FALSE}
library(grasp2db)
GRASP2()           # dbplyr representation
```

or more directly as

```{r eval=FALSE}
library(AnnotationHub)
db <- AnnotationHub()[["AH21414"]]
```

In both cases, the (large) data base is downloaded to a local cache
(see documentation in the `r Biocpkg("AnnotationHub")` package); this
can take several minutes the first time the data base is used.