---
title: "GDSArray: Representing GDS files as array-like objects"
author:
- name: Qian Liu
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY
- name: Hervé Pagès
  affiliation: Fred Hutchinson Cancer Research Center, Seattle, WA
- name: Martin Morgan
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY
date: "last edit: 3/16/2018"
output:
    BiocStyle::html_document:
        toc: true
        toc_float: true
package: GDSArray
vignette: |
    %\VignetteIndexEntry{GDSArray: Representing GDS files as array-like objects}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r options, eval=TRUE, echo=FALSE}
options(showHeadLines=3)
options(showTailLines=3)
```

# Introduction

[GDSArray][] is a _Bioconductor_ package that represents GDS files as
objects derived from the [DelayedArray][] package and `DelayedArray`
class. It converts a GDS node in the file to a `DelayedArray`-derived
data structure. The rich common methods and data operations defined on
`GDSArray` makes it more _R_-user-friendly than working with the GDS
file directly. The array data from GDS files are always returned with
the first dimension being `variants/snps` and the second dimension
being `samples`. This feature is consistent with the assay data saved
in `SummarizedExperiment`, and makes the `GDSArray` package
interoperable with other established _Bioconductor_ data
infrastructure.

[GDSArray]: https://bioconductor.org/packages/GDSArray
[DelayedArray]: https://bioconductor.org/packages/DelayedArray

# Package installation

1. Download the package from Bioconductor.

```{r getPackage, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("GDSArray")
```

2. Load the package into R session.
```{r Load, message=FALSE}
library(GDSArray)
```

# GDS format introduction

## Genomic Data Structure (GDS)

The _Bioconductor_ package [gdsfmt][] has provided a high-level R
interface to CoreArray Genomic Data Structure (GDS) data files, which
is designed for large-scale datasets, especially for data which are
much larger than the available random-access memory.

The GDS format has been widely used in genetic/genomic research for
high-throughput genotyping or sequencing data. There are two major
classes that extends the `gds.class`: `SNPGDSFileClass` suited for
genotyping data (e.g., GWAS), and `SeqVarGDSClass` that are designed
specifically for DNA-sequencing data. The file format attribute in
each data class is set as `SNP_ARRAY` and `SEQ_ARRAY`. There are rich
functions written based on these data classes for common data
operation and statistical analysis.

More details about GDS format can be found in the vignettes of the
[gdsfmt][], [SNPRelate][], and [SeqArray][] packages.

[gdsfmt]: https://bioconductor.org/packages/gdsfmt
[SNPRelate]: https://bioconductor.org/packages/SNPRelate
[SeqArray]: https://bioconductor.org/packages/SeqArray

# `GDSArray`, `GDSMatrix`, and `GDSFile`

`GDSArray` represents GDS files as `DelayedArray` instances. It has
methods like `dim`, `dimnames` defined, and it inherits array-like
operations and methods from `DelayedArray`, e.g., the subsetting
method of `[`.

## GDSArray, GDSMatrix, and DelayedArray

The `GDSArray()` constructor takes as arguments the file path and the
GDS node inside the GDS file. The `GDSArray()` constructor always
returns the object with rows being features (genes / variants / snps)
and the columns being "samples". This is consistent with the assay
data inside `SummarizedExperiment`.

```{r, GDSArray}
file <- SeqArray::seqExampleFileName("gds")
GDSArray(file, "genotype/data")
```
A `GDSMatrix` is a 2-dimensional `GDSArray`, and will be returned from
the `GDSArray()` constructor automatically if the input GDS node is
2-dimensional.

```{r, GDSMatrix}
GDSArray(file, "phase/data")
```

## `GDSFile` 

The `GDSFile` is a light-weight class to represent GDS files. It
has the `$` completion method to complete any possible gds nodes. It
could be used as a convenient `GDSArray` constructor if the slot of
`current_path` in `GDSFile` object represents a valid gds node. 
Otherwise, it will return the `GDSFile` object with an updated `current_path`.  

```{r, GDSFile}
gf <- GDSFile(file)
gf$annotation$info
gf$annotation$info$AC
```

Try typing in `gf$ann` and pressing `tab` key for the completion.  

## `GDSArray` methods

### slot accessors. 
- `seed` returns the `GDSArraySeed` of the `GDSArray` object.
```{r, seedAccessor}
ga <- GDSArray(file, "genotype/data")
seed(ga)
```

- `gdsfile` returns the file path of the corresponding GDS file. 
```{r, gdsfileAccessor}
gdsfile(ga)
```

### Available GDS nodes

`gdsnodes()` takes the GDS file path or `GDSFile` object as input, and
returns all nodes that can be converted to `GDSArray` instances. The
returned GDS node names can be used as input for the `GDSArray(name=)`
constructor.

```{r, gdsnodes}
gdsnodes(file)
identical(gdsnodes(file), gdsnodes(gf))
GDSArray(file, name=gdsnodes(file)[2])
```

### `dim()`, `dimnames()`

The `dimnames(GDSArray)` returns an unnamed list, with the length of
each element to be the same as return from `dim(GDSArray)`.

```{r, dims}
ga <- GDSArray(file, "phase/data")
dim(ga)
class(dimnames(ga))
lengths(dimnames(ga))
```

### `[` subsetting

`GDSArray` instances can be subset, following the usual _R_
conventions, with numeric or logical vectors; logical vectors are
recycled to the appropriate length.

```{r, methods}
ga[1:3, 10:15]
ga[c(TRUE, FALSE), ]
```

### some numeric calculation

```{r, numeric}
dp <- GDSArray(file, "annotation/format/DP/data")
dp
log(dp)

dp[rowMeans(dp) < 60, ]
```

## Internals: `GDSArraySeed`

The `GDSArraySeed` class represents the 'seed' for the `GDSArray`
object. It is not exported from the [GDSArray][] package. Seed objects
should contain the GDS file path, and are expected to satisfy the
“seed contract” i.e. to support dim() and dimnames().

```{r, GDSArraySeed}
seed <- GDSArray:::GDSArraySeed(file, "genotype/data")
seed
```

The seed can be used to construct a `GDSArray` instance.

```{r, GDSArray-from-GDSArraySeed}
GDSArray(seed)
```

The `DelayedArray()` constructor with `GDSArraySeed` object as
argument will return the same content as the `GDSArray()` constructor
over the same `GDSArraySeed`.

```{r, da}
class(DelayedArray(seed))
```

# sessionInfo

```{r, sessionInfo}
sessionInfo()
```