--- title: "DelayedDataFrame: an on-disk represention of DataFrame" author: - name: Qian Liu affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY - name: Hervé Pagès affiliation: Fred Hutchinson Cancer Research Center, Seattle, WA - name: Martin Morgan affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY date: "last edit: 10/15/2021" output: BiocStyle::html_document: toc: true toc_float: true package: DelayedDataFrame vignette: | %\VignetteIndexEntry{DelayedDataFrame} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "##" ) ``` ```{r options, eval=TRUE, echo=FALSE} options(showHeadLines=3) options(showTailLines=3) ``` # Introduction As the genetic/genomic data are having increasingly larger profile, the annotation file are also getting much bigger than expected. the memory space in _R_ has been an obstable for fast and efficient data processing, because most available _R_ or _Bioconductor_ packages are developed based on in-memory data manipulation. With some newly developed data structure as [HDF5][] or [GDS][], and the _R_ interface of [DelayedArray][] to represent on-disk data structures with different back-end in _R_-user-friendly array data structure (e.g., [HDF5Array][],[GDSArray][]), the high-throughput genetic/genomic data are now being able to easily loaded and manipulated within _R_. However, the annotation files for the samples and features inside the high-through data are also getting unexpectedly larger than before. With an ordinary `data.frame` or `DataFrame`, it is still getting more and more challenging for any analysis to be done within _R_. So here we have developed the `DelayedDataFrame`, which has the very similar characteristics as `data.frame` and `DataFrame`. But at the same time, all column data could be optionally saved on-disk (e.g., in [DelayedArray][] structure with any back-end). Common operations like constructing, subsetting, splitting, combining could be done in the same way as `DataFrame`. This feature of `DelayedDataFrame` could enable efficient on-disk reading and processing of the large-scale annotation files, and at the same, signicantly saves memory space with common `DataFrame` metaphor in _R_ and _Bioconductor_. [HDF5]: https://www.hdfgroup.org/solutions/hdf5/ [GDS]: http://corearray.sourceforge.net/ [DelayedArray]: https://bioconductor.org/packages/DelayedArray [GDSArray]: https://bioconductor.org/packages/GDSArray [HDF5Array]: https://bioconductor.org/packages/HDF5Array # Installation Download the package from _Bioconductor_: ```{r getPackage, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("DelayedDataFrame") ``` The development version is also available to download through github: ```{r getDevel, eval=FALSE} BiocManager::install("Bioconductor/DelayedDataFrame") ``` Load the package into _R_ session before using: ```{r Load, message=FALSE, warning=FALSE} library(DelayedDataFrame) ``` # DelayedDataFrame class ## class extension `DelayedDataFrame` extends the `DataFrame` data structure, with an additional slot called `lazyIndex`, which saves all the mapping indexes for each column of the data inside `DelayedDataFrame`. It is similar to `data.frame` in terms of construction, subsetting, splitting, combining... The `rownames` are having same feature as `DataFrame`. It will not be given automatically, but only by explicitly specify in the constructor function `DelayedDataFrame(, row.names=...)` or using the slot setter function `rownames()<-`. Here we use the [GDSArray][] data as example to show the `DelayedDataFrame` characteristics. [GDSArray][] is a _Bioconductor_ package that represents GDS files as objects derived from the [DelayedArray][] package and `DelayedArray` class. It carries the on-disk data path and represent the GDS nodes in a `DelayedArray`-derived data structure. The `GDSArray()` constructor takes 2 arguments: the file path and the GDS node name inside the GDS file. ```{r, GDSArray} library(GDSArray) file <- SeqArray::seqExampleFileName("gds") gdsnodes(file) varid <- GDSArray(file, "annotation/id") DP <- GDSArray(file, "annotation/info/DP") ``` We use an ordinary character vector and the `GDSArray` objects to construct a `DelayedDataFrame` object. ```{r, construction} ddf <- DelayedDataFrame(varid, DP) ## only accommodate 1D GDSArrays with same length ``` ## slot accessors The slots of `DelayedDataFrame` could be accessed by `lazyIndex()`, `nrow()`, `rownames()` (if not NULL) functions. With a newly constructed `DelayedDataFrame` object, the initial value of `lazyIndex` slot will be NULL for all columns. ```{r, accessors} lazyIndex(ddf) nrow(ddf) rownames(ddf) ``` ## `lazyIndex` slot The `lazyIndex` slot is in `LazyIndex` class, which is defined in the `DelayedDataFrame` package and extends the `SimpleList` class. The `listData` slot saves unique indexes for all the columns, and the `index` slots saves the position of index in `listData` slot for each column in `DelayedDataFrame` object. In the above example, with an initial construction of `DelayedDataFrame` object, the index for each column will all be NULL, and all 3 columns points the NULL values which sits in the first position in `listData` slot of `lazyIndex`. ```{r} lazyIndex(ddf)@listData lazyIndex(ddf)@index ``` Whenever an operation is done (e.g., subsetting), the `listData` slot inside the `DelayedDataFrame` stays the same, but the `lazyIndex` slot will be updated, so that the show method, further statistical calculation will be applied to the subsetting data set. For example, here we subset the `DelayedDataFrame` object `ddf` to keep only the first 5 rows, and see how the `lazyIndex` works. As shown in below, after subsetting, the `listData` slot in `ddf1` stays the same as `ddf`. But the subsetting operation was recorded in the `lazyIndex` slot, and the slots of `lazyIndex`, `nrows` and `rownames` (if not NULL) are all updated. So the subsetting operation is kind of `delayed`. ```{r, lazyIndex} ddf1 <- ddf[1:20,] identical(ddf@listData, ddf1@listData) lazyIndex(ddf1) nrow(ddf1) ``` Only when functions like `DataFrame()`, or `as.list()`, the `lazyIndex` will be realized and `DelayedDataFrame` returned. We will show the realization in the following coercion method section. # DelayedDataFrame methods The common methods on `data.frame` or `DataFrame` are also defined on `DelayedDataFrame` class, so that they behave similarily on `DelayedDataFrame` objects. ## Coercion methods Coercion methods between `DelayedDataFrame` and other data structures are defined. When coercing from `ANY` to `DelayedDataFrame`, the `lazyIndex` slot will be added automatically, with the initial NULL value of indexes for each column. - From vector ```{r} as(letters, "DelayedDataFrame") ``` - From DataFrame ```{r} as(DataFrame(letters), "DelayedDataFrame") ``` - From list ```{r} (a <- as(list(a=1:5, b=6:10), "DelayedDataFrame")) lazyIndex(a) ``` When coerce `DelayedDataFrame` into other data structure, the `lazyIndex` slot will be realized and the new data structure returned. For example, when `DelayedDataFrame` is coerced into a `DataFrame` object, the `listData` slot will be updated according to the `lazyIndex` slot. ```{r} df1 <- as(ddf1, "DataFrame") df1@listData dim(df1) ``` ## Subsetting methods ### subsetting by `[` two-dimensional `[` subsetting on `DelayedDataFrame` objects by integer, character, logical values all work. - integer subscripts. ```{r, singleSB1} ddf[, 1, drop=FALSE] ``` - character subscripts (column names). ```{r, singleSB2} ddf[, "DP", drop=FALSE] ``` - logical subscripts. ```{r, singleSB3} ddf[, c(TRUE,FALSE), drop=FALSE] ``` When subsetting using `[` on an already subsetted `DelayedDataFrame` object, the `lazyIndex`, `nrows` and `rownames`(if not NULL) slot will be updated. ```{r, singleSB4} (a <- ddf1[1:10, 2, drop=FALSE]) lazyIndex(a) nrow(a) ``` ### subsetting by `[[` The `[[` subsetting will take column subscripts for integer or character values, and return corresponding columns in it's original data format. ```{r, doubleSB} ddf[[1]] ddf[["varid"]] identical(ddf[[1]], ddf[["varid"]]) ``` ## `rbind/cbind` When doing `rbind`, the `lazyIndex` of input arguments will be realized and a new `DelayedDataFrame` with NULL lazyIndex will be returned. ```{r, rbind} ddf2 <- ddf[21:40, ] (ddfrb <- rbind(ddf1, ddf2)) lazyIndex(ddfrb) ``` `cbind` of `DelayedDataFrame` objects will keep all existing `lazyIndex` of input arguments and carry into the new `DelayedDataFrame` object. ```{r, cbind, error=FALSE} (ddfcb <- cbind(varid = ddf1[,1, drop=FALSE], DP=ddf1[, 2, drop=FALSE])) lazyIndex(ddfcb) ``` # sessionInfo ```{r, sessioninfo} sessionInfo() ```