--- title: Saving objects to artifacts and back again author: - name: Aaron Lun email: infinite.monkeys.with.keyboards@gmail.com package: alabaster.base date: "Revised: September 29, 2022" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Saving and loading artifacts} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE} library(BiocStyle) self <- Biocpkg("alabaster.base"); knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE) ``` # Introduction The `r self` package (and its family) implements methods to save common Bioconductor objects to file artifacts and load them back into R. This aims to provide a functional equivalent to RDS-based serialization that is: - More stable to changes in the class definition. Such changes would typically require costly `updateObject()` operations at best, or invalidate RDS files at worst. - More interoperable with other analysis frameworks. All artifacts are saved in standard formats (e.g., CSV, HDF5) and can be easily parsed by applications in other languages. - More modular, with each object split into multiple artifacts. This enables parts of the object to be loaded into memory according to each application's needs. Parts can also be updated cheaply on disk without rewriting all files. # Quick start To demonstrate, let's mock up a `DataFrame` object from the `r Biocpkg("S4Vectors")` package. ```{r} library(S4Vectors) df <- DataFrame(X=1:10, Y=letters[1:10]) df ``` We'll save this `DataFrame` to file inside a "staging directory": ```{r} staging <- tempfile() library(alabaster.base) saveLocalObject(df, staging, "my_favorite_df") ``` And reading it back in: ```{r} readLocalObject(staging, "my_favorite_df") ``` # Class-specific methods Each class implements a staging and loading method for use inside the _alabaster_ framework. The staging method (for the `stageObject()` generic) will save the object to one or more files, given a staging directory and a path inside it: ```{r} tmp <- tempfile() dir.create(tmp) # DataFrame method already defined for the stageObject generic: meta <- stageObject(df, tmp, path="my_df") str(meta) # Writing the metadata to file. invisible(.writeMetadata(meta, tmp)) list.files(tmp, recursive=TRUE) ``` Conversely, the loading function will - as the name suggests - load the object back into memory, given the staging directory and the file's metadata. The correct loading function for each class is automatically called by the `loadObject()` function: ```{r} remeta <- acquireMetadata(tmp, "my_df/simple.csv.gz") loadObject(remeta, tmp) ``` `r self` itself supports a small set of classes from the `r Biocpkg("S4Vectors")` packages; support for additional classes can be found in other packages like `r Biocpkg("alabaster.ranges")` and `r Biocpkg("alabaster.se")`. Third-party developers can also add support for their own classes by defining new methods, see the [_Extensions_](extensions.html) vignette for details. Incidentally, `readLocalObject()` and `saveLocalObject()` are just convenience wrappers acround `stageObject()` and `loadObject()`, respectively. # What's a staging directory? For some concept of a "project", the staging directory contains all of the artifacts generated for that project. Multiple objects can be saved into a single staging directory, usually in separate subdirectories to avoid conflicts. For example: ```{r} # Creating a nested DF to be a little spicy: df2 <- DataFrame(Z=factor(1:5), AA=I(DataFrame(B=runif(5), C=rnorm(5)))) meta2 <- stageObject(df2, tmp, path="my_df2") list.files(tmp, recursive=TRUE) ``` We use the staging directory as a root for the hard-coded paths in the metadata returned by `stageObject()` (i.e., `meta` and `meta2`). Thus, we can easily copy the entire directory to a new file system and everything will still be correctly referenced. ```{r} meta2$path new.dir <- tempfile() dir.create(new.dir) invisible(file.copy(tmp, new.dir, recursive=TRUE)) loadObject(meta2, file.path(new.dir, basename(tmp))) ``` The simplest way to share objects is to just `zip` or `tar` the staging directory for _ad hoc_ distribution. For more serious applications, `r self` can be used in conjunction with centralized file storage systems - see the [_Applications_](applications.html) vignette for more details. # Saving and validating metadata Each file artifact is associated with JSON-formatted metadata, denoted by a file with an additional `.json` suffix. This defines the interpretation of the contents of the file as well as pointing to other files that are required to represent the corresponding object in an R session. Users can use the `.writeMetadata()` utility to quickly write the metadata to an appropriate location inside the staging directory. ```{r} # Writing the metadata to file. invisible(.writeMetadata(meta, dir=tmp)) invisible(.writeMetadata(meta2, dir=tmp)) # Reading a snippet. meta.path <- file.path(tmp, paste0(meta$path, ".json")) cat(head(readLines(meta.path), 20), sep="\n") ``` The JSON metadata for each artifact follows the constraints of the schema specified in the `$schema` property. This schema determines what fields can be stored and which are mandatory; the loading method uses such fields to reliably restore the object into memory. `.writeMetadata()` will automatically validate the metadata against the schema to ensure that the contents are appropriate. ```{r} meta.fail <- meta meta.fail[["data_frame"]][["columns"]] <- NULL try(.writeMetadata(meta.fail, dir=tmp)) ``` The schema will also list the exact R function that is required to load an object into memory. This means that developers can create third-party extensions for new data structures by defining new schemas, without requiring changes to `r self`'s code. Indeed, `loadObject()` automatically uses the schema to determine how it should load the files back into memory: ```{r} re.read <- acquireMetadata(tmp, "my_df2/simple.csv.gz") loadObject(re.read, tmp) ``` # Session information {-} ```{r} sessionInfo() ```