--- title: "SIAMCAT input files formats" author: - name: "Konrad Zych, Jakob Wirbel, and Georg Zeller" affiliation: "EMBL Heidelberg" email: "georg.zeller@embl.de" date: "Date last modified: 2018-09-24" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{SIAMCAT.input} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{ASCII} --- # Introduction This vignette illustrates how to read and input your own data to the `SIAMCAT` package. We will cover reading in text files from the disk, formatting them and using them to create an object of `siamcat-class`. The `siamcat-class` is the centerpiece of the package. All of the input data and result are stored inside of it. The structure of the object is described below in the [siamcat-class object](#siamcat-class-object) section. # Loading your data into R ## SIAMCAT input Generally, there are three types of input for `SIAMCAT`: ### Features The features should be a `matrix`, a `data.frame`, or an `otu_table`, organized as follows: `features (in rows) x samples (in columns).` | | Sample_1 | Sample_2 | Sample_3 | Sample_4 | Sample_5 | | --- | ---:| ---:| ---:| ---:| ---:| | **Feature_1** | 0.59 | 0.71 | 0.78 | 0.61 | 0.66 | | **Feature_2** | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | | **Feature_3** | 0.02 | 0.00 | 0.00 | 0.00 | 0.20 | | **Feature_4** | 0.34 | 0.00 | 0.13 | 0.07 | 0.00 | | **Feature_5** | 0.06 | 0.16 | 0.00 | 0.00 | 0.00 | > Please note that `SIAMCAT` is supposed to work with **relative abundances**. Other types of data (e.g. counts) will also work, but not all functions of the package will result in meaningful outputs. An example of a typical feature file is attached to the `SIAMCAT` package, containing data from a publication investigating the microbiome in colorectal cancer (CRC) patients and controls (the study can be found here: [Zeller et al](http://europepmc.org/abstract/MED/25432777)). The metagenomics data were processed with the [MOCAT](http://mocat.embl.de/) pipeline, returning taxonomic profiles on the species levels (`specI`): ```{r feat_file, message=FALSE} library(SIAMCAT) fn.in.feat <- system.file( "extdata", "feat_crc_zeller_msb_mocat_specI.tsv", package = "SIAMCAT" ) ``` One way to load such data into `R` could be the use of `read.table` _(Beware of the defaults in R! They are not always useful...)_ ```{r read_feat, message=FALSE} feat <- read.table(fn.in.feat, sep='\t', header=TRUE, quote='', stringsAsFactors = FALSE, check.names = FALSE) # look at some features feat[110:114, 1:2] ``` ### Metadata The metadata should be either a matrix or a `data.frame`. `samples (in rows) x metadata (in columns)`: | | Age | Gender | BMI | | --- | ---:| ---:| ---:| | **Sample_1** | 52 | 1 | 20| | **Sample_2** | 37 | 1 | 18 | | **Sample_3** | 66 | 2 | 24 | | **Sample_4** | 54 | 2 | 26 | | **Sample_5** | 65 | 2 | 30 | The `rownames` of the metadata should match the `colnames` of the feature matrix. Again, an example of such a file is attached to the `SIAMCAT` package, taken from the same study: ```{r meta_file, message=FALSE} fn.in.meta <- system.file( "extdata", "num_metadata_crc_zeller_msb_mocat_specI.tsv", package = "SIAMCAT" ) ``` Also here, the `read.table` can be used to load the data into `R`. ```{r read_meta, warning=FALSE} meta <- read.table(fn.in.meta, sep='\t', header=TRUE, quote='', stringsAsFactors = FALSE, check.names = FALSE) head(meta) ``` ### Label Finally, the label can come in different three different flavours: * **Named vector**: A named vector containing information about cases and controls. The names of the vector should match the `rownames` of the metadata and the `colnames` of the feature data. The label can contain either the information about cases and controls either + as integers (e.g. `0` and `1`), + as characters (e.g. `CTR` and `IBD`), or + as factors. * **Metadata column**: You can provide the name of a column in the metadata for the creation of the label. See below for an example. * **Label file**: `SIAMCAT` has a function called `read.label`, which will create a label object from a label file. The file should be organized as follows: + The first line is supposed to read: `#BINARY:1=[label for cases];-1=[label for controls]` + The second row should contain the sample identifiers as tab-separated list (consistent with feature and metadata). + The third row is then supposed to contain the actual class labels (tab-separated): `1` for each case and `-1` for each control. An example file is attached to the package again, if you want to have a look at it. For our example dataset, we can create the label object out of the metadata column called `diagnosis`: ```{r create_label, results="hide", warning=FALSE, eval=FALSE} label <- create.label(meta=meta, label="diagnosis", case = 1, control=0) ``` When we later plot the results, it might be nicer to have names for the different groups stored in the label object (instead of `1` and `0`). We can also supply them to the `create.label` function: ```{r create_label_2, warning=FALSE} label <- create.label(meta=meta, label="diagnosis", case = 1, control=0, p.lab = 'cancer', n.lab = 'healthy') label$info ``` >Note: If you have no label information for your dataset, you can still create a `SIAMCAT` object from your features alone. The `SIAMCAT` object without label information will contain a `TEST` label that can be used for making holdout predictions. Other functions, e.g. model training, will not work on such an object. ## LEfSe format files [LEfSe](https://bitbucket.org/biobakery/biobakery/wiki/lefse) is a tool for identification of associations between micriobial features and up to two metadata. LEfSe uses LDA (linear discriminant analysis). LEfSe input file is a `.tsv` file. The first few rows contain the metadata. The following row contains sample names and the rest of the rows are occupied by features. The first column holds the row names: | label | healthy | healthy | healthy | cancer | cancer | | --- | ---:| ---:| ---:| ---:| ---:| | **age** | 52 | 37 | 66 | 54 | 65 | | **gender** | 1 | 1 | 2 | 2 | 2 | |**Sample_info** | Sample_1 | Sample_2 | Sample_3 | Sample_4 | Sample_5 | | **Feature_1** | 0.59 | 0.71 | 0.78 | 0.61 | 0.66 | | **Feature_2** | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | | **Feature_3** | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | | **Feature_4** | 0.34 | 0.00 | 0.43 | 0.00 | 0.00 | | **Feature_5** | 0.56 | 0.56 | 0.00 | 0.00 | 0.00 | An example of such a file is attached to the `SIAMCAT` package: ```{r lefse_file, message=FALSE} fn.in.lefse<- system.file( "extdata", "LEfSe_crc_zeller_msb_mocat_specI.tsv", package = "SIAMCAT" ) ``` `SIAMCAT` has a dedicated function to read LEfSe format files. The `read.lefse` function will read in the input file and extract metadata and features: ```{r read_lefse_file, results="hide", warning=FALSE} meta.and.features <- read.lefse(fn.in.lefse, rows.meta = 1:6, row.samples = 7) meta <- meta.and.features$meta feat <- meta.and.features$feat ``` We can then create a label object from one of the columns of the meta object and create a `siamcat` object: ```{r lefse_label, results="hide", warning=FALSE} label <- create.label(meta=meta, label="label", case = "cancer") ``` ## metagenomeSeq format files [metagenomeSeq](http://bioconductor.org/packages/metagenomeSeq/) is an R package to determine differentially abundant features between multiple samples. There are two ways to input data into metagenomeSeq: a) two files, one for metadata and one for features - those can be used in `SIAMCAT` just like described in [SIAMCAT input](#SIAMCAT-input) with `read.table`: ```{r read_metagenome_seq, results="hide", warning=FALSE, eval=FALSE} fn.in.feat <- system.file( "extdata", "CHK_NAME.otus.count.csv", package = "metagenomeSeq" ) feat <- read.table(fn.in.feat, sep='\t', header=TRUE, quote='', row.names = 1, stringsAsFactors = FALSE, check.names = FALSE ) ``` b) `BIOM` format file, that can be used in `SIAMCAT` as described in the [following section](BIOM-format-files) ## BIOM format files The BIOM format files can be added to `SIAMCAT` via `phyloseq`. First the file should be imported using the `phyloseq` function `import_biom`. Then a `phyloseq` object can be imported as a `siamcat` object as descibed in the [next section.](#Creating-a-siamcat-object-of-a-phyloseq-object) ## Creating a siamcat object of a phyloseq object The `siamcat` object extends on the `phyloseq` object. Therefore, creating a `siamcat` object from a `phyloseq` object is really straightforward. This can be done with the `siamcat` constructor function. First, however, we need to create a label object: ```{r create_from_phyloseq, results="hide", warning=FALSE, eval=TRUE} data("GlobalPatterns") ## phyloseq example data label <- create.label(meta=sample_data(GlobalPatterns), label = "SampleType", case = c("Freshwater", "Freshwater (creek)", "Ocean")) # run the constructor function siamcat <- siamcat(phyloseq=GlobalPatterns, label=label) ``` # Creating a siamcat-class object The `siamcat-class` is the centerpiece of the package. All of the is stored inside of the object: ![internal make-up of a siamcat object](./siamcat.png) In the figure above, rectangles depict slots of the object and the class of the object stored in the slot is given in the ovals. There are two obligatory slots -**phyloseq** (containing the metadata as `sample_data` and the original features as `otu_table`) and **label** - marked with thick borders. The `siamcat` object is constructed using the `siamcat()` function. There are two ways to initialize it: * **Features**: You can provide a feature `matrix`, `data.frame`, or `otu_table` to the function (together with label and metadata information): ```{r constructor, eval=FALSE} siamcat <- siamcat(feat=feat, label=label, meta=meta) ``` * **phyloseq**: The alternative is to create a `siamcat` object directly out of a `phyloseq` object: ```{r constructor_phyloseq, eval=FALSE} siamcat <- siamcat(phyloseq=phyloseq, label=label) ``` Please note that you **have to** provide either `feat` or `phyloseq` and that you **cannot** provide both. In order to explain the `siamcat` object better we will show how each of the slots is filled. ## phyloseq, label and orig_feat slots The phyloseq and label slots are obligatory. * The phyloseq slot is an object of class `phyloseq`, which is described in the help file of the `phyloseq` class. Help can be accessed by typing into R console: `help('phyloseq-class')`. + The `otu_table` slot in `phyloseq` -see `help('otu_table-class')`- stores the original feature table. For `SIAMCAT`, this slot can be accessed by `orig_feat`. * The label slot contains a list. This list has a specific set of entries -see `help('label-class')`- that are automatically generated by the `read.label` or `create.label` functions. The `phyloseq`, label and orig_feat are filled when the `siamcat` object is first created with the constructor function. ![construction](./allCr.png) ## All the other slots Other slots are filled during the run of the `SIAMCAT` workflow: ![workflow](./Slots_create.png) ## Accessing and assigning slots Each slot in `siamcat` can be accessed by typing ``` slot_name(siamcat) ``` e.g. for the `eval_data` slot you can types ```{r, eval=FALSE} eval_data(siamcat) ``` There is one notable exception: the phyloseq slot has to be accessed with `physeq(siamcat)` due to technical reasons. Slots will be filled during the `SIAMCAT` workflow by the package's functions. However, if for any reason a slot needs to be assigned outside of the workflow, the following formula can be used: ``` slot_name(siamcat) <- object_to_assign ``` e.g. to assign a `new_label` object to the `label` slot: ```{r, eval=FALSE} label(siamcat) <- new_label ``` _Please note that this may lead to unforeseen consequences..._ ## Slots inside the slots There are two slots that have slots inside of them. First, the `model_list` slot has a `models` slot that contains the actual list of [mlr](https://mlr-org.github.io/mlr-tutorial/devel/html/index.html) models -can be accessed via `models(siamcat)`- and `model.type` which is a character with the name of the method used to train the model: `model_type(siamcat)`. The phyloseq slot has a complex structure. However, unless the phyloseq object is created outside of the `SIAMCAT` workflow, only two slots of phyloseq slot will be occupied: the `otu_table` slot containing the features table and the `sam_data` slot containing metadata information. Both can be accessed by typing either `features(siamcat)` or `meta(siamcat)`. Additional slots inside the phyloseq slots do not have dedicated accessors, but can easily be reached once the phyloseq object is exported from the `siamcat` object: ```{r} phyloseq <- physeq(siamcat) tax_tab <- tax_table(phyloseq) head(tax_tab) ``` If you want to find out more about the phyloseq data structure, head over to the [phyloseq](https://bioconductor.org/packages/release/bioc/html/phyloseq.html) BioConductor page. # Session Info ```{r} sessionInfo() ```