---
title: "SIAMCAT input files formats"
author:
-   name: "Konrad Zych, Jakob Wirbel, and Georg Zeller"
    affiliation: "EMBL Heidelberg"
    email: "georg.zeller@embl.de"
date: "Date last modified: 2018-09-24"
output: BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{SIAMCAT.input}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{ASCII}
---

# Introduction

This vignette illustrates how to read and input your own data to the
`SIAMCAT` package. We will cover reading in text files from the disk,
formatting them and using them to create an object of `siamcat-class`.

The `siamcat-class` is the centerpiece of the package. All of the input data
and result are stored inside of it. The structure of the object is described
below in the [siamcat-class object](#siamcat-class-object) section.


# Loading your data into R

## SIAMCAT input

Generally, there are three types of input for `SIAMCAT`:

### Features

The features should be a `matrix`, a `data.frame`, or an `otu_table`,
organized as follows:

`features (in rows) x samples (in columns).`

| | Sample_1 | Sample_2  | Sample_3 | Sample_4  | Sample_5 |
| --- | ---:| ---:| ---:| ---:| ---:|
| **Feature_1** | 0.59 | 0.71 | 0.78 | 0.61 | 0.66 |
| **Feature_2** | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 |
| **Feature_3** | 0.02 | 0.00 | 0.00 | 0.00 | 0.20 |
| **Feature_4** | 0.34 | 0.00 | 0.13 | 0.07 | 0.00 |
| **Feature_5** | 0.06 | 0.16 | 0.00 | 0.00 | 0.00 |


> Please note that `SIAMCAT` is supposed to work with **relative abundances**.
Other types of data (e.g. counts) will also work, but not all functions of the
package will result in meaningful outputs.

An example of a typical feature file is attached to the `SIAMCAT` package,
containing data from a publication investigating the microbiome in colorectal
cancer (CRC) patients and controls (the study can be found here:
[Zeller et al](http://europepmc.org/abstract/MED/25432777)). The metagenomics
data were processed with the [MOCAT](http://mocat.embl.de/) pipeline, returning
taxonomic profiles on the species levels (`specI`):

```{r feat_file, message=FALSE}
library(SIAMCAT)
fn.in.feat  <- system.file(
    "extdata",
    "feat_crc_zeller_msb_mocat_specI.tsv",
    package = "SIAMCAT"
)
```

One way to load such data into `R` could be the use of `read.table`

_(Beware of the defaults in R! They are not always useful...)_

```{r read_feat, message=FALSE}
feat <- read.table(fn.in.feat, sep='\t',
    header=TRUE, quote='',
    stringsAsFactors = FALSE, check.names = FALSE)
# look at some features
feat[110:114, 1:2]
```

### Metadata

The metadata should be either a matrix or a `data.frame`.

`samples (in rows) x metadata (in columns)`:

| | Age | Gender | BMI |
| --- | ---:| ---:| ---:|
| **Sample_1** | 52 | 1 | 20|
| **Sample_2** | 37 | 1 | 18 |
| **Sample_3** | 66 | 2 | 24 |
| **Sample_4** | 54 | 2 | 26 |
| **Sample_5** | 65 | 2 | 30 |

The `rownames` of the metadata should match the `colnames` of the feature
matrix.

Again, an example of such a file is attached to the `SIAMCAT` package, taken
from the same study:

```{r meta_file, message=FALSE}
fn.in.meta  <- system.file(
    "extdata",
    "num_metadata_crc_zeller_msb_mocat_specI.tsv",
    package = "SIAMCAT"
)
```
Also here, the `read.table` can be used to load the data into `R`.

```{r read_meta, warning=FALSE}
meta <- read.table(fn.in.meta, sep='\t',
    header=TRUE, quote='',
    stringsAsFactors = FALSE, check.names = FALSE)
head(meta)
```

### Label

Finally, the label can come in different three different flavours:

* **Named vector**: A named vector containing information about cases and
controls. The names of the vector should match the `rownames` of the metadata
and the `colnames` of the feature data.
The label can contain either the information about cases and controls either
    + as integers (e.g. `0` and `1`),
    + as characters (e.g. `CTR` and `IBD`), or
    + as factors.

* **Metadata column**: You can provide the name of a column in the metadata for
the creation of the label. See below for an example.

* **Label file**: `SIAMCAT` has a function called `read.label`, which will
create a label object from a label file. The file should be organized as
follows:
    + The first line is supposed to read:
        `#BINARY:1=[label for cases];-1=[label for controls]`
    + The second row should contain the sample identifiers as tab-separated
        list (consistent with feature and metadata).
    + The third row is then supposed to contain the actual class labels
        (tab-separated): `1` for each case and `-1` for each control.

    An example file is attached to the package again, if you want to have a
    look at it.

For our example dataset, we can create the label object out of the metadata
column called `diagnosis`:

```{r create_label, results="hide", warning=FALSE, eval=FALSE}
label <- create.label(meta=meta, label="diagnosis",
    case = 1, control=0)
```

When we later plot the results, it might be nicer to have names for the
different groups stored in the label object (instead of `1` and `0`). We can
also supply them to the `create.label` function:

```{r create_label_2, warning=FALSE}
label <- create.label(meta=meta, label="diagnosis",
    case = 1, control=0,
    p.lab = 'cancer', n.lab = 'healthy')
label$info
```


>Note:  
If you have no label information for your dataset, you can still create a
`SIAMCAT` object from your features alone. The `SIAMCAT` object without label
information will contain a `TEST` label that can be used for making holdout
predictions. Other functions, e.g. model training, will not work on such an
object.


## LEfSe format files

[LEfSe](https://bitbucket.org/biobakery/biobakery/wiki/lefse) is a tool for
identification of associations between micriobial features and up to two
metadata. LEfSe uses LDA (linear discriminant analysis).

LEfSe input file is a `.tsv` file. The first few rows contain the metadata. The
following row contains sample names and the rest of the rows are occupied by
features. The first column holds the row names:

| label | healthy | healthy  | healthy | cancer  | cancer |
| --- | ---:| ---:| ---:| ---:| ---:|
| **age** | 52 | 37  | 66 | 54  | 65 |
| **gender** | 1 | 1  | 2 | 2  | 2 |
|**Sample_info** | Sample_1 | Sample_2  | Sample_3 | Sample_4  | Sample_5 |
| **Feature_1** | 0.59 | 0.71 | 0.78 | 0.61 | 0.66 |
| **Feature_2** | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 |
| **Feature_3** | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 |
| **Feature_4** | 0.34 | 0.00 | 0.43 | 0.00 | 0.00 |
| **Feature_5** | 0.56 | 0.56 | 0.00 | 0.00 | 0.00 |

An example of such a file is attached to the `SIAMCAT` package:

```{r lefse_file, message=FALSE}
fn.in.lefse<- system.file(
    "extdata",
    "LEfSe_crc_zeller_msb_mocat_specI.tsv",
    package = "SIAMCAT"
)
```

`SIAMCAT` has a dedicated function to read LEfSe format files. The `read.lefse`
function will read in the input file and extract metadata and features:

```{r read_lefse_file, results="hide", warning=FALSE}
meta.and.features <- read.lefse(fn.in.lefse,
    rows.meta = 1:6, row.samples = 7)
meta <- meta.and.features$meta
feat <- meta.and.features$feat
```

We can then create a label object from one of the columns of the meta object and
create a `siamcat` object:

```{r lefse_label, results="hide", warning=FALSE}
label <- create.label(meta=meta, label="label", case = "cancer")
```

## metagenomeSeq format files

[metagenomeSeq](http://bioconductor.org/packages/metagenomeSeq/) is an R
package to determine differentially abundant features between multiple samples.

There are two ways to input data into metagenomeSeq:

a) two files, one for metadata and one for features - those can be used
    in `SIAMCAT` just like described in [SIAMCAT input](#SIAMCAT-input) with
    `read.table`:

```{r read_metagenome_seq, results="hide", warning=FALSE, eval=FALSE}
fn.in.feat  <- system.file(
    "extdata",
    "CHK_NAME.otus.count.csv",
    package = "metagenomeSeq"
)
feat <- read.table(fn.in.feat, sep='\t',
    header=TRUE, quote='', row.names = 1,
    stringsAsFactors = FALSE, check.names = FALSE
)
```

b) `BIOM` format file, that can be used in `SIAMCAT` as described in the
[following section](BIOM-format-files)

## BIOM format files

The BIOM format files can be added to `SIAMCAT` via `phyloseq`. First the file
should be imported using the `phyloseq` function `import_biom`. Then a
`phyloseq` object can be imported as a `siamcat` object as descibed in the
[next section.](#Creating-a-siamcat-object-of-a-phyloseq-object)

## Creating a siamcat object of a phyloseq object

The `siamcat` object extends on the `phyloseq` object. Therefore, creating
a `siamcat` object from a `phyloseq` object is really straightforward. This
can be done with the `siamcat` constructor function. First, however, we need
to create a label object:

```{r create_from_phyloseq, results="hide", warning=FALSE, eval=TRUE}
data("GlobalPatterns") ## phyloseq example data
label <- create.label(meta=sample_data(GlobalPatterns),
    label = "SampleType",
    case = c("Freshwater", "Freshwater (creek)", "Ocean"))
# run the constructor function
siamcat <- siamcat(phyloseq=GlobalPatterns, label=label)
```

# Creating a siamcat-class object

The `siamcat-class` is the centerpiece of the package. All of the is stored
inside of the object:
![internal make-up of a siamcat object](./siamcat.png)

In the figure above, rectangles depict slots of the object and the class of
the object stored in the slot is given in the ovals. There are two
obligatory slots -**phyloseq** (containing the metadata as `sample_data` and
the original features as `otu_table`) and **label** - marked with thick borders.

The `siamcat` object is constructed using the `siamcat()` function. There are
two ways to initialize it:

*  **Features**: You can provide a feature `matrix`, `data.frame`, or
    `otu_table` to the function (together with label and metadata information):
    ```{r constructor, eval=FALSE}
    siamcat <- siamcat(feat=feat, label=label, meta=meta)
    ```

*  **phyloseq**: The alternative is to create a `siamcat` object directly out
    of a `phyloseq` object:
    ```{r constructor_phyloseq, eval=FALSE}
    siamcat <- siamcat(phyloseq=phyloseq, label=label)
    ```

Please note that you **have to** provide either `feat` or `phyloseq` and that
you **cannot** provide both.

In order to explain the `siamcat` object better we will show how each of the
slots is filled.

## phyloseq, label and orig_feat slots

The phyloseq and label slots are obligatory.

* The phyloseq slot is an object of class `phyloseq`, which is described in the
    help file of the `phyloseq` class. Help can be accessed by typing into R
    console: `help('phyloseq-class')`.
    + The `otu_table` slot in `phyloseq` -see `help('otu_table-class')`-
        stores the original feature table. For `SIAMCAT`, this slot can be
        accessed by `orig_feat`.
* The label slot contains a list. This list has a specific set of entries
    -see `help('label-class')`- that are automatically generated by the
    `read.label` or `create.label` functions.

The `phyloseq`, label and orig_feat are filled when the `siamcat` object is
first created with the constructor function.
![construction](./allCr.png)

## All the other slots
Other slots are filled during the run of the `SIAMCAT` workflow:
![workflow](./Slots_create.png)

## Accessing and assigning slots

Each slot in `siamcat` can be accessed by typing
```
slot_name(siamcat)
```
e.g. for the `eval_data` slot you can types
```{r, eval=FALSE}
eval_data(siamcat)
```
There is one notable exception: the phyloseq slot has to be accessed with
`physeq(siamcat)` due to technical reasons.

Slots will be filled during the `SIAMCAT` workflow by the package's functions.
However, if for any reason a slot needs to be assigned outside of the workflow,
the following formula can be used:
```
slot_name(siamcat) <- object_to_assign
```
e.g. to assign a `new_label` object to the  `label` slot:
```{r, eval=FALSE}
label(siamcat) <- new_label
```

_Please note that this may lead to unforeseen consequences..._

## Slots inside the slots

There are two slots that have slots inside of them. First, the `model_list`
slot has a `models` slot that contains the actual list of
[mlr](https://mlr-org.github.io/mlr-tutorial/devel/html/index.html) models
-can be accessed via `models(siamcat)`- and `model.type` which is a character
with the name of the method used to train the model: `model_type(siamcat)`.

The phyloseq slot has a complex structure. However, unless the phyloseq
object is created outside of the `SIAMCAT` workflow, only two slots of phyloseq
slot will be occupied: the `otu_table` slot containing the features table and
the `sam_data` slot containing metadata information. Both can be accessed by
typing either `features(siamcat)` or `meta(siamcat)`.

Additional slots inside the phyloseq slots do not have dedicated accessors,
but can easily be reached once the phyloseq object is exported from the
`siamcat` object:

```{r}
phyloseq <- physeq(siamcat)
tax_tab <- tax_table(phyloseq)
head(tax_tab)
```

If you want to find out more about the phyloseq data structure, head over to
the
[phyloseq](https://bioconductor.org/packages/release/bioc/html/phyloseq.html)
BioConductor page.
# Session Info

```{r}
sessionInfo()
```