---
title: "Collision removal functionality"
author: 
  - name: Giulia Pais
    affiliation: | 
     San Raffaele Telethon Institute for Gene Therapy - SR-Tiget, 
     Via Olgettina 60, 20132 Milano - Italia
    email: giuliapais1@gmail.com, calabria.andrea@hsr.it
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('ISAnalytics')`"
vignette: >
  %\VignetteIndexEntry{collision_removal}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r GenSetup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL
    ## Related to
    ## https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
```

```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE}
## Bib setup
library("RefManageR")

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    knitr = citation("knitr")[1],
    RefManageR = citation("RefManageR")[1],
    rmarkdown = citation("rmarkdown")[1],
    sessioninfo = citation("sessioninfo")[1],
    testthat = citation("testthat")[1],
    ISAnalytics = citation("ISAnalytics")[1]
)
```

# Introduction

```{r echo=FALSE}
inst_chunk_path <- system.file("rmd", "install_and_options.Rmd", package = "ISAnalytics")
```

```{r child=inst_chunk_path}

```

```{r}
library(ISAnalytics)
```

## What is a collision and why should you care?

We're not going into too much detail here, but we're going to explain in a
very simple way what a "collision" is and how the function in this package
deals with them.

We say that an integration (aka a unique combination of chromosome,
integration locus and strand) is a *collision* if this combination is shared
between different independent samples: an independent sample is a unique
combination of `ProjectID` and `SubjectID` (where subjects usually represent
patients). The reason behind this is that it's highly improbable to observe
the very same integration in two different subjects and this phenomenon might
be an indicator of some kind of contamination in the sequencing phase or in
PCR phase, for this reason we might want to exclude such contamination from
our analysis.
`ISAnalytics` provides a function that processes the imported data for the
removal or reassignment of these "problematic" integrations,
`remove_collisions()`.

The processing is done using the sequence count value, so the corresponding
matrix is needed for this operation.

## The logic behind the function

The `remove_collisions()` function follows several logical
steps to decide whether
an integration is a collision and if it is it decides whether to re-assign it or
remove it entirely based on different criterias.

### Identifying the collisions

As we said before, a collision is a triplet made of `chr`, `integration locus`
and `strand`, which is shared between different independent samples, aka a pair
made of `ProjectID` and `SubjectID`. The function uses the information stored
in the association file to assess which independent samples are present and
counts the number of independent samples for each integration: those who have a
count > 1 are considered collisions.

```{r echo=FALSE}
ex_coll <- tibble::tribble(
  ~ chr, ~ integration_locus, ~ strand, ~ seqCount, ~ CompleteAmplificationID,
  ~ SubjectID, ~ ProjectID, 
  "1", 123454,  "+", 653, "SAMPLE1", "SUBJ01", "PJ01",
  "1", 123454, "+", 456, "SAMPLE2", "SUBJ02", "PJ01"
)
knitr::kable(ex_coll, caption = paste("Example of collisions: the same",
                                      "integration (1, 123454, +) is found",
                                      "in 2 different independent samples",
                                      "((SUBJ01, PJ01) & (SUBJ02, PJ01))"))
```

### Re-assign vs remove

Once the collisions are identified, the function follows 3 steps where it tries
to re-assign the combination to a single independent sample.
The criterias are:

1. Compare dates: if it's possible to have an absolute ordering on dates, the
integration is re-assigned to the sample that has the earliest date. If two
samples share the same date it's impossible to decide, so the next criteria is
tested
2. Compare replicate number: if a sample has the same integration in more than
one replicate, it's more probable the integration is not an artifact. If it's
possible to have an absolute ordering, the collision is re-assigned to the
sample whose grouping is largest
3. Compare the sequence count value: if the previous criteria wasn't sufficient
to make a decision, for each group of independent samples it's evaluated the
sum of the sequence count value - for each group there is a cumulative value of
the sequence count and this is compared to the value of other groups. If there
is a single group which has a ratio n times bigger than other groups, this one
is chosen for re-assignment. The factor n is passed as a parameter in the
function (`reads_ratio`), the default value is 10.

If none of the criterias were sufficient to make a decision, the integration
is simply removed from the matrix.

# Usage

```{r}
data("integration_matrices", package = "ISAnalytics")
data("association_file", package = "ISAnalytics")
## Multi quantification matrix
no_coll <- remove_collisions(x = integration_matrices,
                             association_file = association_file,
                             report_path = NULL)
## Matrix list
separated <- separate_quant_matrices(integration_matrices)
no_coll_list <- remove_collisions(x = separated,
                             association_file = association_file,
                             report_path = NULL)
## Only sequence count
no_coll_single <- remove_collisions(x = separated$seqCount,
                             association_file = association_file,
                             quant_cols = c(seqCount = "Value"),
                             report_path = NULL)
```

Important notes on the association file:

* You have to be sure your association file is properly filled out. The function
requires you to specify a date column (by default "SequencingDate"), you have to
ensure this column doesn't contain NA values or incorrect values.

The function accepts different inputs, namely:

* A multi-quantification matrix: this is always
the recommended approach
* A named list of matrices where names are quantification types in
`quantification_types()`
* The single sequence count matrix: this is not the recommended approach
since it requires a realignment step for other quantification matrices if 
you have them.

If the option `ISAnalytics.reports` is active, an interactive report in 
HTML format will be produced at the specified path.

# Re-align other matrices

If you've given as input the standalone sequence count 
matrix to `remove_collisions()`, to realign other matrices you have
to call the function `realign_after_collisions()`, passing as input the
processed sequence count matrix and the named list of other matrices
to realign.
**NOTE: the names in the list must be quantification types.**

```{r realign}
other_realigned <- realign_after_collisions(
  sc_matrix = no_coll_single,
  other_matrices = list(fragmentEstimate = separated$fragmentEstimate)
)
```

# Reproducibility

`R` session information.

```{r reproduce3, echo=FALSE}
## Session info
library("sessioninfo")
options(width = 120)
session_info()
```

# Bibliography

This vignette was generated using `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])`
with `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` and `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` running behind the scenes.

Citations made with `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`.

```{r vignetteBiblio, results = "asis", echo = FALSE, warning = FALSE, message = FALSE}
## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))
```