---
title: "Analyzing Cellular DNA Barcode with CellBarcode"
author: 
- name: Wenjie Sun
- name: Anne-Marie Lyne
package: CellBarcode
output: 
  BiocStyle::html_document:
    toc_float: true
  BiocStyle::pdf_document: default
vignette: >
  %\VignetteIndexEntry{UMI_Barcode}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  message = FALSE,
  error = FALSE,
  warn = FALSE,
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(data.table)
library(ggplot2)
library(CellBarcode)
```

# Introduction

## About the package

**What's this package used for?**

  Cellular DNA barcoding (genetic lineage tracing) is a powerful tool for
  lineage tracing and clonal tracking studies. This package provides a toolbox
  for DNA barcode analysis, from extraction from fastq files to barcode error
  correction and quantification. 

**What types of barcode can this package handle?**

  The package can handle all kinds of barcodes, as long as the barcodes have a
  pattern which can be matched by a regular expression, and each barcode is
  within a single sequencing read. It can handle barcodes with flexible length,
  and barcodes with UMI (unique molecular identifier).

  This tool can also be used for the pre-processing part of amplicon data
  analysis such as CRISPR gRNA screening, immune repertoire sequencing and meta
  genome data.

**What can the package do?**

  The package provides functions for 1). Sequence quality control and
  filtering, 2). Barcode (and UMI) extraction from sequencing reads, 3). Sample
  and barcode management with metadata, 4). Barcode filtering.


## About function naming

Most of the functions in this packages have names with `bc_` as initiation. We
hope it can facilitate the syntax auto-complement function of IDE (integrated
development toolkit) or IDE-like tools such as RStudio, R-NVIM (in VIM), and
ESS (in Emacs). By typing `bc_` you can have a list of suggested functions,
then you can pick the function you need from the list.

TODO: the function brain-map

## About test data set

The test data set in this package can be accessed by

```{r eval=FALSE}
system.file("extdata", "mef_test_data", package="CellBarcode")
```

The data are from Jos et. al (TODO: citation). There are 7 mouse embryo
fibroblast (MEF) lines with different DNA barcodes. The barcodes are in vivo
inducible VDJ barcodes (TODO: add citation when have). These MEF lines were
mixed in a ratio of 1:2:4:8:16:32:64.

| sequence                          | clone size 2^x |
| ---                               | ---            |
| AAGTCCAGTTCTACTATCGTAGCTACTA      | 1              |
| AAGTCCAGTATCGTTACGCTACTA          | 2              |
| AAGTCCAGTCTACTATCGTTACGACAGCTACTA | 3              |
| AAGTCCAGTTCTACTATCGTTACGAGCTACTA  | 4              |
| AAGTCCATCGTAGCTACTA               | 5              |
| AAGTCCAGTACTGTAGCTACTA            | 6              |
| AAGTCCAGTACTATCGTACTA             | 7              |

Then 5 pools of 196 to 50000 cells were prepared from the MEF lines mixture.
For each pool 2 technical replicates (NGS libraries) were prepared and
sequenced, finally resulting in 10 samples. 

| sample name | cell number | replication |
| ---         | ---         | ---         |
| 195_mixa    | 195         | mixa        |
| 195_mixb    | 195         | mixb        |
| 781_mixa    | 781         | mixa        |
| 781_mixb    | 781         | mixb        |
| 3125_mixa   | 3125        | mixa        |
| 3125_mixb   | 3125        | mixb        |
| 12500_mixa  | 12500       | mixa        |
| 12500_mixb  | 12500       | mixb        |
| 50000_mixa  | 50000       | mixa        |
| 50000_mixb  | 50000       | mixb        |

The original FASTQ files are relatively large, so only 2000 reads for each sample
have been randomly sampled as a test set here.

```{r smaples}
example_data <- system.file("extdata", "mef_test_data", package = "CellBarcode")
fq_files <- dir(example_data, "fastq.gz", full=TRUE)

# prepare metadata for the samples
metadata <- stringr::str_split_fixed(basename(fq_files), "_", 10)[, c(4, 6)]
metadata <- as.data.frame(metadata)
sample_name <- apply(metadata, 1, paste, collapse = "_")
colnames(metadata) = c("cell_number", "replication")
# metadata should has the row names consistent to the sample names
rownames(metadata) = sample_name
metadata
```

# Installation

Install from Bioconductor.

```{r installation Bioconducotr, eval=FALSE}
if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("CellBarcode")
```

Install the development version from Github.

```{r installation devel, eval=FALSE}
# install.packages("remotes")
remotes::install_github("wenjie1991/CellBarcode")
```

# A basic workflow

Here is an example of a basic workflow:

```{r basic_workflow}
# install.packages("stringr")
library(CellBarcode)
library(magrittr)

# The example data is the mix of MEF lines with known barcodes
# 2000 reads for each file have been sampled for this test dataset
# extract UMI barcode with regular expression
bc_obj <- bc_extract(
  fq_files,  # fastq file
  pattern = "([ACGT]{12})CTCGAGGTCATCGAAGTATC([ACGT]+)CCGTAGCAAGCTCGAGAGTAGACCTACT", 
  pattern_type = c("UMI" = 1, "barcode" = 2),
  sample_name = sample_name,
  metadata = metadata
)
bc_obj

# sample subset operation, select technical repeats 'mixa'
bc_sub = bc_subset(bc_obj, sample=replication == "mixa")
bc_sub 

# filter the barcode, UMI barcode amplicon >= 2 & UMI counts >= 2
bc_sub <- bc_cure_umi(bc_sub, depth = 2) %>% bc_cure_depth(depth = 2)

# select barcodes with a white list
bc_2df(bc_sub)
bc_sub[c("AAGTCCAGTACTATCGTACTA", "AAGTCCAGTACTGTAGCTACTA"), ]

# export the barcode counts to data.frame
head(bc_2df(bc_sub))

# export the barcode counts to matrix
head(bc_2matrix(bc_sub))
```


# Sequence quality control

## Evaluation

In a full analysis starting from fastq files, the first step is to check the
seqencing quality and filter as required. The `bc_seq_qc` function is for
checking the sequencing quality. If multiple samples are input the output is a
`BarcodeQcSet` object, otherwise a `BarcodeQC` object will be returned. In
addition, `bc_seq_qc` also can handle the `ShortReadQ`, `DNAStringSet` and
other data types.

```{r quality_control_1}
qc_noFilter <- bc_seq_qc(fq_files)
qc_noFilter
bc_names(qc_noFilter)
class(qc_noFilter)
```

The `bc_plot_seqQc` function can be invoked with a `BarcodeQcSet` as argument,
and the output is a QC summary with two panels. The first shows the ratio of
ATCG bases for each sequencing cycle with one sample per row; this allows the
user to, for example, identify constant or random parts of the sequencing read.
The second figure shows the average sequencing quality index of each cycle
(base). 

For the test set, the first 12 bases are UMI, which are random. This is
followed by the constant region of the barcode (the PCR primer selects reads
with this sequence), and here we observe a specific base for each cycle across
all the samples.


```{r qc_figure_set1}
bc_plot_seqQc(qc_noFilter) 
```

We can also plot one of the `BarcodeQc` in the `BarcodeQcSet` object. In the
output, there are three panels. The top left one shows the reads depth 
distribution, the top right figure shows the "ATCG" base ratio by each
sequencing cycle, and the last one shows the sequencing quality by sequencing
cycle.

```{r qc_figure_single1}
qc_noFilter[1]
class(qc_noFilter[1])
bc_plot_seqQc(qc_noFilter[1]) 
```

## Filtering

`bc_seq_filter` reads in the sequence data and applies filters, then returns a
`ShortReadQ` object which contains the filtered sequences. 

The `bc_seq_filter` function can read fastq files, and it can also handle
sequencing reads in `ShortReadQ`, `DNAStringSet` and `data.frame`.

The currently available filter parameters are: 
- min_average_quality: average base sequencing quality across read.
- min_read_length: minimum number of bases per read.
- N_threshold: maximum number of "N" bases in sequence.
 
```{r filter_low_quality_seq}
# TODO: output the filtering percentage
# TODO: Trimming
fq_filter <- bc_seq_filter(
  fq_files,
  min_average_quality = 30,
  min_read_length = 60,
  sample_name = sample_name)

fq_filter
bc_plot_seqQc(bc_seq_qc(fq_filter))
```

# Parse reads

One of the core applications of this package is parsing the sequences to get
the barcode (and UMI). Our package uses regular expressions to identify
barcodes (and UMI) from sequencing reads. This is how we tell `bc_extract` the
structure of the input sequences.

3 arguments are necessary for `bc_extract`, they are:
- x: the sequence data, it can be in fastq, `ShortReadQ`, `DNAStringSet` or
  `data.frame` format.
- pattern: the sequence pattern regular expression.
- pattern_type: pattern description.

The `pattern` argument is the regular expression, it tells the function where
to find the barcode (or UMI). We capture the barcode (or UMI) by `()` in the
backbone. For the sequence captured by `()`, the `pattern_type` argument tells
which is the UMI or the barcode. In the example

```{r}
pattern <- "([ACGA]{12})CTCGAGGTCATCGAAGTATC([ACGT]+)CCGTAGCAAGCTCGAGAGTAGACCTACT"
pattern_type <- c("UMI" = 1, "barcode" = 2)
```

1. The sequence starts with 12 base pairs of random sequence, which is UMI. It
   is the first barcode captured by `()` in the `pattern` argument, and corresponds to
   `UMI = 1` in the `pattern_type` argument.
2. Then, there is a known constant sequence: `CTCGAGGTCATCGAAGTATC`.
3. Following the constant region, there is flexible length random sequence. This is the
   barcode which is trapped by second `()`, and it is defined by `barcode = 2`
   in the `pattern_type` argument.
4. At the end of the sequence, there is another constant sequence
   `CCGTAGCAAGCTCGAGAGTAGACCTACT`.

In the regular expression, the UMI pattern is retrieved with `[ACGT]{12}`. The
`[ACGT]` means to match "A", "C", "G" or "T", and the `{12}` means match 12
`[ACGT]`. In the barcode pattern `[ACGT]+`, again `[ACGT]` means match "A",
"C", "G" or "T" and the `+` says to match at least one `[ACGT]`.

The `bc_extract` function is used to extract the barcode(s) from the sequences.
It returns a `BarcodeObj` object if the input is a list or a vector of Fastq
files. The `BarcodeObj` created by `bc_extract` is a R S4 class with three
slots: `messyBc`, `metadata` and `cleanBc` (which is NULL in the `bc_extract`
output). They can be accessed by `@` operator or corresponding accesors:
  - `bc_messyBc`: return the `messyBc` slot.
  - `bc_cleanBc`: return the `cleanBc` slot.
  - `bc_meta`: return the `metadata` slot.

`messyBc` is a list, where each element is a `data.table` corresponding to the
successive samples. Each `data.table` has 5 columns:

1. reads_seq: full read sequence before parsing.
2. match_seq: the sequence matched by pattern.
3. umi_seq (optional): UMI sequence, applicable when there is a UMI in `pattern`
   and `pattern_type` argument.
4. barcode_seq: barcode sequence.
5. count: the count of the full read sequence.

**Attention**: In the `data.table`, `barcode_seq` value may be not unique, as
two different full read sequences can contain the same barcode sequence, due to
the UMI or mutations in the constant region.

If the input to `bc_extract` is just a sample, the output is a single
`data.frame` with the 5 columns 1). `reads_seq`, 2). `match_seq`, 3).
`umi_seq`, 4). `barcode_seq` and 5). `count`, as described above.

The sequence in `match_seq` is a contiguous segment of the full read given in
`reads_seq`. The `umi_seq` and `barcode_seq` are contiguous segments of
`match_seq`. Take note that, the `reads_seq` is the unique id for each row. The
`match_seq`, `umi_seq` or `barcode_seq` can be duplicated, due to the potential
variation in the region outside of `match_seq`. Please keep this in mind when
you use data in `$messyBc` to perform the analysis.


## Sequencing without UMI 

In the following example, only a barcode is extracted.

```{r extract_barcode_no_UMI}
pattern <- "CTCGAGGTCATCGAAGTATC([ACGT]+)CCGTAGCAAGCTCGAGAGTAGACCTACT"
bc_obj <- bc_extract(
  fq_filter,
  sample_name = sample_name,
  pattern = pattern,
  pattern_type = c(barcode = 1))

bc_obj
names(bc_messyBc(bc_obj)[[1]])
```

Here the regular expression matches a constant sequence at the beginning and
the end and the barcode in `()` matches at least one of any character.


## Sequencing with UMI

In the following example, both UMI and barcode are extracted. The regular
expression is explained above.

```{r extract_barcode_with_UMI}
pattern <- "([ACGA]{12})CTCGAGGTCATCGAAGTATC([ACGT]+)CCGTAGCAAGCTCGAGAGTAGACCTACT"
bc_obj_umi <- bc_extract(
  fq_filter,
  sample_name = sample_name,
  pattern = pattern,
  maxLDist = 0,
  pattern_type = c(UMI = 1, barcode = 2))

class(bc_obj_umi)
bc_obj_umi
```

## Metadata updated

`bc_extract` added two columns named "row_read_count" and "barcode_read_count"
to the metadata slot of the returned BarcodeObj object.

**row_read_count**: Total raw reads number of each sample.
**barcode_read_count**: The number of reads that contain the barcodes.

You can use the ratio of `barcode_read_count` versus `raw_read_count` to check
the successfulness of the sequencing or correctness of the pattern given to the
`bc_extract`.

```{r, fig.width=5, fig.height=5}
# select two samples from bc_obj_umi
bc_obj_umi_sub <- bc_obj_umi[, c("781_mixa", "781_mixb")]
# get the metadata matrix
(d <- bc_meta(bc_obj_umi_sub))
# use the row name of the metadata, which contains the sample names
d$sample_name <- rownames(d)

d$barcode_read_count / d$raw_read_count
# visualize
ggplot(d) + 
    aes(x=sample_name, y=barcode_read_count / raw_read_count) + 
    geom_bar(stat="identity")
```

# Data management

Besides, we provide operators to handle the barcodes and samples in `BarcodeObj`
object. You can easily select one or several samples by their names,
indices or metadata.


Select slot by accesors:

```{r}
# Access messyBc slot
head(bc_messyBc(bc_obj_umi)[[1]], n=2)
# return a data.frame
head(bc_messyBc(bc_obj_umi, isList=FALSE), n=2)

# Access cleanBc slot
# return a data.frame
head(bc_cleanBc(bc_obj_umi, isList=FALSE), n=2)
```

Select sample by sample names

```{r}
bc_obj_umi_sub <- bc_obj_umi[, c("781_mixa", "781_mixb")]
bc_names(bc_obj_umi_sub)
```

Set metadata

```{r}
bc_meta(bc_obj_umi_sub)$rep <- c("a", "b")
bc_meta(bc_obj_umi_sub)
```

Select sample by metadata

```{r}
bc_subset(bc_obj_umi_sub, sample = rep == "a")
```

# Barcode filtering

Most of the times, it needs PCR and NGS to read out the cellular barcode
sequences. `bc_extract` will output all barcodes found in the sequences. Some
of the identified barcodes may contain PCR or sequencing errors. 

The potential errors derived from PCR and NGS lead to spurious barcodes that
not existed in biological samples. The spurious barcodes are more likely to be
less abundant comparing to corresponding "mother" barcodes they derived from.

As UMI can be used to label a DNA molecular, one UMI labeled barcode molecular
becomes multiple copies by PCR. Thus all the sequences derived from the
template sequence, including original template sequence and mutant ones, are
marked by UMI for having the same UMI. The original template sequence is likely
having more reads comparing to the spurious one derived from PCR or sequencing
mutation, as errors happens with low probability. Also, a barcode sequence is
less likely to be spurious one when it relates to several UMIs. 

We created the `bc_cure_*` functions to perform filtering for removing the
potential spurious barcodes. The `bc_cure_*` functions create or update the
`cleanBc` slot in `BarcodeObj`. The `cleanBc` slot contains 2 columns
  - barcode_seq: barcode sequence.
  - counts: reads count, or UMI count in the case that the cleanBc was created
    by `bc_cure_umi`.

**Important**: The `createBc` slot, the barcode_seq is not duplicated in each sample.

In the `bc_cure_*` function family, there are `bc_cure_depth`, `bc_cure_umi`
and `bc_cure_cluster`.


## Filter UMI-barcode tag

In the case when the UMI is applied, the template sequence is marked by UMI,
and we use "UMI-barcode tag" to denote a combination of a UMI and a barcode.
The UMI-barcode tag with few reads are likely deriving from PCR or sequence
errors. `bc_cure_umi` carries out the filtering based on the UMI-barcode tag
read count from the `messyBc` slot in BarcodeObj object, and returns a updated
BarcodeObj object with a `cleanBc` slot containing the barcodes passing the
filtering.

```{r correct_barcodde_with_UMI}
# Filter the barcodes with UMI-barcode tag >= 1, 
# and treat UMI as absolute unique and do "fish"
bc_obj_umi_sub <- bc_cure_umi(
    bc_obj_umi_sub, depth = 1, 
    isUniqueUMI = TRUE, 
    doFish = TRUE)
bc_obj_umi_sub
```

The available arguments of `bc_cure_umi` are: 

- depth: minimum read count required for a UMI.
- doFish: if true, for barcodes with UMI read depth above the threshold, “fish”
  for identical barcodes with UMI read depth below the threshold. The
  consequence of “doFish” will not increase the number of identified barcodes,
  but the UMI counts will increase due to including the low depth UMI barcodes. 
- isUniqueUMI: one UMI sequence may be linked to several barcodes. Do you
  believe the UMI is absolutely unique? If yes, we treat the UMI as absolutely
  unique makers. Thus the most abundant barcode will be picked for a UMI, and
  less abundant barcodes with the same UMI are obsolete.

## Filter by count

`bc_cure_depth` performs filtering by reads/UMI count. It can filter the raw
barcodes in the `messyBc` and create a `cleanBc` slot , or update the
`cleanBc` when the argument `isUpdate` is TRUE. You should set this argument to
`TRUE`, when you want apply the filtering on the UMI count with the
`bc_cure_umi` output. In this case, `bc_cure_depth` will update the `cleanBc`
slot created by `bc_cure_umi`.

The function has two arguments:

- depth: sequence/UMI count threshold, it can be a numeric number or a numeric
  vector, in the later case, each number corresponds to a sample in the
  BarcodeObj object.
- isUpdate: if true (default) the bc_cure will preferentially perform filtering
  on the `cleanBc` slot and update it, otherwise the `messyBc` will be used as
  input. 
 
```{r correct_barcodde_with_count}
# Apply the barcode sequence depth with depth >= 3
# With isUpdate = FLASE, the data in `messyBc` slot of bc_obj_umi_sub
#   will be used for depth filtering. The UMI information will be discarded, 
#   the identical barcodes in different UMI-barcode tags are merged before
#   performing the sequence depth filtering.
bc_obj_umi_sub <- bc_cure_depth(bc_obj_umi_sub, depth = 3, isUpdate = FALSE)
bc_obj_umi_sub

# Apply the UMI count filter, keep barcode >= 3 UMI
# The `bc_cure_umi` function applies the filtering on the UMI-barcode tags,
#   and create a `cleanBc` slot in the return BarcodeObj object. Then, 
#   the `bc_cure_depth` with `isUpdate` argument TRUE will apply the filtering
#   on the UMI counts in `cleanBc` and updated the `cleanBc`.
bc_obj_umi_sub <- bc_cure_umi(
    bc_obj_umi_sub, depth = 1, 
    isUniqueUMI = TRUE, 
    doFish = TRUE)
bc_obj_umi_sub
bc_obj_umi_sub <- bc_cure_depth(bc_obj_umi_sub, depth = 3, isUpdate = TRUE)
bc_obj_umi_sub
```

## Cluster and merge

The sequences with more reads have more chance to be the original templates. In
contrast,the sequences with few reads are more likely derived from mutations
of the most abundant sequence. Thus, the spurious sequence might be identified
by comparing the most abundant sequence to the least one. If they are similar,
the least abundant sequence will be merged to the most abundant one.
    
`bc_cure_cluster` performs the clustering to merge the barcodes with
insufficient depth (or UMI counts) to the most abundant ones by similarity, it
needs the `cleanBc` slot and will update it.

To control the clustering methods and threshold for merging you need the
following arguments:

- dist_thresh: a single integer or vector of integers with the length of sample
  number, specifying the editing distance threshold of merging two similar
  barcode sequences. If the input is a vector, each value in vector is for
  one sample according to the sample order in BarcodeObj object.
- dist_method: A  character string, specifying the distance algorithm for
  evaluating barcodes similarity. It can be "hamm" for Hamming distance or
  "leven" for Levenshtein distance.
- merge_method: A character string specifying the algorithm used to perform the
  clustering merging of barcodes. Currently only "greedy" is available, in this
  case, the least abundant barcode is preferentially merged to the most
  abundant ones.
- param count_threshold: An integer, read depth threshold to consider a
  barcode as a true barcode, when when a barcode with count higher than this
  threshold it will not be merged into more abundant barcode.
- dist_costs: A list, the cost of the events when calculating distance between
  two barcode sequences, applicable when Levenshtein distance is applied. The
  names of vector have to be “insert”, “delete” and “replace”, specifying the
  weight of insertion, deletion, replacement events respectively. The default
  cost for each event is 1.

```{r correct_barcodde_clustering}
# Do the clustering and merging the least abundant barcodes to the similar
# abundant ones
bc_cure_cluster(bc_obj_umi_sub)
```

# Barcode count distribution

We provides `bc_plot_single`, `bc_plot_mutual` and `bc_plot_pair` functions for
helping exploring the barcode count distribution for single sample or between
two samples.

## Single sample

`bc_plot_single` can be used for exploring barcode count distribution sample
wise. It uses the `cleanBc` slot in the BarcodeObj bc_obj_umi_sub.

```{r, fig.width=5}
bc_plot_single(bc_obj_umi_sub)
```

The `bc_plot_single` function provides arguments for showing the potential
cutoff point and highlighting specific barcodes.

```{r, fig.width=5}
# re-do the filtering using depth threshold 0 to include all barcodes.
bc_obj_umi_sub_neo <- bc_cure_depth(bc_obj_umi_sub, depth=0, isUpdate=FALSE)

# you can use the count_marks argument to display the cutoff points in the figure
# and the highlight argument to highlight specific barcodes.
bc_plot_single(bc_obj_umi_sub_neo, count_marks=10, 
    highlight= c("AAGTCCAGTACTATCGTACTA", "AAGTCCAGTACTGTAGCTACTA"))
```

## Pairwise

`bc_plot_mutual` and `bc_plot_pair` are designed for comparing the barcodes
between two samples. 

The `bc_plot_mutual` generates a scatter plot matrix which contains all the
pairwise sample combination in the provided BarcodeObj object.

```{r, fig.height=7}
# create a new BarcodeObj for following visualization
# use depth as 0 to include all the barcodes.
bc_obj_umi_neo <- bc_cure_depth(bc_obj_umi[, 1:4], depth=0)
# you can set the count_marks to display the cutoff point
# and highlight specific barcodes dots by highlight
bc_plot_mutual(bc_obj_umi_neo, count_marks=c(10, 20, 30, 40), 
    highlight= c("AAGTCCAGTACTATCGTACTA", "AAGTCCAGTACTGTAGCTACTA"))
```

And the `bc_plot_pair` only draws the scatter plot for the given sample pairs. 

```{r, fig.width=5}
# create a new BarcodeObj for following visualization
# use depth as 0 to include all the barcodes.
bc_obj_umi_neo <- bc_cure_depth(bc_obj_umi[, 1:4], depth=0)

# 2d scatters plot with x axis of sample_x and y axis of sample_y
# sample_x, and sample_y can be the sample name or sample index
bc_plot_pair(
    bc_obj_umi_neo, 
    sample_x = c("50000_mixa"),
    sample_y = c("50000_mixb", "12500_mixa", "195_mixb"), 
    count_marks_x = 10,
    count_marks_y = c(10, 20, 30),
    highlight= c("AAGTCCAGTACTATCGTACTA", "AAGTCCAGTACTGTAGCTACTA")
)
```

# Miscellaneous

We provides functions to transform the barcode information in `BarcodeObj` to
more general R data types.

## Sample names

```{r}
bc_names(bc_obj_umi_sub)
```

## Output to data.frame

`bc_2df` function uses the barcode and count info in the `cleanBc` slot,
outputs a data.frame contains: 
  - barcode_seq: barcode sequence
  - sample_name
  - count: reads or UMI count
 
```{r}
bc_2df(bc_obj_umi_sub)
```

Or if you prefer `data.table`

```{r}
bc_2dt(bc_obj_umi_sub)
```

## Output to matrix

`bc_2matrix` uses barcode and count information in `cleanBc` slot to create
reads count or UMI count matrix, with barcodes in rows and samples in columns.

```{r misc}
bc_2matrix(bc_obj_umi_sub)
```

## More

You can use:

- `+`: to combine two BarcodeObj objects.
- `-`: to remove barcodes in a black list.
- `*`: only include barcodes in a white list.

For examples:

```{r}
data(bc_obj)

# Join two samples with different barcodes 
bc_obj["AGAG", "test1"] + bc_obj["AAAG", "test1"]

# Join two samples with shared barcodes
bc_obj_join <- bc_obj["AGAG", "test1"] + bc_obj["AGAG", "test1"]
bc_obj_join

# In this case, the shared barcodes are not merged.
# Applying bc_cure_depth() to merge them.
bc_cure_depth(bc_obj_join)

# Remove barcodes
bc_obj - "AAAG"

# Select barcodes in white list
bc_obj * "AAAG"
```

What's more, by combining several functions, it is possible to accomplish more
complex task. In the following example, a barcode from sample "781_mixa" is
selected , then output the result in `data.frame` format.

```{r}
bc_2df(bc_obj_umi_sub[bc_barcodes(bc_obj_umi_sub)[1], "781_mixa"])
                  ## 1. Use `bc_barcodes` to pull out all the barcodes in two
                  ##    samples, and choose the fist barcode.
       ## 2. Select the barcode got in step 1, and the sample named "781_mixa".
## 3. Convert the BarcodeObj object to a data.frame. 
```

# Session Info

```{r}
sessionInfo()
```