---
title: "scPipe: a flexible data preprocessing pipeline for scATAC-seq data"
author: "Shani Amarasinghe, Phil Yang"
date: "`r Sys.Date()`"
output: 
    BiocStyle::html_document:
        toc_float: true
vignette: >
  %\VignetteIndexEntry{scPipe: a flexible data preprocessing pipeline for scATAC-seq data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

`scPipe` is a package initially designed to process single-cell RNA-sequencing (scRNA-seq) data generated by different protocols. We have modified it to accommodate pre-processing capability of single-cell ATAC-Seq (Assay for Transposase-Accessible Chromatin using sequencing) data pre-processing. `scPipe` ATAC-Seq module is designed for protocols without UMIs, but can also adapt to any UMI protocols if the need arise.

`scPipe` ATAC-Seq module consists of two major components. The first is data pre-processing with barcodes as raw fastq as input and a feature count matrix as output. Second is the data pre-processing with barcode CSV file as input and a feature count matrix as output.

`10X ATAC` method currently is the most popular method to generate scATAC-Seq data with higher sensitivity and lower cost. The structure of the 10X ATAC library is shown below. 

The output fastq files from a `10X ATAC` experiment is paired-ended and data is contained within both reads.

# Workflow

The pipeline is visually described via the workflow diagram depicted below. The functions will be explained in greater detail.

```{r, echo=FALSE, warning=FALSE, message=FALSE, out.width="100%"}
knitr::include_graphics(system.file("scATAC-seq_workflow.png", package = "scPipe"))
```

# Getting started

It is not mandatory to specify an output folder even though it can be specified. If no `output_folder` is defined a folder named `scPipe-atac-output` will get created in the working directory.

We begin by loading the library.

```{r, message=FALSE}
library(scPipe)
data.folder <- system.file("extdata", package = "scPipe", mustWork = TRUE)

```

Specify the output folder.

```{r}
output_folder <- paste0(tempdir(), "/scPipeATACVignette")
```

# Fastq reformatting

To process the data, we need the `fastq` files (both compressed and uncompressed versions are accepted) and a cell barcode annotation. We have supplied very small sample FASTQ files of chr21.  The barcode annotation could be either in a `.fastq` format or a `.csv` file with at least two columns, where the first column has the cell id and the second column contains the barcode sequence. All these files can be found in the `data` folder of the `scPipe` package:

```{r}
# data fastq files
r1      <- file.path(data.folder, "small_chr21_R1.fastq.gz") 
r2      <- file.path(data.folder, "small_chr21_R3.fastq.gz") 

# barcodes in fastq format
barcode_fastq      <- file.path(data.folder, "small_chr21_R2.fastq.gz") 

# barcodes in .csv format
barcode_1000       <- file.path(data.folder, "chr21_modified_barcode_1000.csv")

```

The pipeline starts with fastq file reformatting. We move the barcode and UMI sequences (if available) to the read name and leave the transcript sequence as is. This outputs a read name that looks like `@[barcode_sequence]_[UMI_sequence]#[readname]`. Usually the scATAC-Seq data is paired-end and a 16bp long barcode is located on both reads. Here the barcode information is located on a separate fastq file and the length of the barcode fastq file matches the length of the reads files. Therefore, you need a minimal example like below to generate the output.

```{r}
sc_atac_trim_barcode (r1            = r1, 
                      r2            = r2, 
                      bc_file       = barcode_fastq, 
                      rmN           = FALSE,
                      rmlow         = FALSE,
                      output_folder = output_folder)
```

This generates two output fastq files that are appended by the prefix `demux_` to notify that the new files are the reformatted (a.k.a. demultiplexed) `.fastq` files. These will get saved in the `scPipe-atac-output` directory if the user has not specified an `output_folder`.

However, if the barcodes are in the form of a `.csv` file, some extra information on 0-indexed barcode start (`id1_st`, `id2_st`) and the barcode length (`id1_len`, `id2_len`) are also required to be entered by the user. The algorithm is flexible to do a "look-around" to identify whether the correct parameters are used for a subset of data (hence saving time) and report back to the console if it believes the barcode position is incorrect and/or should be shifted.

```{r,eval=FALSE}
sc_atac_trim_barcode (r1            = r1, 
                      r2            = r2, 
                      bc_file       = barcode_1000, 
                      id1_st        = -1,  
                      id1_len       = -1,  
                      id2_st        = -1,  
                      id2_len       = -10,  
                      output_folder = output_folder, 
                      rmN           = TRUE)
```

To accommodate combinatorial indexing in some scATAC-seq protocols, the `bc_file` parameter will accept a list of barcode files as well (currently only implemented for the `fastq` approach).

Additionally, `rmN`, `rmlow` and `min_qual` parameters define the quality thresholds for the reads. If there are `Ns` in the barcode or UMI positions those will be discarded by `rmN = TRUE`. `rmlow =TRUE` will remove reads having lower quality than what is defined by `min_qual` (default: 20).

Completion of this function will output three different outputs depending on the findings;
* complete matches: When the barcode is completely matched and identified in the correct position
* partial matches: When the barcode is identified in the location specified but corrected with hamming distance approach
* unmatched: no barcode match is found in the given position even after hamming distance corrections are applied

**NOTE**: we use a zero based index system, so the indexing of the sequence starts at zero.

# Aligning reads to a reference genome

Next, we align reads to the genome. This example uses `Rsubread` but any aligner that support RNA-seq alignment and gives standard BAM output can be used here.

```{r}
demux_r1        <- file.path(output_folder, "demux_completematch_small_chr21_R1.fastq.gz")
demux_r2        <- file.path(output_folder, "demux_completematch_small_chr21_R3.fastq.gz")
reference = file.path(data.folder, "small_chr21.fa")


aligned_bam <- sc_aligning(ref = reference, 
                R1 = demux_r1, 
                R2 = demux_r2, 
                nthreads  = 12,
                output_folder = output_folder)
```

# Demultiplexing the BAM file

Next, the BAM file needs to be modified in a way one or two new columns are generated for the cellular barcode tag and the molecular barcode (i.e. UMI) tag denoted by `CB:Z:` and `OX:Z:`, respectively. 

```{r}

sorted_tagged_bam <- sc_atac_bam_tagging (inbam = aligned_bam, 
                       output_folder =  output_folder, 
                       bam_tags      = list(bc="CB", mb="OX"), 
                       nthreads      =  12)

```

# Remove duplicates

Next, PCR duplicate reads are removed from the BAM file. If `samtools` is installed, then `sc_atac_remove_duplicates` can be used, with the installation path of `samtools` being an optional argument if a particular version is preferred. However, if `samtools` isn't installed, the PCR duplicate removal should be performed externally.

```{r, eval=FALSE}

removed <- sc_atac_remove_duplicates(sorted_tagged_bam,
                          output_folder = output_folder)
if (!isFALSE(removed))
  sorted_tagged_bam <- removed

```


# Gemerating a fragment file

Next, a `fragment file` in a `.bed` format is generated, where each line represents a unique fragment generated by the assay. This file is used to generate useful quality control statistics, as well as for the `filter` and `cellranger` cell calling methods. 

```{r}
sc_atac_create_fragments(inbam = sorted_tagged_bam,
                         output_folder = output_folder)

```


# Peak calling

Similar to many other tools, the the ability to call peaks on a pseudo-bulk level on the demultiplexed reads has also been incorporated. `MACS3` is used underneath to achieve this functionality. If the genome size is not provided then it will be roughly estimated.

```{r, eval=FALSE}
# CHECK IF THE .narrowPeak file is small enough to include in the package,
# otherwise, we need to make this basilisk call work??
sc_atac_peak_calling(inbam = sorted_tagged_bam, 
                     ref = reference,
                     genome_size = NULL,
                     output_folder = output_folder)

```


# Assigning reads to features and feature counting

After the read alignment and BAM file demultiplexing, a feature set can be used to find the overlap between the aligned reads and the features using the `sc_atac_feature_counting` function. 

Accepted format of the `feature_input` should be either a BED format (i.e. format should have three main columns; chromosome, feature start and feature end, extension of the file is not considered) or a `genome.fasta` file. If using a BED format as the `feature_input`, the `feature_type` should be either "peak" or "tss". If using a `.fasta` for the `feature_input`, the `feature_type` needs to be defined as `genome_bin`. This `genome.fasta` file will be used within the function to generate a `genome_bin` that defines the array of features. The size of the bins can be set using the `bin_size` parameter (default: 2000). 

Note: avoid using input BAM files larger than 8GB for best performance

```{r, eval=FALSE}
features          <- file.path(output_folder, "NA_peaks.narrowPeak")

sc_atac_feature_counting (fragment_file = file.path(output_folder, "fragments.bed"),
                          feature_input = features, 
                          bam_tags      = list(bc="CB", mb="OX"), 
                          feature_type  = "peak", 
                          organism      = "hg38",
                          yieldsize     = 1000000,
                          exclude_regions = TRUE,
                          output_folder = output_folder,
                          fix_chr       = "none"
                          )
```

If genome_bin approach is selected, the following can be used:

```{r,eval=FALSE}
reference       <- file.path(data.folder, "small_chr21.fa")
sc_atac_feature_counting (fragment_file = file.path(output_folder, "fragments.bed"),
                          feature_input = reference,
                          bam_tags      = list(bc="CB", mb="OX"),
                          feature_type  = "genome_bin",
                          organism      = "hg38",
                          cell_calling  = FALSE,
                          yieldsize     = 1000000,
                          exclude_regions = TRUE,
                          output_folder = output_folder,
                          fix_chr       = "none"
                          )

```

Call calling is performed on the data prior to finding overlaps. The cell calling methods available are the `emptyDrops` method from [DropletUtils](https://bioconductor.org/packages/release/DropletUtils), `filter` which filters the cells based on various QC cut-offs, and `cellranger` which models the signal and noise as a mixture of two negative binomial distributions, though the `filter` method is recommended as it is most commonly used and the best-suited for scATAC-seq data.

For the `filter` cell-calling method, there are a variety of QC metrics that can be used, including:

* `min_uniq_frags`
* `max_uniq_frags`
* `min_frac_peak`
* `min_frac_tss`
* `min_frac_enhancer`
* `min_frac_promoter`
* `max_frac_mito`


```{r,eval=FALSE}
features          <- file.path(output_folder, "NA_peaks.narrowPeak")
```
```{r, eval=TRUE, echo=FALSE}
features          <- file.path(data.folder, "NA_peaks.narrowPeak")
```
```{r}
sc_atac_feature_counting (fragment_file = file.path(output_folder, "fragments.bed"),
                          feature_input = features, 
                          bam_tags      = list(bc="CB", mb="OX"), 
                          feature_type  = "peak",
                          organism      = "hg38",
                          cell_calling  = "filter",
                          min_uniq_frags = 0,
                          min_frac_peak = 0,
                          min_frac_promoter = 0,
                          yieldsize     = 1000000,
                          exclude_regions = TRUE,
                          output_folder = output_folder,
                          fix_chr       = "none"
                          )
```

Mapping quality (MAPQ) value can be set to filter data further in step (default: 30). Additionally, regions that are considered anomalous, unstructured, or high signal in next-generation sequencing experiments are excluded using an inbuilt `excluded_regions` file (available are for `organism` `hg19`, `hg38`, and `mm10`) or a user provided `excluded_regions_filename` file in BED format.

The discrepancy of `chr` between the alignment file and the feature file/excluded regions file can also be fixed (using `fix_char` parameter) if needed by adding the string `chr` to the beginning of either the features, and/or excluded regions. So the options for `fix_char` are "none", "excluded_regions", "feature", "both".

**NOTE** `genome_bin` approach may be more reliable in detecting sensitive features than using a pseudo-bulk peak calling approach, hence it would make the function slower as well.

The `sc_atac_feature_counting` function generates a matrix format of the feature by cell matrix (features as rows, cell barcodes as columns) in several formats (raw, sparse, binary, jaccard) that can be used downstream to generate a `singleCellExperiment, SCE` object.

```{r}
feature_matrix <- readRDS(file.path(output_folder, "unfiltered_feature_matrix.rds"))
dplyr::glimpse(feature_matrix)

sparseM <- readRDS(file.path(output_folder, "sparse_matrix.rds"))
dplyr::glimpse(sparseM)
```

# Generating the *Single-cell Experiment (SCE)* object

Easiest way to generate a *SCE* object is to run `sc_atac_create_sce` with the `input_folder` parameter. However, if the default dir name was used for previous steps simply running `sc_atac_create_sce()` would produce a `sce` object in the default location (i.e. `scPipe-atac-output`)

```{r}
sce <- sc_atac_create_sce(input_folder = output_folder,
                   organism     = "hg38",
                   feature_type = "peak",
                   pheno_data   = NULL,
                   report       = FALSE)
```

We have now completed the preprocessing steps. The feature count matrix is available as a `.rds` file in `scPipe-atac-output/feature_matrix.rds` and quality control statistics are saved in the `scPipe-atac-output/scPipe_atac_stats` folder. This data is useful for later quality control as well (QC).

If the report parameter is set to `TRUE` then an HTML report with various statistics and plots is generated. A partial screenshot of the report is shown below.

```{r, echo=FALSE, warning=FALSE, message=FALSE, out.width="100%"}
knitr::include_graphics(system.file("report_example.png", package = "scPipe"))
```

# A convenience function for running the whole pipeline

A function `sc_atac_pipeline` has been included to make it easier to run the whole pipeline.

```{r eval = FALSE}
data.folder <- system.file("extdata", package = "scPipe", mustWork = TRUE)
output_folder <- tempdir()
sce <- sc_atac_pipeline(r1 = file.path(data.folder, "small_chr21_R1.fastq.gz"),
                        r2 = file.path(data.folder, "small_chr21_R3.fastq.gz"),
                        barcode_fastq = file.path(data.folder, "small_chr21_R2.fastq.gz"),
                        organism = "hg38",
                        reference = file.path(data.folder, "small_chr21.fa"),
                        feature_type = "peak",
                        remove_duplicates = FALSE,
                        min_uniq_frags = 0, # set to 0 as the sample dataset only has a small number of reads
                        min_frac_peak = 0,
                        min_frac_promoter = 0,
                        output_folder = output_folder)
```


# Downstream analysis

Since the [scater](https://bioconductor.org/packages/release/scater) and [scran](https://bioconductor.org/packages/release/scran) packages both use the [SingleCellExperiment](https://bioconductor.org/packages/release/SingleCellExperiment) class, it will be easy to further process this data using these packages for normalization and visualization. Other packages such as [SC3](https://bioconductor.org/packages/release/SC3) may be useful for clustering and [MAST](https://bioconductor.org/packages/release/MAST) and [edgeR](https://bioconductor.org/packages/release/edgeR) for differential expression analysis.

```{r, echo=FALSE, results='hide'}
# cleanup tempfiles
unlink(output_folder, recursive=TRUE)
```
# Session Information {-}
```{r}
sessionInfo()
```