--- title: "scPipe: a flexible data preprocessing pipeline for single-cell data" author: "Shani Amarasinghe, Phil Yang" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{scPipe: a flexible data preprocessing pipeline for single-cell data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction `scPipe` is a package initially designed to process single-cell RNA-sequencing (scRNA-seq) data generated by different protocols. We have modified it to accommodate pre-processing capability of single-cell ATAC-Seq (Assay for Transposase-Accessible Chromatin using sequencing) data pre-processing. `scPipe` ATAC-Seq module is designed for protocols without UMIs, but can also adapt to any UMI protocols if the need arise. `scPipe` ATAC-Seq module consists of two major components. The first is data pre-processing with barcodes as raw fastq as input and a feature count matrix as output. Second is the data pre-processing with barcode CSV file as input and a feature count matrix as output. `10X ATAC` method currently is the most popular method to generate scATAC-Seq data with higher sensitivity and lower cost. The structure of the 10X ATAC library is shown below. The output fastq files from a `10X ATAC` experiment is paired-ended and data is contained within both reads. # Workflow The pipeline is visually described via the workflow diagram depicted below. The functions will be explained in greater detail. ```{r, echo=FALSE, warning=FALSE, message=FALSE, out.width="100%"} knitr::include_graphics(system.file("scATAC-seq_workflow.png", package = "scPipe")) ``` # Getting started It is not mandatory to specify an output folder even though it can be specified. If no `output_folder` is defined a folder named `scPipe-atac-output` will get created in the working directory. We begin by loading the library. ```{r, message=FALSE} library(scPipe) data.folder <- system.file("extdata", package = "scPipe", mustWork = TRUE) ``` Specify the output folder. ```{r} output_folder <- paste0(tempdir(), "/scPipeATACVignette") ``` # Fastq reformatting To process the data, we need the `fastq` files (both compressed and uncompressed versions are accepted) and a cell barcode annotation. We have supplied very small sample FASTQ files of chr21. The barcode annotation could be either in a `.fastq` format or a `.csv` file with at least two columns, where the first column has the cell id and the second column contains the barcode sequence. All these files can be found in the `data` folder of the `scPipe` package: ```{r} # data fastq files r1 <- file.path(data.folder, "small_chr21_R1.fastq.gz") r2 <- file.path(data.folder, "small_chr21_R3.fastq.gz") # barcodes in fastq format barcode_fastq <- file.path(data.folder, "small_chr21_R2.fastq.gz") # barcodes in .csv format barcode_1000 <- file.path(data.folder, "chr21_modified_barcode_1000.csv") ``` The pipeline starts with fastq file reformatting. We move the barcode and UMI sequences (if available) to the read name and leave the transcript sequence as is. This outputs a read name that looks like `@[barcode_sequence]_[UMI_sequence]#[readname]`. Usually the scATAC-Seq data is paired-end and a 16bp long barcode is located on both reads. Here the barcode information is located on a separate fastq file and the length of the barcode fastq file matches the length of the reads files. Therefore, you need a minimal example like below to generate the output. ```{r} sc_atac_trim_barcode (r1 = r1, r2 = r2, bc_file = barcode_fastq, rmN = FALSE, rmlow = FALSE, output_folder = output_folder) ``` This generates two output fastq files that are appended by the prefix `demux_` to notify that the new files are the reformatted (a.k.a. demultiplexed) `.fastq` files. These will get saved in the `scPipe-atac-output` directory if the user has not specified an `output_folder`. However, if the barcodes are in the form of a `.csv` file, some extra information on 0-indexed barcode start (`id1_st`, `id2_st`) and the barcode length (`id1_len`, `id2_len`) are also required to be entered by the user. The algorithm is flexible to do a "look-around" to identify whether the correct parameters are used for a subset of data (hence saving time) and report back to the console if it believes the barcode position is incorrect and/or should be shifted. ```{r,eval=FALSE} sc_atac_trim_barcode (r1 = r1, r2 = r2, bc_file = barcode_1000, id1_st = -1, id1_len = -1, id2_st = -1, id2_len = -10, output_folder = output_folder, rmN = TRUE) ``` To accommodate combinatorial indexing in some scATAC-seq protocols, the `bc_file` parameter will accept a list of barcode files as well (currently only implemented for the `fastq` approach). Additionally, `rmN`, `rmlow` and `min_qual` parameters define the quality thresholds for the reads. If there are `Ns` in the barcode or UMI positions those will be discarded by `rmN = TRUE`. `rmlow =TRUE` will remove reads having lower quality than what is defined by `min_qual` (default: 20). Completion of this function will output three different outputs depending on the findings; * complete matches: When the barcode is completely matched and identified in the correct position * partial matches: When the barcode is identified in the location specified but corrected with hamming distance approach * unmatched: no barcode match is found in the given position even after hamming distance corrections are applied **NOTE**: we use a zero based index system, so the indexing of the sequence starts at zero. # Aligning reads to a reference genome Next, we align reads to the genome. This example uses `Rsubread` but any aligner that support RNA-seq alignment and gives standard BAM output can be used here. ```{r} demux_r1 <- file.path(output_folder, "demux_completematch_small_chr21_R1.fastq.gz") demux_r2 <- file.path(output_folder, "demux_completematch_small_chr21_R3.fastq.gz") reference = file.path(data.folder, "small_chr21.fa") aligned_bam <- sc_aligning(ref = reference, R1 = demux_r1, R2 = demux_r2, nthreads = 12, output_folder = output_folder) ``` # Demultiplexing the BAM file Next, the BAM file needs to be modified in a way one or two new columns are generated for the cellular barcode tag and the molecular barcode (i.e. UMI) tag denoted by `CB:Z:` and `OX:Z:`, respectively. ```{r} sorted_tagged_bam <- sc_atac_bam_tagging (inbam = aligned_bam, output_folder = output_folder, bam_tags = list(bc="CB", mb="OX"), nthreads = 12) ``` # Remove duplicates Next, PCR duplicate reads are removed from the BAM file. If `samtools` is installed, then `sc_atac_remove_duplicates` can be used, with the installation path of `samtools` being an optional argument if a particular version is preferred. However, if `samtools` isn't installed, the PCR duplicate removal should be performed externally. ```{r, eval=FALSE} removed <- sc_atac_remove_duplicates(sorted_tagged_bam, output_folder = output_folder) if (!isFALSE(removed)) sorted_tagged_bam <- removed ``` # Gemerating a fragment file Next, a `fragment file` in a `.bed` format is generated, where each line represents a unique fragment generated by the assay. This file is used to generate useful quality control statistics, as well as for the `filter` and `cellranger` cell calling methods. ```{r} sc_atac_create_fragments(inbam = sorted_tagged_bam, output_folder = output_folder) ``` # Peak calling Similar to many other tools, the the ability to call peaks on a pseudo-bulk level on the demultiplexed reads has also been incorporated. `MACS3` is used underneath to achieve this functionality. If the genome size is not provided then it will be roughly estimated. ```{r, eval=FALSE} # CHECK IF THE .narrowPeak file is small enough to include in the package, # otherwise, we need to make this basilisk call work?? sc_atac_peak_calling(inbam = sorted_tagged_bam, ref = reference, genome_size = NULL, output_folder = output_folder) ``` # Assigning reads to features and feature counting After the read alignment and BAM file demultiplexing, a feature set can be used to find the overlap between the aligned reads and the features using the `sc_atac_feature_counting` function. Accepted format of the `feature_input` should be either a BED format (i.e. format should have three main columns; chromosome, feature start and feature end, extension of the file is not considered) or a `genome.fasta` file. If using a BED format as the `feature_input`, the `feature_type` should be either "peak" or "tss". If using a `.fasta` for the `feature_input`, the `feature_type` needs to be defined as `genome_bin`. This `genome.fasta` file will be used within the function to generate a `genome_bin` that defines the array of features. The size of the bins can be set using the `bin_size` parameter (default: 2000). Note: avoid using input BAM files larger than 8GB for best performance ```{r, eval=FALSE} features <- file.path(output_folder, "NA_peaks.narrowPeak") sc_atac_feature_counting (fragment_file = file.path(output_folder, "fragments.bed"), feature_input = features, bam_tags = list(bc="CB", mb="OX"), feature_type = "peak", organism = "hg38", yieldsize = 1000000, exclude_regions = TRUE, output_folder = output_folder, fix_chr = "none" ) ``` If genome_bin approach is selected, the following can be used: ```{r,eval=FALSE} reference <- file.path(data.folder, "small_chr21.fa") sc_atac_feature_counting (fragment_file = file.path(output_folder, "fragments.bed"), feature_input = reference, bam_tags = list(bc="CB", mb="OX"), feature_type = "genome_bin", organism = "hg38", cell_calling = FALSE, yieldsize = 1000000, exclude_regions = TRUE, output_folder = output_folder, fix_chr = "none" ) ``` Call calling is performed on the data prior to finding overlaps. The cell calling methods available are the `emptyDrops` method from [DropletUtils](https://bioconductor.org/packages/release/DropletUtils), `filter` which filters the cells based on various QC cut-offs, and `cellranger` which models the signal and noise as a mixture of two negative binomial distributions, though the `filter` method is recommended as it is most commonly used and the best-suited for scATAC-seq data. For the `filter` cell-calling method, there are a variety of QC metrics that can be used, including: * `min_uniq_frags` * `max_uniq_frags` * `min_frac_peak` * `min_frac_tss` * `min_frac_enhancer` * `min_frac_promoter` * `max_frac_mito` ```{r,eval=FALSE} features <- file.path(output_folder, "NA_peaks.narrowPeak") ``` ```{r, eval=TRUE, echo=FALSE} features <- file.path(data.folder, "NA_peaks.narrowPeak") ``` ```{r} sc_atac_feature_counting (fragment_file = file.path(output_folder, "fragments.bed"), feature_input = features, bam_tags = list(bc="CB", mb="OX"), feature_type = "peak", organism = "hg38", cell_calling = "filter", min_uniq_frags = 0, min_frac_peak = 0, min_frac_promoter = 0, yieldsize = 1000000, exclude_regions = TRUE, output_folder = output_folder, fix_chr = "none" ) ``` Mapping quality (MAPQ) value can be set to filter data further in step (default: 30). Additionally, regions that are considered anomalous, unstructured, or high signal in next-generation sequencing experiments are excluded using an inbuilt `excluded_regions` file (available are for `organism` `hg19`, `hg38`, and `mm10`) or a user provided `excluded_regions_filename` file in BED format. The discrepancy of `chr` between the alignment file and the feature file/excluded regions file can also be fixed (using `fix_char` parameter) if needed by adding the string `chr` to the beginning of either the features, and/or excluded regions. So the options for `fix_char` are "none", "excluded_regions", "feature", "both". **NOTE** `genome_bin` approach may be more reliable in detecting sensitive features than using a pseudo-bulk peak calling approach, hence it would make the function slower as well. The `sc_atac_feature_counting` function generates a matrix format of the feature by cell matrix (features as rows, cell barcodes as columns) in several formats (raw, sparse, binary, jaccard) that can be used downstream to generate a `singleCellExperiment, SCE` object. ```{r} feature_matrix <- readRDS(file.path(output_folder, "unfiltered_feature_matrix.rds")) dplyr::glimpse(feature_matrix) sparseM <- readRDS(file.path(output_folder, "sparse_matrix.rds")) dplyr::glimpse(sparseM) ``` # Generating the *Single-cell Experiment (SCE)* object Easiest way to generate a *SCE* object is to run `sc_atac_create_sce` with the `input_folder` parameter. However, if the default dir name was used for previous steps simply running `sc_atac_create_sce()` would produce a `sce` object in the default location (i.e. `scPipe-atac-output`) ```{r} sce <- sc_atac_create_sce(input_folder = output_folder, organism = "hg38", feature_type = "peak", pheno_data = NULL, report = FALSE) ``` We have now completed the preprocessing steps. The feature count matrix is available as a `.rds` file in `scPipe-atac-output/feature_matrix.rds` and quality control statistics are saved in the `scPipe-atac-output/scPipe_atac_stats` folder. This data is useful for later quality control as well (QC). If the report parameter is set to `TRUE` then an HTML report with various statistics and plots is generated. A partial screenshot of the report is shown below. ```{r, echo=FALSE, warning=FALSE, message=FALSE, out.width="100%"} knitr::include_graphics(system.file("report_example.png", package = "scPipe")) ``` # A convenience function for running the whole pipeline A function `sc_atac_pipeline` has been included to make it easier to run the whole pipeline. ```{r eval = FALSE} data.folder <- system.file("extdata", package = "scPipe", mustWork = TRUE) output_folder <- tempdir() sce <- sc_atac_pipeline(r1 = file.path(data.folder, "small_chr21_R1.fastq.gz"), r2 = file.path(data.folder, "small_chr21_R3.fastq.gz"), barcode_fastq = file.path(data.folder, "small_chr21_R2.fastq.gz"), organism = "hg38", reference = file.path(data.folder, "small_chr21.fa"), feature_type = "peak", remove_duplicates = FALSE, min_uniq_frags = 0, # set to 0 as the sample dataset only has a small number of reads min_frac_peak = 0, min_frac_promoter = 0, output_folder = output_folder) ``` # Downstream analysis Since the [scater](https://bioconductor.org/packages/release/scater) and [scran](https://bioconductor.org/packages/release/scran) packages both use the [SingleCellExperiment](https://bioconductor.org/packages/release/SingleCellExperiment) class, it will be easy to further process this data using these packages for normalization and visualization. Other packages such as [SC3](https://bioconductor.org/packages/release/SC3) may be useful for clustering and [MAST](https://bioconductor.org/packages/release/MAST) and [edgeR](https://bioconductor.org/packages/release/edgeR) for differential expression analysis. ```{r, echo=FALSE, results='hide'} # cleanup tempfiles unlink(output_folder, recursive=TRUE) ``` # Session Information {-} ```{r} sessionInfo() ```