---
title: "SCOPE: Single-cell Copy Number Estimation"
author: "Rujin Wang, Danyu Lin, Yuchao Jiang"
date: "`r format(Sys.Date())`"
output:
    html_document:
        highlight: pygments
        toc: true
vignette: >
    %\VignetteIndexEntry{SCOPE: Single-cell Copy Number Estimation}
    %\VignetteEngine{knitr::rmarkdown}
    \usepackage[utf8]{inputenc}
---

# 1. Overview of analysis pipeline
## 1.1 Introduction
SCOPE is a statistical framework designed for calling 
copy number variants (CNVs) from whole-genome single-cell 
DNA sequencing read depths. The distinguishing features 
of SCOPE include: 

1. Utilizes cell-specific Gini coefficients for quality 
controls and for identification of normal/diploid cells. 
In most single-cell cancer genomics studies, diploid cells 
are inevitably picked up from adjacent normal tissues for 
sequencing and can thus serve as normal controls for 
read depth normalization. However, not all 
platforms/experiments allow or adopt flow-sorting 
based techniques before scDNA-seq and thus cell ploidy 
and case-control labeling are not always readily 
available. Gini coefficient is able to index diploid 
cells out of the entire cell populations and serves 
as good proxies to identify cell outliers. 

2. Employs an EM algorithm to model GC content bias, 
which accounts for the different copy number states 
along the genome. SCOPE is based on a Poisson latent 
factor model for cross-sample normalization, 
borrowing information both across regions and 
across samples to estimate the bias terms. 

3. Incorporates multi-sample segmentation procedure 
to identify breakpoints that are shared across cells 
from the same genetic background


## 1.2 Bioinformatic pre-processing

We demonstrate here how the pre-processing bioinformatic
pipeline works. The 
[Picard toolkit](https://broadinstitute.github.io/picard/) 
is open-source and free for all uses. The `split_script.py` 
python script is pre-stored in the package for demultiplexing. 

There are two types of scDNA-seq data sources: public 
data from NCBI Sequence Read Archive and data from 
10X Genomics. For the NCBI SRA data, start with the 
SRA files. Fastq-dump to obtain FASTQ files. Align 
FASTQ sequences to NCBI hg19 reference genome and 
convert to BAM files. For the 10X Genomic datasets, 
process from the original integrated BAM file. 
Error-corrected chromium cellular barcode information 
for each read is stored as CB tag fields. Only 
reads that contain CB tags and are in the list 
of barcode of interest are demultiplexed via 
a Python script. Sort, add read group, and 
dedup on aligned/demultiplexed BAMs. Use 
deduped BAM files as the input. 

```{bash, eval = FALSE}
# public data from NCBI Sequence Read Archive
SRR=SRRXXXXXXX
kim=/pine/scr/r/u/rujin/Kim_Navin_et_al_Cell_2018
fastq_dir=$kim/fastq
align_dir=$kim/align

# Align FASTQ sequences to NCBI hg19 reference genome 
# (Single-end sequenced cells have only 1 FASTQ file; 
# paired-end sequencing would generate two FASTQ files, 
# with suffix "_1" and "_2")
cd $fastq_dir
bwa mem -M -t 16 \
    ucsc.hg19.fasta `ls | grep "$SRR" | tr '\n' ' '` > $align_dir/"$SRR".sam

# Convert .sam to .bam
cd $align_dir
samtools view -bS "$SRR".sam > "$SRR".bam

# Sort
java -Xmx30G -jar /proj/yuchaojlab/bin/picard.jar SortSam \
    INPUT="$SRR".bam OUTPUT="$SRR".sorted.bam \
    SORT_ORDER=coordinate

# Add read group
java -Xmx40G -jar /proj/yuchaojlab/bin/picard.jar AddOrReplaceReadGroups \
    I="$SRR".sorted.bam O="$SRR".sorted.rg.bam RGID="$SRR" \
    RGLB=Chung_Et_Al RGPL=ILLUMINA RGPU=machine RGSM="$SRR"
samtools index "$SRR".sorted.rg.bam

# Dedup
java -Xmx40G -jar /proj/yuchaojlab/bin/picard.jar MarkDuplicates \
    REMOVE_DUPLICATES=true \
    I="$SRR".sorted.rg.bam O="$SRR".sorted.rg.dedup.bam \
    METRICS_FILE="$SRR".sorted.rg.dedup.metrics.txt \
    PROGRAM_RECORD_ID= MarkDuplicates PROGRAM_GROUP_VERSION=null \
    PROGRAM_GROUP_NAME=MarkDuplicates
java -jar /proj/yuchaojlab/bin/picard.jar BuildBamIndex \
    I="$SRR".sorted.rg.dedup.bam

# 10X Genomics
XGenomics=/pine/scr/r/u/rujin/10XGenomics
dataset=breast_tissue_A_2k
output_dir=$XGenomics/$dataset/output
align_dir=$XGenomics/$dataset/align

# Demultiplex
cd $output_dir
samtools view ${dataset}_possorted_bam.bam | python $XGenomics/split_script.py

# Add header to demultiplexed bam files for further processing
cd $XGenomics
samtools view -H $dataset/output/${dataset}_possorted_bam.bam > \
    $dataset/header.txt
barcode=AAAGATGGTGTAAAGT
cat header.txt $align_dir/$barcode/$barcode-1.sam > \
    $align_dir/$barcode/$barcode-1.header.sam

# Convert .sam to .bam
cd $align_dir/$barcode
samtools view -bS "$barcode"-1.header.sam > "$barcode".bam
```


# 2. Pre-computation and Quality Control
## 2.1 Pre-preparation
SCOPE enables reconstruction of user-defined genome-wide 
consecutive bins with fixed interval length prior to downstream 
analysis by specifying arguments `genome` and `resolution` in 
`get_bam_bed()` function. Make sure that all chromosomes are 
named consistently and be concordant with `.bam` files. SCOPE 
processes the entire genome altogether. Use function `get_bam_bed()` 
to finish the pre-preparation step. By default, SCOPE is designed 
for hg19 with fixed bin-length = 500kb. 
```{r, eval=TRUE, message=FALSE}
library(SCOPE)
library(WGSmapp)
library(BSgenome.Hsapiens.UCSC.hg38)
bamfolder <- system.file("extdata", package = "WGSmapp")
bamFile <- list.files(bamfolder, pattern = '*.dedup.bam$')
bamdir <- file.path(bamfolder, bamFile)
sampname_raw <- sapply(strsplit(bamFile, ".", fixed = TRUE), "[", 1)
bambedObj <- get_bam_bed(bamdir = bamdir, sampname = sampname_raw, 
                        hgref = "hg38")
ref_raw <- bambedObj$ref
```

## 2.2 Getting GC content and mappability
Compute GC content and mappability for each bin. 
By default, SCOPE is intended for hg19 reference genome. 
To compute mappability for hg19, we employed the 100-mers 
mappability track from the ENCODE Project
(`wgEncodeCrgMapabilityAlign100mer.bigwig` from [link](
http://rohsdb.cmb.usc.edu/GBshape/cgi-bin/hgFileUi?
db=hg19&g=wgEncodeMapability)) 
and computed weighted average of the mappability 
scores if multiple ENCODE regions overlap with 
the same bin. For SCOPE, the whole-genome mappability 
track on human hg19 assembly is stored as part of the package. 


The whole-genome mappability track on human 
hg38 assembly is also stored in SCOPE package. 
For more details on mappability calculation, 
please refer to [CODEX2 for hg38](
https://github.com/yuchaojiang/CODEX2/blob/master/README.md). 
Load the hg38 reference package, specify argument `hgref = "hg38"` 
in `get_mapp()` and `get_gc()`. By default, `hg19` are used. 

```{r, eval=TRUE, message=FALSE}
mapp <- get_mapp(ref_raw, hgref = "hg38")
head(mapp)
gc <- get_gc(ref_raw, hgref = "hg38")
values(ref_raw) <- cbind(values(ref_raw), DataFrame(gc, mapp))
ref_raw
```

Note that SCOPE can also be adapted to 
the mouse genome (mm10) in a similar way 
(see [CODEX2 for mouse genome](https://github.com/yuchaojiang/CODEX2/blob/master/README.md)). 
Specify argument `hgref = "mm10"` in `get_bam_bed()`, `get_mapp()`, 
`get_gc()` and `get_coverage_scDNA()`. Calculation of GC content 
and mappability needs to be modified from the default (hg19). 
For mm10, there are two workarounds: 1) set all mappability to 1 
to avoid extensive computation; 2) adopt QC procedures based on
annotation results, e.g., filter out bins within "blacklist" regions, 
which generally have low mappability. To be specific, we obtained
and pre-stored mouse genome "blacklist" regions from
[Amemiya et al., Scientific Reports, 2019](https://github.com/Boyle-Lab/Blacklist/tree/master/lists/mm10-blacklist.v2.bed.gz). 
```{r, eval=FALSE}
library(BSgenome.Mmusculus.UCSC.mm10)
mapp <- get_mapp(ref_raw, hgref = "mm10")
gc <- get_gc(ref_raw, hgref = "mm10")
```


For unknown reference assembly without 
pre-calculated mappability track, 
refer to [CODEX2: mappability pre-calculation](
https://github.com/yuchaojiang/CODEX2/blob/master/mouse/mapp.R).

## 2.3 Getting coverage
Obtain either single-end or paired-end sequencing 
read depth matrix. SCOPE, by default, adopts a 
fixed binning method to compute the depth of 
coverage while removing reads that are mapped 
to multiple genomic locations and to "blacklist" 
regions. This is followed by an additional step 
of quality control to remove bins with extreme 
mappability to avoid erroneous detections. 
Specifically, the "blacklist" bins, including 
[segmental duplication regions](http://humanparalogy.
gs.washington.edu/build37/data/GRCh37GenomicSuperDup.tab) 
and [gaps in reference assembly](https://gist.github.com/leipzig/6123703) 
from telomere, centromere, and/or 
heterochromatin regions. 
```{r, eval=TRUE}
# Getting raw read depth
coverageObj <- get_coverage_scDNA(bambedObj, mapqthres = 40, 
                                seq = 'paired-end', hgref = "hg38")
Y_raw <- coverageObj$Y
```

## 2.4 Quality control
`get_samp_QC()` is used to perform QC step on 
single cells, where total number/proportion 
of reads, total number/proportion of mapped 
reads, total number/proportion of mapped 
non-duplicate reads, and number/proportion 
of reads with mapping quality greater than 20 
will be returned. Use `perform_qc()` to further 
remove samples/cells with low proportion of 
mapped reads, bins that have extreme GC content 
(less than 20% and greater than 80%) and low 
mappability (less than 0.9) to reduce artifacts. 
```{r, eval=TRUE}
QCmetric_raw <- get_samp_QC(bambedObj)
qcObj <- perform_qc(Y_raw = Y_raw, 
    sampname_raw = sampname_raw, ref_raw = ref_raw, 
    QCmetric_raw = QCmetric_raw)
Y <- qcObj$Y
sampname <- qcObj$sampname
ref <- qcObj$ref
QCmetric <- qcObj$QCmetric
```



# 3. Running SCOPE
## 3.1 Gini coefficient
One feature of SCOPE is to identify normal/diploid 
cells using Gini index. Gini coefficient is 
calculated for each cell as 2 times the area 
between the Lorenz curve and the diagonal. 
The value of the Gini index varies between 
0 and 1, where 0 is the most uniform and 1 
is the most extreme. Cells with extremely 
high Gini coefficients(greater than 0.5) 
are recommended to be excluded. Set up a 
Gini threshold for identification of 
diploid/normal cells (for example, 
Gini less than 0.12). We demonstrate 
the pre-stored toy dataset as follows. 
```{r, eval=TRUE, message=FALSE}
# get gini coefficient for each cell
Gini <- get_gini(Y_sim)
```

## 3.2 Running SCOPE with negative control samples

Normal cell index is determined either by Gini coefficients 
or prior knowledge. SCOPE utilizes ploidy estimates from a 
first-pass normalization run to ensure fast convergence
and to avoid local optima. Specify `SoS.plot = TRUE` in 
`initialize_ploidy()` to visualize ploidy pre-estimations. 
The normalization procedure includes an expectation-maximization 
algorithm in the Poisson generalized linear model. Note that, 
by setting `ploidyInt` in the normalization function, SCOPE allows 
users to exploit prior-knowledge ploidies as the input and to 
manually tune a few cells that have poor fitting. 
```{r, eval=TRUE, message=TRUE}
# first-pass CODEX2 run with no latent factors
normObj.sim <- normalize_codex2_ns_noK(Y_qc = Y_sim,
                                        gc_qc = ref_sim$gc,
                                        norm_index = which(Gini<=0.12))

# Ploidy initialization
ploidy.sim <- initialize_ploidy(Y = Y_sim, Yhat = normObj.sim$Yhat, ref = ref_sim)

# If using high performance clusters, parallel computing is 
# easy and improves computational efficiency. Simply use 
# normalize_scope_foreach() instead of normalize_scope(). 
# All parameters are identical. 
normObj.scope.sim <- normalize_scope_foreach(Y_qc = Y_sim, gc_qc = ref_sim$gc,
    K = 1, ploidyInt = ploidy.sim,
    norm_index = which(Gini<=0.12), T = 1:5,
    beta0 = normObj.sim$beta.hat, nCores = 2)
# normObj.scope.sim <- normalize_scope(Y_qc = Y_sim, gc_qc = ref_sim$gc,
#     K = 1, ploidyInt = ploidy.sim,
#     norm_index = which(Gini<=0.12), T = 1:5,
#     beta0 = beta.hat.noK.sim)
Yhat.sim <- normObj.scope.sim$Yhat[[which.max(normObj.scope.sim$BIC)]]
fGC.hat.sim <- normObj.scope.sim$fGC.hat[[which.max(normObj.scope.sim$BIC)]]
```

Visualize selection results for each individual cell. 
By default, BIC is used to choose optimal CNV group. 
```{r, eval=FALSE}
plot_EM_fit(Y_qc = Y_sim, gc_qc = ref_sim$gc, norm_index = which(Gini<=0.12), 
            T = 1:5,
            ploidyInt = ploidy.sim, beta0 = normObj.sim$beta.hat,
            filename = "plot_EM_fit_demo.pdf")
```

Upon completion of SCOPE's default normalization and segmentation, 
SCOPE includes the option to cluster cells based on the matrix of 
normalized z-scores, estimated copy numbers, or estimated changepoints. 
Given the inferred subclones, SCOPE can opt to perform a second round of 
group-wise ploidy initialization and normalization
```{r, eval=FALSE}
# Group-wise ploidy initialization
clones <- c("normal", "tumor1", "normal", "tumor1", "tumor1")
ploidy.sim.group <- initialize_ploidy_group(Y = Y_sim, Yhat = Yhat.noK.sim,
                                ref = ref_sim, groups = clones)
ploidy.sim.group

# Group-wise normalization
normObj.scope.sim.group <- normalize_scope_group(Y_qc = Y_sim,
                                    gc_qc = ref_sim$gc,
                                    K = 1, ploidyInt = ploidy.sim.group,
                                    norm_index = which(clones=="normal"),
                                    groups = clones,
                                    T = 1:5,
                                    beta0 = beta.hat.noK.sim)
Yhat.sim.group <- normObj.scope.sim.group$Yhat[[which.max(
                                    normObj.scope.sim.group$BIC)]]
fGC.hat.sim.group <- normObj.scope.sim.group$fGC.hat[[which.max(
                                    normObj.scope.sim.group$BIC)]]
```

## 3.3 Cross-sample segmentation by SCOPE

SCOPE provides the cross-sample segmentation, 
which outputs shared breakpoints 
across cells from the same clone. This step 
processes the entire genome chromosome by 
chromosome. Shared breakpoints and integer copy-number 
profiles will be returned. 
```{r, eval=TRUE, message=FALSE}
chrs <- unique(as.character(seqnames(ref_sim)))
segment_cs <- vector('list',length = length(chrs))
names(segment_cs) <- chrs
for (chri in chrs) {
    message('\n', chri, '\n')
    segment_cs[[chri]] <- segment_CBScs(Y = Y_sim,
                                    Yhat = Yhat.sim,
                                    sampname = colnames(Y_sim),
                                    ref = ref_sim,
                                    chr = chri,
                                    mode = "integer", max.ns = 1)
}
iCN_sim <- do.call(rbind, lapply(segment_cs, function(z){z[["iCN"]]}))
```

## 3.4 Visualization

SCOPE offers heatmap of inferred integer copy-number 
profiles with cells clustered by hierarchical clustering.
```{r, eval=FALSE}
plot_iCN(iCNmat = iCN_sim, ref = ref_sim, Gini = Gini, 
        filename = "plot_iCN_demo")
```


## Session information

```{r}
sessionInfo()
```