--- title: "single-sperm-co-analysis" output: BiocStyle::html_document bibliography: ref.bib vignette: > %\VignetteIndexEntry{single-sperm-co-analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} suppressPackageStartupMessages({ library(comapr) library(SummarizedExperiment) }) ``` # Introduction In the previous vignette, we demonstrated how to calculate genetic distances using genotyped markers from a group of samples. [Genetic distance calculate from genotype shifts of markers](getStarted.html) In this document, we will focus on building individualized genetic maps from output files of `sscocaller` which is available at [here](https://gitlab.svi.edu.au/biocellgen-public/sscocaller). # Locate file path The `comapr` package includes a list of toy example output files from the `sscocaller` command line tool. The follow code will get the file path that we will use later. ```{r} demo_path <-paste0(system.file("extdata",package = "comapr"),"/") ``` We can see that we have two samples (individual donors) and each of them has haplotype states inferred for chr1 to chr5. ```{r} list.files(demo_path) ``` # File information - *.mtx - sparse matrix with columns corresponding to the list of sperm cell barcodes and rows corresponding to the list of SNP positions in VCF file - {sample}_chr1_altCount.mtx, a sparse mtx with entries representing alternative allele counts - {sample}_chr1_totalCount.mtx, a sparse mtx with entries representing total allele counts - {sample}_chr1_vi.mtx, a sparse mtx with entries representing inferred viterbi state (haplotype state) - {sample}_chr1_snpAnnot.txt, the SNP positions and allele - {sample}_chr1_SegInfo.txt, statistics of viterbi state segments in text file format. It contains consecutive viterbi states for each chromosome with statistics including, starting SNP position, ending SNP position, the number of SNPs supporting the segment, the log likelihood ratio of the viterbi segment and the inferred hidden state. # Diagnostic functions `comapr` provides quality-check functions for examining SNP coverage per chr and per cell, chromosome segregation pattern checks, and summary statics for filtering low confidence crossovers. ## perCellChrQC ```{r} pcQC <- perCellChrQC(sampleName="s1",chroms=c("chr1"), path=paste0(demo_path,"/"), barcodeFile=NULL) ``` # Input parsing Input-parsing functions are implemented in `comapr` to construct `RangedSummarizedExpriment` object that parses files generated from `sscocaller` and filter out low-confidence COs at the same time. For the demo dataset, these filters do not have any effects: ``` minSNP = 0, minlogllRatio = 50, bpDist = 100, maxRawCO=10, minCellSNP = 1 ``` # Construct `RangedSummarizedExpriment` object We first construct `RangedSummarizedExpriment` object from parsing output files from `sscocaller` and filter out low-confidence crossovers. ```{r} s1_rse_state <- readHapState("s1",chroms=c("chr1","chr2"), path=demo_path,barcodeFile=NULL, minSNP = 0, minlogllRatio = 50, bpDist = 100, maxRawCO=10, minCellSNP = 1) s2_rse_state <- readHapState("s2",chroms=c("chr1","chr2"), path=demo_path, barcodeFile=NULL, minSNP = 0, minlogllRatio = 50, bpDist = 100, maxRawCO=10, minCellSNP = 1) ``` ```{r} s1_rse_state ``` The Viterbi states for SNP markers are stored in the "assay" slot: ```{r} assay(s1_rse_state) ``` The `rowRanges` contains the SNP's positions: ```{r} rowRanges(s1_rse_state) ``` ## Formate sample group factor We have read in the Viterbi states for cells from two samples: s1 and s2. We now combine them into one data object for downstream analysis. ## Add sample group factor ```{r} colData(s1_rse_state) colData(s1_rse_state)$sampleGroup <- "s1" colData(s2_rse_state)$sampleGroup <- "s2" ``` ## Combine two groups We now call `combineHapState` function to combine sample s1 and sample s2 into one `RangedSummarizedExperiment` object. ```{r} twoSample <- combineHapState(s1_rse_state, s2_rse_state) ``` Now the `assay` data slot contains the Viterbi states across SNP positions for the combined samples. ```{r} twoSample <- combineHapState(s1_rse_state,s2_rse_state) ``` Now the `twoSample` object contains the cells from both samples with `assay` slot contains the Viterbi states and `rowRanges` contains position annotaitons for the list SNP markers. ```{r} assay(twoSample) ``` # Count crossovers The `countCOs` function can then find out the crossover intervals according to the Viterbi states for each cell and the number of crossovers per cell is then calculated. # Count crossovers for SNP intervals `countCOs` function will find the crossover intervals for each samples and attribute the observed crossovers from each sample to the corresponding intervals. ```{r} cocounts <- countCOs(twoSample) ``` The `rowRanges` from the resulting data object now denotes the crossover interval and the `assay` slot now contains the number of crossovers in each cell. Now `rowRanges` contains the intervals ```{r} rowRanges(cocounts) ``` The colData slot still contains the annotations of each cell. ```{r} colData(cocounts) ``` # Calculate genetic distances The genetic distances can be calculated by using mapping functions such as the Kosambi or Haldane \cite{} and `assay` slot contains the number of crossovers found in each sample across these intervals. ```{r} assay(cocounts) ``` # Calculate genetic distances `calGeneticDist` function will convert the raw crossover frequencies to genetic distances via selected mapping function (ie. kosambi or haldane). ```{r} dist_gr <- calGeneticDist(co_count = cocounts, mapping_fun = "k") dist_gr ``` The genetic distances for each interval are stored in `rowData`. ```{r} rowData(dist_gr) ``` The above genetic distances have been calculated using all samples. We can also specify the group factor so that we can calculate genetic distances for different sample groups: ```{r} ## sampleGroup is a column in the colData slot dist_gr <- calGeneticDist(co_count = cocounts, group_by = "sampleGroup", mapping_fun = "k") ``` Now the group/Sample specific distances are stored in `rowData` ```{r} rowData(dist_gr)$kosambi ``` # Plot whole genome genetic distances We construct a `GRange` object from the `dist_gr` first. ```{r} p_gr <- granges(dist_gr) mcols(p_gr) <- rowData(dist_gr)$kosambi ``` We can plot whole-genome genetic distances ```{r} plotWholeGenome(p_gr) ``` We can also do per chromosome plot ```{r} plotGeneticDist(p_gr,chr = "chr1",cumulative = TRUE) ``` # Group differences `bootstrapDist` function generates the sampling distribution of the difference in total genetic distances for the two groups under comparisons. ```{r} set.seed(100) bootsDiff <- bootstrapDist(co_gr = cocounts,B=10, group_by = "sampleGroup") ``` ```{r} hist(bootsDiff) ``` From the `bootsDiff` data object, we can find a 95% confidence interval to test whether the difference is statistically significant. ```{r} quantile(bootsDiff,c(0.025,0.975),na.rm =TRUE) ``` An alternative re-sampling method, `permuteDist`, can be applied to generate a null distribution for the group difference by reshuffling the group labels across the samples. ```{r} set.seed(100) perms <- permuteDist(cocounts,B=1000,group_by = "sampleGroup") ``` A p-value can be calculated using the `statmod::permp` function [@Phipson2010-xi]. ```{r} statmod::permp(x = sum(perms$permutes>=perms$observed_diff, na.rm = TRUE), nperm = sum(!is.na(perms$permutes)), n1 = perms$nSample[1], n2 = perms$nSample[2]) ``` # Session info ```{r} sessionInfo() ```