--- author: "Jack Fu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Title of your vignette} %\usepackage[UTF-8]{inputenc} --- ## Overview MDTS provides the necessary infrastructure to take raw bam files of targeted sequencing trios to produce de novo deletion calls of high sensitivity and low false positives. Our method benefits tremendously from pooling information from across many trios to determine regions of interest, normalize read-depth, and to filter candidate deletion signals. MDTS is broken down into the following major steps: 1. Calculating the MDTS bins 2. Determinig the read-depth of the bins for each sample 3. Normalizing the read-depth 4. Creating the Minimum Distance for each trio 5. Call de novo deletions and filter ## Raw Data Preprocessed sample data is included with the package `MDTS`, but should you be interested in the raw files, they can be found in the github repo `jmf47/MDTSdata`. It includes: 1. 21 simulated bam files (7 trios) + Sequencing read length of 100bp + 1 true de novo deletion in family F4 2. A bw file that includes the 100mer mappability of chr1 3. A pedigree file + Records the trio kinships + Records the file paths to the raw bam files of each sample ```{r, echo=F, message=F, warning=F} library(MDTS); library(BSgenome.Hsapiens.UCSC.hg19) setwd(system.file("extdata", package="MDTS")) load('pD.RData') pD ``` ## 1. Calculating MDTS Bins One innovation of our method is that the bins to determine sequencing read depth is calculated based on the empirical capture. These bins can be significantly different from the standard practice of using probe design coordinates to create bins. Furthermore, our binning process allows the bins to be smaller in regions of high capture, and vice verse in low capture. This dynamic scaling of the bin size makes efficient use of varying depths of coverage - allowing us to call deletions with finer resolution in areas of higher coverage. The general approach is first to examine the mapped coverage of a subset of the full dataset across the entire genome. Basepairs that are covered passed a certain threshold in at least one sample is selected as a proto-region. Proto-regions are subdvided into non-overlapping bins such that the median number of reads falling into each bin meets a specified level spcified by the *medianCoverage* parameter. The choice of *medianCoverage* is a tradeoff between sensitivity and false positives. A larger *medianCoverage* decreases sensitivity and false positives. For our publication, we used *medianCoverage=160*, and for this example we used *medianCoverage=150*. Information on GC content and mappability of the bins are also calculated at this stage. As a result, MDTS requires a BSgenome object that is consistent with the reference annotation the bam files were aligned to, as well as a mappability bigwig that contains mappability information in windows consistent with sequencing read length. In our example, we require `BSgenome.Hsapiesn.UCSC.hg19` and 100mer mappability. ** In general you will need a bw file that contains the information for all autosomes 1-22 (unlike the bw file only for chr1 included in the example). ** This is the link to some hg19 mappability bw files: http://rohsdb.cmb.usc.edu/GBshape/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability This portion of the vignette example uses raw data from `MDTSData`. However the resulting bins are included in the `data` directory of MDTS as a `.RData`. ```{r, echo=FALSE, message=FALSE} library(MDTS) ``` ```{r, eval=F, warning=F} library(MDTS); library(BSgenome.Hsapiens.UCSC.hg19) # Using the raw data from MDTSData devtools::install_github("jmf47/MDTSData") setwd(system.file("data", package="MDTSData")) # Importing the pedigree file that includes information on where to locate the # raw bam files pD <- getMetaData("pD.ped") # Information on the GC content and mappability to estimate GC and mappability # for the MDTS bins genome <- BSgenome.Hsapiens.UCSC.hg19; map_file = "chr1.map.bw" # This command now subsets 5 samples to determine MDTS bins # pD is the metaData matrix from getMetaData() # n is the number of samples to examine to calculate the bins # readLength is the sequencing read length # minimumCoverage is the minimum read depth for a location to be included # in a proto region # medianCoverage is the median number of reads across the n samples in a bin bins <- calcBins(metaData=pD, n=5, readLength=100, minimumCoverage=5, medianCoverage=150, genome=genome, mappabilityFile=map_file) ``` ## 2. Calculating coverage of MDTS bins Given a set of dynamic MDTS bins, we can proceed to calculate the number of reads that fall into these bins for the entirety of our sample. We organize the number of reads as a matrix, where each column is a sample, and each row corresponds to a bin. This portion of the vignette is a continuation of the usage of the raw data from `MDTSdata` above. However the resulting `counts` matrix is also shipped as a `rda` in `MDTS`. ```{r, eval=F} # pD is the phenotype matrix # bins is the previously calculated MDTS bins # rl is the sequencing read length counts = calcCounts(pD, bins, rl=100) ``` The MDTS bins, raw counts, and pedigree files are included with `MDTS` and can be loaded as follows: ```{r, message=FALSE, warning=F} load(system.file("extdata", 'bins.RData', package = "MDTS")) load(system.file("extdata", 'counts.RData', package = "MDTS")) load(system.file("extdata", 'pD.RData', package = "MDTS")) ``` The MDTS bins are ```{r} bins ``` The count matrix where each column is a sample and each row is a bin: ```{r} head(counts) ``` ## 3. Normalizing the read-depth After obtaining the raw read-depth matrix, MDTS calculates a vector M score matrix of the same organization - each row corresponds to a row and each column a sample. The M score is based on a log2 transformation, followed by median polish, and GC and mappability adjust via a loess smoother. The resulting M scores have critical values where (0, -1, <-4) are generally consistent with 2, 1, and 0 copy numbers [barring Copy Number Polymorphisms]. ```{r, warning=F} # counts is the raw read depth of [MDTS bins x samples] # bins is the previously calculated MDTS bins mCounts <- normalizeCounts(counts, bins) ``` ## 4. Creating the Minimum Distance for each trio The second advantage of MDTS is the use of Minimum Distances for candidate de novo deletion identification. The assumption underlying the development of the Minimum Distance is that it is vastly more likely for a deletion to be inherited than de novo if the proband shares a deletion with one of the parents in the trio. Based on that assumption, the Minimum Distance is calculated per family - bin combination. For each bin, the Minimum Distance is smallest (in absolute terms) difference between the proband's M score and both parents'. For example, if the M scores are (-1, -1, 0) for a proband and the 2 parents respectively, a situation consistent with an inherited deletion in the proband from parent 1, the Minimum Distance of the 2 possible pairwise comparisons is 0. On the other hand, if the M scores are (-1, 0, 0) for the proband and the 2 parents, the Minimum Distance stands out at -1, and is consistent with a de novo deletion. To calculate the Minimum Distance of the dataset, MDTS takes as input the M score matrix. The output matrix is organized where each row corresponds still to a bin, but each column now refers to a trio. ```{r, warning=F} # mCounts is the normalized read depth of [MDTS bins x samples] # bins is the previously calculated MDTS bins # pD is the phenotype matrix md <- calcMD(mCounts, pD) ``` ## 5. Call de novo deletions and filter Using the Minimum Distance matrix calculated above, MDTS uses the tried and true Circular Binary Segmentation method to infer deletion states. Inferred candidate de novo deletions are further filtered to remove likely false positive signals that arose out of regions of highly variable read-depth (generally indicative of sequence artifacts or polymorphic events). ```{r, warning=FALSE, message=FALSE, warning=F} # md is the Minimum Distance of [MDTS bins x trio] # bins is the previously calculated MDTS bins # mCounts is the normalized read depth of [MDTS bins x samples] cbs <- segmentMD(md=md, bins=bins) denovo <- denovoDeletions(cbs, mCounts, bins) ``` In our example, the output is a single detected de novo deletion in family F4: ```{r} denovo ``` No signals were picked up from the CNP region or elsewhere. ## 6. SessionInfo ```{r} sessionInfo() ```