\documentclass{article} \usepackage{cite, hyperref} \title{contiBAIT: Improving Genome Assemblies Using Strand-seq Data} \author{Kieran O'Neill, Mark Hills and Mike Gottlieb} %\VignetteIndexEntry{flowBi}ngit \begin{document} \SweaveOpts{concordance=TRUE} \setkeys{Gin}{width=0.75\textwidth} \setkeys{Gin}{height=0.75\textwidth} \maketitle \begin{center} {\tt koneill@bcgsc.ca} \end{center} \textnormal{\normalfont} \tableofcontents \newpage \section{Licensing} Under the Two-Clause BSD License, you are free to use and redistribute this software. \section{Introduction} Strand-seq is a method for determining template strand inheritance in single cells. When strand-seq data are collected for many cells from the same organism, spatially close genomic regions show similar patterns of template strand inheritance. ContiBAIT allows users to leverage this property to carry out three tasks to improve draft genomes. Firstly, in assemblies made up entirely of contigs or scaffolds not yet assigned to chromosomes, these contigs can be clustered into chromosomes. Secondly, in assemblies wherein scaffolds have been assigned to chromosomes, but not yet placed on those chromosomes, those scaffolds can be placed in order relative to each other. Thirdly, for assemblies at the chromosome stage, where scaffolds are ordered and separated by many unbridged sequence gaps, the orientation of these sequence gaps can be found. All three of these tasks can be run in parallel, taking contig-stage assemblies and ordering all fragments first to chromosomes, then within chromosomes while simultaneously determining the relative orientation of each fragment. This vignette will outline some specific functions of contiBAIT, and is comparible to the contiBAIT() master function included in this package that will perform the same sequence of function calls outlined below. \section{Input} ContiBAIT requires input in BAM format. Multiple BAM files are required for analysis, so ContiBAIT specifically calls for users to identify a BAM directory in which to analyse. Sorted BAM files will speed up analysis. <>= # Read in BAM files. Path denotes location of the BAM files. # Returns a vector of file locations library(contiBAIT) bamFileList <- list.files( path=file.path(system.file(package='contiBAIT'), 'extdata'), pattern=".bam$", full.names=TRUE) @ The example data provided by contiBAIT is from a human blood sample and has been aligned to GRCh38/hg38. Since this assembly is already complete, we must first split this genome into chunks to simulate a contig-stage assembly. To do this we need to extract information on the assembly the bam file is aligned to by creating a chromosome table instance, then splitting this table. \subsection{\textit{Creating a chromosome table instance}} The example data provided with the contiBAIT package is derived from GRCh38/hg38 data (only aligned to all autosomes and allosomes; no alternative locations or contigs were included). To subset these data, and for further downstream analysis, a GRanges chromosome table instance can be made, representing the contig name and length. This is generated with makeChrTable, where the resulting object is similar to the header portion of a BAM file. Note a meta column with a name formed of the contig and start and end locations is generated for downstream workflows. <>= # build chr table from BAM file in bamFileList exampleChrTable <- makeChrTable(bamFileList[1]) exampleChrTable @ \subsection{\textit{Splitting a chromosome table instance}} We can also split the above chromosome table instance into 1 Mb fragments. This subdivision isn't just for testing purposes. For chromosome- and contig-stage assemblies with very large fragments, subdividing the data into bins can help identify chimeric fragments and misorientations. Some assemblies have a large degree of misorientations or chimerism in the data, and subdividing them aids in clustering these fragments. For example, if a region is misoriented within a contig, the strand state will change in this region, skewing this contig toward a WC call in every library. However, while fragmenting can improve the overall number of contigs included in analysis and improve clustering, as the fragments get further subdivided, the number of reads used to make strand state calls decreases, and the probability of there being insufficient reads to make an accurate call increases. Note the following divided chromosome table can be used with the filter argument in strandSeqFreqTable to generate a sub-divided table. <>= exampleDividedChr <- makeChrTable(bamFileList[1], splitBy=1000000) exampleDividedChr @ \subsection{\textit{Splitting a chromosome table based on strand state changes}} A change in strand state within a contig can represent a number of things. At it's simplest, it could represent a sister chromatid exchange switching the templates in that particular cell. In cases where the same location is a site of recurrent strand state changes, the more likely explanation is that the fragment is chimeric or has a misorientation within it. contiBAIT allows users to cut contigs at these locations to allow for better clustering. The most likely site of incorrectly oriented or placed fragments is at unbridged or bridged gap regions. A function is included that allows us to look for overlaps between recurrent strand state changes and gap regions. <>= library(rtracklayer) # Download GRCh38/hg38 gap track from UCSC gapFile <- import.bed("http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=465319523_SLOtFPExny48YZFaXBh4sSTzuMcA&boolshad.hgta_printCustomTrackHeaders=0&hgta_ctName=tb_gap&hgta_ctDesc=table+browser+query+on+gap&hgta_ctVis=pack&hgta_ctUrl=&fbQual=whole&fbUpBases=200&fbDownBases=200&hgta_doGetBed=get+BED") # Create fake SCE file with four regions that overlap a gap sceFile <- GRanges(rep('chr4',4), IRanges(c(1410000, 1415000, 1420000, 1425000), c(1430000, 1435000, 1430000, 1435000))) overlappingFragments <- mapGapFromOverlap(sceFile, gapFile, exampleChrTable, overlapNum=4) show(overlappingFragments) @ What is returned is a GRanges chromosome table instance where the gap that is coincident with the recurrent strand state change has split that contig into two smaller fragments. Note the example table now has 25 fragments as chr4 has been split. \subsection{\textit{Creating a strandFreqMatrix instance}} To read in BAM files into ContiBAIT, create a strandFreMatrix instance by calling strandSeqFreqTable(). This program will read each BAM file, calculate the ratio of W and C reads, and return this value along with the total number of reads used to make the call. Note, we will use the GRanges divided chromosome table to simultaneously cut the assembly into 1 Mb fragments. By default duplicate reads are removed, a minimal mapping quality of 0 is used and the function expects to see paired end data. Because of the way BAM files store strand information, it is important to ensure that the pairedEnd parameter is correctly set. A warning will be issued if single-end data is run as if it is paired-end. <>= # Create a strandFreqTable instance strandFrequencyList <- strandSeqFreqTable(bamFileList, filter=exampleDividedChr, qual=10, pairedEnd=FALSE, BAITtables=TRUE) @ This returns a list of two data.frames if the BAITtables argument is set to FALSE, or four if it is set to TRUE. The first data.frame consists of a strand state frequency, calculated by taking the number of Watson (- strand) reads, subtracting the number of Crick (+ strand) reads, and dividing by the total number of reads. These values range from -1 (entirely Watson reads) through to 1 (entirely Crick reads). The second data.frame consists of the absolute number of reads covering the contig. This is used in thresholding the data, and in weighting the accuracy of calls in subsequent orderings. Note that the fewer reads used to make a strand call, the less accurate that call will be. In the absence of background reads, WC regions will follow a binomial distribution. If we assume any contigs with <-0.8 are WW, and >0.8 are CC (this is the default strandTableThreshold parameter used in preprocessStrandTable), then there is a probability of 0.044 that the strand state call is incorrect. As such we exclude calls that are made with fewer than 10 reads. Increasing this number will make calls more accurate, but will reduce the number of contigs included in analysis. Since the contigs are weighted based on read density during clustering, a minimum of 10 reads to support a strand call provides a good balance between accuracy and inclusion. <>= # Returned list consisting of two data.frames strandFrequencyList # Exclude frequencies calculated from # contigs with less than 10 reads exampleStrandFreq <- strandFrequencyList[[1]] exampleReadCounts <- strandFrequencyList[[2]] exampleStrandFreq[which(exampleReadCounts < 10)] <- NA @ Additional information can be found on the help page for strandSeqFreqTable including all parameters.\\ The quality of the libraries, specifically whether the files being analysed appear to show the expected distributions of directional reads, can be assessed with plotWCdistributions. In a diploid organism, there is an expectation that chromosomes will be derived from either two Watson homologues, one Watson and one Crick homologue, or two Crick homologues in a Mendelian 1:2:1 ratio. In Strand-seq data, this will mean about 1/4 of the contigs will only have Watson reads mapping to them, and have a strand state frequency of -1, 1/4 of the contigs will only have Crick reads mapping to them, and have a strand state frequency of +1, and 1/2 of the contigs will have an approximately even mix of Watson and Crick reads (based on a binomial distribution of sampling). plotWCDistribution generates boxplots for different strand state frequencies and models the expected distribution (blue line). The average called WW or CC contigs are shown in green, and should match closely with the expected distribution line. <>= # Assess the quality of the libraries being analysed plotWCdistribution(exampleStrandFreq) @ \section{Creating a strand state matrix} The returned list of strandSeqFreqTable can be converted to a strand state matrix that makes a contig-wide call on the overall strand state based on the frequencies of Watson and Crick reads. The function removes BAM files that either contain too few reads to make accurate strand calls or are not strand-seq libraries (i.e. every contig contains approximately equal numbers of + and - reads). Conversely the function removes contigs that either contain too few reads, or always contain roughly equal numbers of + and - reads. More details on the parameters can be found in the function documentation. The function returns a similar data.frame to strandSeqFreqTable, but with the frequencies converted to strand calls: 1 is a homozygous Watson call (by default, a frequency less than -0.8, but this can be changed with the filterThreshold argument), 2 is a heterozygous call (a frequency between -0.8 and 0.8 by default) and 3 is a homozygous Crick call (by default, a frequency above 0.8). These factors can then be used to cluster similar contigs together.\\ <>= # Convert strand frequencies to strand calls. exampleStrandStateMatrix <- preprocessStrandTable( exampleStrandFreq, lowQualThreshold=0.8) exampleStrandStateMatrix[[1]] @ \section{Clustering contigs into chromosomes} clusterContigs utilizes a custom algorithm to cluster all fragments together that share a similar strand state across multiple cells. For example, if two contigs are adjacent on the same chromosome, then they will inherit the same strand state in every cell that is analyzed. It is important to note however that the relative directionality of any two fragments within an assembly is unknown. Contigs which belong on the same chromosome but are in different orientations will display as complete opposites; every library where one contig is homozygous Watson will have the other contig as homozygous Crick. However, heterozygous contigs (where chromosomes inherited one Watson template and one Crick template), will not be mirrored, with one contig being "WC", while the other will be "CW". As such, by default the function performs clustering between homozygous calls (WW or CC treated the same) and heterozygous calls (WC) to identify contigs that belong together despite their misorientation status with respect to each other. Using the clusterBy='homo' option will perform the clustering just between WW and CC calls (ignoring WC calls) and misoriented fragments from the same chromosome will cluster into different linkage groups. Once clustered, the misorientated fragments are identified in each linkage group using reorientAndMergeLGs. Since chromosome orientation cannot be determined by sequence alone, the largest sub-group is arbitrarily considered "+", while the smaller group is considered "-". The product of this function is a list where the first element is a StrandStateMatrix instance that has been correctly oriented, the second element is a data.frame of contigs and orientations, and the third element is a recomputed LinkageGroupList that merges groups that were previously discordant based on misorientation status. This merger occurs by computing a consensus strand state across libraries within each linkage group and comparing them. Note for the purposes of this example, we are using the argument randomise=FALSE to ensure conformity of the vignette when running sweave. Randomisation is recommended for most applications and is set to TRUE by default. <>= exampleWCMatrix <- exampleStrandStateMatrix[[1]] clusteredContigs <- clusterContigs(exampleWCMatrix, randomise=FALSE) reorientedMatrix <- reorientAndMergeLGs(clusteredContigs, exampleWCMatrix) exampleLGList <- reorientedMatrix[[3]] exampleLGList exampleLGList[[1]] @ The clusterContigs function generates a list of linkage groups consisting of all the clustered contigs. After reorientation and merging, all contigs within the linkage groups are highly similar, while the contigs between linkage groups are highly dissimilar. The similarity between linkage groups can be visualized using plotLGDistances. <>= plotLGDistances(exampleLGList, exampleWCMatrix) @ While the similarity within linkage groups can be visualized using plotLinkageGroup (here, the first linkage group is used for creating this heatmap). Note, side by side comparisons of linkage group members can be performs with multiple lg options (e.g. lg=1:2 will plot the first two linkage groups, lg=c(1,4) will plot the first and forth etc.). <>= plotLGDistances(exampleLGList, exampleWCMatrix, lg=1) @ \section{Ordering contigs within chromosomes} With contigs clustered to chromosomes, we can then order them within chromosomes. Just as meiotic recombination shuffles loci and allows genetic distances between them to be determined, sister chromatid exchanges (SCE) events reshuffle templates, and similarly allow us to infer a linkage distance. We have employed a greedy algorithm to do this, but have an argument allowing a TSP solution as an alternative. Contigs are ordered by similarity across libraries, then by contig name. Contigs that are zero distance apart (ie have no SCE events between them and are therefore unordered) are returned in contig name order. The output is split into sub-linkage groups, so Linkage group 1 will be split into a number of groups depending on the number of SCE events that occur within the chromosome. The output will be an S4 object of type contigOrdering. <>= contigOrder <- orderAllLinkageGroups(exampleLGList, exampleWCMatrix, exampleStrandFreq, exampleReadCounts, whichLG=1, saveOrdered=TRUE) contigOrder[[1]] @ If the assembly is mostly complete and you wish to compare the actual location of the fragments in the assembly you're working with against the output of orderAllLinkageGroups, contiBAIT has the built in plotContigOrder function. This assumes that the contig name from the contigOrdering object is in a format of chr:start-end. <>= plotContigOrder(contigOrder[[1]]) @ Alternatively, all contigs can be ordered simultaneously by omitting the whichLG argument. If saveOrdered is set to TRUE, plots will be generated for every linkage group. By ordering all of the linkage groups we can proceed to create BED files of these data for the new assemblies. <>= contigOrderAll <- orderAllLinkageGroups(exampleLGList, exampleWCMatrix, exampleStrandFreq, exampleReadCounts) contigOrderAll[[1]] @ \section{Checking order using BAIT ideograms} It is possible to visually validate the ordering by creating ideogram plots of the data. The supplied test data comprises of reads from the first four chromosomes. Below is example code that will plot the second library from the output of strandSeqFreqTable. Note the third and forth elements of the strandFrequencyList are only generated if BAITtables is set to TRUE when running strandSeqFreqTable. The plot below shows the location of the reads and highlights an SCE on chromosome 3. Note to only display the first library, we need to subset the strandReadMatrix and retain these as strandReadMatrix objects. <>= # extract elements from strandSeqFreqTable list WatsonFreqList <- strandFrequencyList[[3]] CrickFreqList <- strandFrequencyList[[4]] # subset elements to only analyze one library singleWatsonLibrary <- StrandReadMatrix(WatsonFreqList[,2, drop=FALSE]) singleCrickLibrary <- StrandReadMatrix(CrickFreqList[,2, drop=FALSE]) # Run ideogram plotter ideogramPlot(singleWatsonLibrary, singleCrickLibrary, exampleDividedChr) @ If chromosome builds are not complete (and so each contig has not been assigned a chromosome in the chrTable instance), these ideograms can be plotted using only the represented fragments in the order given to the function from the orderedContig object generated from orderAllLinkageGroups. Here we will use the orderedContig object representing all 4 linkage groups. <>= ideogramPlot(singleWatsonLibrary, singleCrickLibrary, exampleDividedChr, orderFrame=contigOrderAll[[1]]) @ Alternatively, all libraries can be compared side by side for a single chromosome. Because this will print to multiple pages, the showPage option can be used to limit the output to a user=specified page. Here we will use the orderedContig object representing just one linkage group. <>= ideogramPlot(WatsonFreqList, CrickFreqList, exampleDividedChr, orderFrame=contigOrder[[1]], plotBy='chr', showPage=1) @ \section{Writing out to a BED file} This file can be passed to bedtools along with the original (draft) reference genome to create a new FASTA file containing the assembled genome. the writeBed function requires a chrTable of class ChrTable with which to extract the contig names and locations, the orientation information derived from reorientLinkageGroups to populate the strand column, the library weight to populate the score column, and an object of type ContigOrdering to invoke the actual order of fragments. A fileName can be supplied, or the default is used. BED files will be written to the working directory <>= writeBed(exampleDividedChr, reorientedMatrix[[2]], contigOrder[[1]]) @ \newpage \section{Additional plotting functions} Using a chromosome table instance, comparisons can be made between the portion of contigs that are included in the analysis verses those that are excluded based on either poor coverage or non-Strand-seq patterning. The code below generates a box plot of contig sizes that are included in the analysis. Note, since sample data are uniform 1 Mb framgents, the box plot does not deviate from the median. The example bam files contain reads from 76 separate 1 Mb fragments from chromsomes 1, 2, 3, and 4. Since the assembly is >3 Gb in size, only a few percent of the assembly will be included in our analysis. <>= makeBoxPlot(exampleDividedChr, exampleLGList) @ \newpage Furthermore we can determine the proportion of assembly fragments in each linkage group in a barplot. If data are in the format chr:start-end, then each unique chromosome name will have a unique color. If data are not in this format, then each fragment will have a unique color. Here, all fragments from chr1 will be colored differently to fragments from chr2, etc. <>= barplotLinkageGroupCalls(exampleLGList, exampleDividedChr) @ Note that if clustering did not occur correctly, some bars would be a mixture of colors. While the above displays the proportion of fragments from one chromosome that has clustered into each linkage group, but omitting the by='chr' parameter, the plot changes to the proportion of linkage groups within each chromosome. \newpage \section{Flow diagram} Here is the basic functionality of the contiBAIT package. The input BAM file(s) and output BED file are displayed in the grey ovals. All plotting functions are shown in blue hexagons and all analysis functions are in white boxes. <>= library(diagram) par(mar = c(1, 1, 1, 1)) openplotmat() # elpos <- coordinates (c(5, 5, 5, 5, 5, 5, 5, 5)) elpos <- coordinates (rep(4,7)) fromto <- matrix(ncol = 2, byrow = TRUE, data = c( 3, 6, 3, 8, 6, 5, 6, 7, 6,9, 6,10, 8,7, 8,11, 8,20, 10,14, 14,18, 18,19, 10,11, 10,15, 19,15, 19,20, 19,23, 23,27, 9,17, 14,13, 17,22, 22,23, 18,23, 23,24 )) nr <- nrow(fromto) arrpos <- matrix(ncol = 2, nrow = nr) for (i in 1:nr) arrpos[i, ] <- straightarrow (to = elpos[fromto[i, 2], ], from = elpos[fromto[i, 1], ], lwd = 2, arr.pos = 0.6, arr.width=0.15, arr.length = 0.5) textround(elpos[3,], 0.04, lab = "BAMFILE", box.col = "grey70", shadow.col = "grey10", shadow.size = 0.005, cex = 0.8) textrect(elpos[8,], 0.09, 0.04,lab = "make\nChrTable", box.col = "white", shadow.col = "grey10", shadow.size = 0.005, cex = 0.7) textrect(elpos[6,], 0.09, 0.04,lab = "strandSeq\nFreqTable", box.col = "white", shadow.col = "grey10", shadow.size = 0.005, cex = 0.7) textrect(elpos[10,], 0.09, 0.04,lab = "preprocess\nStrandTable", box.col = "white", shadow.col = "grey10", shadow.size = 0.005, cex = 0.7) textrect(elpos[14,], 0.09, 0.04,lab = "cluster\nContigs", box.col = "white", shadow.col = "grey10", shadow.size = 0.005, cex = 0.7) textrect(elpos[18,], 0.09, 0.04,lab = "reorient\nLinkageGroups", box.col = "white", shadow.col = "grey10", shadow.size = 0.005, cex = 0.7) textrect(elpos[19,], 0.09, 0.04,lab = "merge\nLinkageGroup", box.col = "white", shadow.col = "grey10", shadow.size = 0.005, cex = 0.7) textrect(elpos[23,], 0.09, 0.04,lab = "orderAll\nLinkageGroups", box.col = "white", shadow.col = "grey10", shadow.size = 0.005, cex = 0.7) textround(elpos[27,], 0.04, lab = "writeBed", box.col = "grey70", shadow.col = "grey10", shadow.size = 0.005, cex = 0.8) texthexa(elpos[7,], 0.09, 0.04,lab = "ideogram\nPlot", box.col = "lightblue", shadow.col = "grey10", shadow.size = 0.005, cex = 0.7) texthexa(elpos[11,], 0.09, 0.04,lab = "make\nBoxPlot", box.col = "lightblue", shadow.col = "grey10", shadow.size = 0.005, cex = 0.7) texthexa(elpos[5,], 0.09, 0.04,lab = "plotWC\ndistribution", box.col = "lightblue", shadow.col = "grey10", shadow.size = 0.005, cex = 0.7) texthexa(elpos[15,], 0.09, 0.04,lab = "plotLG\ndistances", box.col = "lightblue", shadow.col = "grey10", shadow.size = 0.005, cex = 0.7) texthexa(elpos[20,], 0.09, 0.04,lab = "barplot\nLGCalls", box.col = "lightblue", shadow.col = "grey10", shadow.size = 0.005, cex = 0.7) texthexa(elpos[24,], 0.09, 0.04,lab = "plotContig\nOrder", box.col = "lightblue", shadow.col = "grey10", shadow.size = 0.005, cex = 0.7) @ \setkeys{Gin}{width=\textwidth} \setkeys{Gin}{height=15cm} \begin{center} <>= <> @ \end{center} \end{document}