--- title: "An introduction to Rbwa" date: "`r Sys.Date()`" author: - name: Jean-Philippe Fortin affiliation: Department of Data Science and Statistical Computing, gRED, Genentech email: fortin946@gmail.com output: BiocStyle::html_document: toc_float: true theme: paper number_sections: true vignette: > %\VignetteIndexEntry{An introduction to Rbwa} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: refs.bib --- ```{r, echo=FALSE, results="hide"} options("knitr.graphics.auto_pdf"=TRUE, eval=TRUE) ``` # Introduction The `r Biocpkg("Rbwa")` package provides an **R** wrapper around the two popular *BWA* aligners `BWA-backtrack` [@bwa1] and `BWA-MEM` [@bwa2]. As mentioned in the BWA manual (see http://bio-bwa.sourceforge.net/bwa.shtml), BWA-backtrack is designed for short Illumina reads (up to 100bp), while BWA-MEM is more suitable for longer sequences (70bp to 1Mbp) and supports split alignment. # Installation `Rbwa` can be installed from Bioconductor using the following commands in a fresh R session: ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("Rbwa") ``` # Overview The two main alignment functions are: - BWA-backtrack: `bwa_aln` - BWA-MEM: `bwa_mem` The package also includes the following convenience functions: - Genome indexing: `bwa_build_index` - Conversion of `bwa_aln` output into SAM output: `bwa_sam` - Generation of secondary alignments: `xa2multi` # Build a reference index with `bwa_build_index` Both `bwa_aln` and `bwa_mem` require first to create a genome index from a FASTA file. This is done only once for a given genome. This can be done using the function `bwa_build_index`. First, we load `Rbwa`: ```{r loading, eval=TRUE} library(Rbwa) ``` In the following example code, we build a BWA index for a small portion of human chr12 that we stored in a FASTA file located within the `Rbwa` package. We store the index files in a temporary directory. ```{r build_index, eval=TRUE} dir <- tempdir() fasta <- system.file(package="Rbwa", "fasta/chr12.fa") index_prefix <- file.path(dir, "chr12") bwa_build_index(fasta, index_prefix=index_prefix) list.files(dir) ``` # Aligment with `bwa_aln` We now align read sequences stored in the toy example FASTQ file `fastq/sequences.fastq`, provided in the `Rbwa` package, to our indexed genome: ```{r bwa_aln, eval=TRUE} fastq <- system.file(package="Rbwa", "fastq/sequences.fastq") bwa_aln(index_prefix=index_prefix, fastq_files=fastq, sai_files=file.path(dir, "output.sai")) ``` Any valid BWA arguments can be passed to the `bwa_aln` function. To see the complete list of valid arguments, please visit the BWA reference manual: http://bio-bwa.sourceforge.net/bwa.shtml. For instance, we can specify the maximal edit distance between the query sequence and the reference genome to be 3 using `n`, as well as the maximal edit distance in the seed sequence `k` to be 3, where we specify that the length of the seed sequence is 13 using the argument `l`: ```{r bwa_aln2, eval=TRUE} bwa_aln(index_prefix=index_prefix, fastq_files=fastq, sai_files=file.path(dir, "output.sai"), n=3, k=3, l=13) ``` ## Creating a SAM file The output of `bwa_aln` is an intermediate `sai` file that should be converted into a `sam` file using the `bwa_sam` function as follows: ```{r bwa_sam, eval=TRUE} bwa_sam(index_prefix=index_prefix, fastq_files=fastq, sai_files=file.path(dir, "output.sai"), sam_file=file.path(dir, "output.sam")) ``` Let's read the first few lines of the SAM file: ```{r reading1, eval=TRUE} strtrim(readLines(file.path(dir, "output.sam")), 65) ``` ## Creating a SAM file with secondary alignments By default, each row of the SAM output corresponds to the best alignment hit for a given input query sequence. Other alignments (secondary alignments, or other loci in case of multiple alignments) are stored in the XA tag. The function `xa2multi` conveniently extracts the alignments from the XA tags and represent them as additional rows in the SAM format. This can be executed as follows: ```{r bwa_sam_multi, eval=TRUE} xa2multi(file.path(dir, "output.sam"), file.path(dir, "output.multi.sam")) strtrim(readLines(file.path(dir, "output.multi.sam")), 65) ``` # Aligment with `bwa_mem` The `bwa_mem` function works similar to the `bwa_aln` function, except that it does not produce intermediate `.sai` files; it outputs a SAM file directly: ```{r bwa_mem, eval=TRUE} fastq <- system.file(package="Rbwa", "fastq/sequences.fastq") bwa_mem(index_prefix=index_prefix, fastq_files=fastq, sam_file=file.path(dir, "output.sam")) ``` ```{r reading2, eval=TRUE} strtrim(readLines(file.path(dir, "output.sam")), 65) ``` # Session info ```{r} sessionInfo() ``` # References