posted by Valerie Obenchain, April 2014
Welcome to the first Bioconductor newsletter! In this space we will provide project updates from the Seattle team and recent contributions from core developers. Topics will include infrastructure reorganizations, new packages and features, conferences, and other areas of interest.
The newsletter will be on a quarterly schedule with the intent of summarizing major developments of the past 3 months. Comments and suggestions are welcome, please send to Valerie at vobencha@fhcrc.org.
GenomicAlignments
packageLast fall Hervé did some housecleaning on the infrastructure
packages. All GAlignments*
classes and related functions were
moved from
GenomicRanges
and
Rsamtools
to the new
GenomicAlignments
package. Classes and functions from GenomicRanges
that built on
functionality in Rsamtools
were also moved. The reorg substantially
cleaned up our dependency tree and will be reflected in the upcoming
release, BioC 2.14. (ASCII art courtesy of Hervé).
IRanges
^ ^
| |
| XVector
| ^
| |
GenomicRanges Biostrings
^ ^
| |
Rsamtools
^
|
BSgenome
^
|
GenomicAlignments
^
|
rtracklayer
^
|
GenomicFeatures
GenomicAlignments
New tools for extracting junctions from genomic alignments have been
added to the GenomicAlignments
package. The junctions()
function,
previously known as gaps()
, extracts the portion of a read
corresponding to the N operation in the CIGAR.
New junction-related operations:
GenomeInfoDb
packageConverting between chromosome names from different institutions presents a challenge when working with ranges and annotations. The new GenomeInfoDb package is a step towards resolving these inconsistencies by providing a mapping interface between styles (e.g., UCSC, NCBI, Ensembl, …) and organisms.
Low-level functions related to seqnames and SeqnamesStyle were
moved from GenomicRanges
and
AnnotationDbi
into GenomeInfoDb
. Some functions were deprecated, some renamed.
See the
NEWS
file for a complete listing of activities.
The package is intended to serve as the base for methods querying or transforming seqlevels. Future work will address the problem of auto-detecting incompatible styles between objects in operations that require matching seqlevels.
Sonali, Marc, Hervé and Martin have all contributed to this work.
Currently 9 organisms are supported:
library(GenomeInfoDb) names(genomeStyles())
## [1] "Arabidopsis_thaliana" "Caenorhabditis_elegans" ## [3] "Cyanidioschyzon_merolae" "Drosophila_melanogaster" ## [5] "Homo_sapiens" "Oryza_sativa" ## [7] "Populus_trichocarpa" "Saccharomyces_cerevisiae" ## [9] "Zea_mays"
There are methods for style discovery,
seqlevelsStyle("chr3")
## [1] "UCSC"
and for converting between styles. Here we map from UCSC to NCBI:
mapSeqlevels(c("chrII", "chrIII", "chrM"), "NCBI")
## chrII chrIII chrM ## "II" "III" "MT"
AnnotationHub
packageThe
AnnotationHub
resource continues to grow steadily. As of March
2014 the hub supports 204 genomes and 98 species from 7 data providers
(Ensembl, NCBI, UCSC, dbSNP, HAEMCODE, Encode, Inparanoid 8). The hub
is a repository of annotation data parsed from original file format
to R objects for convient computing. For example, BED, GTF, and track
files from UCSC are available as GRanges
. VCF files from dbSNP have
been parsed into VCF
objects and FASTA files are available as FaFile
objects.
Marc, Dan and Paul have led this effort.
RefNet
packageA new package to watch for in the Spring release is RefNet. This package offers access to molecular interaction data in computable form and is the first of its kind in Bioconductor. The package offers access to 2 transcription factor networks and Recon 2 (consensus metabolic reconstruction compiled from 5 resources). Queries are made using the PSICQUIC or RefNet.db packages.
These data can be used to curate or study experiment-specific pathways, explore the function and relationships of differentially expressed genes, and offer insight to a variety of explorations at the molecular level.
RefNet
is extensible and we encourage contributions. Networks, known
interactions, infection pathways, disease-host interactions, or pathways
from manuscripts would be valuable additions. Guidance for contributing
data are included in the package.
Credit for RefNet
, RefNet.db
and PSICQUIC
goes to Paul with
contributions from Mentored Projects mentees.
The biocViews terms used to categorize packages have been been refreshed. New terms were added, old terms removed and packages with old or invalid terms were updated. The updated web-interface allows searching by biocViews term or subject. A check box option displays terms not associated with any package. Thanks to Sonali and Dan for the work on this.
BiocCheck
packageBiocCheck is a new package that encapsulates Bioconductor package guidelines and best practices. It was designed to provide guidance to new package authors and is run by the SinglePackageBuilder during the package review process.
BiocCheck
can also be used from the command line
R CMD BiocCheck package
and should be run after R CMD check
successfully completes. A
full-featured vignette
explains the suggestions made by the package.
Dan is the architect behind this one.
Hits
classOne of my favorite objects is the Hits
class. It is a simple,
straightforward class that can be used for complicated indexing and
combining. It stores a set of “hits” between two vector-like objects and
is most familiar as the output of a findOverlaps
operation.
Here we use Hits
to identify and remove the “shared” regions in a
set of ranges. The shared region in the original set is positions 5 to 10.
Range 3 is completely within this shared range.
library(GenomicRanges) gr1 <- GRanges("A", IRanges(c(1, 5, 5), c(10, 15, 10))) gr1
## GRanges with 3 ranges and 0 metadata columns: ## seqnames ranges strand #### [1] A [1, 10] * ## [2] A [5, 15] * ## [3] A [5, 10] * ## --- ## seqlengths: ## A ## NA First we disjoin the ranges and perform `findOverlaps` which will give us a `Hits` object: gr2 <- disjoin(gr1) ov <- findOverlaps(gr1, gr2) ov## Hits of length 5 ## queryLength: 3 ## subjectLength: 3 ## queryHits subjectHits #### 1 1 1 ## 2 1 2 ## 3 2 2 ## 4 2 3 ## 5 3 2 Convenience functions `countQueryHits' and `countSubjectHits' summarize the number of times each record in the `query` or `subject` was hit. The regions that are not shared between the ranges will be counted only once (self-hit).countSubjectHits(ov)## [1] 1 3 1Accessors `queryHits` and `subjectHits` extract the corresponding columns from the `Hits` object:keep_idx <- which(countSubjectHits(ov) == 1L)The extracted `subjectHits` can be used in a matching operation to subset the `Hits` object containing just the "unshared" regions:subjectHits(ov)## [1] 1 2 2 3 2ov <- ov[subjectHits(ov) %in% keep_idx] ov## Hits of length 2 ## queryLength: 3 ## subjectLength: 3 ## queryHits subjectHits #### 1 1 1 ## 2 2 3 Again `subjectHits` are extracted but this time from the subset `Hits` object. We use this index to subset the disjoined original ranges leaving only the "unshared" regions: ans_regions <- gr2[subjectHits(ov)] ans_regions## GRanges with 2 ranges and 0 metadata columns: ## seqnames ranges strand #### [1] A [ 1, 4] * ## [2] A [11, 15] * ## --- ## seqlengths: ## A ## NA To finish this off we make a `GRangesList` of the unshared regions. The list is length 3 corresponding the the 3 ranges in `gr1': ans_eltlens <- countQueryHits(ov) ans_skeleton <- PartitioningByEnd(cumsum(ans_eltlens)) relist(ans_regions, ans_skeleton)## GRangesList of length 3: ## [[1]] ## GRanges with 1 range and 0 metadata columns: ## seqnames ranges strand #### [1] A [1, 4] * ## ## [[2]] ## GRanges with 1 range and 0 metadata columns: ## seqnames ranges strand ## [1] A [11, 15] * ## ## [[3]] ## GRanges with 0 ranges and 0 metadata columns: ## seqnames ranges strand ## ## --- ## seqlengths: ## A ## NA Thanks to Michael and Hervé for the work on the `Hits` class. ## Self-matching Recently I had a question about splitting on groups/factors. I thought the solution Hervé came up with was quite elegant and worth sharing. The problem boils down to converting the 'group1' vector to 'group2': The motivating use case was to split a GRanges based on the groups in `group1` but not the implied order. When we split this GRangesgroup1 <- c(3, 1, 2, 2) group2 <- c(1, 2, 3, 3)x <- GRanges(c(rep("chr2", 2), rep("chr19", 2)), IRanges(1:4, width=1)) x## GRanges with 4 ranges and 0 metadata columns: ## seqnames ranges strand #### [1] chr2 [1, 1] * ## [2] chr2 [2, 2] * ## [3] chr19 [3, 3] * ## [4] chr19 [4, 4] * ## --- ## seqlengths: ## chr19 chr2 ## NA NA by `group1` the object is re-ordered to match the priority in `group1`. split(x, group1)## GRangesList of length 3: ## $1 ## GRanges with 1 range and 0 metadata columns: ## seqnames ranges strand #### [1] chr2 [2, 2] * ## ## $2 ## GRanges with 2 ranges and 0 metadata columns: ## seqnames ranges strand ## [1] chr19 [3, 3] * ## [2] chr19 [4, 4] * ## ## $3 ## GRanges with 1 range and 0 metadata columns: ## seqnames ranges strand ## [1] chr2 [1, 1] * ## ## --- ## seqlengths: ## chr19 chr2 ## NA NA This is expected behavior but wasn't exactly the result I was after. I wanted to respect the grouping in `group1` but not the order. Essentially I wanted to turn `group1` in to `group2` so I would get this result: split(x, group2)## GRangesList of length 3: ## $1 ## GRanges with 1 range and 0 metadata columns: ## seqnames ranges strand #### [1] chr2 [1, 1] * ## ## $2 ## GRanges with 1 range and 0 metadata columns: ## seqnames ranges strand ## [1] chr2 [2, 2] * ## ## $3 ## GRanges with 2 ranges and 0 metadata columns: ## seqnames ranges strand ## [1] chr19 [3, 3] * ## [2] chr19 [4, 4] * ## ## --- ## seqlengths: ## chr19 chr2 ## NA NA Solution: One straight-forward solution is to re-order the GRangesList object once created. However the data were large so I was hoping to solve the ordering problem before creating the list. The second solution was to transform `group1` with a "self-match".grl <- split(x, group1) grl[as.character(unique(group1))]A "self match" on any vector-like object 'x' is an easy and very efficient way to assign unique ids to unique values in 'x'. Also the mapping from unique values in 'x' and ids is guaranteed to be strictly increasing, which is what I needed for my use case.match(group1, group1)## [1] 1 2 3 3Hervé's 2 line solution to my splitting problem:xx <- c("Z", "B", "C", "C", "A", "B") ids <- match(xx, xx) ids## [1] 1 2 3 3 5 2ids[!duplicated(xx)] # always strictly increasing## [1] 1 2 3 5More detail on the `Hits` class can be found on the man page: ?Hits The ["HOWTOs" vignette](http://www.bioconductor.org/packages/devel/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf) in `GenomicRanges` demonstrates other practical workflows that make use of the extensive `GenomicRanges` infrastructure. library(GenomicRanges) browseVignettes("GenomicRanges") ## Courses / Conferences In January Martin, Sean and Hervé headed south to Ribeirão Preto, Brazil to teach at the [Summer Bioinformatics course](http://bioinformatics.fmrp.usp.br/curso2014/) hosted by the Ribeirão Preto Medical School. The course was organized by Benilton Carvalho and Houtan Noushmehr and focused on the analysis and comprehension of high throughput sequence data. In Seattle we offer several short courses throughout the year. The last was in February given by Martin and Sonali. This was a 2 day introductory course covering the basics of working with large data, class structure, integrated workflows and the analysis of RNAseq differential expresssion data. Upcoming events and past course materials can be found on the [web site](http://www.bioconductor.org/help/course-materials/). ## Looking ahead The spring release is just weeks away. We plan to branch on April 11th and release on the 14th. The full schedule is [posted online](http://www.bioconductor.org/developers/release-schedule/). BioC 2014 is in Boston this year (July 30 - August 1). Watch the [web site](http://www.bioconductor.org/help/course-materials/2014/BioC2014/) for updates. The July newsletter will (probably) include copy number packages on the rise, the `GenomicFileViews` package, and a summary of tally/coverage functions. If there are topics you'd like to see covered let us know!grl2 <- split(x, match(group1, group1)) names(grl2) <- unique(group1)