Bioconductor - Bioconductor Newsletter, April 2014 Issue

Home
Help
Bioconductor Newsletter, April 2014 Issue

posted by Valerie Obenchain, April 2014

Introduction
Infrastructure
Networks / Annotations
- AnnotationHub package
- RefNet package
Web Site-ish
- biocViews terms
- BiocCheck package
Programming bits
- Hits class

Introduction

Welcome to the first Bioconductor newsletter! In this space we will provide project updates from the Seattle team and recent contributions from core developers. Topics will include infrastructure reorganizations, new packages and features, conferences, and other areas of interest.

The newsletter will be on a quarterly schedule with the intent of summarizing major developments of the past 3 months. Comments and suggestions are welcome, please send to Valerie at vobencha@fhcrc.org.

Infrastructure

`GenomicAlignments` package

Last fall Hervé did some housecleaning on the infrastructure packages. All GAlignments* classes and related functions were moved from GenomicRanges and Rsamtools to the new GenomicAlignments package. Classes and functions from GenomicRanges that built on functionality in Rsamtools were also moved. The reorg substantially cleaned up our dependency tree and will be reflected in the upcoming release, BioC 2.14. (ASCII art courtesy of Hervé).

                  IRanges
                  ^     ^
                  |     |
                  |    XVector
                  |     ^
                  |     |
       GenomicRanges   Biostrings
                  ^     ^
                  |     |
                 Rsamtools
                     ^
                     |
                 BSgenome
                     ^
                     |
             GenomicAlignments
                     ^
                     |
                 rtracklayer
                     ^
                     |
               GenomicFeatures

Junction tools in `GenomicAlignments`

New tools for extracting junctions from genomic alignments have been added to the GenomicAlignments package. The junctions() function, previously known as gaps(), extracts the portion of a read corresponding to the N operation in the CIGAR.

New junction-related operations:

summarize the number of reads that overlap each junction
retrieve junction motifs based on dinucleotides at junction boundaries (BSgenome package required)
import junction files generated by TopHat and STAR aligners
determine if the RNA-seq protocol was stranded

`GenomeInfoDb` package

Converting between chromosome names from different institutions presents a challenge when working with ranges and annotations. The new GenomeInfoDb package is a step towards resolving these inconsistencies by providing a mapping interface between styles (e.g., UCSC, NCBI, Ensembl, …) and organisms.

Low-level functions related to seqnames and SeqnamesStyle were moved from GenomicRanges and AnnotationDbi into GenomeInfoDb. Some functions were deprecated, some renamed. See the NEWS file for a complete listing of activities.

The package is intended to serve as the base for methods querying or transforming seqlevels. Future work will address the problem of auto-detecting incompatible styles between objects in operations that require matching seqlevels.

Sonali, Marc, Hervé and Martin have all contributed to this work.

Currently 9 organisms are supported:

library(GenomeInfoDb)
names(genomeStyles())

## [1] "Arabidopsis_thaliana"     "Caenorhabditis_elegans"  
## [3] "Cyanidioschyzon_merolae"  "Drosophila_melanogaster" 
## [5] "Homo_sapiens"             "Oryza_sativa"            
## [7] "Populus_trichocarpa"      "Saccharomyces_cerevisiae"
## [9] "Zea_mays"

There are methods for style discovery,

seqlevelsStyle("chr3")

## [1] "UCSC"

and for converting between styles. Here we map from UCSC to NCBI:

mapSeqlevels(c("chrII", "chrIII", "chrM"), "NCBI")

##  chrII chrIII   chrM 
##   "II"  "III"   "MT"

Networks / Annotations

`AnnotationHub` package

The AnnotationHub resource continues to grow steadily. As of March 2014 the hub supports 204 genomes and 98 species from 7 data providers (Ensembl, NCBI, UCSC, dbSNP, HAEMCODE, Encode, Inparanoid 8). The hub is a repository of annotation data parsed from original file format to R objects for convient computing. For example, BED, GTF, and track files from UCSC are available as GRanges. VCF files from dbSNP have been parsed into VCF objects and FASTA files are available as FaFile objects.

Marc, Dan and Paul have led this effort.

`RefNet` package

A new package to watch for in the Spring release is RefNet. This package offers access to molecular interaction data in computable form and is the first of its kind in Bioconductor. The package offers access to 2 transcription factor networks and Recon 2 (consensus metabolic reconstruction compiled from 5 resources). Queries are made using the PSICQUIC or RefNet.db packages.

These data can be used to curate or study experiment-specific pathways, explore the function and relationships of differentially expressed genes, and offer insight to a variety of explorations at the molecular level.

RefNet is extensible and we encourage contributions. Networks, known interactions, infection pathways, disease-host interactions, or pathways from manuscripts would be valuable additions. Guidance for contributing data are included in the package.

Credit for RefNet, RefNet.db and PSICQUIC goes to Paul with contributions from Mentored Projects mentees.

Web Site-ish

biocViews terms

The biocViews terms used to categorize packages have been been refreshed. New terms were added, old terms removed and packages with old or invalid terms were updated. The updated web-interface allows searching by biocViews term or subject. A check box option displays terms not associated with any package. Thanks to Sonali and Dan for the work on this.

`BiocCheck` package

BiocCheck is a new package that encapsulates Bioconductor package guidelines and best practices. It was designed to provide guidance to new package authors and is run by the SinglePackageBuilder during the package review process.

BiocCheck can also be used from the command line

R CMD BiocCheck package

and should be run after R CMD check successfully completes. A full-featured vignette explains the suggestions made by the package. Dan is the architect behind this one.

Programming bits

`Hits` class

One of my favorite objects is the Hits class. It is a simple, straightforward class that can be used for complicated indexing and combining. It stores a set of “hits” between two vector-like objects and is most familiar as the output of a findOverlaps operation.

Here we use Hits to identify and remove the “shared” regions in a set of ranges. The shared region in the original set is positions 5 to 10. Range 3 is completely within this shared range.

library(GenomicRanges)
gr1 <- GRanges("A", IRanges(c(1, 5, 5), c(10, 15, 10)))
gr1

## GRanges with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##             
##   [1]        A   [1, 10]      *
##   [2]        A   [5, 15]      *
##   [3]        A   [5, 10]      *
##   ---
##   seqlengths:
##     A
##    NA




First we disjoin the ranges and perform `findOverlaps` which
will give us a `Hits` object:

gr2 <- disjoin(gr1)
ov <- findOverlaps(gr1, gr2)
ov

## Hits of length 5
## queryLength: 3
## subjectLength: 3
##   queryHits subjectHits 
##        
##  1         1           1 
##  2         1           2 
##  3         2           2 
##  4         2           3 
##  5         3           2




Convenience functions `countQueryHits' and `countSubjectHits'
summarize the number of times each record in the `query` or
`subject` was hit. 

countSubjectHits(ov)

## [1] 1 3 1




The regions that are not shared between the ranges will be
counted only once (self-hit).

keep_idx <- which(countSubjectHits(ov) == 1L)




Accessors `queryHits` and `subjectHits` extract the corresponding
columns from the `Hits` object:

subjectHits(ov)

## [1] 1 2 2 3 2




The extracted `subjectHits` can be used in a matching operation
to subset the `Hits` object containing just the "unshared" regions:

ov <- ov[subjectHits(ov) %in% keep_idx]
ov

## Hits of length 2
## queryLength: 3
## subjectLength: 3
##   queryHits subjectHits 
##        
##  1         1           1 
##  2         2           3




Again `subjectHits` are extracted but this time from the subset
`Hits` object. We use this index to subset the disjoined original
ranges leaving only the "unshared" regions:

ans_regions <- gr2[subjectHits(ov)]
ans_regions

## GRanges with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##             
##   [1]        A  [ 1,  4]      *
##   [2]        A  [11, 15]      *
##   ---
##   seqlengths:
##     A
##    NA




To finish this off we make a `GRangesList` of the unshared regions.
The list is length 3 corresponding the the 3 ranges in `gr1':

ans_eltlens <- countQueryHits(ov)
ans_skeleton <- PartitioningByEnd(cumsum(ans_eltlens))
relist(ans_regions, ans_skeleton)

## GRangesList of length 3:
## [[1]] 
## GRanges with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##             
##   [1]        A    [1, 4]      *
## 
## [[2]] 
## GRanges with 1 range and 0 metadata columns:
##       seqnames   ranges strand
##   [1]        A [11, 15]      *
## 
## [[3]] 
## GRanges with 0 ranges and 0 metadata columns:
##      seqnames ranges strand
## 
## ---
## seqlengths:
##   A
##  NA




Thanks to Michael and Hervé for the work on the `Hits` class.


## Self-matching

Recently I had a question about splitting on groups/factors. I
thought the solution Hervé came up with was quite elegant and worth 
sharing.

The problem boils down to converting the 'group1' vector to 'group2':

group1 <- c(3, 1, 2, 2)
group2 <- c(1, 2, 3, 3)




The motivating use case was to split a GRanges based on the 
groups in `group1` but not the implied order.

When we split this GRanges

x <- GRanges(c(rep("chr2", 2), rep("chr19", 2)), IRanges(1:4, width=1))
x

## GRanges with 4 ranges and 0 metadata columns:
##       seqnames    ranges strand
##             
##   [1]     chr2    [1, 1]      *
##   [2]     chr2    [2, 2]      *
##   [3]    chr19    [3, 3]      *
##   [4]    chr19    [4, 4]      *
##   ---
##   seqlengths:
##    chr19  chr2
##       NA    NA




by `group1` the object is re-ordered to match the priority
in `group1`.

split(x, group1)

## GRangesList of length 3:
## $1 
## GRanges with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##             
##   [1]     chr2    [2, 2]      *
## 
## $2 
## GRanges with 2 ranges and 0 metadata columns:
##       seqnames ranges strand
##   [1]    chr19 [3, 3]      *
##   [2]    chr19 [4, 4]      *
## 
## $3 
## GRanges with 1 range and 0 metadata columns:
##       seqnames ranges strand
##   [1]     chr2 [1, 1]      *
## 
## ---
## seqlengths:
##  chr19  chr2
##     NA    NA




This is expected behavior but wasn't exactly the result I was
after. I wanted to respect the grouping in `group1` but not the order.
Essentially I wanted to turn `group1` in to `group2` so
I would get this result:

split(x, group2)

## GRangesList of length 3:
## $1 
## GRanges with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##             
##   [1]     chr2    [1, 1]      *
## 
## $2 
## GRanges with 1 range and 0 metadata columns:
##       seqnames ranges strand
##   [1]     chr2 [2, 2]      *
## 
## $3 
## GRanges with 2 ranges and 0 metadata columns:
##       seqnames ranges strand
##   [1]    chr19 [3, 3]      *
##   [2]    chr19 [4, 4]      *
## 
## ---
## seqlengths:
##  chr19  chr2
##     NA    NA




Solution:

One straight-forward solution is to re-order the GRangesList object 
once created. 

grl <- split(x, group1)
grl[as.character(unique(group1))] 




However the data were large so I was hoping to solve the ordering 
problem before creating the list. The second solution was to
transform `group1` with a "self-match". 

match(group1, group1)

## [1] 1 2 3 3




A "self match" on any vector-like object 'x' is an easy and
very efficient way to assign unique ids to unique values in 'x'.
Also the mapping from unique values in 'x' and ids is guaranteed
to be strictly increasing, which is what I needed for my use
case.

xx <- c("Z", "B", "C", "C", "A", "B")
ids <- match(xx, xx)
ids

## [1] 1 2 3 3 5 2

ids[!duplicated(xx)]  # always strictly increasing

## [1] 1 2 3 5




Hervé's 2 line solution to my splitting problem:

grl2 <- split(x, match(group1, group1))
names(grl2) <- unique(group1)




More detail on the `Hits` class can be found on the man page:

    ?Hits

The ["HOWTOs" vignette](http://www.bioconductor.org/packages/devel/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)
in `GenomicRanges` demonstrates other
practical workflows that make use of the extensive `GenomicRanges`
infrastructure.

    library(GenomicRanges)
    browseVignettes("GenomicRanges")
 

## Courses / Conferences

In January Martin, Sean and Hervé headed south to Ribeirão Preto, Brazil
to teach at the 
[Summer Bioinformatics course](http://bioinformatics.fmrp.usp.br/curso2014/) 
hosted by the Ribeirão Preto Medical School. The course was organized by 
Benilton Carvalho and Houtan Noushmehr and focused on the analysis and 
comprehension of high throughput sequence data. 

In Seattle we offer several short courses throughout the year. The last
was in February given by Martin and Sonali. This was a 2 day introductory 
course covering the basics of working with large data, class structure, 
integrated workflows and the analysis of RNAseq differential expresssion data.

Upcoming events and past course materials can be found on the 
[web site](http://www.bioconductor.org/help/course-materials/).


## Looking ahead

The spring release is just weeks away. We plan to branch on April 11th
and release on the 14th. The full schedule is
[posted online](http://www.bioconductor.org/developers/release-schedule/).

BioC 2014 is in Boston this year (July 30 - August 1). Watch the 
[web site](http://www.bioconductor.org/help/course-materials/2014/BioC2014/)
for updates.

The July newsletter will (probably) include copy number packages on the
rise, the `GenomicFileViews` package, and a summary of tally/coverage 
functions. If there are topics you'd like to see covered let us know!

Hosting provided by Fred Hutchinson Cancer Research Center

Bioconductor Newsletter

Contents

Introduction

Infrastructure

GenomicAlignments package

Junction tools in GenomicAlignments

GenomeInfoDb package