Home
Developers
Mentored Projects: GSOC 2014

Google Summer of Code Ideas List for 2014

This page is the main information point for Bioconductors participation in the Google summer of Code project this year (2014).

Overview

Each selected student (mentee) will be paid USD $5500 to work on a Bioconductor project for 3 months during the summer.

Students should look at the list of projects and see if any project interests them. Email the project mentors to express your interest, and describe any prior experience.

Students with ideas for Bioconductor projects not listed below are encouraged to email any of the mentors listed below with project ideas.

Students will submit project applications directly to Google.

Google will award a certain number of student slots to the Bioconductor project.

The Bioconductor administrators and mentors will rank projects in order of importance to the project, and the top projects will be funded.

Any selected students will be expected to register with the bioconductor and bioc-devel mailing lists.

There is a timeline posted at Google explaining how this works. Students are encouraged to look at this and make sure that they can commit to this. There is also a FAQ in case people have other questions that are not addressed here.

Here are our suggested ideas:

ExperimentHub project

Background/Motivation: As very large genomic data sets become more and more common, computational biologists are spending inordinate time transforming data from the format of the original resource to a format amenable to computation in their programming language of choice. The R / Bioconductor community needs programmatic access to cloud-based experimental data resources that can be readily incorporated into their own work flows.

Goal

AnnotationHub and its supporting packages are primed to support such a project. AnnotationHub provides infrastructure to make well-curated resources available to R software clients, but it needs the addition of a GUI interface to allow addition of user-supplied resources, including transformation of data into formats amenable to direct use by R clients.

The task

Work with us to create a GUI interface that does the following:

1) allows the user to add large genomic resources that have been transformed into a GenomicRanges::GRanges object along with their associated metadata to a NOSQL back-end database. 2) provides an intuitive front end using a shiny method that allows the user to upload the object that was passed in to the method up to the DB 3) checks and validates that all the metadata has been filled in appropriately when the shiny GUI is being run and then uploads that to the DB. 4) on the back end, enable a new instance of the AnnotationHubServer that knows how to listen for requests from the GUI and can add the data when appropriate. 5) Once the method and back end are both working for GenomicRanges::GRanges objects, you should also write methods for other popular Bioconductor objects such as: Biobase::ExpressionSet, GenomicRanges::SummarizedExperiment,
GenomicRanges::GrangesList.

Skills required

Familiarity with R S4 methods and with shiny.

Test (to help you get started and also to indicate your competence)

Subscribe to the mailing lists for bioconductor and bioc-devel.
Create a basic S4 method similar to what you can see in the interactiveDisplay package. Focus on the method for data.frames since it is the simplest to understand. Your method will be conceptually similar to this. Except in this case, instead of the focus being on displaying the object, this method should be intended to allow the user to collect metadata from the user about the Title, Species, TaxonomyId and Genome and then send a message back to the user R session that contains this information along with the object itself. The method should take a GRanges object as an argument and then should launch a simple GUI to extract information from the user (or even better-from the object itself), and then return that data to the user.

Mentor

Marc Carlson mcarlson@fhcrc.org

Backup Mentors

Dan Tenenbaum dtenenba@fhcrc.org Martin Morgan mtmorgan@fhcrc.org

Enabling Interactivity in Reproducible Reports

Background / Motivation

Every scientific analysis should result in a reproducible report. The ReportingTools Bioconductor package provides multiple means of report generation, including an imperative API driven by an R script, as well as a declarative interface through knitr. ReportingTools converts common R/Bioconductor data structures like data.frames and ExpressionSets into report elements, such as tables and plots, according to user-definable mappings. It supports multiple backends, with the HTML backend being the most developed. The HTML report elements have some limited interactivity (such as sortable tables). Additional interactivity is enabled through integration with the shiny package.

Our goal is to add some simple interactive HTML report elements, where the interactivity is implemented in the front-end, independent of shiny or other R instance. These would include basic plots of summaries, as well as simple genomic plots, which would display alignments, per-position summaries, and annotations along the genome. Summarized data could be directly embedded in the report, and the client would access large datasets by querying standard biological data sources like DAS, AnnotationHub, ExperimentHub and web-accessible BAM/VCF/BigWig files. This work would occur in a new or at least separate package that could be used directly, or through its integration with ReportingTools. We may be able to leverage existing work like clickme and Rcharts for the low-level plotting. The genomic plotting might rely on and even drive the development of the pViz javascript library.

The goal is for simple, lightweight plots in redistributable reports. There is no intent for this to replace the more sophisticated shiny-based solutions, nor applications like epiviz(R).

Tasks

Subscribe to the mailing lists for bioconductor and bioc-devel.
Install and explore the ReportingTools package from Bioconductor.
Gather specific use cases of ReportingTools, including the common types of embedded plots and the types of questions readers want to ask of them.
Familiarize yourself with web front-end technologies like d3.js, backbone.js, and bower.
Propose a design for the geneneration of some simple summary plots, such as scatterplots, density plots, boxplots, and barcharts, with canonical interactions like tooltips, pan/zoom, and linked selection.
After discussions with mentors, implement a prototype of the system.
Solicit feedback from members of the Bioconductor community.
Given positive feedback, solidify the implementation.
(Stretch goal) Design, prototype and implement a genomic plot that draws alignments and annotations from web-accessible data repositories.

Skills required

Proficiency in Javascript and web technologies like those listed above.
Familiarity with R; having publicly released an R package would be a plus.
Exposure to biological/sequencing use cases is a plus.

Mentors

Michael Lawrence lawrence.michael@gene.com

Backup Mentors

Martin Morgan mtmorgan@fhcrc.org Marc Carlson mcarlson@fhcrc.org

Extending mzR

Introduction

The mzR R/Bioconductor package provides a unified application programming interface to the common open and community-driven file formats and parsers available for mass spectrometry data, namely mzXML, mzML and mzData (see current vignette for details and references). It relies on C and C++ code from other third party open-source projects and the Rcpp package to, notably, provide a direct mapping from R to C++ infrastructure.

Currently, mzR provides two back-ends to read mass spectrometry raw data:

netCDF which reads, as the name implies, netCDF data
RAMP to read mzData and mzXML via the ISB RAMP parser. This back-end can also read mzML through the proteowizard RAMPadapter around the proteowizard infrastructure, but this interface is limited to the lowest common denominator between the mzXML/mzData/mzML formats.

More details about the project can be found on the official package page.

Goal

The goal is to extend current useful, yet limited capabilities of mzR by adding support for the state-of-the-art proteowizard project.

The tasks

We will provide example data in all formats and support on the domain. This project will use the mzR github page as main collaboration and communication hub.

Subscribe to the Bioconductor and Bioc-devel mailing lists.
Give us your github account to be added as a mzR collaborator on the github page.
Install and explore the mzR and Rcpp packages and the proteowizard project. We will provide simple tasks to be coded in C++ using the proteowizard code base to familiarise yourself.
Expand mzR on raw data and meta-data accession using the proteowizard 'msdata' object type. This will allow read and write support for the widely used Proteomics Standards Initiative XML-based raw data formats and their latest definitions.
Implement support for additional open formats, in particular the MSMS identification (mzIdentML) using the 'identdata' object type.
Ensure successful compilation of the updated mzR package on the respective architectures.