%\VignetteIndexEntry{03 Annotation -- Exercises} %\VignetteEngine{knitr::knitr} \documentclass{article} <>= options(max.print=1000) stopifnot(BiocInstaller::biocVersion() == "2.13") BiocStyle::latex() library(knitr) opts_chunk$set(cache=TRUE, tidy=FALSE) @ <>= suppressPackageStartupMessages({ library(org.Hs.eg.db) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(BSgenome.Hsapiens.UCSC.hg19) library(rtracklayer) library(biomaRt) }) @ \title{Practical: Annotations} \author{Martin Morgan (\url{mtmorgan@fhcrc.org})} \date{3 February 2014} \newcommand{\Hsap}{\emph{H.~sapiens}} \newcommand{\Dmel}{\emph{D.~melanogaster}} \usepackage{theorem} \newtheorem{Ext}{Exercise} \newenvironment{Exercise}{ \renewcommand{\labelenumi}{\alph{enumi}.}\begin{Ext}% }{\end{Ext}} \newenvironment{Solution}{% \noindent\textbf{Solution:}\renewcommand{\labelenumi}{\alph{enumi}.}% }{\bigskip} \setlength{\abovecaptionskip}{6pt} \setlength{\belowcaptionskip}{6pt} \begin{document} \maketitle \tableofcontents \section{Gene annotation} \subsection{Data packages} Organism-level (`org') packages contain mappings between a central identifier (e.g., Entrez gene ids) and other identifiers (e.g. GenBank or Uniprot accession number, RefSeq id, etc.). The name of an org package is always of the form \texttt{org...db} (e.g. \Biocannopkg{org.Hs.eg.db}) where \texttt{} is a 2-letter abbreviation of the organism (e.g. \texttt{Hs} for \emph{Homo spaiens}) and \texttt{} is an abbreviation (in lower-case) describing the type of central identifier (e.g. \texttt{eg} for ENTREZ gene identifiers). The ``How to use the `.db' annotation packages'' vignette in the \Biocpkg{AnnotationDbi} package (org packages are only one type of ``.db'' annotation packages) is a key reference. The `.db' and most other \Bioconductor{} annotation packages are updated every 6 months. Annotation packages usually contain an object named after the package itself. These objects are collectively called \Rclass{AnnotationDb} objects, with more specific classes named \Rclass{OrgDb}, \Rclass{ChipDb} or \Rclass{TranscriptDb} objects. Methods that can be applied to these objects include \Rfunction{cols}, \Rfunction{keys}, \Rfunction{keytypes} and \Rfunction{select}. Common operations for retrieving annotations are summarized in Table~\ref{tab:select-ops}. \begin{table} \centering \caption{Common operations for retrieving and manipulating annotations.} \label{tab:select-ops} \begin{tabular}{lll} Category & Function & Description \\ \hline\noalign{\smallskip} Discover & \Rfunction{columns} & List the kinds of columns that can be returned \\ & \Rfunction{keytypes} & List columns that can be used as keys \\ & \Rfunction{keys} & List values that can be expected for a given keytype \\ & \Rfunction{select} & Retrieve annotations matching \Rcode{keys}, \Rcode{keytype} and \Rcode{columns} \\ Manipulate & \Rfunction{setdiff}, \Rfunction{union}, \Rfunction{intersect} & Operations on sets \\ & \Rfunction{duplicated}, \Rfunction{unique} & Mark or remove duplicates \\ & \Rfunction{\%in\%}, \Rfunction{match} & Find matches \\ & \Rfunction{any}, \Rfunction{all} & Are any \Rcode{TRUE}? Are all? \\ & \Rfunction{merge} & Combine two different \Robject{data.frames} based on shared keys \\ \Rclass{GRanges*} & \Rfunction{transcripts}, \Rfunction{exons}, \Rfunction{cds} & Features (transcripts, exons, coding sequence) as \Rclass{GRanges}. \\ & \Rfunction{transcriptsBy} , \Rfunction{exonsBy} & Features group by gene, transcript, etc., as \Rclass{GRangesList}.\\ & \Rfunction{cdsBy}\\ \hline \end{tabular} \end{table} \begin{Exercise} This exercise illustrates basic use of the `select' interface to annotation packages. \begin{enumerate} \item What is the name of the org package for \emph{Homo sapiens}? Load it. Display the \Rclass{OrgDb} object for the \Biocpkg{org.Hs.eg.db} package. Use the \Rfunction{keytypes} and \Rfunction{columns} methods to discover which sorts of annotations can be queried and extracted. \item Here are some ENTREZID values. <>= egids <- c("3183", "91828", "81537", "4776", "283624", "4053", "85446", "10484", "55701", "1112") @ %% These are the most strongly differentially expressed genes from a subset of an RNA-seq differential expression analysis that you will encounter later in the course. The biological background is provided in \cite{pmid23374342}; see the ArrayExpress entry for \href{https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-1147/}{E-MTAB-1147}. The data are from chromosome 14 only Use the ENTREZIDs in the \Rfunction{select} method in such a way that you extract the SYMBOL (gene symbol) and GENENAME information for each. To what extent do the differentially expressed genes make biological sense? \end{enumerate} \end{Exercise} \begin{Solution} The `org' package for humans (\emph{Homo sapiens}) is \Biocannopkg{org.Hs.eg.db}. Load the \Biocannopkg{org.Hs.eg.db} package. <>= library(org.Hs.eg.db) @ Discover the key types and columns in the annotation package. <