%\VignetteIndexEntry{Using the inSilicoDb package} %\VignetteDepends{RCurl, rjson} \documentclass{article} \usepackage{url} \usepackage{color} \newcommand{\todo}[1]{\textcolor{red}{\textbf{#1}}} \begin{document} \title{Using the inSilicoDb package} \author{Jonatan Taminau$\footnote{\texttt{jtaminau@vub.ac.be}}$} \date{CoMo, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium} \maketitle \section{The inSilico database} With more than 500,000 genomic profiles freely available in the public domain, there is a huge amount of information accessible for computational biologists or bioinformaticians. However, accessibility to these data requires complex computational steps. Manual parsing of annotations and keywords, which is in most cases a necessary evil, tends to be time-consuming and is known to be error-prone. Also the wide variety of normalization and preprocessing methods makes the comparability of different existing studies hard, or even impossible. The inSilico initiative (\url{http://insilico.ulb.ac.be}) provides an answer to those problems with its freely available web-based database tool: the inSilico database\footnote{Manuscript in preparation}. Starting with all public available human Affymetrix studies from Gene Expression Omnibus (GEO) \cite{Edgar_2002} it provides those studies in a consistent and well curated form. With a direct connection to GenePattern \cite{Reich_2006} and the ability to export the data to different formats, the inSilico database is an efficient mean to re-analyse public datasets and improve reproducibility in genome-wide research. \\ To further ease the use of this vast amount of genomic data the inSilicoDb R package was developed. This package can be seen as a different front-end to the core inSilico database and, although it provides only limited functionality compared to the web-based tool, it can become very valuable for R programmers or anyone who is interested in large scale analysis using automated scripting. \\ Similar packages to retrieve gene expression data in R exist \cite{Zhu_2008, Sean_2007}, but the added value and strength of this package is tightly connected to the innovative concept of the inSilico database and will therefore circumvent common obstacles like incompatibility and missing or malformed annotations. \section{Getting started using inSilicoDb} As this section will show, accessing data from the inSilico database is surprisingly easy and straightforward. \subsection{Simple access} Suppose one is interested in a number of publicly available gene expression studies which he found while browsing Gene Expression Omnibus (GEO) or the inSilico database. Using only the GSE identifier, a completely annotated and formatted dataset can be downloaded in just seconds, without any need for further manual parsing: You need to login to access datasets and datasetinfo. Use your InSilicoDB login and an md5 hash of your password. For this example we're using a restricted test account. <>= library("inSilicoDb"); InSilicoLogin("rpackage_tester@insilicodb.com", "5c4d0b231e5cba4a0bc54783b385cc9a"); res = getDatasets("GSE4635"); eset = res[[1]]; @ The result of \texttt{getDataset} is a list, containing a Bioconductors ExpressionSet (\texttt{eset}) for every platform that exists for this dataset (in the example there is only one platform). An alternative approach to obtain the same data is to specify the platform. In this case no list but the expression set is directly returned: <>= eset = getDataset("GSE4635", "GPL96"); @ And in case the platform is unknown, the auxiliary function \texttt{getPlatforms} is provided: <>= platforms = getPlatforms("GSE4635"); print(platforms); @ Once an expression set is retrieved, all available Bioconductor packages can be applied for further analysis, as the following code illustrates: \begin{center} <>= #eset = getDataset("GSE4635", "GPL96", features = "gene"); #heatmap(exprs(eset)[1:100,]); library("limma") eset = getDataset("GSE4635", "GPL96", norm="FRMA", features = "gene"); # Find 50 most discriminating genes f = pData(eset)[ ,"Smoker"] design = model.matrix(~f); fit = eBayes(lmFit(eset,design)); t = topTable(fit, coef=2, number=50); selected = is.element(rownames(exprs(eset)), t[ ,"SYMBOL"]) eset = eset[selected, ]; labels = pData(eset)[,"Smoker"]; heatmap(exprs(eset), labCol=labels); @ \end{center} \noindent \textit{(Only the first 100 genes are printed for simplicity)}. \\ In case only the annotation information is needed and there is no need for the numerical data, the \texttt{getAnnotations} function also exists for convenience: <>= annot = getAnnotations("GSE4635", "GPL96"); pData(annot); @ \subsection{More options} By default all numerical data is retrieved the same way the original authors have submitted the data to GEO and can therefore have been processed by a wide variety of preprocessing methods. However, when combining different studies a consistent preprocessing is required and therefore all studies for which there are CEL files available, are also precomputed by applying the FRMA preprocessing method \cite{McCall_2010}. The user can retrieve those studies as fast and easy as the original ones, simply by using the optional \texttt{norm} parameter. <>= eset = getDataset("GSE4635", "GPL96", norm="FRMA"); @ All gene expression matrices contain probes as features, although it is also possible to retrieve the genes instead. This probe to gene mapping is precomputed for every dataset and can be selected using the \texttt{genes} parameter. By default probes are selected, as this is how the data was submitted to GEO. <>= eset = getDataset("GSE4635", "GPL96"); print(nrow(eset)); eset = getDataset("GSE4635", "GPL96", features="gene"); print(nrow(eset)); @ %Besides speed and user-friendliness of this tool as depicted so far, another advantage is the consistent (re)annotation of all data. If we for example look at the annotations obtained for the \texttt{GSE781} series: % %<<>>= %eset = getDataset("GSE781", "GPL96"); %print(colnames(pData(eset))); %@ % %and we compare this information with the original information in GEO (and the one retrieved by the GEOquery package) we can clearly see that the annotation information is already processed and presented in a clear and consistent manner without the additional need for extra parsing by the user. \subsection{Create your own loop...} One of the advantages of retrieving data through R is the possibility to develop a whole automated workflow in just a few lines of code. The following example illustrates the many opportunities researchers can have using this tool. In the example code below, we iterate over a list of series GSE identifiers and try to retrieve every dataset from the database. Once retrieved some basic analysis is performed (printing the number of annotations and missing values). Note that the \texttt{getDataset} function can throw an error (e.g. no internet connection, dataset is not available, etc.) which is best caught in a try-catch loop, as is shown in the example. \SweaveOpts{keep.source=TRUE} <>= lst = list("GSE4635", "GSExxx", "GSE781"); gpl = "GPL96"; for(gse in lst) { catn = function(...) { cat(...,"\n"); } catn("Processing",gse); catn("======================="); eset = tryCatch({getDataset(gse, gpl);}, error = function(x) { print(as.character(x)); NULL; }); if(is.null(eset)) { next; } catn("Number of annotations:"); catn(ncol(pData(eset))); catn("Number of missing values:"); catn(sum(is.na(exprs(eset)))); } @ \section{Conclusion} This package is built in addition to a very powerful web-based database tool for genomic analysis. Despite its simplicity, it captures many of the benefits of this tool and provides the typical R users efficient means of performing large scale genomic analysis using automated scripting. \section{Session Info} <<>>= sessionInfo() @ \bibliographystyle{plain} \bibliography{inSilicoDb} \end{document}