%\VignetteIndexEntry{AnnotationForge: Creating select Interfaces for custom Annotation resources}
%\VignetteDepends{Homo.sapiens,hom.Hs.inp.db}
%\VignetteEngine{knitr::knitr}

\documentclass[11pt]{article}

<<style, eval=TRUE, echo=FALSE, results='asis'>>=
BiocStyle::latex()
@

%% Question, Exercise, Solution
\usepackage{theorem}
\theoremstyle{break}
\newtheorem{Ext}{Exercise}
\newtheorem{Question}{Question}


\newenvironment{Exercise}{
  \renewcommand{\labelenumi}{\alph{enumi}.}\begin{Ext}%
}{\end{Ext}}
\newenvironment{Solution}{%
  \noindent\textbf{Solution:}\renewcommand{\labelenumi}{\alph{enumi}.}%
}{\bigskip}




\title{Creating select Interfaces for custom Annotation resources}
\author{Marc Carlson and Valerie Obenchain}


<<include=FALSE>>=
library(knitr)
opts_chunk$set(tidy=FALSE)
@

\begin{document}

\maketitle



\section{Introduction}
The most common interface for retrieving data in \Bioconductor{} is
now the \Rfunction{select} method.  The interface provides a simple way
of extracting data.  

There are really 4 methods that work together to allow a
\Rfunction{select} interface.  The 1st one is \Rfunction{columns}, which
tells you about what kinds of values you can retrieve as columns in
the final result.

<<Homo.sapiens>>=
library(RSQLite)
library(Homo.sapiens)
columns(Homo.sapiens)
@ 

The next method is \Rfunction{keytypes} which tells you the kinds of
things that can be used as keys.

<<Homo.sapiens2>>=
keytypes(Homo.sapiens) 
@ 

The third method is \Rfunction{keys} which is used to retrieve all the
viable keys of a particular type.

<<Homo.sapiens3>>=
k <- head(keys(Homo.sapiens,keytype="ENTREZID"))
k
@ 

And finally there is \Rfunction{select}, which extracts data by using
values supplied by the other method.

<<Homo.sapiens4>>=
result <- select(Homo.sapiens, keys=k, 
                 columns=c("TXNAME","TXSTART","TXSTRAND"), 
                 keytype="ENTREZID")
head(result)
@ 


But why would we want to implement these specific methods?  It's a
fair question. Why would we want to write a select interface for our
annotation data?  Why not just save a .rda file to the data directory
and be done with it?  There are basically two reasons for this.  The
1st reason is convenience for end users.  When your end users can
access your data using the same four methods that they use everywhere
else, they will have a more effortless time retrieving their data.
And things that benefit your users benefit you.

The second reason is that by enabling a consistent interface across
all annotation resources, we allow for things to be used in a
programmatic manner.  By implementing a select interface, we are
creating a universal API for the whole project.  

Lets look again at the example I described above and think about what
is happening.  The \Biocannopkg{Homo.sapiens} package is able to
integrate data from many different resources largely because the
separate resources all implemented a select method.  This allows the
\Biocpkg{OrganismDbi} package to pull together resources from
\Biocannopkg{org.Hs.eg.db}, \Biocannopkg{GO.db} and
\Biocannopkg{TxDb.Hsapiens.UCSC.hg19.knownGene}.  


\begin{figure}[ht]
\centering
\includegraphics[width=.75\textwidth]{Homo_sapiens.pdf}
\caption{Packages and relationships represented by the Homo.sapiens package}
\label{fig:dbtypes}
\end{figure}


If these packages all exposed different interfaces for retrieving the
data, then it would be a lot more challenging to retrieve it, and
writing general code that retrieved the appropriate data would be a
lost cause.  So implementing a set of select methods is a way to
convert your package from a data file into an actual resource.


\section{Creating other kinds of Annotation packages}

A few more automated options already exist for generating specific
kinds of annotation packages. For users who seek to make custom chip
packages, users should see the \emph{SQLForge: An easy way to create a
new annotation package with a standard database schema.} in the
\Biocpkg{AnnotationForge} package.  And, for users who seek to make a
probe package, there is another vignette called \emph{Creating probe
packages} that is also in the \Biocpkg{AnnotationForge} package. And
finally, for custom organism packages users should look at the manual
page for \Rfunction{makeOrgPackageFromNCBI}.  This function will
attempt to make you an simplified organism package from NCBI
resources.  However, this function is not meant as a way to refresh
annotation packages between releases.  It is only meant for people who
are working on less popular model organisms (so that annotations can
be made available in this format).

But what if you had another kind of web resource or database and you
wanted to expose it to the world using something like this new
\Rfunction{select} method interface?  How could you go about this?



\section{Retrieving data from a web resource}

If you choose to expose a web resource, then you will need to learn
some skills for retrieving that data from the web.  The \R{}
programming language has tools that can help you interact with web
resources, pulling down files that are tab delimited or formatted as
XML etc.  There are also packages that can help you parse what you
retrieve.  In this section we will describe some of these resources in
the context of the uniprot web service, and give examples to
demonstrate how you can expose resources like this for your own
purposes.

These days many web services are exposed using a representational
state transfer or RESTful interface.  An example of this are the
services offered at Uniprot.  Starting with the uniprot base URI you
can add details to simply indicate what it is that you wish to
retrieve.

So in the case of Uniprot the base URI for the service we want today is this:

\begin{verbatim}
    http://www.uniprot.org/uniprot/
\end{verbatim}

This URI can be extended to retrieve individual uniprot records by
specifying a query argument like this:

\begin{verbatim}
    http://www.uniprot.org/uniprot/?query=P13368
\end{verbatim}

We can then request multiple records like this:

\begin{verbatim}
    http://www.uniprot.org/uniprot/?query=P13368+or+Q6GZX4
\end{verbatim}

And we can ask that the records be returned to us in tabular form by
adding another argument like this.

\begin{verbatim}
    http://www.uniprot.org/uniprot/?query=P13368+or+Q6GZX4&format=tab
\end{verbatim}

As you might guess, each RESTful interface is a little different, but
you can easily see how once you read the documentation for a given
RESTful interface, you can start to retrieve the data in \R.  Here is
an example.

<<URI Example>>=
   uri <- 'http://www.uniprot.org/uniprot/?query='
   ids <- c('P13368', 'Q6GZX4')
   idStr <- paste(ids, collapse="+or+")
   format <- '&format=tab'
   fullUri <- paste0(uri,idStr,format)
   read.delim(fullUri)
@ 

\begin{Exercise}
If you use the columns argument you can also specify which columns you
want returned.  So for example, you can choose to only have the
sequence and id columns returned like this:

\begin{verbatim}
    http://www.uniprot.org/uniprot/?query=P13368+or+Q6GZX4&format=tab&columns=id,sequence
\end{verbatim}

Use this detail about the uniprot web service along with what was
learned above to write a function that takes a character vector of
uniprot IDs and another character vector of columns arguments and then
returns the appropriate values.  Be careful to filter out any extra
records that the service returns.

\end{Exercise}
\begin{Solution}
<<web service code>>=
getUniprotGoodies  <- function(query, columns)
{
    ## query and columns start as a character vectors
    qstring <- paste(query, collapse="+or+")
    cstring <- paste(columns, collapse=",")
    uri <- 'http://www.uniprot.org/uniprot/?query='
    fullUri <- paste0(uri,qstring,'&format=tab&columns=',cstring)
    dat <- read.delim(fullUri, stringsAsFactors=FALSE)
    ## now remove things that were not in the specific original query...
    dat <- dat[dat[,1] %in% query,]
    dat
}
@
\end{Solution}

\subsection{Parsing XML}

Data for the previous example were downloaded from Uniprot in
tab-delimited format. This is a convenient output to work with
but unfortunately not always available. XML is still very
common and it is useful to have some familiarity with parsing
it. In this section we give a brief overview to using the
\Rpackage{XML} package for navigating XML data.

The \Rpackage{XML} package provides functions to parse XML in
both the tree-based DOM (document object model) or the event-driven
SAX (Simple API for XML). We will use the DOM approach. The XML is
first parsed into a tree-structure where the different elements
of the data are nodes. The elements are processed by traversing the
tree and generating a user-level representation of the nodes. XPath
syntax is used to traverse the nodes. A detailed description of
XPath can be found at \url{http://www.w3.org/xml}.

\paragraph{Retrieve the data:}
Data will be retrieved for the same id's as in the previous example.
Unlike tab-delimited, the XML queries cannot be subset by column
so the full record will be returned for each id. Details for what
is possible with each type of data retrieval are found at
\url{http://www.uniprot.org/faq/28}.

Parse the XML into a tree structure with \Rfunction{xmlTreeParse}.
When \Rcode{useInternalNodes=TRUE} and no \Rcode{handlers}
are specified the return value is a reference to C-level nodes.
This storage mode allows us to traverse the tree of data in C
instead of R objects.

<<xml_tree>>=
library(XML)
uri <- "http://www.uniprot.org/uniprot/?query=P13368+or+Q6GZX4&format=xml"
fl <- tempfile()
download.file(uri, fl)
xml <- xmlTreeParse(fl, useInternalNodes=TRUE)
@

\paragraph{XML namespace:}
XML pages can have namespaces which facilitate the use of
different XML vocabularies by resolving conflicts arising
from identical tags. Namepaces are represented by a uri
pointing to an XML schema page. When a namespace is defined
on a node in an XML document it must be included in the XPath
expression.

Use the \Rfunction{xmlNamespaceDefinitions} function to check
if the XML has a namespace.

<<xml_namespace>>=
defs <- xmlNamespaceDefinitions(xml, recurisve=TRUE)
defs
@

The presence of uri's confirm there is a namespace.
Alternatively we could have looked at the XML nodes for 
declarstions of the form
\Rcode{xmlns:myNamespace="http://www.namspace.org"}.
We organize the namespaces and will use them directly
in parsing.

<<xml_namespace_struct>>=
ns <- structure(sapply(defs, function(x) x$uri), names=names(defs))
@

\paragraph{Parsing with XPath:}

There are two high level 'entry' nodes which represent
the two id's requested in the original query.
<<xml_namespace2>>=
entry <- getNodeSet(xml, "//ns:entry", "ns")
xmlSize(entry)
@

To get an idea of the data structure we first list the
attributes of the top nodes and extract the names.
<<xml_xmlAttrs>>=
nms <- xpathSApply(xml, "//ns:entry/ns:name", xmlValue, namespaces="ns")
attrs <- xpathApply(xml, "//ns:entry", xmlAttrs, namespaces="ns")
names(attrs) <- nms
attrs
@

Next, inspect the direct children of each node.
<<xml_xmlChildren>>=
fun1 <- function(elt) unique(names(xmlChildren(elt)))
xpathApply(xml, "//ns:entry", fun1, namespaces="ns")
@

Query Q6GZX4 has 2 'feature' nodes and query P13368 has 48.
<<xml_feature_type>>=
Q6GZX4 <- "//ns:entry[ns:accession='Q6GZX4']/ns:feature"
xmlSize(getNodeSet(xml, Q6GZX4, namespaces="ns"))

P13368 <- "//ns:entry[ns:accession='P13368']/ns:feature"
xmlSize(getNodeSet(xml, P13368, namespaces="ns"))
@

List all possible values for the 'type' attribute of the
'feature' nodes.
<<xml_feature_type2>>=
path <- "//ns:feature"
unique(xpathSApply(xml, path, xmlGetAttr, "type", namespaces="ns"))
@

XPath allows the construction of complex queries to pull out
specific subsets of data. Here we extract the features with 
`type=sequence conflict' for query P13368.
<<xml_feature_type_P13368>>=
path <- "//ns:entry[ns:accession='P13368']/ns:feature[@type='sequence conflict']"
data.frame(t(xpathSApply(xml, path, xmlAttrs, namespaces="ns")))
@

Put the sequences in an AAStringSet and add the names.
<<xml_sequence>>=
library(Biostrings)
path <- "//ns:entry/ns:sequence"
seqs <- xpathSApply(xml, path, xmlValue, namespaces="ns")
aa <- AAStringSet(unlist(lapply(seqs, function(elt) gsub("\n", "", elt)),
    use.names=FALSE))
names(aa) <- nms
aa
@

\section{Setting up a package to expose a web service}

In order to expose a web service using select, you will need to create
an object that will be loaded at the time when the package is loaded.
Unlike with a database, the purpose of this object is pretty much
purely for dispatch.  We just need \Rfunction{select} and it's friends
to know which select method to call

The first step is to create an object.  Creating an object is simple
enough:

<<WebServiceObject>>=
setClass("uniprot", representation(name="character"),
         prototype(name="uniprot"))
@

Once you have a class defined, all you need is to make an instance of
this class.  Making an instance is easy enough:

<<makeInstanceWebServiceObj>>=
    uniprot <- new("uniprot")
@ 

But of course it's a little more complicated because one of these
objects will need to be spawned up whenever our package loads. This is
acclomplished by calling the \Rfunction{.onLoad} function in the zzz.R
file.  The following code will create an object, and then assign it to
the package namespace as the package loads.

<<onLoad2,eval=FALSE>>=
.onLoad <- function(libname, pkgname)
{
    ns <- asNamespace(pkgname)
    uniprot <- new("uniprot")
    assign("uniprot", uniprot, envir=ns)
    namespaceExport(ns, "uniprot")
}
@ 


\section{Creating package accessors for a web service}

At this point you have all that you need to know in order to implement
\Rfunction{keytype},\Rfunction{columns},\Rfunction{keys} and \Rfunction{select}
for your package.  In this section we will explore how you could
implement some of these if you were making a package that exposed
uniprot.

\subsection{Example: creating \Rfunction{keytypes} and \Rfunction{columns} methods}

The \Rfunction{keytype} and \Rfunction{columns} methods are always the 1st
ones you should implement.  They are the easiest, and their existence
is required to be able to use \Rfunction{keys} or \Rfunction{select}.  In
this simple case we only have one value that can be used as a keytype,
and that is a UNIPROT ID.

<<keytypeUniprot>>=
setMethod("keytypes", "uniprot",function(x){return("UNIPROT")})
uniprot <- new("uniprot")
keytypes(uniprot)
@ 

So what about columns?  Well it's not a whole lot more complicated in
this case since we are limited to things that we can return from the
web service.  Since this is just an example, lets limit it to the
following fields: "ID", "SEQUENCE", "ORGANISM". 
%%
<<keytypeUniprot2>>=
setMethod("columns", "uniprot", 
          function(x){return(c("ID", "SEQUENCE", "ORGANISM"))})
columns(uniprot)
@
%%
Also, notice how for both \Rfunction{keytypes} and \Rfunction{columns} I am
using all capital letters.  This is a style adopted throughout the
project.

\subsection{Example 2: creating a \Rfunction{select} method}

At this point we have enough to be able to make a select method.


\begin{Exercise}
Using what you have learned above, and the helper function from
earlier, define a select method.  This select method will have a
default \Rcode{keytype} of "UNIPROT".
\end{Exercise}
\begin{Solution}
<<webServiceSelect>>=
.select <- function(x, keys, columns){
    colsTranslate <- c(id='ID', sequence='SEQUENCE', organism='ORGANISM')
    columns <- names(colsTranslate)[colsTranslate %in% columns]
    getUniprotGoodies(query=keys, columns=columns)
}
setMethod("select", "uniprot", 
    function(x, keys, columns, keytype)
{
    .select(keys=keys, columns=columns)
})
@ 

<<webServiceSelect2, eval=FALSE>>=
select(uniprot, keys=c("P13368","P20806"), columns=c("ID","ORGANISM"))
@
\end{Solution}

\section{Retrieving data from a database resource}

If your package is retrieving data from a database, then there are some
additional skills you will need to be able to interface with this
database from \R.  This section will introduce you to those skills.

\subsection{Getting a connection}

If all you know is the name of the SQLite database, then to get a DB
connection you need to do something like this:
%%
<<classicConn,results='hide'>>=
drv <- SQLite()
library("org.Hs.eg.db")
con_hs <- dbConnect(drv, dbname=system.file("extdata", "org.Hs.eg.sqlite",
                        package = "org.Hs.eg.db"))
con_hs
dbDisconnect(con_hs)
@
%%
But in our case the connection has already been created here as part
of the object that was generated when the package was loaded:
%%
<<ourConn>>=
require(hom.Hs.inp.db)
str(hom.Hs.inp.db)
@
%%
So we can do something like below:
%%
<<ourConn2>>=
hom.Hs.inp.db$conn
## or better we can use a helper function to wrap this:
AnnotationDbi::dbconn(hom.Hs.inp.db)
## or we can just call the provided convenience function 
## from when this package loads:
hom.Hs.inp_dbconn()
@

\subsection{Getting data out}

Now we just need to get our data out of the DB.  There are several
useful functions for doing this.  Most of these come from the
\Rpackage{RSQLite} or \Rpackage{DBI} packages.  For the sake of
simplicity, I will only discuss those that are immediately useful for
exploring and extracting data from a database in this vignette.  One
pair of useful methods are the \Rfunction{dbListTables} and
\Rfunction{dbListFields} which are useful for exploring the schema of a
database.

<<dbListTables>>=
con <- AnnotationDbi::dbconn(hom.Hs.inp.db)
head(dbListTables(con))
dbListFields(con, "Mus_musculus")
@ 
%%
For actually executing SQL to retrieve data, you probably want to
use something like \Rfunction{dbGetQuery}.  The only caveat is that this
will actually require you to know a little SQL.
%%
<<dbGetQuery>>=
dbGetQuery(con, "SELECT * FROM metadata")
@ 

\subsection{Some basic SQL}

The good news is that SQL is pretty easy to learn.  Especially if you
are primarily interested in just retrieving data from an existing
database.  Here is a quick run-down to get you started on writing
simple SELECT statements.  Consider a table that looks like this:
%%
\begin{table}[h]
  \begin{tabular}{cc}
    \multicolumn{2}{l}{Table sna} \\
    foo & bar \\\hline
    1   & baz \\
    2   & boo \\
  \end{tabular}
\end{table}

\noindent This statement:
\begin{verbatim}
    SELECT bar FROM sna;
\end{verbatim}
Tells SQL to get the "bar" field from the "foo" table.  If we wanted
the other field called "sna" in addition to "bar", we could
have written it like this:
\begin{verbatim}
    SELECT foo, bar FROM sna;
\end{verbatim}
Or even this (* is a wildcard character here)
\begin{verbatim}
    SELECT * FROM sna;
\end{verbatim}
Now lets suppose that we wanted to filter the results.  We could also
have said something like this:
\begin{verbatim}
    SELECT * FROM sna WHERE bar='boo';
\end{verbatim}
That query will only retrieve records from foo that match the criteria
for bar.  But there are two other things to notice.  First notice that
a single = was used for testing equality.  Second notice that I used
single quotes to demarcate the string.  I could have also used double
quotes, but when working in \R{} this will prove to be less convenient as
the whole SQL statement itself will frequently have to be wrapped as a
string.

What if we wanted to be more general?  Then you can use LIKE.  Like
this:
\begin{verbatim}
    SELECT * FROM sna WHERE bar LIKE 'boo\%';
\end{verbatim}
That query will only return records where bar starts with "boo", (the
\% character is acting as another kind of wildcard in this context).

You will often find that you need to get things from two or more
different tables at once.  Or, you may even find that you need to
combine the results from two different queries.  Sometimes these two
queries may even come from the same table.  In any of these cases, you
want to do a join.  The simplest and most common kind of join is an
inner join.  Lets suppose that we have two tables:
\begin{table}[h]
  \begin{tabular}{ccc@{\hspace{2em}}ccc}
    \multicolumn{2}{c}{Table sna} & & & \multicolumn{2}{c}{Table fu} \\
    foo & bar & & & foo & bo \\\cline{1-2}\cline{5-6}
    1   & baz & & & 1   & hi \\
    2   & boo & & & 2   & ca \\
  \end{tabular}
\end{table}

\noindent And we want to join them where the records match in their
corresponding "foo" columns.  We can do this query to join them:
\begin{verbatim}
    SELECT * FROM sna,fu WHERE sna.foo=fu.foo;
\end{verbatim}
Something else we can do is tidy this up by using aliases like so:
\begin{verbatim}
    SELECT * FROM sna AS s,fu AS f WHERE s.foo=f.foo;
\end{verbatim}
This last trick is not very useful in this particular
example since the query ended up being longer than we started with,
but is still great for other cases where queries can become really
long.

\subsection{Exploring the SQLite database from \R}

Now that we know both some SQL and also about some of the methods in
\Rpackage{DBI} and \Rpackage{RSQLite} we can begin to explore the
underlying database from \R.  How should we go about this?  Well the
1st thing we always want to know are what tables are present.  We
already know how to learn this:
<<dbListTables2>>=
head(dbListTables(con))
@ 
And we also know that once we have a table we are curious about, we
can then look up it's fields using \Rfunction{dbListFields}
<<dbListFields2>>=
dbListFields(con, "Apis_mellifera")
@ 
And once we know something about which fields are present in a table,
we can compose a SQL query.  perhaps the most straightforward query is
just to get all the results from a given table.  We know that the SQL
for that should look like:
\begin{verbatim}
    SELECT * FROM Apis_mellifera;
\end{verbatim}
So we can now call a query like that from R by using \Rfunction{dbGetQuery}: 
<<dbGetQuery2>>=
head(dbGetQuery(con, "SELECT * FROM Apis_mellifera"))
@ 

\begin{Exercise}
  Now use what you have learned to explore the
  \Biocannopkg{hom.Hs.inp.db} database.  The formal scientific name for
  one of the mosquitoes that carry the malaria parasite is Anopheles
  gambiae.  Now find the table for that organism in the
  \Biocannopkg{hom.Hs.inp.db} database and extract it into R.  How many
  species are present in this table?  Inparanoid uses a five letter
  designation for each species that is composed of the 1st 2 letters
  of the genus followed by the 1st 3 letters of the species.  Using
  this fact, write a SQL query that will retrieve only records from
  this table that are from humans (Homo sapiens).
\end{Exercise}
\begin{Solution}
<<Anopheles,eval=FALSE>>=
head(dbGetQuery(con, "SELECT * FROM Anopheles_gambiae"))
## Then only retrieve human records
## Query: SELECT * FROM Anopheles_gambiae WHERE species='HOMSA'
head(dbGetQuery(con, "SELECT * FROM Anopheles_gambiae WHERE species='HOMSA'"))
dbDisconnect(con)
@
\end{Solution}

\section{Setting up a package to expose a SQLite database object}

For the sake of simplicity, lets look at an existing example of this
in the \Biocannopkg{hom.Hs.inp.db} package.  If you download this
tarball from the website you can see that it contains a .sqlite
database inside of the inst/extdata directory.  There are a couple of
important details though about this database.  The 1st is that we
recommend that the database have the same name as the package, but end
with the extension .sqlite.  The second detail is that we recommend
that the metadata table contain some important fields.  This is the
metadata from the current \Biocannopkg{hom.Hs.inp.db} package.

<<getMetadata, echo=FALSE>>=
library(hom.Hs.inp.db)
hom.Hs.inp_dbInfo()
@

As you can see there are a number of very useful fields stored in the
metadata table and if you list the equivalent table for other packages
you will find even more useful information than you find here.  But
the most important fields here are actually the ones called "package"
and "Db type".  Those fields specify both the name of the package with
the expected class definition, and also the name of the object that
this database is expected to be represented by in the R session
respectively. If you fail to include this information in your metadata
table, then \Rfunction{loadDb} will not know what to do with the
database when it is called.  In this case, the class definition has
been stored in the \Biocpkg{AnnotationDbi} package, but it could live
anywhere you need it too.  By specifying the metadata field, you
enable \Rfunction{loadDb} to find it.

Once you have set up the metadata you will need to create a class for
your package that extends the \Rclass{AnnotationDb} class.  In the
case of the hom.Hs.inp.db package, the class is defined to be a
\Rclass{InparanoidDb} class. This code is inside of
\Biocpkg{AnnotationDbi}.

<<referenceClass,eval=FALSE>>=
.InparanoidDb <-
    setRefClass("InparanoidDb", contains="AnnotationDb")
@

Finally the \Rfunction{.onLoad} call for your package will have to
contain code that will call the \Rfunction{loadDb} method.  This is what
it currently looks like in the \Biocannopkg{hom.Hs.inp.db} package.

<<onLoad,eval=FALSE>>=
sPkgname <- sub(".db$","",pkgname)
db <- loadDb(system.file("extdata", paste(sPkgname,
               ".sqlite",sep=""), package=pkgname, lib.loc=libname),
               packageName=pkgname)
dbNewname <- AnnotationDbi:::dbObjectName(pkgname,"InparanoidDb")
ns <- asNamespace(pkgname)
assign(dbNewname, db, envir=ns)
namespaceExport(ns, dbNewname)
@

When the code above is run (at load time) the name of the package (AKA
"pkgname", which is a parameter that will be passed into
\Rfunction{.onLoad}) is then used to derive the name for the object.
Then that name, is used by \Rfunction{onload} to create an
\Rclass{InparanoidDb} object.  This object is then assigned to the
namespace for this package so that it will be loaded for the user.

\section{Creating package accessors for databases}

At this point, all that remains is to create the means for accessing
the data in the database.  This should prove a lot less difficult than
it may initially sound.  For the new interface, only the four methods
that were described earlier are really required:
\Rfunction{columns},\Rfunction{keytypes},\Rfunction{keys} and \Rfunction{select}.

In order to do this you need to know a small amount of SQL and a few
tricks for accessing the database from R.  The point of providing
these 4 accessors is to give users of these packages a more unified
experience when retrieving data from the database.  But other kinds of
accessors (such as those provided for the \Rclass{TxDb} objects) may
also be warranted.

\subsection{Examples: creating a \Rfunction{columns} and \Rfunction{keytypes} method}

Now lets suppose that we want to define a \Rfunction{columns} method for
our \Rclass{hom.Hs.inp.db} object. And lets also suppose that we want
is for it to tell us about the actual organisms for which we can
extract identifiers.  How could we do that?

<<columns,eval=FALSE>>=
.cols <- function(x)
{
    con <- AnnotationDbi::dbconn(x)
    list <- dbListTables(con)
    ## drop unwanted tables
    unwanted <- c("map_counts","map_metadata","metadata")
    list <- list[!list %in% unwanted]
    ## Then just to format things in the usual way
    list <- toupper(list)
    dbDisconnect(con)
    list
}

## Then make this into a method
setMethod("columns", "InparanoidDb", .cols(x))
## Then we can call it
columns(hom.Hs.inp.db)
@

Notice again how I formatted the output to all uppercase characters?
This is just done to make the interface look consistent with what has
been done before for the other \Rfunction{select} interfaces.  But doing
this means that we will have to do a tiny bit of extra work when we
implement out other methods.

\begin{Exercise}
  Now use what you have learned to try and define a method for
  \Rfunction{keytypes} on \Rclass{hom.Hs.inp.db}.  The keytypes method
  should return the same results as columns (in this case).  What if you
  needed to translate back to the lowercase table names?  Also write
  an quick helper function to do that.
\end{Exercise}
\begin{Solution}
<<keytypes,eval=FALSE>>=
setMethod("keytypes", "InparanoidDb", .cols(x))
## Then we can call it
keytypes(hom.Hs.inp.db)

## refactor of .cols
.getLCcolnames <- function(x)
{
    con <- AnnotationDbi::dbconn(x)
    list <- dbListTables(con)
    ## drop unwanted tables
    unwanted <- c("map_counts","map_metadata","metadata")
    list <- list[!list %in% unwanted]
    dbDisconnect(con)
    list
}
.cols <- function(x)
{
    list <- .getLCcolnames(x)
    ## Then just to format things in the usual way
    toupper(list)
}
## Test:
columns(hom.Hs.inp.db)

## new helper function:
.getTableNames <- function(x)
{
    LC <- .getLCcolnames(x)
    UC <- .cols(x)
    names(UC) <- LC
    UC
}
.getTableNames(hom.Hs.inp.db)
@
\end{Solution}

\subsection{Example: creating a \Rfunction{keys} method}

\begin{Exercise}
  Now define a method for \Rfunction{keys} on \Rclass{hom.Hs.inp.db}.
  The keys method should return the keys from a given organism based
  on the appropriate keytype.  Since each table has rows that
  correspond to both human and non-human IDs, it will be necessary to
  filter out the human rows from the result
\end{Exercise}
\begin{Solution}
<<keys,eval=FALSE>>=
.keys <- function(x, keytype)
{
    ## translate keytype back to table name
    tabNames <- .getTableNames(x)
    lckeytype <- names(tabNames[tabNames %in% keytype])
    ## get a connection
    con <- AnnotationDbi::dbconn(x)
    sql <- paste("SELECT inp_id FROM",lckeytype, "WHERE species!='HOMSA'")
    res <- dbGetQuery(con, sql)
    res <- as.vector(t(res))
    dbDisconnect(con)
    res
}

setMethod("keys", "InparanoidDb", .keys(x, keytype))
## Then we can call it
keys(hom.Hs.inp.db, "TRICHOPLAX_ADHAERENS")
@
\end{Solution}


\section{Creating a database resource from available data}
Sometimes you may have a lot of data that you want to organize into a
database.  Or you may have another existing database that you wish to
convert into a SQLite database.  This section will deal with some
simple things you can do to create and import a SQLite database of
your own.


\subsection{Making a new connection}

First, lets close the connection to our other DB:
<<dbDisconnect>>=
   dbDisconnect(con)
@ 

Then lets make a new database.  Notice that we specify the database
name with "dbname" This allows it to be written to disc instead of
just memory.

<<makeNewDb>>=
drv <- dbDriver("SQLite")
dbname <- file.path(tempdir(), "myNewDb.sqlite")
con <- dbConnect(drv, dbname=dbname)
@ 


\subsection{Importing data}

Imagine that we want to reate a database and then put a table in it
called genePheno to store the genes mutated and a phenotypes
associated with each.  Plan for genePheno to hold the following gene
IDs and phenotypes (as a toy example):

<<exampleFrame>>=
data = data.frame(id=c(1,2,9),
                  string=c("Blue",
                           "Red",
                           "Green"),
                  stringsAsFactors=FALSE)
@

Making the table is very simple, and just involves a create table
statement.  

\begin{verbatim}
    CREATE Table genePheno (id INTEGER, string TEXT);
\end{verbatim}  

The SQL create table statement just indicates what the table is to be
called, as well as the different fields that will be present and the
type of data each field is expected to contain.

<<exercise2>>=
dbGetQuery(con, "CREATE Table genePheno (id INTEGER, string TEXT)")
@ 

But putting the data into the database is a little bit more delicate.
We want to take control over which columns we want to insert from our
\Robject{data.frame}.  Fortunately, the RSQLite package provides these
facilities for us.

<<LabelledPreparedQueries>>=
names(data) <- c("id","string")
sql <- "INSERT INTO genePheno VALUES ($id, $string)"
dbBegin(con)
res <- dbSendQuery(con,sql)
dbBind(res, data)
dbFetch(res)
dbClearResult(res)
dbCommit(con)
@ 

Please notice that we want to use strings instead of factors in our
data.frame.  If you insert the data as factors, you may not be happy
with what ends up in the DB.


% TODO: exercise idea: have them verify the contents of this table after inserting the data into it...


\subsection{Attaching other database resources}

In SQLite it is possible to attach another database to your session
and then query across both resources as if they were the same DB.


The SQL what we want looks quite simple:

\begin{verbatim}
    ATTACH "TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite" AS db;
\end{verbatim}  

So in R we need to do something similar to this:

<<ATTACH>>=
db <- system.file("extdata", "TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite", 
                  package="TxDb.Hsapiens.UCSC.hg19.knownGene")
dbGetQuery(con, sprintf("ATTACH '%s' AS db",db))
@ 

Here we have attached a DB from one of the packages that this vignette
required you to have installed, but we could have attached any SQLite
database that we provided a path to.


Once we have attached the database, we can join to it's tables as if
they were in our own database.  All that is required is a prefix, and
some knowledge about how to do joins in SQL.  In the end the SQL to
take advantage of the attached database looks like this:

\begin{verbatim}
    SELECT * FROM db.gene AS dbg, genePheno AS gp 
    WHERE dbg.gene_id=gp.id;
\end{verbatim}  

  Then in R:
<<ATTACHJoin>>=
  sql <- "SELECT * FROM db.gene AS dbg, 
          genePheno AS gp WHERE dbg.gene_id=gp.id"
  res <- dbGetQuery(con, sql)
  res
@ 


%% TODO: another exercise would be to have the group think of two different ways that they could implement getting the transcript IDs from the attached DB into their DB (attaching/depending vs. just creating another table in the DB).  


The version number of R and packages loaded for generating the
vignette were:

<<SessionInfo, echo=FALSE>>=
sessionInfo()
@


\end{document}