%\VignetteIndexEntry{SQLForge: An easy way to create a new annotation package with a standard database schema.}
%\VignetteDepends{org.Hs.eg.db}
%\VignetteDepends{human.db0}
%\VignetteKeywords{annotation, database}
%\VignettePackage{AnnotationForge}
%\VignetteEngine{knitr::knitr}

\documentclass[11pt]{article}

<<style, eval=TRUE, echo=FALSE, results='asis'>>=
BiocStyle::latex()
@

\title{Creating a New Annotation Package using SQLForge}
\author{Marc Carlson, Hervé Pagès, Nianhua Li}




\begin{document}

\maketitle

\section{Introduction}

The \Rpackage{AnnotationForge} package provides a series of functions
that can be used to build \Rpackage{AnnotationDbi} packages for
supported organisms. This collection of functions is called SQLForge.

In order to use SQLForge you really only need to have one kind of
information and that is a list of paired IDs.  These IDs are to be
stored in a tab delimited file that is formatted in the same way that
they used to be for the older AnnBuilder package.  For those who are
unfamiliar with the AnnBuilder package, this just means that there are
two columns separated by a tab where the column on the left contains
probe or probeset identifiers and the column on the right contains
some sort of widely accepted gene accession.  This file should NOT
contain a header.  SQLForge will then use these IDs along with it's
own support databases to make an \Rpackage{AnnotationDbi} package for
you. Here is how these IDs should look if you were to read them into
R:

<<FileDemo>>=
library(RSQLite)
library(AnnotationForge)
read.table(system.file("extdata", "hcg110_ID",
                       package="AnnotationDbi"),
           sep = "\t", header = FALSE, as.is = TRUE)[1:5,]
@


In the example above, Genbank IDs are demonstrated.  But it is also
possible to use entrez gene IDs, refseq IDs or unigene accessions as
the gene identifiers.  If refseq IDs are used, it is preferable to
strip off the version extensions that can sometimes be added on by
some vendors.  The version extensions are digits that are sometimes
tacked onto the end of a refseq ID and separated from the accession by
a dot. As an example consider "NM\_000193.2" .  The "NM\_000193"
portion would be the actual accession number and the ".2" would be the
version number.  These version numbers are not used by these databases
and their presence in your input can cause less than desirable
results.

Alternatively, if you have an annotation file for an Affymetrix chip,
you can use a parameter called affy that will automatically parse such
a file and produce a similar mapping from that.  It is important to
understand however that despite that rather rich contents of an
Affymetrix annotation file, almost none of these data are used in the
making of an annotation package with SQLForge.  Instead, the relevant
IDs are stripped out, and then passed along to SQLForge as if you had
created a file like is seen above.  The option here to use such a file
is offered purely as a convenience because the platform is so popular.

If you have additional information about your probes in the form of
other kinds of supported gene IDs, you can pass these in as well by
using the otherSrc parameter.  These IDs must be formatted using the
same two column format as described above, and if there are multiple
source files, then you can pass them in as a list of strings that
correspond to the file paths for these files.

Once you have your IDs ready, SQLForge will read them in, and use the
gene IDs to compare to an intermediate database.  The data from this
database is what is used to make the specialized database that is
placed inside of an annotation package.

At the present time, it is possible to make annotation packages for
the most common model organisms.  For each of these organisms another
support package will be maintained and updated biannually which will
include all the basic data gathered for this organism from sources
such as NCBI, GO, KEGG and Flybase etc. These support packages will
each be named after the organism they are intended for and will each
include a large sqlite database with all the supporting information
for that organism. Please note that support databases are not
necessary unless you intend to actually make a new annotation package
for one of the supported organisms. In the case where you want to make
annotation packages, the support databases are only required for the
organism in question. When SQLForge makes a new database, it uses the
information supplied by the support database as the data source to
make the annotation package.  So the relevant support packages needs
to be updated to the latest version in order to guarantee that the
annotation packages you produce will be made with information from the
last biannual update.  These support packages are not meant to be
annotation packages themselves and they come with no schema of their
own. Instead these are merely a way to distribute the data to those
who want to make custom annotation packages.

To check if your organism is supported simply look in the metadata
packages repository on the bioconductor website for a .db0 package.
Only special organism base packages will end with the .db0
extension. If you find a package that is named after the organism you
are interested in, then your organism is supported, and you can use
that database to make custom packages.  To list all the supported organism
.db0 packages directly from R you can use available.db0pkgs().

<<availableDB0s>>=
  available.db0pkgs()  
@



\section{How to use SQLForge}

Once you know the name of the package that you need, you can get the
latest .db0 package for your organism by using BiocManager::install(): 

<<GetIntermedDB, eval=FALSE>>=
library(BiocManager)
install("human.db0")
@

<<checkforhumandb0, echo=FALSE>>=
require("human.db0")
@

Even if you have installed the .db0 package before, it's a good idea
to run this again to be sure that you have the latest organism
package.

Once you have the appropriate .db0 package, you may also need to
install the appropriate org.* package.  Most users probably have this
already, because using any chip packages for this organism will
require it. But if you don't have it yet, you can find a list of these
packagesit on the web site here:

\url{http://www.bioconductor.org/packages/release/BiocViews.html#___OrgDb}

Once you know the name of the package you need, you can install it like before.

<<GetOrg.db, eval=FALSE>>=
install("org.Hs.eg.db")
@

All the supported organisms with a .db0 package, will also have a
matching org.* package.  The org.* packages are named like this:
org.<species abbreviation>.<source abbreviation>.db.  The source
indicates where the data originates from and also which central ID the
data all connects to, while the species abbreviation indicates which
organism the data is for.  The web site spells out the species a bit
more clearly by listing the titles for each package.



Since each organism will have different kinds of data available, the
schemas that will be needed for each organism will also
change. SQLForge provides support functions for each of the model
organisms that will create a sqlite database that complies with a
specified database schema.  The following helper function will list
supported Schemas.

<<list Schemas>>=
available.dbschemas()
@ 

The following shows an example of how to make a chip package:

<<SQLForge, tidy=FALSE, results='hide'>>=
hcg110_IDs = system.file("extdata",
                          "hcg110_ID",
                          package="AnnotationDbi")

tmpout = tempdir()

makeDBPackage("HUMANCHIP_DB",
              affy=FALSE,            
              prefix="hcg110",
              fileName=hcg110_IDs,
              baseMapType="gb",
              outputDir = tmpout,
              version="1.0.0",
              manufacturer = "Affymetrix",
              chipName = "Human Cancer G110 Array",
              manufacturerUrl = "http://www.affymetrix.com")
@ 
<<cleanup2, echo=FALSE, results='hide'>>=
file.remove(file.path(tmpout, "hcg110.sqlite"))
file.rename(file.path(tmpout, "hcg110.db"),file.path(tmpout, "foo.db"))
@

Wrapper functions are provided for making all of the different kinds
of chip based package types that are presently defined.  These are
named after the schemas that they correspond to.  So for example
\Rfunction{makeHUMANCHIP\_DB()} corresponds to the HUMANCHIP\_DB
schema, and is used to produce chip based annotation packages of that
type.

\subsection{Installing your custom package}

To install your package in Unix simply use R CMD INSTALL <packageName>
at the command line.  But if you are on Windows or Mac, you may have
to instead use \Rfunction{install.packages} from within R.  This will
work because this kind of simple annotation package does not contain
any code that has to be compiled.  So you can simply call
\Rfunction{install.packages} and set the repos parameter to NULL and
the type parameter to "source". The final R command will look
something like this:

<<install, eval=FALSE>>=
install.packages("packageNameAndPath", repos=NULL, type="source")
@ 

Of course, you still have to type the path to your source directory
correctly as the 1st argument.  It is recommended that you use the
autocomplete feature in R as you enter it so that you get the path
specified correctly.




\section{For Advanced users: How to add extra data into your packages}

Sometimes you may find that you want to add extra supplementary data
into the database for the package that you just created.  In these
cases, you will have to begin by using the SQL to add more data into
the database.  Before you can do that however, you will have to change
the permissions on the sqlite database.  The database will always be
in the inst/extdata directory of your package source after you run
SQLForge.  Once you can edit your database, you will have to create a
new table, and populate that table with new information using SQL
statements.  One good way to do this would be to use the
\Rpackage{RSQLite} interface that is introduced in portions of the
\Rpackage{AnnotationDbi} vignette.  For a more thorough treatment of
the RSQLite package, please see the vignette for that package at CRAN.
Once you are finished editing the database with SQL, be sure to change
the database file back to being a read only file.

However, adding the content to the database is only the 1st part of
what has to be done.  In order for the data to be exposed to the R
layer as a mapping, you will have to also create and document a
mapping object.  To do this step we have added a simple utility
function to AnnotationDbi that allows you to make a simple Bimap from
a single table.  The following example will make an additional mapping
between the gene names and the gene symbols found in the gene\_info
table for the package \Rpackage{hgu95av2.db}.  For this particular
example, no additional SQL has to be inserted 1st into the database
since it is just adding a mapping onto data that already exists in the
database (but is just not normally exposed as a mapping).

<<createSimpleMapping, tidy=FALSE>>=
library(hgu95av2.db)
hgu95av2NAMESYMBOL <- createSimpleBimap("gene_info",
                                        "gene_name",
                                        "symbol",
                                        hgu95av2.db:::datacache,
                                        "NAMESYMBOL",
                                        "hgu95av2.db")
##What is the mapping we just made?
hgu95av2NAMESYMBOL
##Display the 1st 4 relationships in this new mapping
as.list(hgu95av2NAMESYMBOL)[1:4]
@ 

If instead of creating a mapping on an existing example, you wanted to
add a new mapping to your customized annotation package, you would
need to call this function from zzz.R in your modified annotation
package (and also expose it in the namespace).  You will then want to
be sure that your updated database has replaced the one in the
inst/extdata directory that was originally generated by SQLForge.  And
finally, you will need to also put a man page into your package so
that users will know how to make use of this new mapping.


\section{Session Information}

The version number of R and packages loaded for generating the vignette were:

<<SessionInfo, echo=FALSE>>=
sessionInfo()
@

\end{document}