% -*- mode: noweb; noweb-default-code-mode: R-mode; -*-
%\VignetteIndexEntry{makecdfenv primer}
%\VignetteKeywords{Preprocessing, Affymetrix}
%\VignetteDepends{makecdfenv, affy}
%\VignettePackage{makecdfenv}
%documentclass[12pt, a4paper]{article}
\documentclass[12pt]{article}

\usepackage{amsmath}
\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}

\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\author{Wolfgang Huber and Rafael Irizarry}
\begin{document}
\title{Textual description of makecdfenv}

\maketitle
\tableofcontents

\section{Introduction}
The \Rpackage{makecdfenv} package is part of the
Bioconductor\footnote{\url{http://www.bioconductor.org/}} project. It
can be used to create cdf environments from Affymetrix chip
description files to be used with the package \Rpackage{affy}.

For many {\it CDF} files a pre-assembled package is available from
\url{http://www.bioconductor.org/}. The
following notes explain what to do if you want to create your own.

\section{Creating a CDF package}
To install packages, you need certain tools besides R itself.  They
are generally available on most Unix platforms. Under MS-Windows, if
you use the precompiled binary distribution of R from CRAN, you will
need to install some extra tools, like the {\em source package
installation files} and {\em Perl}. If you think this is too
complicated, you can contribute the package to Bioconductor
and we will create a windows binary for you. Alternatively,
you may skip this section and proceed to
section~\ref{sec:make.cdf.env}

Let's say your CDF file is called \Robject{eggplantgenome.cdf} and
is in your working directory. To make a package, simply write
\begin{Sinput}
R> make.cdf.package("eggplantgenome.cdf", species = "Solanum_sp")
\end{Sinput}
This will create a subdirectory \Robject{eggplantgenomecdf} in your
working directory, which contains the package.  Please consult the
help page for \Rfunction{make.cdf.package} to find out about further
options.

Now, open a terminal with an operating system shell, and write
\begin{Sinput}
> R CMD INSTALL eggplantgenomecdf
\end{Sinput}
This will install the package into your R. You are now ready to use the
affy package to process eggplant genechips.

\section{Creating CDF environments}
\label{sec:make.cdf.env}
If you do not choose to use the package creation mechanism, you can
still produce a data structure (in R lingo, it is called an
{\it environment}) that can be used by the affy package.
Let's say your CDF file is called \Robject{EggPlantGenome.cdf} and
is in your working directory. Simply write
\begin{Sinput}
R> EggPlantGenome = make.cdf.env("EggPlantGenome.cdf")
\end{Sinput}
Please consult the help page for \Rfunction{make.cdf.env} to find out
about further options.

\section{Naming of CDF packages and environments}
What should you call your package or environment?

\Rfunction{make.cdf.package} chooses a default name by stripping all
non-letters and non-numbers from the CDF file name, and converting
everything to lower case. In almost all cases you should use this default
name for a package. 

This is important because the \Rpackage{affy} package contains
functionality to automatically detect the correct CDF package for a given
chip. If the package is not found, it will try to download from the 
Bioconductor servers. If the package is not found there, it will error out. 
At that point the only recourse is to manually change the cdfName slot of 
the \Robject{AffyBatch}, or re-create the CDF package with the correct name.

You can obtain the CDF name that is associated with a CEL file through
\begin{Sinput}
R> pname <- cleancdfname(whatcdf("mycelfile.cel"))
\end{Sinput}
Then call the package making function this way:
\begin{Sinput}
R> make.cdf.package("eggplantgenome.cdf", packagename=pname)
\end{Sinput}

Unfortunately, the naming convention for environments is slightly different.
Instead of using the \Rfunction{cleancdfname}, the \Rpackage{affy} package
will look for an environment with the same \it{exact} name as the original
CDF file, minus the suffix (the '.CDF' part). So for instance, if the CDF 
name was EggPlantGenome.CDF, you would want to create an environment this way:
\begin{Sinput}
R> EggPlantGenome <- make.cdf.env("EggPlantGenome.CDF")
\end{Sinput}

There are certain characters that will not work in a variable name
(such as +, -, /, *). If your CDF name contains one of these characters, the 
only recourse is to either create a package (which by default will not contain
these characters), or to manually change the cdfName of your \Robject{AffyBatch}
so the \Rpackage{affy} package will find your environment. Say your CDF name
is an-unacceptable-name.CDF.
\begin{Sinput}
R> mycdfenv <- make.cdf.env("an-unacceptable-name.CDF")
R> dat <- ReadAffy()
R> dat@cdfName <- "mycdfenv"
\end{Sinput}

In instances of \Robject{AffyBatch}, the \Robject{cdfName} slot gives the name
of the appropriate CDF file for the arrays that are represented in the
\Robject{intensity} slot.
The functions \Rfunction{read.celfile},
\Rfunction{read.affybatch}, and \Rfunction{ReadAffy} extract the CDF
filename from the CEL files that they read. The function
\Rfunction{cleancdfname} converts Affymetrix' CDF name to the name
that is used in Bioconductor. Here are two examples:
<<echo=F,results=hide>>=
library(affy)
@
<<>>=
cat("HG_U95Av2 is",cleancdfname("HG_U95Av2"),"\n")
cat("HG-133A is",cleancdfname("HG-133A"),"\n")
@
The method \Rfunction{getCdfInfo} takes as an argument \Robject{AffyBatch}
and returns the appropriate environment.  If \Robject{x} is an
\Robject{AffyBatch}, this function will look for an environment with name
\Robject{x@cdfName}.

\section{Location-ProbeSet Mapping}
\label{sec:probesloc}
On Affymetrix GeneChip arrays, several probes are used to represent
genes in the form of probe sets. From a {\it CEL} file we get for each
physical location, or cel, (uniquely identified by its $x$ and $y$
coordinates) an intensity.  The {\it CEL} file also contains the name
of the {\it CDF} file needed for the location-probe-set mapping. The
{\it CDF} files store the name of the probe set related to each
location on the array.  We store this mapping information in {\it R}
environments, such as the ones produced by \Rfunction{make.cdf.env} or
contained in the packages made by \Rfunction{make.cdf.package}.

In \Rpackage{affy}, the $x$ and $y$ coordinates are internally stored as
one number $i$. The mapping between $(x,y)$ and $i$ is provided by the
functions \Rfunction{i2xy(i)} and \Rfunction{xy2i} that are contained in each
CDF package. They are very simple:
\begin{eqnarray*}
  i &=& y  s_x + x + 1\\
  x &=& (i-1)\,\%\%\,  s_x\\
  y &=& (i-1)\,\%/\%\, s_x,
\end{eqnarray*}
where $s_x$ is the side length of the chip (measured in number of
probes) in $x$-direction.


\end{document}