% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- %\VignetteIndexEntry{makecdfenv primer} %\VignetteKeywords{Preprocessing, Affymetrix} %\VignetteDepends{makecdfenv, affy} %\VignettePackage{makecdfenv} %documentclass[12pt, a4paper]{article} \documentclass[12pt]{article} \usepackage{amsmath} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \author{Wolfgang Huber and Rafael Irizarry} \begin{document} \title{Textual description of makecdfenv} \maketitle \tableofcontents \section{Introduction} The \Rpackage{makecdfenv} package is part of the Bioconductor\footnote{\url{http://www.bioconductor.org/}} project. It can be used to create cdf environments from Affymetrix chip description files to be used with the package \Rpackage{affy}. For many {\it CDF} files a pre-assembled package is available from \url{http://www.bioconductor.org/}. The following notes explain what to do if you want to create your own. \section{Creating a CDF package} To install packages, you need certain tools besides R itself. They are generally available on most Unix platforms. Under MS-Windows, if you use the precompiled binary distribution of R from CRAN, you will need to install some extra tools, like the {\em source package installation files} and {\em Perl}. If you think this is too complicated, you can contribute the package to Bioconductor and we will create a windows binary for you. Alternatively, you may skip this section and proceed to section~\ref{sec:make.cdf.env} Let's say your CDF file is called \Robject{eggplantgenome.cdf} and is in your working directory. To make a package, simply write \begin{Sinput} R> make.cdf.package("eggplantgenome.cdf", species = "Solanum_sp") \end{Sinput} This will create a subdirectory \Robject{eggplantgenomecdf} in your working directory, which contains the package. Please consult the help page for \Rfunction{make.cdf.package} to find out about further options. Now, open a terminal with an operating system shell, and write \begin{Sinput} > R CMD INSTALL eggplantgenomecdf \end{Sinput} This will install the package into your R. You are now ready to use the affy package to process eggplant genechips. \section{Creating CDF environments} \label{sec:make.cdf.env} If you do not choose to use the package creation mechanism, you can still produce a data structure (in R lingo, it is called an {\it environment}) that can be used by the affy package. Let's say your CDF file is called \Robject{EggPlantGenome.cdf} and is in your working directory. Simply write \begin{Sinput} R> EggPlantGenome = make.cdf.env("EggPlantGenome.cdf") \end{Sinput} Please consult the help page for \Rfunction{make.cdf.env} to find out about further options. \section{Naming of CDF packages and environments} What should you call your package or environment? \Rfunction{make.cdf.package} chooses a default name by stripping all non-letters and non-numbers from the CDF file name, and converting everything to lower case. In almost all cases you should use this default name for a package. This is important because the \Rpackage{affy} package contains functionality to automatically detect the correct CDF package for a given chip. If the package is not found, it will try to download from the Bioconductor servers. If the package is not found there, it will error out. At that point the only recourse is to manually change the cdfName slot of the \Robject{AffyBatch}, or re-create the CDF package with the correct name. You can obtain the CDF name that is associated with a CEL file through \begin{Sinput} R> pname <- cleancdfname(whatcdf("mycelfile.cel")) \end{Sinput} Then call the package making function this way: \begin{Sinput} R> make.cdf.package("eggplantgenome.cdf", packagename=pname) \end{Sinput} Unfortunately, the naming convention for environments is slightly different. Instead of using the \Rfunction{cleancdfname}, the \Rpackage{affy} package will look for an environment with the same \it{exact} name as the original CDF file, minus the suffix (the '.CDF' part). So for instance, if the CDF name was EggPlantGenome.CDF, you would want to create an environment this way: \begin{Sinput} R> EggPlantGenome <- make.cdf.env("EggPlantGenome.CDF") \end{Sinput} There are certain characters that will not work in a variable name (such as +, -, /, *). If your CDF name contains one of these characters, the only recourse is to either create a package (which by default will not contain these characters), or to manually change the cdfName of your \Robject{AffyBatch} so the \Rpackage{affy} package will find your environment. Say your CDF name is an-unacceptable-name.CDF. \begin{Sinput} R> mycdfenv <- make.cdf.env("an-unacceptable-name.CDF") R> dat <- ReadAffy() R> dat@cdfName <- "mycdfenv" \end{Sinput} In instances of \Robject{AffyBatch}, the \Robject{cdfName} slot gives the name of the appropriate CDF file for the arrays that are represented in the \Robject{intensity} slot. The functions \Rfunction{read.celfile}, \Rfunction{read.affybatch}, and \Rfunction{ReadAffy} extract the CDF filename from the CEL files that they read. The function \Rfunction{cleancdfname} converts Affymetrix' CDF name to the name that is used in Bioconductor. Here are two examples: <>= library(affy) @ <<>>= cat("HG_U95Av2 is",cleancdfname("HG_U95Av2"),"\n") cat("HG-133A is",cleancdfname("HG-133A"),"\n") @ The method \Rfunction{getCdfInfo} takes as an argument \Robject{AffyBatch} and returns the appropriate environment. If \Robject{x} is an \Robject{AffyBatch}, this function will look for an environment with name \Robject{x@cdfName}. \section{Location-ProbeSet Mapping} \label{sec:probesloc} On Affymetrix GeneChip arrays, several probes are used to represent genes in the form of probe sets. From a {\it CEL} file we get for each physical location, or cel, (uniquely identified by its $x$ and $y$ coordinates) an intensity. The {\it CEL} file also contains the name of the {\it CDF} file needed for the location-probe-set mapping. The {\it CDF} files store the name of the probe set related to each location on the array. We store this mapping information in {\it R} environments, such as the ones produced by \Rfunction{make.cdf.env} or contained in the packages made by \Rfunction{make.cdf.package}. In \Rpackage{affy}, the $x$ and $y$ coordinates are internally stored as one number $i$. The mapping between $(x,y)$ and $i$ is provided by the functions \Rfunction{i2xy(i)} and \Rfunction{xy2i} that are contained in each CDF package. They are very simple: \begin{eqnarray*} i &=& y s_x + x + 1\\ x &=& (i-1)\,\%\%\, s_x\\ y &=& (i-1)\,\%/\%\, s_x, \end{eqnarray*} where $s_x$ is the side length of the chip (measured in number of probes) in $x$-direction. \end{document}