% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- %\VignetteIndexEntry{altcdfenvs} %\VignetteKeywords{Preprocessing, Affymetrix} %\VignetteDepends{altcdfenvs} %\VignettePackage{altcdfenvs} \documentclass[12pt]{article} %\usepackage{amsmath} %\usepackage{hyperref} \usepackage[authoryear,round]{natbib} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \author{Laurent Gautier} \title{Alternative CDF environments} \begin{document} \maketitle \tableofcontents \section{Introduction} On short oligonuleotide arrays, several probes are designed to match a target transcript, and probes matching the same target transcript can be grouped in a probe set. Between the time the probes for a given short oligonucleotide chip were designed, and the time an analysis is made, the knowledge of expected transcripts for a given organism might have changed. Unless one includes the latest development in transcripts into an analysis, the analysis could suffer from what we like to call a {\it Dorian Gray}\footnote{From the novel `The Picture of Dorian Gray' by Oscar Wilde.} effect. The chip itself does not change, which means that the probes and their respective sequences remain the same, while the knowledge of the transcripts, and eventually their sequence, might evolve, and in time the immobility of the probe and probe sets give an uglier picture of the biological phenomena to study. Being able to easily modify or replace the grouping of probes in probe sets gives the opportunity to minimize this effect. The package is directly usable with {\it Affymetrix} {\it GeneChip} short oligonucleotide arrays, and can be adapted or extended to other platforms. The bibliographic reference associated with the package is given by the command: \begin{Scode} citation(package="altcdfenvs") \end{Scode} \begin{quote} Alternative mapping of probes to genes for Affymetrix chips Laurent Gautier, Morten Mooller, Lennart Friis-Hansen, Steen Knudsen BMC Bioinformatics 2004, 5:111 \end{quote} If you use it, consider citing it, and if you cite it consider citing as well other packages it depends on. To start we will first load the package: <<>>= library(altcdfenvs) @ \section{The class \Rclass{CdfEnvAffy}} Each instance of this class contains a way to group probes in probe sets. Different instances, describing different ways to group probes in probe sets, can co-exist for a given chip type. When experimenting, it is highly recommended to use the functions \Rfunction{validCdfEnvAffy} and \Rfunction{validAffyBatch} to make sure that a given instance is a valid one. \section{Reading sequence information in FASTA connections} The package contains simple functions to read {\bf R} connections in the FASTA format. Typically, collections of sequences are stored in FASTA files, which can be significantly large, one can wish to read and process sequences one after the other. This can be done by opening the file in `r' mode: <<>>= fasta.filename <- system.file("exampleData", "sample.fasta", package="altcdfenvs") con <- file(fasta.filename, open="r") @ Reading the sequences one after another, and printing information about them in turn goes like: <<>>= fasta.seq <- read.FASTA.entry(con) while(! is.null(fasta.seq$header)) { print(fasta.seq) fasta.seq <- read.FASTA.entry(con) } close(con) @ One can foresee that the matching of a set of reference sequences against all the probes can be parallelized easily: the reference sequences can simply be distributed across different processors/machines. When working with all the reference sequences in a single large FASTA file, the option \Robject{skip} can let one implement a poor man's parallel sequence matching very easily. \section{Creating an alternative mapping from sequences in a FASTA file} \subsection{Select the constituting elements} \begin{itemize} \item Chip type: For this tutorial we decide to work with the Affymetrix chip HG-U133A. \item Target sequences: The set of target sequences we use for this tutorial is in the exemplar FASTA file: <<>>= ## first, count the number of FASTA entries in our file con <- file(fasta.filename, open="r") n <- countskip.FASTA.entries(con) close(con) ## read all the entries con <- file(fasta.filename, open="r") my.entries <- read.n.FASTA.entries.split(con, n) close(con) @ \end{itemize} \subsection*{matching the probes} The package \Rpackage{Biostrings} and the probe data package for HG-U133A are required to perform the matching. The first step is to load them: <<>>= library(hgu133aprobe) @ The matching is done simply (one can refer to the documentation for the package \Rpackage{Biostrings} for further details): <<>>= targets <- my.entries$sequences names(targets) <- sub(">.+\\|(Hs\\#|NM_)([^[:blank:]\\|]+).+", "\\1\\2", my.entries$headers) m <- matchAffyProbes(hgu133aprobe, targets, "HG-U133A") @ \subsection{analyzing the matches} When the position of the match between probes and target sequences does not matter, the association can be represented as a bipartite graph. The method \Rfunction{toHypergraph} will transform an instance of \Rclass{AffyProbesMatch} into an \Rclass{Hypergraph}. <<>>= hg <- toHypergraph(m) @ Currently, there are not many functions implemented around hypergraphs, so we convert it to a more common graph. <<>>= gn <- toGraphNEL(hg) @ Since this is now a regular graph, all of probes and targets are regular nodes on that graph. Node name-based rules can be applied to identify whether a node is a target sequence or a probe. <<>>= targetNodes <- new.env(hash=TRUE, parent=emptyenv()) for (i in seq(along=targets)) { targetNodes[[names(targets)[i]]] <- i } @ Since the graph is relatively small, we can plot it, and see that one probe is common to both probe sets: <>= library(Rgraphviz) tShapes <- rep("ellipse", length=length(targets)) names(tShapes) <- names(targets) tColors <- rep("ivory", length=length(targets)) names(tColors) <- names(targets) nAttrs <- list(shape = tShapes, fillcolor = tColors) gAttrs <- list(node = list(shape = "rectangle", fixedsize = FALSE)) plot(gn, "neato", nodeAttrs = nAttrs, attrs = gAttrs) @ Whenever a large number oftarget sequences are involved, counting the degrees will be more efficient than plotting. The package contains a function to create a \Rclass{CdfEnv} from the matches: <>= alt.cdf <- buildCdfEnv.biostrings(m, nrow.chip = 712, ncol.chip = 712) @ Note that the size for chip must be specified. This is currently a problem with cdfenvs as they are created by the package \Rpackage{makecdfenv}. The class \Rclass{CdfEnv} suggests a way to solve this (hopefully this will be integrated in \Rpackage{makecdfenv} in the near future). When this happens, the section below will be replaced by something more intuitive. But in the meanwhile, here is the current way to use our shiny brand new class \Rclass{CdfEnv}: \begin{Scode} ## say we have an AffyBatch of HG-U133A chips called 'abatch' ## summary checks to avoid silly mistakes validAffyBatch(abatch, alt.cdf) ## it is ok, so we proceed... ## get the environment out of it class alt.cdfenv <- alt.cdf@envir abatch@cdfName <- "alt.cdfenv" \end{Scode} From now on, the object \Robject{abatch} will use our `alternative mapping' rather than the one provided by the manufacturer of the chip: \begin{Scode} print(abatch) \end{Scode} %\section*{Creating an alternative environment to store only perfect matches} \section{Always up-to-date} Even if alternative mapping is not used upstream of the analysis, it can still be interesting to verify probesets highlighted during data analysis. The \Rpackage{biomaRt} package makes withdrawing up-to-date sequences very easy, and those sequences can be matched against the probes. First, we create a \emph{mart}: \begin{Scode} library(biomaRt) mart <- useMart("ensembl",dataset="hsapiens_gene_ensembl") \end{Scode} (refer to the documentation for the \Rpackage{biomaRt} for further information). \subsection{Casual checking of genes} In this example, we assume that for one reason or an other a researcher would like to know more about the probes matching the SLAMF genes. <>= geneSymbols <- c("SLAMF1", "SLAMF3", "SLAMF6", "SLAMF7", "SLAMF8", "SLAMF9") @ The vector \Robject{geneSymbols} defined can easily be replaced by your favorite genes; the example below should still work. We then write a convenience function \Robject{getSeq} to extract the sequences. This function appenda a \verb+-+ to the HUGO symbol (as there might be several sequences matching). <>= getSeq <- function(name) { seq <- getSequence(id=name, type="hgnc_symbol", seqType="cdna", mart = mart) targets <- seq$cdna if (is.null(targets)) return(character(0)) names(targets) <- paste(seq$hgnc_symbol, 1:nrow(seq), sep="-") return(targets) } @ % load saved data (instead of connecting to the mart) <>= load(system.file("exampleData", "slamf_targets.RData", package="altcdfenvs")) @ The function let us obtain the target sequences very easily: \begin{Scode} targets <- unlist(lapply(geneSymbols, getSeq)) \end{Scode} The targets are matched as seen previously: <<>>= m <- matchAffyProbes(hgu133aprobe, targets, "HG-U133A") @ A colorful graph can be made in order to visualize how matching probes are distributed: <>= hg <- toHypergraph(m) gn <- toGraphNEL(hg) library(RColorBrewer) col <- brewer.pal(length(geneSymbols)+1, "Set1") tColors <- rep(col[length(col)], length=numNodes(gn)) names(tColors) <- nodes(gn) for (col_i in 1:(length(col)-1)) { node_i <- grep(paste("^", geneSymbols[col_i], "-", sep=""), names(tColors)) tColors[node_i] <- col[col_i] } nAttrs <- list(fillcolor = tColors) plot(gn, "twopi", nodeAttrs=nAttrs) @ \begin{itemize} \item Watch for \emph{SLAMF6} and \emph{SLAMF7} \item The second sequence in SLAMF8 can potentially has specific probes (the rest of the probes are matching both SLAMF8 sequences) \end{itemize} Comparison with the official mapping can be made (not so simply, a future version should address this) <<>>= library("hgu133a.db") affyTab <- toTable(hgu133aSYMBOL) slamf_i <- grep("^SLAMF", affyTab$symbol) pset_id <- affyTab$probe_id[slamf_i] library("hgu133acdf") countProbes <- lapply(pset_id, function(x) nrow(hgu133acdf[[x]])) names(countProbes) <- affyTab$symbol[slamf_i] countProbes @ The results do not appear in complete agreement with the matching just performed. \end{document}