%\VignetteIndexEntry{qPCR analysis in R} %\VignetteDepends{HTqPCR} %\VignetteKeywords{qpcr, preprocessing, normalization} %\VignettePackage{HTqPCR} % name of package %%%% HEAD SECTION: START EDITING BELOW %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \documentclass[11pt, a4paper, fleqn]{article} \usepackage{geometry}\usepackage{color} \definecolor{darkblue}{rgb}{0.0,0.0,0.75} \usepackage[% baseurl={http://www.bioconductor.org},% pdftitle={Introduction to HTqPCR},% pdfauthor={Heidi Dvinge},% pdfsubject={HTqPCR Vignette},% pdfkeywords={Bioconductor},% pagebackref,bookmarks,colorlinks,linkcolor=darkblue,citecolor=darkblue,% filecolor=darkblue,urlcolor=darkblue,pagecolor=darkblue,% raiselinks,plainpages,pdftex]{hyperref} \usepackage{verbatim} % for multi-line comments \usepackage{fancyvrb} \usepackage{amsmath,a4,t1enc, graphicx} \usepackage{natbib} \bibpunct{(}{)}{;}{a}{,}{,} \parindent0mm %\parskip2ex plus0.5ex minus0.3ex \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\phead}[1]{{\flushleft \sf \small \textbf{#1} \quad}} \newcommand{\myincfig}[3]{% \begin{figure}[h!tb] \begin{center} \includegraphics[width=#2]{#1} \caption{\label{#1}\textit{#3}} \end{center} \end{figure} } \addtolength{\textwidth}{2cm} \addtolength{\oddsidemargin}{-1cm} \addtolength{\evensidemargin}{-1cm} \addtolength{\textheight}{2cm} \addtolength{\topmargin}{-1cm} \addtolength{\skip\footins}{1cm} %%%%%%% START EDITING HERE %%%%%%% \begin{document} \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=1.5em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=1.5em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=1.5em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \SweaveOpts{eps=false, keep.source=FALSE} % produce no 'eps' figures <>= options(width=65) set.seed(123) @ \title{HTqPCR - high-throughput qPCR analysis in R and Bioconductor} \author{Heidi Dvinge} %\date{} \maketitle \section{Introduction} The package \Rpackage{HTqPCR} is designed for the analysis of cycle threshold (Ct) values from quantitative real-time PCR data. The main areas of functionality comprise data import, quality assessment, normalisation, data visualisation, and testing for statistical significance in Ct values between different features (genes, miRNAs). The example data used throughout this vignette is from from TaqMan Low Density Arrays (TLDA), a proprietary format of Applied Biosystems, Inc. However, most functions can be applied to any kind of qPCR data, regardless of the nature of the experimental samples and whether genes or miRNAs were measured. Section~\ref{SEC:formats} gives some examples of how to handle other types of input data, including output formats from other qPCR assay vendors (e.g. Roche Applied Science and Bio-Rad) and data from non-well based microfluidic systems (e.g. BioMark from Fluidigm Corporation). <>= library("HTqPCR") @ The package employs functions from other packages of the Bioconductor project \citep{ref:bioc}. Dependencies include \Rpackage{Biobase}, \Rpackage{RColorBrewer}, \Rpackage{limma}, \Rpackage{statmod}, \Rpackage{affy} and \Rpackage{gplots}. \subsection*{Examples from the vignette} This vignette was developed in Sweave, so the embedded R code was compiled when the PDF was generated, and its output produced the results and plots that appear throughout the document. The following commands will extract all of the code from this file: <>= all.R.commands <- system.file("doc", "HTqPCR.Rnw", package = "HTqPCR") Stangle(all.R.commands) @ This will create a file called HTqPCR.R in your current working directory, and this file can then either be sourced directly, or the commands run individually. \subsection*{General workflow} The main functions and their use are outlined in Figure~\ref{fig:workflow}. Note that the QC plotting functions can be used both before and after normalisation, in order to examine the quality of the data or look for particular trends. \begin{figure} \begin{center} \includegraphics{workflow.png} \end{center} \caption{Workflow in \Rpackage{HTqPCR} analysis of qPCR data. Centre column: The main procedural steps in a typical qPCR analysis; Left: examples of visualisation functions; Right: data analysis functions.} \label{fig:workflow} \end{figure} For the full set of available functions type: <>= ls("package:HTqPCR") @ \subsection*{Getting help} Please send questions about \Rpackage{HTqPCR} to the Bioconductor mailing list.\\ See \url{http://www.bioconductor.org/docs/mailList.html} \\ %Their archive of questions and responses may prove helpful, too. %%%%%%%%%%%%%%%% DESCRIPTION OF THE DATA FORMAT %%%%%%%%%%%%%%%%% \section{\Rclass{qPCRset} objects} The data is stored in an object of class \Rclass{qPCRset}, which inherits from the \Rclass{eSet} class from package \Rpackage{Biobase}. \Rclass{eSet} was originally designed for handling microarray data, but can deal with any kind of data where the same property (e.g.~qPCR of genes) is measured across a range of samples. Two \Robject{qPCRset} test objects are included in the package: one containing raw data, and the other containing processed values. An example is shown in figure~\ref{fig:objectfunctions}, along with some of the functions that can typically be used for manipulating \Rclass{qPCRset} objetcs. <>= data(qPCRraw) data(qPCRpros) class(qPCRraw) @ The format is the same for raw and normalized data, and depending on how much information is available about the input data, the object can contain the following information: \begin{description} \item[\Robject{featureNames}] Object of class \Rclass{character} giving the names of the features (genes, miRNAs) used in the experiment. This is a column in the \Rclass{featureData} of the \Rclass{qPCRset} object (see below). \item[\Robject{sampleNames}] Object of class \Rclass{character} containing the sample names of the individual experiments. \item[\Robject{exprs}] Object of class \Rclass{matrix} containing the Ct values. \item[\Robject{flag}] Object of class \Rclass{data.frame} containing the flag for each Ct value, as supplied by the input files. These are typically set during the calculation of Ct values, and indicate whether the results are flagged as e.g.~``Passed'' or ``Flagged''. \item[\Robject{featureType}]Object of class \Rclass{character} representing the different types of features on the card, such as endogenous controls and target genes. This is a column in the \Rclass{featureData} of the \Rclass{qPCRset} object. \item[\Robject{featurePos}] Object of class \Rclass{character} representing the location "well" of a gene in case TLDA cards are used, or some other method containing a defined spatial layout of features. Like \Robject{featureType} and \Robject{featureName}, \Robject{featurePos} is found within the \Rclass{featureData}. \item[\Robject{featureClass}] Object of class \Rclass{factor} with some meta-data about the genes, for example if it is a transcription factor, kinase, marker for different types of cancers or similar. This is typically set by the user, and will be located within the \Rclass{featureData}. \item[\Robject{featureCategory}] Object of class \Rclass{data.frame} representing the quality of the measurement for each Ct value, such as ``OK'', ``Undetermined'' or ``Unreliable''. These can be set using the function \Rfunction{setCategories} depending on a number of parameters, such as how the Ct values are flagged, upper and lower limits of Ct values and variations between technical and biological replicates of the same feature. \item[\Robject{history}] Object of class \Rclass{data.frame} indicating what operations have been performed on the \Rclass{qPCRset} object, and what the parameters were. Automatically set when any of the functions on the upper right hand side of figure~\ref{fig:workflow} are called (\Rfunction{readCtData, setCategory, filterCategory, normalizeCtData, filterCtData, changeCtLayout, rbind, cbind}). \end{description} Generally, information can be handled in the \Rclass{qPCRset} using the same kind of functions as for \Rclass{ExpressionSet}, such as \Rfunction{exprs}, \Rfunction{featureNames} and \Rfunction{featureCategory} for extracting the data, and \Rfunction{exprs<-}, \Rfunction{featureNames<-} and \Rfunction{featureCategory<-} for replacing or modifying values. The use of \Rfunction{exprs} might not be intuitive to users who are not used to dealing with microarray data, and hence \Rclass{ExpressionSet}. The functions \Rfunction{getCt} and \Rfunction{setCt<-} that perform the same operations as \Rfunction{exprs} and \Rfunction{exprs<-} are therefore also included. For the sake of consistency, \Rfunction{exprs} will be used throughout this vignette for accessing the Ct values, but it can be replaced by \Rfunction{getCt} in all examples. The overall structure of \Robject{qPCRset} is inherited from \Robject{eSet}, as shown in the example below. This is a flexible format, which allows the used to add additional information about for example the experimental protocol. Information about the samples is contained within the \Robject{phenoData} slot, and details can be accessed of modified using \Rfunction{pData}. Likewise, for the individual features (mRNA, miRNAs) are available in the \Robject{featureData} slot, and can be accessed of modified using \Rfunction{fData}. See e.g.~\Robject{AnnotatedDataFrame} for details. <>= slotNames(qPCRraw) phenoData(qPCRraw) pData(qPCRraw) pData(qPCRraw) <- data.frame(Genotype=rep(c("A", "B"), each=3), Replicate=rep(1:3, 2)) pData(qPCRraw) featureData(qPCRraw) head(fData(qPCRraw)) @ \Robject{qPCRset} objects can also be combined or reformatted in various ways (see section~\ref{SEC:objectmanipulation}). \begin{figure} \begin{center} \includegraphics{objects.png} \end{center} \caption{An example of a \Rclass{qPCRset} object, and some of the functions that can be used to display and/or alter different aspects of the object, i.e.~the accessor and replacement functions.} \label{fig:objectfunctions} \end{figure} %%%%%%%%%%%%%%%% DATA INPUT %%%%%%%%%%%%%%%%% \newpage \section{Reading in the raw data} \label{SEC:datainput} %\subsection{General data format} The standard input consists of tab-delimited text files containing the Ct values for a range of genes. Additional information, such as type of gene (e.g.~target, endogenous control) or groupings of genes into separate classes (e.g.~markers, kinases) can also be read in, or supplied later. The package comes with example input files (from Applied Biosystem's TLDA cards), along with a text file listing sample file names and biological conditions. <>= path <- system.file("exData", package="HTqPCR") head(read.delim(file.path(path, "files.txt"))) @ The data consist of 192 features represented twice on the array and labelled ``Gene1'', ``Gene2'', etc. There are three different conditions, ``Control'', ``Starve'' and ``LongStarve'', each having 2 replicates. The input data consists of tab-delimited text files (one per sample); however, the format is likely to vary depending on the specific platform on which the data were obtained (e.g., TLDA cards, 96-well plates, or some other format). The only requirement is that columns containing the Ct values and feature names are present. <>= files <- read.delim(file.path(path, "files.txt")) raw <- readCtData(files=files$File, path=path) @ The \Rclass{qPCRset} object looks like: <>= show(raw) @ NB: This section only deals with data presented in general data format. For notes regarding other types of input data, see section~\ref{SEC:formats}. This section also briefly deals with other types of qPCR results besides Ct data, notably the Cp values reported by the LightCycler System from Roche. %%%%%%%%%%%%%%%% BASIC PLOTTING %%%%%%%%%%%%%%%%% \newpage \section{Data visualisation} \subsection{Overview of Ct values across all groups} To get a general overview of the data the (average) Ct values for a set of features across all samples or different condition groups can be displayed. In principle, all features in a sample might be chosen, but to make it less cluttered Figure~\ref{fig:overview} displays only the first 10 features. The top plot was made using just the Ct values, and shows the 95\% confidence interval across replicates within and between samples. The bottom plot represents the same values but relative to a chosen calibrator sample, here the ``Control''. Confidence intervals can also be added to the relative plot, in which case these will be calculated for all values compared to the average of the calibrator sample per gene. %<>= <>= g <- featureNames(raw)[1:10] plotCtOverview(raw, genes=g, xlim=c(0,50), groups=files$Treatment, conf.int=TRUE, ylim=c(0,55)) @ <>= plotCtOverview(raw, genes=g, xlim=c(0,50), groups=files$Treatment, calibrator="Control") @ \begin{figure} \begin{center} <>= par(mfrow=c(2,1)) <> <> @ \end{center} \caption{Overview of Ct values for the raw data.} \label{fig:overview} \end{figure} \subsection{Spatial layout} When the features are organised in a particular spatial pattern, such as the 96- or 384-well plates, it is possible to plot the Ct values or other characteristics of the features using this layout. Figure~\ref{fig:one} shows an example of the Ct values, as well as the location of different classes of features (using random examples here), across all the wells of a TLDA microfluidic card. <>= plotCtCard(raw, col.range=c(10,35), well.size=2.6) @ <>= featureClass(raw) <- factor(c("Marker", "TF", "Kinase")[sample(c(1,1,2,2,1,3), 384, replace=TRUE)]) plotCtCard(raw, plot="class", well.size=2.6) @ \begin{figure} \begin{center} <>= <> @ <>= <> @ \end{center} \caption{Ct values for the first sample (top), and the location of different feature classes (bottom). Ct values are visualised using colour intensity, and grey circles are features that were marked ``undetermined'' in the input file.} \label{fig:one} \end{figure} \subsection{Comparison of duplicated features within samples} When a sample contains duplicate measurements for some or all features, the Ct values of these duplicates can be plotted against each other to measure accordance between duplicates. In Figure~\ref{fig:replicates} the duplicates in sample2 are plotted against each other, and those where the Ct values differ more than 20\% from the average of a given feature are marked. <>= plotCtReps(qPCRraw, card=2, percent=20) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Concordance between duplicated Ct values in sample 2, marking features differing $>$20\% from their mean.} \label{fig:replicates} \end{figure} Differences will often arise due to one of the duplicates marked as ``Undetermined'', thus contributing to an artificially high Ct value, but other known cases exist as well. \subsection{Variation within and across samples} In some cases more than two replicates are present, either within each qPCR card (feature replicates) or across cards (replicated samples). Assessing the variation within replicates can indicate whether some samples or individual features are less reliable, or if perhaps an entire qPCR card shows high variation between replicate features and needs to be discarded. \Rfunction{plotCtVariation} generates a boxplot with all the variation values, either across genes or with each samples. That way the general distribution of variation or standard deviation values can be compared quickly (Figure~\ref{fig:variation}). In this example, the variation across samples doesn't differ much. %For illustration, we use a data set where one of the samples is highly variable. <>= raw.mix <- raw #exprs(raw.mix)[,6] <- sample(exprs(raw[,6])) plotCtVariation(raw.mix, variation="sd", log=TRUE, main="SD of replicated features", col="lightgrey") @ If it looks like there's an unacceptable (or interesting) difference in the variation, this can be further investigated using the parameter \Rfunction{type="detail"}. This will generate multiple sub-plots, containing a single scatterplot of variation versus mean for each gene or sample (Figure~\ref{fig:variation}). That way individual outliers can be identified, or whole samples removed by examining the resulting variation in more detail. <>= raw.variation <- plotCtVariation(raw.mix, type="detail", add.featurenames=TRUE, pch=" ", cex=1.2) @ <>= names(raw.variation) head(raw.variation[["Var"]][,1:4]) head(raw.variation[["Mean"]][,1:4]) apply(raw.variation[["Var"]][,3:7], 2, summary) colSums(raw.variation[["Var"]][,3:7]>20) @ In the example in this section a lot of features from sample 6 have intra-replicate variation above an arbitrary threshold selected based on Figure~\ref{fig:variation}, and the mean and median values are much higher than for the remaining samples. Sample 1 is excluded due to the page width. \begin{figure} \begin{center} <>= <> @ <>= <> @ \end{center} \caption{(Top) summary of standard deviation between replicated features within each of the six samples. (Bottom) Variation versus mean for replicated features.} \label{fig:variation} \end{figure} Variation across Ct values is discussed further in the following section regarding filtering. %%%%%%%%%% DESCRIPTION OF FEATURE CATEGORIES %%%%%%%%%%% \newpage \section{Feature categories and filtering} Each Ct values in HTqPCR has an associated feature category. This is an important component to indicate the reliability of the qPCR data. Aside from the ``OK'' indicator, there are two other categories: ``Undetermined'' is used to flag Ct values above a user-selected threshold, and ``Unreliable'' indicates Ct values that are either so low as to be estimated by the user to be problematic, or that arise from deviation between individual Ct values across replicates. By default, only Ct values labelled as ``undetermined'' in the input data files are placed into the ``Undetermined'' category, and the rest are classified as ``OK''. However, either before or after normalisation these categories can be altered depending on various criteria. \begin{description} \item[Range of Ct values] Some Ct values might be too high or low to be considered a reliable measure of gene expression in the sample, and should therefore not be marked as ``OK''. \item[Flags] Depending on the qPCR input the values might have associated flags, such as ``Passed'' or ``Failed'', which are used for assigning categories. \item[Biological and technical replicates] If features are present multiple times within each sample, or if samples are repeated in the form of technical or biological replicates, then these values can be compared. Ct values lying outside a user-selected confidence interval (90\% by default) will be marked as ``Unreliable''. \end{description} %<>= %raw.cat <- setCategory(raw, groups=files$Treatment, quantile=0.8) %@ A summary plot for the sample categories is depicted in Figure~\ref{fig:categories}. The result can be stratified by \Rfunction{featureType} or \Rfunction{featureClass}, for example to determine whether one class of features performed better or worse than others. <>= raw.cat <- raw plotCtCategory(raw.cat) @ <>= plotCtCategory(raw.cat, stratify="class") @ \begin{figure} \begin{center} <>= par(mfrow=c(2,1)) <> <> @ \end{center} \caption{Summary of the categories, either for each sample individually or stratified by feature class.} \label{fig:categories} \end{figure} The results can also be shown per feature rather than averaged across samples (Figure~\ref{fig:categories2}). <>= plotCtCategory(raw.cat, by.feature=TRUE, cexRow=0.1) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Summary of the categories, clustered across features.} \label{fig:categories2} \end{figure} If one doesn't want to include unreliable or undetermined data in part of the analysis, these Ct values can be set to NA using \Rfunction{filterCategory}. However, the presence of NAs could make the tests for differential expression less robust. When testing for differential expression the result will come with an associated category (``OK'' or ``Unreliable'') that can instead be used to assess the quality of the results. For the final results both ``Undetermined'' and ``Unreliable'' are pooled together as being ``Unreliable''. However, the label for each feature can either be set according to whether half or more of the samples are unreliable, or whether only a single non-``OK'' category is present, depending on the level of stringency the user wishes to enforce. %%%%%%%%%%%%%%%% DATA NORMALISATION %%%%%%%%%%%%%%%%% \section{Normalisation} Five different normalisation methods are currently implemented in \Rpackage{HTqPCR}. Three of these (\Rfunction{scale.rankinvariant}, \Rfunction{deltaCt} and \Rfunction{geometric.mean}) will scale each individual sample by a given value, whereas the remaining two will change the distribution of Ct values. \begin{description} \item[quantile] Will make the distribution of Ct values more or less identical across samples. \item[norm.rankinvariant] Computes all rank-invariant sets of features between pairwise com- parisons of each sample against a reference, such as a pseudo-mean. The rank-invari- ant features are used as a reference for generating a smoothing curve, which is then applied to the entire sample. \item[scale.rankinvariant] Also computes the pairwise rank-invariant features, but then takes only the features found in a certain number of samples, and used the average Ct value of those as a scaling factor for correcting all Ct values. \item[deltaCt] Calculates the standard deltaCt values, i.e.~subtracts the mean of the chosen controls from all other values in the feature set. \item[geometric.mean] Calculates the average Ct value for each sample, and scales all Ct values according to the ratio of these mean Ct values across samples. There are some indications that this is beneficial for e.g.~miRNA studies~\citep{Mestdagh:2009}. \end{description} For the rank-invariant normalisation and geometric mean methods, Ct values above a given threshold can be excluded from the calculation of a scaling factor or normalisation curve. This is useful so that a high proportion of ``Undetermined'' Ct values (assigned a value of 40 by default) in a given sample doesn't bias the normalisation of the remaining features. In the example dataset, Gene1 and Gene60 correspond to 18S RNA and GADPH, and are used as endogenous controls. Normalisation methods can be run as follows: <>= q.norm <- normalizeCtData(raw.cat, norm="quantile") sr.norm <- normalizeCtData(raw.cat, norm="scale.rank") nr.norm <- normalizeCtData(raw.cat, norm="norm.rank") d.norm <- normalizeCtData(raw.cat, norm="deltaCt", deltaCt.genes=c("Gene1", "Gene60")) g.norm <- normalizeCtData(raw.cat, norm="geometric.mean") @ Comparing the raw and normalised values gives an idea of how much correction has been performed (Figure~\ref{fig:two}), as shown below for the \Robject{q.norm} object. Note that the scale on the y-axis varies. <>= plot(exprs(raw), exprs(q.norm), pch=20, main="Quantile normalisation", col=rep(brewer.pal(6, "Spectral"), each=384)) @ \begin{figure} \begin{center} <>= col <- rep(brewer.pal(6, "Spectral"), each=384) col2 <- brewer.pal(5, "Dark2") par(mfrow=c(3,2), mar=c(2,2,2,2)) # All methods individually plot(exprs(raw), exprs(q.norm), pch=20, main="Quantile normalisation", col=col) plot(exprs(raw), exprs(sr.norm), pch=20, main="Rank invariant scaling", col=col) plot(exprs(raw), exprs(nr.norm), pch=20, main="Rank invariant normalisation", col=col) plot(exprs(raw), exprs(d.norm), pch=20, main="deltaCt normalisation", col=col) plot(exprs(raw), exprs(g.norm), pch=20, main="Geometric mean normalisation", col=col) # Just a single sample, across methods plot(exprs(raw)[,3], exprs(q.norm)[,3], pch=20, col=col2[1], main="Comparison of methods for sample 3", ylim=c(-10,40)) points(exprs(raw)[,3], exprs(sr.norm)[,3], pch=20, col=col2[2]) points(exprs(raw)[,3], exprs(nr.norm)[,3], pch=20, col=col2[3]) points(exprs(raw)[,3], exprs(d.norm)[,3], pch=20, col=col2[4]) points(exprs(raw)[,3], exprs(g.norm)[,3], pch=20, col=col2[5]) legend(8, 40, legend=c("Quantile", "Rank.invariant scaling", "Rank.invariant normalization", "deltaCt", "Geometric.mean"), col=col2, lwd=2, bty="n") @ \end{center} \caption{Normalized versus raw data, using a separate colour for each sample. The raw data is plotted along the x-axis and the normalised along y. The last plot is a comparison between normalization methods for the third sample, still with the raw Ct values along the x-axis.} \label{fig:two} \end{figure} %%%%%%%%%%%%%%%% FILTERING %%%%%%%%%%%%%%%%% \newpage \section{Filtering and subsetting the data} At any point during the analysis it's possible to filter out both individual features or groups of features that are either deemed to be of low quality, or not of interest for a particular aspect of the analysis. This can be done using any of the feature characteristics that are included in the \Rfunction{featureNames}, \Rfunction{featureType}, \Rfunction{featureClass} and/or \Rfunction{featureCategory} slots of the data object. Likewise, the \Rfunction{qPCRset} object can be turned into smaller subsets, for example if only a particular class of features are to be used, or some samples should be excluded. Simple subsetting can be done using the standard \Rfunction{[,]} notation of R, for both rows (genes) and columns (samples). %<>= %nr.norm[1:10,] %nr.norm[,c(1,3,5)] %@ %Filtering is done by specifying the components to remove, either by just using a single criteria, or by combining multiple filters: %<>= %qFilt <- filterCtData(nr.norm, remove.type="Endogenous Control") %qFilt <- filterCtData(nr.norm, remove.name=c("Gene1", "Gene20", "Gene30")) %qFilt <- filterCtData(nr.norm, remove.class="Kinase") %qFilt <- filterCtData(nr.norm, remove.type=c("Endogenous Control"), remove.name=c("Gene1", "Gene20", "Gene30")) %@ %The data can also be adjusted according to feature categories. With \Rfunction{filterCategory} mentioned previously it's possible to replace certain Ct values with NA, but one might want to completely exclude features where a certain number of the Ct values are for example unreliable. % %<>= %qFilt <- filterCtData(nr.norm, remove.category="Undetermined") %qFilt <- filterCtData(nr.norm, remove.category="Undetermined", n.category=5) %@ %Another typical filtering step would be to remove features showing little or no variation across samples, prior to testing for statistical significance of genes between samples. Features with relatively constant Ct levels are less likely to be differentially expressed, so including them in the downstream analysis would cause some loss of power when adjusting the p-values for multiple testing (the feature-by-feature hypothesis testing). Variation across samples can be assessed using for example the interquartile range (IQR) values for each feature (Figure~\ref{fig:IQR}). % %<>= %iqr.values <- apply(exprs(nr.norm), 1, IQR) %hist(iqr.values, n=20, main="", xlab="IQR across samples") %abline(v=1.5, col=2) %@ % %All features with IQR below a certain threshold can then be filtered out. % %<>= %qFilt <- filterCtData(nr.norm, remove.IQR=1.5) %@ % %\begin{figure} %\begin{center} %<>= %<> %@ %\end{center} %\caption{Histogram of the IQR values for all features across the samples, including the cut-off used in the filtering example.} %\label{fig:IQR} %\end{figure} % %Note that filtering prior to normalisation can affect the outcome of the normalisation procedure. In some cases this might be desirable, for example if a particular feature class are heavily biasing the results so it's preferable to split the \Robject{qPCRset} object into smaller data sets. However, in other cases it might for example make it difficult to identify a sufficient number of rank invariant features for the\Rfunction{norm.rankinvariant} and \Rfunction{scale.rankinvariant} methods. Whether to perform filtering, and if so then during what step of the analysis, depends on the genes and biological samples being analysed, as well as the quality of the data. It's therefore advisable to perform a detailed quality assessment and data comparison, as mentioned in the next section. %%%%%%%%%%%%%%%% QC %%%%%%%%%%%%%%%%% \newpage \section{Quality assessment} \subsection{Correlation between samples} The overall correlation between different samples can be displayed visually, such as shown for the raw data in Figure~\ref{fig:cor}. Per default, 1 minus the correlation is plotted. <>= plotCtCor(raw, main="Ct correlation") @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Correlation between raw Ct values.} \label{fig:cor} \end{figure} \subsection{Distribution of Ct values} It may be of interest to examine the general distribution of data both before and after normalisation. A simple summary of the data can be obtained using \Rfunction{summary} as shown below. <

>= summary(raw) @ However, figures are often more informative. To that end, the range of Ct values can be illustrated using histograms or with the density distribution, as shown in Figure~\ref{fig:density.box}. <>= plotCtDensity(sr.norm) @ <>= plotCtHistogram(sr.norm) @ %<>= %plotCtBoxes(r.norm, stratify=NULL) %@ \begin{figure} \begin{center} <>= par(mfrow=c(1,2), mar=c(3,3,2,1)) <> <> @ \end{center} \caption{Distribution of Ct values for the individual samples, either using the density of all arrays (left) or a histogram of a single sample (right), after scale rank-invariant normalisation.} \label{fig:density.box} \end{figure} Plotting the densities of the different normalisation methods lends insight into how they differ (Figure~\ref{fig:all.density}). \begin{figure} \begin{center} <>= par(mfrow=c(3,2), mar=c(2,2,2,1)) plotCtDensity(qPCRraw, main="Raw Ct values") plotCtDensity(q.norm, main="quantile") plotCtDensity(sr.norm, main="scale.rankinvariant") plotCtDensity(nr.norm, main="norm.rankinvariant") plotCtDensity(d.norm, main="deltaCt") plotCtDensity(g.norm, main="geometric.mean") @ \end{center} \caption{Densities of Ct values for all samples before and after each of the normalisation methods. The peak at the high end originates from features with ``Undetermined'' Ct values, which are assigned the Ct value 40 by default. } \label{fig:all.density} \end{figure} Ct values can also be displayed in boxplots, either with one box per sample or stratified by different attributes of the features, such as \Rfunction{featureClass} or \Rfunction{featureType} (Fig.~\ref{fig:strat.box}). <>= plotCtBoxes(sr.norm, stratify="class") @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Boxplot of Ct values across all samples, stratified by feature classes.} \label{fig:strat.box} \end{figure} \subsection{Comparison of Ct values for two samples} It is often of interest to directly compare Ct values between two samples. In Figure~\ref{fig:scatter}, two examples are shown for the rank-invariant normalised data: one for different biological samples, and one for replicates. <>= plotCtScatter(sr.norm, cards=c(1,2), col="type", diag=TRUE) @ <>= plotCtScatter(sr.norm, cards=c(1,4), col="class", diag=TRUE) @ \begin{figure} \begin{center} <>= par(mfrow=c(1,2), mar=c(3,3,2,1)) <> <> @ \end{center} \caption{Scatter plot of Ct values in different samples, with points marked either by featureType (left) or featureClass (right) and the diagonal through $x=y$ marked with a grey line.} \label{fig:scatter} \end{figure} \subsection{Scatter across all samples} It is also possible to generate a scatterplot of Ct values between more than the two samples shown above. In Figure~\ref{fig:scatter.all} all pairwise comparisons are shown, along with their correlation when all Ct values $<$35 are removed. <>= plotCtPairs(sr.norm, col="type", diag=TRUE) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Scatterplot for all pairwise comparisons between samples, with spots marked depending on \Rfunction{featureType}, i.e.~whether they represent endogenous controls or targets.} \label{fig:scatter.all} \end{figure} \subsection{Ct heatmaps} Heatmaps provide a convenient way to visualise clustering of features and samples at the same time, and show the levels of Ct values (Figure~\ref{fig:heatmap}). The heatmaps can be based on either Pearson correlation coefficients or Euclidean distance clustering. Euclidean-based heatmaps will focus on the magnitude of Ct values, whereas Pearson clusters the samples based on similarities between the Ct profiles. <>= plotCtHeatmap(raw, gene.names="", dist="euclidean") @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Heatmap for all samples and genes, based on the Euclidean distance between Ct values.} \label{fig:heatmap} \end{figure} \subsection{Coefficients of variation} The coefficients of variation (CV) can be calculated for each feature across all samples. Stratifying the CV values by \Rfunction{featureType} or \Rfunction{featureClass} can help to determine whether one class of features is more variable than another (Figure~\ref{fig:CV}). For the example data feature classes have been assigned randomly, and the CVs are therefore similar, whereas for the feature types there's a clear difference between controls and targets. <>= plotCVBoxes(qPCRraw, stratify="class") plotCVBoxes(qPCRraw, stratify="type") @ \begin{figure} \begin{center} <>= par(mfrow=c(1,2), mar=c(2,2,2,1)) plotCVBoxes(qPCRraw, stratify="class") plotCVBoxes(qPCRraw, stratify="type") @ \end{center} \caption{Coefficients of variation for each feature across all samples. } \label{fig:CV} \end{figure} %%%%%%%%%%%%%%%% CLUSTERING %%%%%%%%%%%%%%%%% \section{Clustering} At the moment there are two default methods present in \Rpackage{HTqPCR} for clustering; hierarchical clustering and principal components analysis (PCA). \subsection{Hierarchical clustering} Both features and samples can be subjected to hierarchical clustering using either Euclidean or Pearson correlation distances, to display similarities and differences within groups of data. Individual subclusters can be selected, either using pre-defined criteria such as number of clusters, or interactively by the user. The content of each cluster is then saved to a list, to allow these features to be extracted from the full data set if desired. An example of a clustering of samples is shown in Figure~\ref{fig:cluster1}. In Figure~\ref{fig:cluster2} these data are clustered by features, and the main subclusters are marked. <>= clusterCt(sr.norm, type="samples") @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Hierarchical clustering of samples.} \label{fig:cluster1} \end{figure} <>= cluster.list <- clusterCt(sr.norm, type="genes", n.cluster=6, cex=0.5) @ %<>= %c6 <- cluster.list[[6]] %print(c6) %show(sr.norm[c6,]) %@ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Hierarchical clustering of features, with subclusters marked.} \label{fig:cluster2} \end{figure} \subsection{Principal components analysis} PCA is performed across the selected features and samples (observations and variables), and can be visualized either in a biplot, or showing just the clustering of the samples (Figure~\ref{fig:PCA}). <>= plotCtPCA(qPCRraw) plotCtPCA(qPCRraw, features=FALSE) @ \begin{figure} \begin{center} <>= par(mfrow=c(1,2), mar=c(2,2,2,1)) plotCtPCA(qPCRraw) plotCtPCA(qPCRraw, features=FALSE) @ \end{center} \caption{Left: a biplot including all features, with samples represented by vectors. Right: the same plot, including only the samples.} \label{fig:PCA} \end{figure} %%%%%%%%%%%%%%%% DE TESTING %%%%%%%%%%%%%%%%% \newpage \section{Differential expression} At this stage multiple filterings might have been performed on the data set(s). To remind yourself about those, you can use the \Rfunction{getCtHistory} function on the \Rclass{qPCRset} object. %<