\documentclass[a4paper]{article} %\VignetteIndexEntry{iBBiG User Manual} %\VignettePackage{biclust} %\VignetteDepends{ade4} %\VignetteDepends{xtable} %\VignetteDepends{stats4} \usepackage{Sweave} \usepackage{amsmath} \usepackage{times} \usepackage{hyperref} \usepackage[numbers]{natbib} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\Rpackage}[1]{\textit{#1}} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \title{Introduction to iBBiG} \author{Aedin Culhane, Daniel Gusenleitner} \begin{document} \maketitle <>== oldopt <- options(digits=3) options(width=60) on.exit( {options(oldopt)} ) library(iBBiG) @ \section{iBBiG} Iterative Binary Bi-clustering of Gene sets (iBBiG) is a bi-clustering algorithm optimized for discovery of overlapping biclusters in spare binary matrices of data (Gusenleitner \emph{et al.} in review). We have optimized this method for the discovery of modules in matrices of discretized \emph{p}-values from gene set enrichment analysis (GSA) of hundreds of datasets. However, it could be applied to any binary (1,0) matrix, such as discretized \emph{p}-values from any sources of binary data. We apply iBBiG to meta-GSA to enable integrated analysis over hundreds of gene expression datasets. By integrating data at the levels of GSA results, we avoid the need to match probes/genes across multiple datasets, making large scale data integration a tractable problem. iBBiG scales well with the dimensions of meta-datasets and is tolerant to noise characteristic of genomic data. It outperformed other traditional clustering approaches (Hierarchical clustering, k-means) or biclustering methods (bimax, fabia, coalesce) when applied to simulated data. \section{Application to simulated dataset} To demonstrate iBBiG, we will use a simulated binary dataset of 400 rows x 400 columns (as described by Gusenleitner \emph{et al.}), in which a $1$ indicates a positive association (or $p < 0.05$) between a gene set (row) and the results of a pairwise test between clinical covariates (column), and a $0$ represents a lack of association. To simulate random noise characteristically observed in genomic data, 10\% random background noise (value of $1$) was introduced into the matrix. The matrix was seeded with seven artificial modules or bi-clusters (M1-M7; Figure 1) by assigning associations (value of 1) to its column and row pairs. To replicate the expected properties of real data, seeded modules partially overlapped in columns, in rows and in both rows and columns simultaneously. M1 gene sets overlap with most other modules with the exception of M3. M2 has overlapping pairwise tests with modules M4-7. Artificial modules also have highly varying sizes and aspect ratios, including "wide" modules driven by a large number of pairwise tests and only a few gene sets and "tall" modules like M1 which consist of 25 pairwise tests and a large number of gene sets ($n=250$). This latter type of module might represent a complex, well-characterized biological process such as proliferation. In a real data set the signal strength will vary both between and within modules. Variance between modules was simulated by imposing random noise (1 -> 0 replacement) with different signal strengths on the modules (Figure 1). Within a module, we expect to see a few strong signals (gene sets associated with all pairwise tests) and many weaker signals. Therefore within each module, a noise gradient was also applied so that the first gene sets had the greatest number of associations (Figure 1). This overlaid noise gradient ranged from 10 to 60\% and varied between modules (Table 1). To create this simulated data as described in Gusenleitner, \emph{et al.} use the function \Rfunction{makeArtifical} which creates an object of class \Rclass{iBBiG}, an extension of \Rclass{biclust}. <>= library(iBBiG) binMat<-makeArtificial() binMat plot(binMat) @ The class \Rclass{BiClust} contains the number of clusters and two logical matrices which indicate whether a row or column are present in the cluster. <>= str(binMat) Number(binMat) RowxNumber(binMat)[1:2,] NumberxCol(binMat)[,1:2] @ The matrix \Rclass{RowxNumber} is a logical matrix having a number of rows equal to that of \Rclass{binMat} and a number of columns equal to the number of detected clusters. \Rclass{NumberxCol} is reversed; this class has a row count equal to the number of clusters and a column count of the number of columns of \Rclass{binMat}. To run iBBiG on this artificial binary matrix, simply call the function \Rfunction{iBBiG}. The function \Rfunction{plot} and \Rfunction{statClust} will provide a visual representation and statistical summary of the results of the cluster analysis. <>= res<- iBBiG(binMat@Seeddata, nModules=10) plot(res) @ If you wish to compare two \Rfunction{iBBiG} or \Rfunction{Biclust} results, for example a prediction and a gold standard (GS), the function \Rfunction{JIdist} will calculate the Jaccard Index distance between two \Rclass{Biclust} or \Rclass{iBBiG} result objects. By default, it calculates the distances between each column. Setting $margin=row$ or $margin=both$ will cause the function to calculate instead the JI distance between the rows, or an average of rows/columns. By default, Rfunction{JIdist} returns a \Rclass{data.frame} with 2 columns, the column $n$ indicating which cluster was the best match (maximum JI) to each cluster of the second iBBiG object (GS). The column $JI$ contains the Jaccard Index distance between the columns of these two clusters. If $best=FALSE$, the function will return the distance matrix instead of the best match. <>= JIdist(res,binMat) JIdist(res,binMat, margin="col", best=FALSE) JIdist(res,binMat, margin="col") JIdist(res,binMat, margin="row") JIdist(res,binMat, margin="both") @ To view the code of the function JIdist <>= showMethods(JIdist) getMethod(iBBiG:::JIdist, signature(clustObj = "iBBiG", GS = "iBBiG")) getMethod("JIdist", signature(clustObj="iBBiG", GS="iBBiG")) @ To extract performance statistics between two \Rfunction{iBBiG} results, use \Rfunction{analyzeClust}, which will take a single \Rclass{iBBiG} result object or a list of objects and compare these to a gold standard (another \Rclass{iBBiG} or \Rclass{biclust} object). Again results can be based on matches to the best row, column or both. <>= analyzeClust(res,binMat) analyzeClust(res,binMat, margin="col") @ Again to view the code of the function, you could: <>= showMethods(analyzeClust) getMethod("analyzeClust", signature(clustObj="iBBiG", GS="iBBiG")) @ The structure of \Rclass{iBBiG} differs from \Rclass{BiClust} in that it contains ClusterScores. ClusterScores are the scores for each module. RowScoreNumber are the scores for each row in the cluster. \Rclass{Seeddata} is a copy of \Rclass{binMat}. <>= str(binMat) RowScorexNumber(res)[1:2,] Clusterscores(res) Seeddata(res)[1:2,1:2] @ There are also the slots for info and Parameters which can contain additional user-entered information about the analysis. We can subset or reorder the results like so: <>= res[1:3] res[c(4,2,1)] res[1, drop=FALSE] @ \section{Using biclust functions} An object from \Rclass{iBBiG} extends the class \Rclass{biclust} and can therefore use methods available to a \Rclass{biclust} object. For example, there are several plot functions in \Rclass{BiClust} <>= class(res) par(mfrow=c(2,1)) drawHeatmap2(res@Seeddata, res, number=4) biclustmember(res,res@Seeddata) biclustbarchart(res@Seeddata, Bicres=res) plotclust(res, res@Seeddata) @ Statistical measures of biclustering performance including the Chia and Karuturi Function, Coherence measures and F Statistics are available within the \Rclass{biclust} R packages. There are function to process data, binarize or discretize data. For example, given gene expression data we can binarize or discretize the data matrix as follows and this can be input into iBBiG <>= data(BicatYeast) BicatYeast[1:5,1:5] binarize(BicatYeast[1:5,1:5], threshold=0.2) discretize(BicatYeast[1:5,1:5]) @ The sub-matrices of each cluster can be extracted from the original matrix, using the function \Rfunction{bicluster} <>= Modules<-bicluster(res@Seeddata, res, 1:3) str(Modules) Modules[[1]][1:3,1:4] @ To write results to a file use the following: <>= writeBiclusterResults("Modules.txt", res, bicName="Output from iBBiG with default parameters", geneNames=rownames(res@Seeddata), arrayNames=colnames(res@Seeddata)) @ %------------------------------------------------------------ \section{Session Info} %------------------------------------------------------------ <>= toLatex(sessionInfo()) @ \appendix \bibliographystyle{plainnat} \begin{thebibliography}{0} \end{thebibliography} \end{document}