%\VignetteIndexEntry{KEGGgraph: Application Examples} %\VignetteDepends{graph, XML, Rgraphviz, RBGL, org.Hs.eg.db, KEGG.db,hgu133plus2.db} %\VignettePackage{KEGGgraph} \documentclass[11pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{geometry} \usepackage{longtable} \usepackage[pdftex]{graphicx} \SweaveOpts{keep.source=TRUE,eps=FALSE,pdf=TRUE,prefix=TRUE} % R part \newcommand{\R}[1]{{\textsf{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Metas}[1]{{\texttt{#1}}} \begin{document} \setkeys{Gin}{width=0.8\textwidth} \title{KEGGgraph: Application Examples} \author{Jitao David Zhang} \maketitle \begin{abstract} In this vignette, we demonstrate the application of \Rpackage{KEGGgraph} as flexible module in analysis pipelines targeting heterogenous biological questions. For basic use of the \Rpackage{KEGGgraph} package, please refer to the vignette \textit{KEGGgraph: a graph approach to KEGG PATHWAY in R and Bioconductor}. \end{abstract} \section{Introduction} In many cases, \Rpackage{KEGGgraph} is used as stand-alone software package to parse KEGG pathway \cite{Kanehisa08} into graph models and consequently to analyze them. However, \Rpackage{KEGGgraph} can also be combined with other tools (within or out of the scope of \R{R} and \R{Bioconductor} \cite{Gentleman04, Carey05}), to build flexible analysis pipelines. In this vignette we demonstrate the cooperation between \Rpackage{KEGGgraph} and other tools with several examples. \section{Parse created or edited pathway} KGML-ED \cite{Klukas07} is a tool designed for the dynamic visualization, interactive navigation and editing of KEGG pathway diagrams. The program is able to modify or even create \textit{de novo} new pathways and export them into KGML (KEGG XML) files. Since \Rpackage{KEGGgraph} captures the graph information of KEGG PATHWAY by parsing KGML files, it is also capable to parse the KGML files exported from KGML-ED program. The joint use of the two programs allows user edit KGML files from KEGG (for example, by adding/removing new nodes/edges), or create new pathways from scratch, and then analyse them. Here we demontrate the parse of a toy pathway (network motif) created by KGML-ED, which illustrates a feed-forward loop (FFL) \cite{Mangan2003, Alon2007}. <>= library(KEGGgraph) toyKGML <- system.file("extdata/kgml-ed-toy.xml", package="KEGGgraph") toyGraph <- parseKGML2Graph(toyKGML, genesOnly=FALSE) toyGraph nodes(toyGraph) @ We visualize the graph with \Rpackage{Rgraphviz} in Figure \ref{fig:toyNetworkVis}. \begin{figure}[!htbp] \centering <>= library(Rgraphviz) nodeInfo <- getKEGGnodeData(toyGraph) nodeType <- sapply(nodeInfo, getType) makeNodeRenderAttrs <- function(g, label=nodes(g), shape="ellipse", fill="#e0e0e0",...) { rv <- list(label=label, shape=shape, fill=fill, ...) nA <- nodeRenderInfo(g) for(i in seq(along=rv)) { if (length(rv[[i]]) == 1) { rv[[i]] <- rep(rv[[i]], numNodes(g)) } else { if (length(rv[[i]]) != numNodes(g)) stop("Attribute vector must have as many elements as 'g' has nodes.") } names(rv[[i]]) <- nodes(g) nA[[ names(rv)[[i]] ]] <- rv[[i]] } nodeRenderInfo(g) <- nA return(g) } toyDrawn <- plotKEGGgraph(toyGraph) toyDrawnRefine <- makeNodeRenderAttrs(toyDrawn, fill=ifelse(nodeType=="gene", "lightblue", "orange"), shape=ifelse(nodeType=="gene", "ellipse","rectangle")) @ <>= renderGraph(toyDrawnRefine) @ \caption{Visualization of the toy network (incoherent type-1 feed-forward loop, IL1-FFL) created by \Rpackage{KGML-ED} and parsed by \Rpackage{KEGGgraph}. The rectangle nodes represent compounds while ellipse ones represent gene/gene products. Solid and dashed red arrows represent activation and expression, whereas blue represent inhibition and repression. Note that self-inhibition of GalS is presented by a node and an edge pointing to its self.}\label{fig:toyNetworkVis} \end{figure} Similarly, users could create new or modify existing pathways by modifying KGML files with tools including KGML-ED, and these pathways can be consequently parsed, visualized and analyzed consequently by \Rpackage{KEGGgraph} and other tools for graphs implemented in \R{Bioconductor}. In this sense \Rpackage{KEGGgraph} is not only able to study existing biological pathways constructed by KEGG, but also a tool for modelling pathways and biological networks in \R{R}. \section{Microarray analysis} KEGG PATHWAY contains pathway maps for nearly 1000 organisms and has been intensively expanded with pathways including signaling transduction, cellular processes and human diseases in recent years. As the graph-theory based approach to this valuable knowledge-base, \Rpackage{KEGGgraph} can be built into analysis pipelines where the pathway information could be useful, especially with the graph attributes. To demonstrate this we show one example of microarray analysis, cooperating with another package \R{SPIA} \cite{Tarca2009}, a package implementing the Signaling Pathway Impact Analysis (SPIA) algorithm. The microarray data is decribed in the vignette of \Rpackage{SPIA}, here it is only briefly summarized: Affemetrix GeneChip (GEO accession: GSE4107), containing 10 normal samples and 12 colorectal cancer samples, is normalized with \Rpackage{affy} and \Rpackage{limma} package. The result of \Rfunction{topTable} in \Rpackage{limma} is used as the input data. <>= if(require(SPIA)) { data(colorectalcancer,package="SPIA") } else { data(colorectalcancerSPIA, package="KEGGgraph") } library(SPIA) data(colorectalcancer) head(top) @ We only consider the probes annotated with EntrezGeneID and differentially expressed with FDR $p$-value less than 0.05. <>= library(hgu133plus2.db) x <- hgu133plus2ENTREZID top$ENTREZ <- unlist(as.list(x[top$ID])) top <- top[!is.na(top$ENTREZ),] top <- top[!duplicated(top$ENTREZ),] tg1 <- top[top$adj.P.Val < 0.05,] DE_Colorectal <- tg1$logFC names(DE_Colorectal) <- as.vector(tg1$ENTREZ) ALL_Colorectal <- top$ENTREZ @ The \emph{SPIA} algorithm takes as input two vectors: log2 fold changes of differentially expressed genes (\Robject{DE\_Colorectal}) and those of all EntrezGeneID annotated genes (\Robject{ALL\_Colorectal}), and returns a table of pathways ranked from the most to the least significant. The results show that Colorectal cancer pathway (ID:05210) is significantly activated (ranked the 5th of the result table). To visualize the results, we visualize the pathway with differentially expressed genes marked with color. For the first time we could use the \Rfunction{retrieveKGML} function to retrieve the KGML remotely from the KEGG FTP site. <>= tmp <- "hsa05210.xml" retrieveKGML(pathwayid="05210", organism="hsa", destfile=tmp) @ We have attached this file in the \Rpackage{KEGGgraph}, we parse it into graph and observe that around $30\%$ percent of the genes in the pathway are differentially expressed. The pathway map (figure) can be found at \url{http://www.genome.jp/dbget-bin/show_pathway?hsa05210}. <>= colFile <- system.file("extdata/hsa05210.xml", package="KEGGgraph") g <- parseKGML2Graph(colFile) deKID <- translateGeneID2KEGGID(names(DE_Colorectal)) allKID <- translateGeneID2KEGGID(ALL_Colorectal) isDiffExp <- nodes(g) %in% deKID sprintf("%2.2f%% genes differentially-expressed", mean(isDiffExp)*100) @ We visualize the pathway with pseudo-colors representing the log2 fold change of the differentially expressed genes in figure \ref{fig:spiaVis}. \begin{figure}[!hb] \centering <>= library(RColorBrewer) library(org.Hs.eg.db) library(RBGL) library(grid) ar <- 20 cols <- rev(colorRampPalette(brewer.pal(6, "RdBu"))(ar)) logfcs <- DE_Colorectal[match(nodes(g), deKID)] names(logfcs) <- nodes(g) logfcs[is.na(logfcs)] <- 0 incol <- round((logfcs+2)*5); incol[incol>ar] <- ar undetected <- !nodes(g) %in% allKID logcol <- cols[incol]; logcol[logfcs==0] <- "darkgrey"; logcol[undetected] <- "yellow" names(logcol) <- names(logfcs) nA <- makeNodeAttrs(g, fillcolor=logcol, label="", width=10, height=1.2) par(mar=c(3,5,0,5), mgp=c(0,0,0)) layout(mat=matrix(c(rep(1,8),2), ncol=1, byrow=TRUE)) plot(g, "dot", nodeAttrs=nA) image(as.matrix(seq(1,ar)), col=cols, yaxt="n", xaxt="n") mtext("down-regulation", side=1, at=0, line=1) mtext("up-regulation", side=1, at=1, line=1) @ \caption{Overview of colorectal cancer pathway with pseudo-colored nodes representing expression change. Red nodes represented up-regulated mRNAs in cancer samples and blue ones are down-regulated. Grey nodes are detected not to be differentially expressed, yellow ones are not reported (possibly due to under detection threshold). The gradual colors indicate the log2 fold change of the expression - the darker the colors, the larger fold change. One distinct feature is the existence of many non-connected nodes, which could has been omitted when the graph information is discarded and only the membership of the gene is considered.}\label{fig:spiaVis} \end{figure} <>= gDeg <- degree(g) gIsSingle <- gDeg[[1]] + gDeg[[2]] == 0 options(digits=3) gGeneID <- translateKEGGID2GeneID(nodes(g)) gSymbol <- sapply(gGeneID, function(x) mget(x, org.Hs.egSYMBOL, ifnotfound=NA)[[1]]) isUp <- logfcs > 0 isDown <- logfcs < 0 singleUp <- isUp & gIsSingle singleDown <- isDown & gIsSingle @ In the figure \ref{fig:spiaVis}, we observe that: \begin{itemize} \item Not all the genes are connected with each other, \Sexpr{mean(gIsSingle)*100}\% (\Sexpr{sum(gIsSingle)}/\Sexpr{length(gIsSingle)}) nodes in the pathway have a degree of 0 -- they are not conencted to any other node. There are several reasons for this. One of them is the genes with genetic alterations are indicated in the pathway map, but not interweaved into the pathway (cf. \url{http://www.genome.jp/dbget-bin/show_pathway?hsa05210}). Another important factor is that in some pathways, especially human disease pathways, KEGG records genes involved in the disease but does not provide any edge in that pathway, although one can find their interaction partners in other pathways (e.g., Grb2 has no edges in colorectal cancer pathway, however in other maps including MAPK signaling pathway, both up- and down-stream interactors are found). One solution to the later problem is to \emph{merge} (\emph{union}) the related pathways together, to this end \Rpackage{KEGGgraph} provides the function \Rfunction{mergeKEGGgraphs}. \item From the visualization in \ref{fig:spiaVis}, it is difficult to recognize any patterns. To examine the pathway in more details, it is necessary to use \emph{Divide and Conquer} strategy, namely to \emph{subset} the graph into subgraphs. To this end \Rpackage{KEGGgraph} provides the function \Rfunction{subKEGGgraph}, which maintains the KEGG information while subsetting the graph. \end{itemize} Out of \Sexpr{sum(gIsSingle)} nonconnected genes, \Sexpr{sum(singleUp)} is up-regulated and \Sexpr{sum(singleDown)} downregulated. Notably, out of 10 \emph{Frizzled} homologues (Wnt receptors), four are up-regulated (\emph{FZD2, FZD4, FZD7 and FZD10}), and only one is down-regulated(\emph{FZD5}), in accordance with the former report \cite{Vincan2007}. Next we concentrate on upregulated genes, the result given by \Rpackage{SPIA} package indicates that colorectal pathway is activated. \begin{figure}[!htb] \centering <>= ups <- nodes(g)[logfcs > 0] upNs <- unique(unlist(neighborhood(g, ups, return.self=TRUE))) upSub <- subKEGGgraph(upNs, g) upNeighbor <- nodes(upSub)[sapply(neighborhood(upSub, nodes(upSub)), length)>0] upNeighbor <- setdiff(upNeighbor, nodes(g)[undetected]) upSub <- subKEGGgraph(upNeighbor, upSub) upSubGID <- translateKEGGID2GeneID(nodes(upSub)) upSymbol <- gSymbol[upSubGID] upnA <- makeNodeAttrs(upSub, fillcolor=logcol[nodes(upSub)], label=upSymbol, fixedsize=TRUE, width=10, height=10, font=20) plot(upSub, "dot", nodeAttrs=upnA) @ \caption{Up-regulated genes (seven, in red) and their neighborhood genes in colorectal cancer pathway, the color follows the legend in figure \ref{fig:spiaVis}. Un-reported genes are hidden from the figure.}\label{fig:spiaSub} \end{figure} We can use degree centrality (the sum of in- and out-degree of each node) as a measure of the relative importance of the node compared to others. In this subgraph, 8 nodes have an degree equal or larger than three (in decreasing order: AKT3, TCF7L1, CCND1, FOS, MAPK3, MAPK1, JUN, MAPK10), and 5 of them are upreguated to different extent. This reinforces the conclusion of \Rpackage{SPIA} that the pathway is activated - it seems that the relative important nodes are up-regulated. Especially we note the upregulation of Jun and Fos, two transcription factors with many targeting genes, although their up-stream interactors (MAPK3 and MAPK1) are either not significantly differentially expressed or slightly down-regulated (MAPK1, logFC=-0.381). Further analysis could done, for example, by merging the colorectal pathway with linked pathways (Wnt signaling, apoptosis, etc) and investigate the graph characteristics of differnetially expressed genes and their links. \begin{thebibliography}{} \bibitem[Gentleman {\it et~al}., 2004]{Gentleman04} Gentleman {\it et~al}. (2004). Bioconductor: open software development for computational biology and bioinformatics, {\it Genome Biology}, {\bf 5}, R80. \bibitem[Carey {\it et~al}., 2005]{Carey05} Carey {\it et~al}. (2005). Network structures and algorithms in Bioconductor, {\it Bioinformatics}, {\bf 21}, 135-136. \bibitem[Kanehisa {\it et~al}., 2008]{Kanehisa08} Kanehisa {\it et~al}. (2008). KEGG for linking genomes to life and the environment, {\it Nucleic Acids Research, Database issue}, {\bf 36}, 480-484. \bibitem[Klukas and Schreiber, 2007]{Klukas07} Klukas and Schreiber. (2007). Dynamic exploration and editing of KEGG pathway diagrams, {\it Bioinformatics}, {\bf 23}, 344-350. \bibitem[Aittokallio and Schwikowski, 2006]{Aittokallio06} Aittokallio and Schwikowski. (2006). Graph-based methods for analysing networks in cell biology, {\it Briefings in Bioinformatics}, {\bf 7}, 243-255. \bibitem[Mangan and Alon, 2003]{Mangan2003} Mangan and Alon. (2003) Structure and function of the feed-forward loop network motif, {\it Proc. Natl. Acad. Sci. U.S.A}, {\bf 100}, 11980-11985. \bibitem[Alon, 2007]{Alon2007} Alon. (2007). An introduction to system biology: design principles of biological circuits, {\it Chapman and Hall}, 2007, 64. \bibitem[Tarca {\it et~al}., 2009]{Tarca2009} Tarca {\it et~al}. (2009). A signaling pathway impact analysis for microarray experiments. {\it Bioinformatics}, {\bf 25}, 75-82. \bibitem[Vincan {\it et~al}., 2007]{Vincan2007} Vincan {\it et~al}. (2007) Frizzled-7 dictates three-dimensional organization of colorectal cancer cell carcinoids, {\it Oncogene}, {\bf 26}, 2340-2352. \end{thebibliography} \end{document}