%\VignetteIndexEntry{Transcription Factor Association Rule Miner} %\VignetteDepends{} %\VignetteKeywords{} %\VignettePackage{TFARM} %\VignetteEngine{knitr::knitr} \documentclass{article} \usepackage[algoruled]{algorithm2e} \usepackage{color} \usepackage{amsfonts} \usepackage{graphicx} %\tracingstats=0 \definecolor{bronze}{rgb}{0.93, 0.53, 0.18} \definecolor{darkblue}{rgb}{0, 0.3, 0.6} \SetKw{template}{Template:} \SetKw{assign}{Assignment:} \SetKw{norm}{Normalization:} <>= BiocStyle::latex() @ \newcommand{\bam}{\texttt{BAM}} \title{\Biocpkg{TFARM}: Transcription Factor Associatio Rule Miner} \author{Liuba Nausicaa Martino \email{liuban.martino@gmail.com} \\Alice Parodi \\Gaia Ceddia \\Piercesare Secchi \\Stefano Campaner \\Marco Masseroli } %\date{Modified: February 24, 2016. Compiled: \today} \date{\today} \begin{document} %\SweaveOpts{concordance=TRUE} <>= library(knitr) opts_chunk$set( concordance=FALSE ) @ <>= library(knitr) opts_chunk$set( background = "#C0C0C0" ) @ \maketitle \tableofcontents <>= options(width=60) @ <>= library(GenomicRanges) @ \bigskip <>= library(TFARM) @ \section{Introduction} Looking for association rules between transcription factors in genomic regions of interest can be useful to find direct or indirect interactions among regulatory factors of DNA transcription. However, the results provided by the most recent algorithms for the search of association rules \cite{borgelt2002induction} \cite{agrawal1993mining} alone are often not intelligible enough, since they only provide a list of association rules. A novel method is proposed for subsequent mining of these results to evaluate the contribution of the items in each association rule. The \Biocpkg{TFARM} package allows us to identify and extract the most relevant association rules with a given target transcription factor and compute the \textit{Importance Index} of a transcription factor (or a combination of some of them) in the extracted rules. Such an index is useful to associate a numerical value to the contribution of one or more transcription factors to the co-regulation with a given target transcription factor. \section{Dataset} Association rules are extracted from a GRanges object in which metadata columns identify transcription factors and genomic coordinates are represented in the left-hand-side of the GRanges; thus, each row is a different genomic region. The element (i,j) (with j > 4) of the metadata section is equal to 0 if a binding site of transcription factor j is absent in region i, or to 1 (or any other value) if it is present. This dataset, called \textit{matrix of presences}, should not have rows with only 0 values since we consider regions with no transcription factors as uninteresting regions. The first three columns of the GRanges contain the chromosome name, the genomic coordinates (i.e., \textit{left} and \textit{right} coordinate are the leftmost and rightmost bases of the DNA region), and the strand (encoded as "+", "-", or "*" if unknown), of each region respectively. The GRanges is obtained from the analysis of ENCODE ChIP-seq data: it concerns the localization of transcription factors binding sites and histone modifications in DNA, as well as RefSeq data (https://www.ncbi.nlm.nih.gov/refseq/); specifically, here we focus on promotorial regions, but further analyses are possible on any region of interest. Such data have been processed and extracted with GMQL (GenoMetric Query Language \cite{masseroli2015genometric}, http://www.bioinformatics.deib.polimi.it/GMQL/) queries. In this example, the dataset we consider is the matrix of presences of 25 transcription factors' binding sites of the MCF-7 human breast adenocarcinoma cell line (i.e., all the transcription factors evaluated in ENCODE for this cell line), in the 2,944 promotorial regions of chromosome 1: \bigskip <<>>= # Load and visualize the dataset: data("MCF7_chr1") length(MCF7_chr1) MCF7_chr1 @ \section{Extraction of the most relevant associations} We define a relevant association for the prediction of transcription factor TFt in the considered genomic regions as an association rule of the type: \begin{center} \{TF1=1,TF2=1,TF3=1\} $\rightarrow$ \{TFt=1\} \end{center} which means that the presence of the transcription factors TF1, TF2, and TF3 implies the presence of transcription factor TFt. Every association rule is characterized by a set of three measures: support, confidence and lift: \begin{itemize} \item \textit{support}: \begin{equation} supp(X \rightarrow Y) = \frac {supp(X \cup Y)}{N} \label{supp_rule} \end{equation} where N is the number of transactions, X $\cup$ Y is a set of items and Supp(X $\cup$ Y) is the support of the itemset \{X,Y\}, defined as \begin{equation} supp(X) = \frac {|\{t_i \in N ; X \subseteq t_i\}|} {N} \label{supp} \end{equation} that is the proportion of transactions $t_i$ in the dataset which contains the itemset X. The support of an association rule measures the frequency of a rule in the dataset and varies in the interval [0,1]. \item \textit{confidence}: \begin{equation} conf(X \rightarrow Y) = \frac {supp(X \cup Y)} {supp(X)} \label{conf} \end{equation} It gives an estimate of the conditioned probability P(Y|X), that is the probability to find the right-hand-side (RHS) of the rule (i.e., the itemset Y) in a set of transactions, given that these transactions also contain the left-hand-side (LHS) of the rule (i.e., the itemset X). Therefore, it measures the realiability of the inference made by the rule X $\rightarrow Y$. The higher is the confidence of the rule, the higher is the probability to find the itemset Y in a transaction containing the itemset X. It varies in the interval [0,1]. \item \textit{lift}: \begin{equation} lift(X \rightarrow Y) = \frac {supp(X \cup Y)}{supp(X) supp(Y)} \label{lift} \end{equation}It measures the strength of the rule, and varies in the interval [0,$\infty$]. \end{itemize} \bigskip To extract a set of relevant associations the user has to specify: \begin{itemize} \item[1.] the presence/absence of the target transcription factor to be predicted, TFt; \item[2.] the minimal support threshold of the rules to be extracted; \item[3.] the minimal confidence threshold of the rules to be extracted. \end{itemize} Points 2. and 3. strongly depend on the dimensions of the dataset (i.e., number of rows - regions - and number of columns - transcription factors), the presence of the target transcription factor in the considered regions, the number of relevant associations that the user wants to find. Usually, the confidence threshold is set higher than 0.5, since it measures the posterior probability to have TFt given the presence of the pattern in the left-hand-side of the rule (e.g., \{TF1=1,TF2=1,TF3=1\}). \medskip The function \texttt{rulesGen} in the \Biocpkg{TFARM} package extracts the association rules by calling the \texttt{apriori} function of the \Rpackage{arules} package \cite{arules1} \cite{arules2} \cite{arules3}. It takes in input: \begin{itemize} \item the GRanges object in which the matrix of presences is represented; \item the target transcription factor; \item the minimum support threshold of the rules to be extracted; \item the minimum confidence threshold of the rules to be extracted; \item the logical parameter \textit{type} that sets the type of left-hand-side of the rules to be extracted (i.e., containing only present transcription factors, or containing present and/or absent transcription factors). \end{itemize} The result of the \texttt{rulesGen} function is a data.frame containing: \begin{itemize} \item in the first column the left-hand-side of each extracted rule; \item in the second column the right-hand-side of each extracted rule (that is the presence/absence of the given target transcription factor); \item in the third column the support of each extracted rule; \item in the fourth column the confidence of each extracted rule; \item in the fifth column the lift of each extracted rule. \end{itemize} See \Rpackage{arulesViz} package for visualization tools of association rules. \bigskip <<>>= # Coming back to the example on the transcription factors of cell line # MCF-7, in the promotorial regions of chromosome 1. # Suppose that the user wants to find the most relevant association # rules for the prediction of the presence of TEAD4. This means extracting # all the association rules with right-hand-side equal to {TEAD4=1} # setting the parameter type = TRUE; the minimun support and minimum # confidence thresholds are set, as an example, to 0.005 and 0.62, # respectively: r_TEAD4 <- rulesGen(MCF7_chr1, "TEAD4=1", 0.005, 0.62, TRUE) dim(r_TEAD4) head(r_TEAD4) @ \bigskip Once the set of the most relevant association rules (i.e., with support and confidence higher than the thresholds specified as parameters) is extracted, the user can look for \textit{candidate co-regulator transcription factors} with the target transcription factor (in the example TEAD4), which are the transcription factors present in the LHS of the extracted rules. This is provided by the function \texttt{presAbs} of the \Biocpkg{TFARM} package. The function \texttt{presAbs} takes in input: \begin{itemize} \item a string vector containing the names of all transcription factors present in the matrix of presences; \item the set of the most relevant association rules previously extracted with \texttt{rulesGen}; \item a logical parameter, \textit{type}, which refers to the type of rules extracted with the \texttt{rulesGen} function. If \textit{type = TRUE}, the LHS of the rules contains only items of the type TF=1, otherwise, if \textit{type = FALSE}, the LHS of the rules can contain both items TF=1 and TF=0. \end{itemize} The \texttt{presAbs} function has two outputs: \begin{itemize} \item \textit{pres}, which is a string vector containing all the items present in the LHSs of the considered set of rules; \item \textit{abs}, which is a string vector containing all the items absent in the LHSs of the considered set of rules. \end{itemize} \bigskip << >>= # Transcription factors present in at least one of the regions: c <- names(mcols(MCF7_chr1)) c lc <- length(c) names(presAbs(c, r_TEAD4, TRUE)) # Transcription factors present in at least one of the association rules: p_TFs <- presAbs(c, r_TEAD4, TRUE)$pres p_TFs # Transcription factors absent in all the association rules: a <- presAbs(c[1:lc], r_TEAD4, TRUE)$abs a @ \bigskip All transcription factors in p are said to be \textit{candidate co-regulators} of the TFt in the most relevant associations extracted with \texttt{rulesGen}. \section{Importance Index of a transcription factor} The extraction of candidate transcription factors for the interaction with a given target transcription factor (TFt) can be useful to provide a global vision of the possible associations of TFt. However, since the number of association rules and candidate co-regulators can be very high, this list does not provide an intelligible result, giving the lack of measures of how much each transcription factor contributes to the existence of a certain complex of transcription factors. Let us consider for example the rule \begin{center} \{TF1=1,TF2=1,TF3=1\} $\rightarrow$ \{TFt=1\} \end{center}Just looking at it, the user could not tell if the presences of TF1, TF2 and TF3 equally contribute to the prediction of the presence of TFt. A solution to this problem can be given by removing, alternatively, TF1, TF2, and TF3 from the rule and evaluate: \begin{itemize} \item[1)] if the rule keeps on existing and being relevant \item[2)] how the three quality measures of support, confidence, and lift of the rule change. \end{itemize} If a rule is not found as relevant after removing a transcription factor from its LHS, then the presence of that transcription factor in the pattern \{TF1=1,TF2=1,TF3=1\} is fundamental for the existence of the association rule \{TF1=1,TF2=1,TF3=1\} $\rightarrow$ \{TFt=1\}. Otherwise, if the rule keeps on existing as relevant, and its quality measures are similar to the ones of the rule initially considered, then the presence of that transcription factor in the pattern \{TF1=1,TF2=1,TF3=1\} is not fundamental for the existence of the association rule \{TF1=1,TF2=1,TF3=1\} $\rightarrow$ \{TFt=1\}. Let us fix an item I (i.e., a candidate co-regulator for TFt) and extract the subset of the most relevant associations containing I, named \{R$^I$\} (with J number of rules in \{R$^I$\}, J=$|$\{R$^I$\}$|$). Each element of \{R$^I$$_j$\}$_{j=1:J}$ is described by a set of quality measures of support, confidence and lift: \{$s^I_j$, $c^I_j$, $l^I_j$\}$_{j=1:J}$. \begin{table}[htbp]\footnotesize \begin{tabular}{|c|c|c|c|} \hline rule & support & confidence & lift \\ \hline $R^I_1$ & $s^I_1$ & $c^I_1$ & $l^I_1$ \\ ... & ... & ... & ... \\ $R^I_J$ & $s^I_J$ & $c^I_J$ & $l^I_J$ \\ \hline \end{tabular} \caption{\small Rules containing item I, and corrispondent measures of support, confidence and lift.} \label{m_rules_1} \end{table} Let then be $\{R^{I-}_j\}_{j=1:J}$ the set of rules obtained substituting the presence of item I with its absence in each element of $\{R^{I}_j\}_{j=1:J}$. For example, if I is TF1 and $R^I_j$ is the rule \{TF1=1,TF2=1,TF3=1\} $\rightarrow$ \{TFt=1\}, with measures $\{s^I_j, c^I_j, l^I_j\}$, then $R^{I-}_j$ will be the rule $\{TF1=0,TF2=1,TF3=1\} \rightarrow \{TFt=1\}$ with measures $\{s^{I-}_j, c^{I-}_j, l^{I-}_j\}$. Thus, $R^{I}_j$ and $R^{-I}_j$ consider the same number of association rules for each item. \begin{table}[htbp]\footnotesize \begin{tabular}{|c|c|c|c|} \hline rule & support & confidence & lift \\ \hline $R^{I-}_1$ & $s^{I-}_1$ & $c^{I-}_1$ & $l^{I-}_1$ \\ ... & ... & ... & ... \\ $R^{I-}_J$ & $s^{I-}_J$ & $c^{I-}_J$ & $l^{I-}_J$ \\ \hline \end{tabular} \caption{\small Rules originally containing item I obtained by removing I, and corrispondent support, confidence and lift measures.} \label{m_rules_2} \end{table} To analyze the importance of a transcription factor, for example TF1, we can compare the two distributions \{$s^I_j$, $c^I_j$\}$_{j=1:J}$ and \{$s^{I^-}_j$, $c^{I^-}_j$\}$_{j=1:J}$ for each j in \{1,...,J\}. Since support and confidence vary in [0,1], while the lift is directly proportional to the confidence measure, we can define an index of the importance of item I in the rule R$^I$$_j$ for j in \{1,...,J\} as: \begin{center} \begin{equation} \label{imp_rule} imp(I)_j = {\Delta s}_j + {\Delta c}_j \end{equation} \end{center} with: \hspace{0.5cm} \begin{math} {\Delta s}_j = {s^I}_j - {s^{I^-}}_j\hspace{0.7cm} {\Delta c}_j = {c^I}_j - {c^{I^-}}_j\hspace{0.7cm} \end{math} The importance of I in its set of rules \{R$^I$\} is obtained evaluating the mean of all its importances imp(I)$_j$ in the set of rules: \begin{equation} \centering \label{imp_formula} imp(I) = \frac{\sum_{j=1}^{J} imp(I)_j}{J} \end{equation} Then, evaluating the index imp(I) for each item in the relevant association rules extracted can be useful to rank the transcription factors by their importance in the association with the target transcription factor, TFt. The presence of the transcription factors with highest mean Importance Index is assumed to be fundamental for the existence of some regulatory complexes (i.e., association rules assumed to be relevant); the transcription factors with lower mean Importance Index, instead, do not significantly influence the pattern of transcription factors associated with the target transcription factor. The definition of the Importance Index can be extended to couples of items, triplets, and so on. This can be easily done by substituting the item I with a set of items (for example I=\{TF1=1,TF2=1\}), and applying the rest of the procedure in a completely analogous way. Thus, we identify as $R^I$ the set of rules containing both TF1 and TF2 and $R^{I-}$ as the set of correspondent rules without the two transcription factors. This kind of approach allows the identification of interactions between transcription factors that would be unrevealed just looking at a list of association rules. The \texttt{rulesTF} function in \Biocpkg{TFARM} package provides the subset of input rules containing a given transcription factor TFi. It takes in input: \begin{itemize} \item a set of rules \item the transcription factor TFi that the user wants to find in the LHSs of a subset of the considered rules \item a logical parameter, \textit{verbose}: if \textit{verbose = TRUE}, a console message is returned if the searched subset of rules is empty. \end{itemize} The output of the function is a data.frame containing the subset of rules whose LHSs contain TFi, and the corresponding quality measures. Using the introduced notation, the output of the \texttt{rulesTF} function is \{R$^I$$_j$\}$_{j=1:J}$ with the quality measures \{s$^I$$_j$, c$^I$$_j$, l$^I$$_j$\}$_{j=1:J}$. The data.frame has J rows and five columns: the first column contains the LHS of the selected rules, the second one contains the RHS of the rules and the last three columns contain s$^I$$_j$, c$^I$$_j$, l$^I$$_j$ (that is a data.frame like the one in Table \ref{m_rules_1}). \bigskip <<>>= # To find the subset of rules containing the transcription factor FOSL2: r_FOSL2 <- rulesTF(TFi = 'FOSL2=1', rules = r_TEAD4, verbose = TRUE) head(r_FOSL2) dim(r_FOSL2)[1] @ \bigskip <<>>= # If none of the rules in input to rulesTF contains the given item TFi, # and verbose = TRUE, a warnig message is reported to the user: r_CTCF <- rulesTF(TFi = 'CTCF=1', rules = r_TEAD4, verbose = TRUE) @ \bigskip If the user wants to evaluate the importance of item I in a set of rules $R^I$, the user needs to substitute the presence of I in all the left-hand-side patterns of $R^I$ with its absence: this is done using the function \texttt{rulesTF0} in \Biocpkg{TFARM} package. This function takes in input: \begin{itemize} \item the transcription factor TFi to be removed \item a set of rules containing TFi \item the total set of rules \item the GRanges object containing the matrix of presences \item the target transcription factor. \end{itemize} It returns a data.frame with the rules obtained substituting the presence of TFi with its absence and the correspondent measures. Using the introduced notation, the output of the \texttt{rulesTF0} function is \{R$^{I-}$$_j$\}$_{j=1:J}$ with the quality measures \{s$^{I-}$$_j$, c$^{I-}$$_j$, l$^{I-}$$_j$\}$_{j=1:J}$. The data.frame has J rows and five columns: the first colum contains the LHS of the rules in $R^I$ without TFi, the second one contains the RHS of the rules and the last three columns contain s$^{I-}$$_j$, c$^{I-}$$_j$, l$^{I-}$$_j$ (that is a data.frame like the one in Table \ref{m_rules_2}). \bigskip <>= # For example to evaluate FOSL2 importance in the set of rules r_FOSL2: r_noFOSL2 <- rulesTF0('FOSL2=1', r_FOSL2, r_TEAD4, MCF7_chr1, "TEAD4=1") @ <<>>= row.names(r_FOSL2) <- match(r_FOSL2$lhs, r_TEAD4$lhs) row.names(r_noFOSL2) <- match(r_FOSL2$lhs, r_TEAD4$lhs) head(r_noFOSL2) @ \bigskip Now that the two sets of rules \{R$^I$$_j$\}$_{j=1:J}$ and \{R$^{I-}$$_j$\}$_{j=1:J}$, and the two sets of measures \{s$^I$$_j$, c$^I$$_j$, l$^I$$_j$\}$_{j=1:J}$ and \{s$^{I-}$$_j$, c$^{I-}$$_j$, l$^{I-}$$_j$\}$_{j=1:J}$ are obtained, the user can compute the Importance Index distribution for the chosen transcription factor TFi. This can be done with the function \texttt{IComp} in the \Biocpkg{TFARM} package which takes in input: \begin{itemize} \item the transcription factor TFi \item the subset of rules rules\_TF containing TFi (provided by the function \texttt{rulesTF}) with their quality measures of support, confidence and lift \item the subset of rules rules\_noTF obtained from rules\_TF removing TFi (provided by the function \texttt{rulesTF0}) \item a logical parameter (figures) to graphically rapresent \{s$^I$$_j$, c$^I$$_j$, l$^I$$_j$\}$_{j=1:J}$ and \{s$^{I-}$$_j$, c$^{I-}$$_j$, l$^{I-}$$_j$\}$_{j=1:J}$; set \textit{figures = TRUE} to get it as an output. \end{itemize} The function has five outputs: \begin{itemize} \item imp, which is the set of importance index values of TFi in the given set of rules (rules\_TF), one value for each rule. \item delta, which is the matrix of variations of standardized support, confidence, and lift, obtained removing TFi from rules\_TF. \item rwi, which is a data.frame that contains rules from \texttt{rulesTF} associated with each candidate co-regulator transcription factor. \item rwo, which is a data.frame with rules in rwi obtained removing each transcription factor TFi. \item the plots of \{s$^I$$_j$, c$^I$$_j$, l$^I$$_j$\}$_{j=1:J}$ and \{s$^{I-}$$_j$, c$^{I-}$$_j$, l$^{I-}$$_j$\}$_{j=1:J}$ obtained if the user sets \textit{figures = TRUE}. \end{itemize} \bigskip <>= # Perform the IComp function to compute the Importance Index distribution: imp_FOSL2 <- IComp('FOSL2=1', r_FOSL2, r_noFOSL2, figures=TRUE) names(imp_FOSL2) imp_FOSL2$imp head(imp_FOSL2$delta) head(imp_FOSL2$rwi) head(imp_FOSL2$rwo) @ \bigskip \incfig{figure/IComp-1}{\textwidth}{Support and Confidence for the extracted rules before and after the removal of item $I$.}{Left panel: Support distribution $\{s^I_j\}_{j=1:J}$, black thick line, and $\{s^{I-}_j\}_{j=1:J}$,red dotted line. Right panel: Confidence distribution $\{c^I_j\}_{j=1:J}$, black thick line, and $\{c^{I-}_j\}_{j=1:J}$, red dotted line.} The most useful application of the function \texttt{IComp} is the ranking of candidate co-regulators through their importances. As previously seen, the candidate co-regulators are returned by the function \texttt{presAbs}. The evaluation of the mean Importance of each co-regulator can be computed cycling the three functions \texttt{rulesTF}, \texttt{rulesTF0} and \texttt{IComp} over a string vector with all transcription factors present in the set of relevant association rules extracted, as returned by the function presAbs. \bigskip <>= # For the considered example the user could run: DELTA_mean_supp <- vector("list", length(p_TFs)) DELTA_mean_conf <- vector("list", length(p_TFs)) all <- lapply(p_TFs, function(pi) { A <- rulesTF(pi, r_TEAD4, FALSE) B <- rulesTF0(pi, A, r_TEAD4, MCF7_chr1, "TEAD4=1") IComp(pi, A, B, figures=FALSE) }) for (i in 1:length(p_TFs)) { IMP_Z[[i]] <- all[[i]]$imp # Extract the delta variations of support and confidence: DELTA_mean_supp[[i]] <- apply(all[[i]]$delta[1], 2, mean) DELTA_mean_conf[[i]] <- apply(all[[i]]$delta[2], 2, mean) } IMP <- data.frame( TF = p_TFs, imp = sapply(IMP_Z, mean), sd = sapply(IMP_Z, sd), delta_support = as.numeric(DELTA_mean_supp), delta_confidence = as.numeric(DELTA_mean_conf), nrules = sapply(IMP_Z, length), stringsAsFactors=FALSE ) library(plyr) @ <<>>= # Sort by imp column of IMP IMP.ord <- arrange(IMP, desc(imp)) IMP.ord @ \bigskip In this way we get, besides the mean Importance Index of each candidate co-regulator of TFt (TFt = TEAD4 in the example), the standard deviation of the distribution of the Importance Index of each candidate co-regulator of TFt, and the number of rules in which each candidate co-regulator of TFt is present. \medskip The function \texttt{IComp} can be easily generalized for the computation of the mean Importance Index of combinations of transcription factors (see the example used for the \texttt{heatI} function in the following section). \subsection{Validation of the Importance Index formula} We defined the Importance Index of an item in an association rule as the linear combination of the variations of the support and confidence of the rule obtained substituting the presence of the item in the left-hand-side of the association rule, with its absence (as in Formula \ref{imp_rule}). In this way, we assume that each of the two variations equally contributes to the evaluation of the contribution of the item to the prediction of the presence of another item in the right-hand-side of the considered association rule. Nevertheless, one of the two quality measures might be more or less sensitive than the other to the removal of the item from the rule, leading to a greater or smaller variation of one or more of the values of support and confidence. \begin{table}[htbp]\footnotesize \begin{tabular}{c|c|c|c} TF & $\Delta s_s$ & $\Delta c_c$\\ \hline $TF_1$ & $\Delta s_{s,1}$ & $\Delta c_{c,1}$\\ ... & ... & ... & ...\\ $TF_1$ & $\Delta s_{s,n_1}$ & $\Delta c_{c,n_1}$\\ & & & \\ ... & ... & ... & ...\\ & & & \\ $TF_M$ & $\Delta s_{s,K-n_M+1}$ & $\Delta c_{c,K-n_M+1}$\\ ... & ... & ... & ...\\ $TF_M$ & $\Delta s_{s,K}$ & $\Delta c_{c,K}$\\ \end{tabular} \caption{\small Matrix with the variations of the support and confidence, obtained removing each transcription factor from the subset of rules in which it is present. M is the total number of transcription factors, K is the total number of rules and $n_i$ is the number of rules for transcription factor TFi.} \label{D} \end{table} Thanks to the Principal Components Analysis \cite{johnson2002applied} \cite{bro2014principal}, computed by the function \texttt{IPCA} in the \Biocpkg{TFARM} package, we can evaluate if it is possible to find a subspace of $\mathbb{R}^2$ in which the most variability of the dataset containing the variations of the measures (Table \ref{D}) is captured. This can be easily done by extracting the delta variations of support and confidence, using the function \texttt{IComp}, simply getting its \textit{delta} output, as well as a matrix containing the candidate co-regulators found, and the number of rules in which each of them appears. A principal component is a combination of the original variables after a linear transformation; the set of principal components defines a new reference system. The new coordinates of data represented in the reference system defined by principal components are called \textit{scores}, and the coefficients of the linear combination that define each principal component are called \textit{loadings} (so, loadings give a measure of the contribution of every observation to each principal component). The \texttt{IPCA} function takes in input: \begin{itemize} \item the list of variations of distributions of support and confidence measures, obtained from the \texttt{IComp} function \item a matrix with the mean importance of all candidate co-regulators and the number of rules in which each of them appears. \end{itemize} It returns: \begin{itemize} \item a summary, containing: the standard deviation on each principal component, the proportion of variance explained by each principal component, and the cumulative proportion of variance described by each principal component; \item the scores of each principal component \item the loadings of each principal component \item a plot with the variability and the cumulate percentage of variance explained by each principal component \item a plot with the loadings of the principal components \end{itemize} \bigskip <>= # Select the candidate co-regulators and the number of rules # associated with them, then perform the Principal Component Analysis: colnames(IMP) TF_Imp <- data.frame(IMP$TF, IMP$imp, IMP$nrules) i.pc <- IPCA(DELTA, TF_Imp) names(i.pc) i.pc$summary head(i.pc$loadings) head(i.pc$scores) @ %% \incfig{figure/IPCA-1}{\textwidth}{Principal Component Analysis of Importance Index}{Variances of each of the two principal components (on the top left), the cumulate proportion of variance explained by each principal component (on the top right), and loadings of the two principal components.} \bigskip Looking at the value of the variance associated with the first principal component in Figure \ref{figure/IPCA-1}, this value explains 89.13\% of the variability of the DELTA dataset. Moreover, from the plot of the loadings in Figure \ref{figure/IPCA-1}, it is easy to note that the first principal component is a linear combination of the variations of support and confidence, that equally contribute to the combination. So, it is reasonable to define the Importance Index as in Formula \ref{imp_rule}. \section{Visualization tools} The function \texttt{distribViz} in the \Biocpkg{TFARM} package provides a boxplot visualization of the Importance Index distributions of a set of transcription factors (or of combinations of transcription factors). \bigskip <>= # Considering for example the candidate co-regulators # found in the set of rules r_TEAD4: distribViz(IMP_Z, p_TFs) @ \incfig{figure/distribViz-1}{\textwidth}{Importance Index distribution.}{Importance Index distributions of candidate co-regulators of TEAD4 in the set of the 30 most relevant associations for the prediction of TEAD4 in promotorial regions of chromosome 1 in MCF7 cell line.} The shape of boxplots changes as follows: \begin{itemize} \item The higher the number of rules containing the candidate co-regulator I, the larger the boxplot for I is \item The higher the variability of the Importance Index of I, the longer the boxplot for I is \item The higher the median of the Importance Index distribution of I, the higher the boxplot for I is aligned with respect to the y-axis. \end{itemize} Moreover, named $q_1$ and $q_3$ the first and third quartiles of the Importance Index distribution for a given item I, all the rules where I has importance \begin{math}x \leq q_1 - 1.5*(q_3 - q_1)\end{math} or \begin{math} x \geq q_1 + 1.5*(q_3 - q_1)\end{math} are considered outlier rules, and represented with a circle outside the boxplot. For example, in the boxplots in Figure \ref{figure/distribViz-1} it is easy to notice that: \begin{itemize} \item[1.] SIN3AK20, HDAC2, GATA3, and GABPA have the highest median Importance Index, and they are present in a high number of relevant association rules \item[2.] HA.E2F1 and TCF12 have intermediate median Importance Index and the lowest variability Importance Index distribution \item[3.] ELF1 and NR2F2 are present in a high number of relevant association rules, but they have low median Importance Index and high variability of the Importance Index distribution. \end{itemize} It can also be noticed that for the transcription factors GABPA, MYC, MAX, NR2F2, FOSL2, and ZNF217 there are some outlier rules, that are rules in which the Importance Index of the candidate co-regulators is a lot higher or lower than the rest of the distribution. These outliers can be extracted as reported in the following text: \bigskip <<>>= # Select the index of the list of importances IMP_Z # containing importance distributions of transcription factor ZNF217 ZNF217_index <- which(p_TFs == 'ZNF217=1') # Select outlier rules where ZNF217 has importance greater than 0 o <- IMP_Z[[ZNF217_index]] > 0 rule_o <- all[[ZNF217_index]]$rwi[o,] # Display the one rule for example the sixth rule_o[6,] # So, ZNF217 is very relevant in the pattern of transcription factors # {FOSL2=1,GABPA=1,MYC=1,MAX=1,ZNF217=1} # for the prediction of the presence of TEAD4. # To extract support, confidence and lift of the corresponding rule # without ZNF217: all <- all[[ZNF217_index]]$rwo[o,] all[6,] # Since the measure of the rule obtained removing ZNF217 is equal to zero, # the rule {FOSL2=1,GABPA=1,MYC=1,MAX=1,ZNF217=0} -> {TEAD4=1}, # obtained removing ZNF217, is found in the relevant rules for the prediction # of the presence of TEAD4. @ \bigskip The function \texttt{heatI} is another useful visualization tool of the package \Biocpkg{TFARM}; it takes in input: \begin{itemize} \item a string vector with names of transcription factors \item a vector of mean importances of pairs of transcription factors in the previous input. \end{itemize} It returns a heatmap visualization of the mean importances of transcription factors in the considered string vector. Evaluating importances of combinations of transcription factors, the number of Importance Index distribution grows combinatorially. This makes it more difficult to see which are the most critical combinations (even sorting them by their mean importances). For pairs of transcription factors, the function \texttt{heatI} gives an heatmap visualization of a square matrix whose elements are as follows (Table \ref{matrimp}): called M the number of candidate co-regulators, the element (i,j) of such matrix is the mean Importance Index of a couple of transcription factors ($TF_i$, $TF_j$). This matrix is symmetric with respect to the main diagonal. \begin{table}[htbp]\footnotesize \begin{tabular}{|c|c|c|c|c|c|} & $TF_1$ & $TF_2$ & ... & $TF_{M-1}$ & $TF_M$\\ \hline $TF_1$ & imp($TF_1$)& imp(\{$TF_1$,$TF_2$\}) & ... & imp(\{$TF_1$,$TF_{M-1}$\}) & imp(\{$TF_1$,$TF_M$\}) \\ $TF_2$ & imp(\{$TF_2$,$TF_1$\}) & imp($TF_2$)&... &imp(\{$TF_2$,$TF_{M-1}$\}) & imp(\{$TF_2$,$TF_M$\}) \\ ... & & & & & \\ $TF_{M-1}$ & imp(\{$TF_{M-1}$,$TF_1$\}) & imp(\{$TF_{M-1}$,$TF_2$\}) &... & imp($TF_{M-1}$)& imp(\{$TF_{M-1}$,$TF_M$\}) \\ $TF_M$ & imp(\{$TF_M$,$TF_1$\}) & imp(\{$TF_M$,$TF_2$\}) & ... & imp(\{$TF_M$,$TF_{M-1}$\}) & imp($TF_M$) \\ \end{tabular} \caption{\small Mean importance matrix of couples of transcription factors} \label{matrimp} \end{table} To get this matrix, we need to build all possible combinations of pair of candidate co-regulators. It can be easily computed with the function \texttt{combn} in the package \Rpackage{combinat}. This function takes as input a vector (which is a string vector of transcription factors) and the number of required elements in the combinations. Function combn(p, 2) generates all pair combinations of p elements. The elements of each combination are then combined in the form \textit{TF1,TF2}. \bigskip <<>>= # Construct couples as a vector in which all possible combinations of # transcription factors (present in at least one association rules) # are included: couples_0 <- combn(p_TFs, 2) couples <- paste(couples_0[1,], couples_0[2,], sep=',') head(couples) @ <>= # The evaluation of the mean Importance Index of each pair is # then computed similarly as previously done for single transcription factors: # Compute rulesTF, rulesTF0 and IComp for each pair, avoiding pairs not # found in the r_TEAD4 set of rules IMP_c <- lapply(couples, function(ci) { A_c <- rulesTF(ci, r_TEAD4, FALSE) if (all(!is.na(A_c[[1]][1]))){ B_c <- rulesTF0(ci, A_c, r_TEAD4, MCF7_chr1, "TEAD4=1") IComp(ci, A_c, B_c, figures=FALSE)$imp } }) # Delete all NULL elements and compute the mean Importance Index of each pair I_c <- matrix(0, length(couples), 2) I_c <- data.frame(I_c) I_c[,1] <- paste(couples) null.indexes <- vapply(IMP_c, is.null, numeric(1)) IMP_c <- IMP_c[!null.indexes] I_c <- I_c[!null.indexes,] I_c[,2] <- vapply(IMP_c, mean, numeric(1)) colnames(I_c) <- colnames(IMP[,1:2]) @ <<>>= # Select rows with mean Importance Index different from NaN, then order I_c: I_c <- I_c[!is.na(I_c[,2]),] I_c_ord <- arrange(I_c, desc(imp)) head(I_c_ord) @ <>= # Construction of a vector in which mean Importance Index values of pairs # of transcription factors are represented. # These transcription factors are taken from the output of presAbs as # present in at least one association rules. # The function rbind is used to combine IMP columns and I_c_ord columns and # then the function arrange orders the data frame by the imp column. I_c_2 <- arrange(rbind(IMP[,1:2], I_c_ord), desc(imp)) p_TFs <- sub("=1", "", p_TFs) I_c_2$TF <-sub("=1", "",I_c_2$TF) i.heat <- heatI(p_TFs, I_c_2) @ \bigskip To build the heatmap, the user must also consider the single transcription factor mean importances (since the heatmap diagonal elements are the mean importances of single transcription factors). \bigskip \incfig{figure/heatmap-1}{0.55\textwidth}{Heatmap.}{Mean importance of couples of candidate co-regulator transcription factors in the set of the 30 most relevant rules for the prediction of the presence of TEAD4 in promotorial regions of chromosome 1 in cell line MCF-7. The mean importances of single transcription factors are represented in the main diagonal as in Table \ref{matrimp}.} The obtained heatmap is represented in Figure \ref{figure/heatmap-1}. The color scale indicates that the lowest mean importances are represented in dark red, whereas the highest ones are represented in light white. This representation is useful to notice that, for example: \begin{itemize} \item ZNF127 has high mean Importance Index alone and in couple with all other candidate co-regulator transcription factors; \item TCF12 has low mean Importance Index alone and in couple with all other candidate co-regulator transcription factors, except with GABPA, ZNF127, and NR2F2. \end{itemize} \bibliography{bibliography} \end{document}