% \VignetteIndexEntry{iASeq Vignette} % \VignetteKeywords{SNP, RNAseq, ChIPseq, Bioinformatics} % \VignettePackage{iASeq} \documentclass[a4paper]{article} \title{iASeq Vignette} \author{Yingying Wei,Hongkai Ji} \begin{document} \maketitle \section{Introduction} In diploid organisms, certain genes can be expressed, methylated or regulated in an allele-specific manner. These allele-specific events (AS) are of high interest for phenotypic diversity and disease susceptibility. Next generation sequencing technologies provide opportunities to study AS globally. However, little is known about the mechanism of AS. For instance, the patterns of allele-specific binding (ASB) across different Transcription Factors (TFs) and histone modifications (HMs) are unclear. Moreover, the limited number of reads on heterozygotic SNPs results in low-signal-to-noise ratio when calling AS. Here, we propose a Bayes hierarchical model to study ASB by jointly analyzing multiple ChIP-seq studies. The model is able to learn the patterns of ASB across studies and make substantial improvement in calling ASB. In principle, the model can also be applied to call AS for multiple RNA-seq and MeDIP-seq studies. \section{Data preparation} In order to adopt the {\em iASeq\/} model, one needs to call the function {\em iASeqmotif\/}. The first requirement, \texttt{exprs}, is the matrix containing the read counts data for heterozygotic SNPs that needs to be analyzed. Each row of the matrix corresponds to a heterozygotic SNP and each column of the matrix corresponds to the reads count for either the reference allele or non-reference allele in a replicate of a study. \parskip=\baselineskip \noindent The second arguement, \texttt{studyid}, identifies the group label of each column. All columns in the same study have the same \texttt{studyid}. Here we use data {\em sampleASE\_exprs\/} as an illustration. {\em sampleASE\_exprs\/} are combined from five studies for 5504 heterozygotic SNPs, each study having two replicates. <<>>= library(iASeq) data(sampleASE) colnames(sampleASE_exprs) sampleASE_studyid @ \noindent The third arguement, \texttt{repid}, represents the sample label for each column of \texttt{exprs} matrix. The two columns within the same replicate, one for reference allele and the other for non-reference allele, have the same \texttt{repid}. In other words, \texttt{repid} discriminates the different replicates within the same study. In {\em sampleASE\/}, \texttt{BroadH3k27ac\_A, BroadH3k27ac\_B, BroadH3k27acR\_A,BroadH3k27acR\_B} for example are two samples from the same study BroadH3k27ac: <<>>= sampleASE_repid @ \noindent The fourth arguement, \texttt{refid}, indicates the reference allele label for each column of \texttt{exprs} matrix. Please code 0 for reference allele columns and 1 for non-reference allele columns to make the interpretation of \texttt{over bound(or expressed in case of expression)} to be skewing to the reference allele. Otherwise, just interpret the other way round. <<>>= sampleASE_refid @ \section{Model fitting} Once we have specified \texttt{studyid}, \texttt{repid} and \texttt{refid}, we are able to fit the {\em iASeq \/} model. We can fit the data with varying motif numbers and use information criterion BIC to select the best model. Here for {\em sampleASE\/}, we fit 5 models with total non-null motif patterns number varying from 1 to 5. Here is only a toy example, to get reseanable results, please run enough iterations for the EM algorithm. <<>>= motif.fitted<-iASeqmotif(sampleASE_exprs,sampleASE_studyid,sampleASE_repid, sampleASE_refid,K=1:5,iter.max=5,tol=1e-3) @ After fitting the {\em iASeq\/} model, we can check the BIC values obtained by all cluster numbers: <>= motif.fitted$bic plotBIC(motif.fitted) @ To picture the motif patterns learned by the algorithm, we can use function \texttt{plotMotif}. Each row in all four graphs corresponds to the same one motif pattern. We call the left two graphs {\em pattern graphs\/} and the right two bar charts {\em frequency graphs\/}. In the pattern graphs, each row indicates a motif pattern and each column represents a study. The grey scale of the cell $(k,d)$ demonstrates the probability of skewness to the reference allele or skewness to the non-reference allele in study $d$ for pattern $k$, and the values are stored in \texttt{motif.fitted\$bestmotif\$motif.qup} and \texttt{motif.fitted\$bestmotif\$motif.qdown}. Each row of the frequency graph corresponds to the motif pattern in the same row of the left pattern graphs. The length of the bar in the first frequency graphs estimates the number of SNPs of the given pattern in the dataset according to motif frequency, which is equal to \texttt{motif.fitted\$bestmotif\$motif.prior}, multiplying the number of total SNPs. The length of the bar in the second bar chart shows the number of SNPs called for the given pattern according to \texttt{cutoff} of posterior probability. \begin{center} <>= plotMotif(motif.fitted$bestmotif,cutoff=0.9) @ \end{center} The posterior probability for each SNP to be allele-specific event is stored in \texttt{bestmotif\$p.post}. <<>>= head(motif.fitted$bestmotif$p.post) @ \begin{thebibliography}{99} \bibitem{riASeq} Yingying Wei, Xia Li, Qianfei Wang, Hongkai Ji (2012) iASeq:integrating multiple ChIP-seq datasets for detecting allele-specific binding. \newblock \emph{In preparation}. \end{thebibliography} \end{document}