%\VignetteIndexEntry{An R package for prediction of nucleosome positioning} %\VignetteKeywords{Nucleosome} \documentclass[a4paper]{article} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \usepackage{ccaption} \usepackage{natbib} \setlength{\textwidth}{6.2in} \setlength{\textheight}{8.5in} \setlength{\parskip}{1.5ex plus0.5ex minus 0.5ex} \setlength{\oddsidemargin}{0.1cm} \setlength{\evensidemargin}{0.1cm} \setlength{\headheight}{0.3cm} \setlength{\arraycolsep}{0.1cm} \renewcommand{\baselinestretch}{1} \begin{document} \title{An introduction to the nuCpos package} \author{Hiroaki Kato\thanks{hkato@med.shimane-u.ac.jp}, \ Takeshi Urano\\ Department of Biochemistry, Shimane University School of Medicine} \maketitle \section{About nuCpos} \Rpackage{nuCpos}, a derivative of \Rpackage{NuPoP}, is an R package for predicting \textbf{\textit{nuc}}leosome \textbf{\textit{pos}}itions. In \Rpackage{nuCpos}, a duration hidden Markov model is trained with a \textbf{\textit{C}}hemical map of nucleosomes either from budding yeast \textit{Saccharomyces cerevisiae} (\cite{stat:Brogaard2012}), fission yeast \textit{Schizosaccharomyces pombe} (\cite{stat:Moyle-Heyrman2013}), or embryonic stem cells of house mouse \textit{Mus musculus} (\cite{stat:Voong2016}). \Rpackage{nuCpos} outputs the Viterbi (most probable) path of nucleosome-linker states, predicted nucleosome occupancy scores and histone binding affinity (HBA) scores as \Rpackage{NuPoP} does. \Rpackage{nuCpos} can also calculate local and whole nucleosomal HBA scores for a given 147-bp sequence. Furthermore, effect of genetic alterations on nucleosome occupancy can be predicted with this package. The parental package \Rpackage{NuPoP}, licensed under GPL-2, was developed by Ji-Ping Wang and Liqun Xi. Please refer to \cite{stat:XiWang2010} and \cite{stat:WangWidom2008} for technical details of \Rpackage{NuPoP}. Note that \Rpackage{NuPoP} uses an MNase-seq-based map of budding yeast nucleosomes to train a duration hidden Markov model. \section{nuCpos functions} \Rpackage{nuCpos} has four functions: \verb@predNuCpos@, \verb@HBA@, \verb@localHBA@ and \verb@mutNuCpos@. The \verb@predNuCpos@ function can serve a chemical counterpart of the \verb@predNuPoP@ function of \Rpackage{NuPoP}: it predicts the nucleosome positioning and nucleosome occupancy. The functions \verb@HBA@ and \verb@localHBA@ receive a sequence of 147-bp DNA and calculate whole nucleosomal and local HBA scores. The \verb@mutNuCpos@ function receives a wild-type DNA sequence and information on a genetic alteration to predict the effect of the mutation on nucleosome positioning. \Rpackage{nuCpos} requires the \Rpackage{Biostrings} package, especially when DNA sequences are given as DNAString objects to the functions \verb@HBA@, \verb@localHBA@ and \verb@mutNuCpos@. These functions can also receive DNA sequences as simple character string objects without loading the \Rpackage{Biostrings} package. Note: \Rpackage{nuCpos} requires the \Rpackage{NuPoP} package to perform some example runs. Load the \Rpackage{nuCpos} package as follows: <<>>= library(nuCpos) @ \section{Performing predictions with predNuCpos} The \verb@predNuCpos@ function acts like the \verb@predNuPoP@ function of \Rpackage{NuPoP}. When the $\Rfunarg{ActLikePredNuPoP}$ argument is set as TRUE, \verb@predNuCpos@ reads a DNA sequence file in FASTA format and invokes a Fortran subroutine to perform predictions. The prediction results will be saved in the working directory. \textit{TRP1ARS1x1.fasta}, the DNA sequence of \textit{TRP1ARS1} circular minichromosome (1,465 bp) (\cite{stat:Fuse2017}), in \Robject{extdata} can be used for an example run. Call the \verb@predNuCpos@ function as follows: <<>>= predNuCpos(file = system.file("extdata", "TRP1ARS1x1.fasta", package = "nuCpos"), species = "sc", smoothHBA = FALSE, ActLikePredNuPoP = TRUE) @ The argument $\Rfunarg{file}$ is the path to the fasta file. The argument $\Rfunarg{species}$ can be specified as follows: mm = \textit{M. musculus}; sc = \textit{S. cerevisiae}; sp = \textit{S. pombe}. Re-scaling of the nucleosome and linker models for the prediction of other species' nucleosomes are not supported. \Rpackage{nuCpos} uses 4th order Markov chain models for the prediction. The name of the output file will be like \textit{TRP1ARS1x1.fasta\_Prediction4.txt}. As in the output file produced by the parental \Rpackage{NuPoP} package, it will contain five columns: \begin{enumerate} \item{\Robject{Position}: position in the input DNA sequence.} \item{\Robject{P-start}: probability that a nucleosome starts at.} \item{\Robject{Occup}: nucleosome occupancy score.} \item{\Robject{N/L}: Viterbi path (1 and 0 for the nucleosome and linker states, respectively).} \item{\Robject{Affinity}: histone binding affinity score.} \end{enumerate} To import the output into R, the \verb@readNuPoP@ function of \Rpackage{NuPoP} can be used: <<>>= library(NuPoP) results.TRP1ARS1x1 <- readNuPoP("TRP1ARS1x1.fasta_Prediction4.txt", startPos = 1, endPos = 1465) results.TRP1ARS1x1[1:5,] @ The arguments $\Rfunarg{startPos}$ and $\Rfunarg{endPos}$ are used to import a part of the prediction results. In this example, the prediction results for the whole tested sequence is imported. First and last 73-bp regions do not have HBA scores (\Robject{Affinity}) as they cannot be calculated. The HBA scores start from the 74th position: <<>>= results.TRP1ARS1x1[72:76,] @ For visualization of the prediction results, the \verb@plotNuPoP@ function of \Rpackage{NuPoP} can be used. This function draws two plots in the graphical window. The top one shows predicted nucleosome occupancy. In the bottom one, probability of a nucleosome to start at the given position (blue vertical lines) and the Viterbi path (red lines) are shown as well as the nucleosome occupancy (gray). <>= plotNuPoP(results.TRP1ARS1x1) @ For prediction of nucleosome positioning in short circular DNA, one can use a triplicated sequence for prediction and read only the central copy for the evaluation. By triplicating the DNA, inaccurate prediction near the DNA ends, which are joined to each other in the circular form, can be avoided. <<>>= predNuCpos(file = system.file("extdata", "TRP1ARS1x3.fasta", package = "nuCpos"), species = "sc", smoothHBA = FALSE, ActLikePredNuPoP = TRUE) results.TRP1ARS1 <- readNuPoP("TRP1ARS1x3.fasta_Prediction4.txt", startPos = 1466, endPos = 2930) @ Here, \Robject{TRP1ARS1x3.fasta} in \Robject{extdata} is a triplicated sequence (4,395 bp) of the \textit{TRP1ARS1} minichromosome (1,465 bp). The central part (from the coordinate 1,466 to 2,930) of the prediction results is read by \verb@readNuPoP@. They are apparently different from the previous results near the terminal regions. <>= par(mfrow = c(2, 1)) plot(x = 1:1465, y = results.TRP1ARS1x1[,3], type = "n", ylim = c(-0.05, 1), xlab = "Position from the EcoRI site #1", ylab = "probability/occupancy") title("Not triplicated") polygon(c(1, 1:1465, 1465), c(0, results.TRP1ARS1x1[,3], 0), col = 8) points(x = 1:1465, y = results.TRP1ARS1x1[,4], type = "l", col = 2) points(x = 1:1465, y = results.TRP1ARS1x1[, 2], type = "h", col = 4) plot(x = 1:1465, y = results.TRP1ARS1[,3], type = "n", ylim = c(-0.05, 1), xlab = "Position from the EcoRI site #1", ylab = "probability/occupancy") title("Triplicated: central copy") polygon(c(1, 1:1465, 1465), c(0, results.TRP1ARS1[,3], 0), col = 8) points(x = 1:1465, y = results.TRP1ARS1[,4], type = "l", col = 2) points(x = 1:1465, y = results.TRP1ARS1[, 2], type = "h", col = 4) @ The main difference between the \verb@predNuCpos@ function of \Rpackage{nuCpos} and the \verb@predNuPoP@ function of \Rpackage{NuPoP} is the nucleosome maps for constructing statistical models. As \Rpackage{NuPoP} is based on an MNase-based-map, it can be affected by MNase's enzymatic preference to cut AT-rich sequence. For comparison, use the \verb@predNuPoP@ function of \Rpackage{NuPoP} and plot the results alongside those obtained with \Rpackage{nuCpos}. Please ignore the line starting with \textit{Prediction output} and running past the right margin because it cannot be suppressed. <>= predNuPoP(file = system.file("extdata", "TRP1ARS1x3.fasta", package = "nuCpos"), species = 7, model = 4) results.NuPoP <- readNuPoP("TRP1ARS1x3.fasta_Prediction4.txt", startPos = 1466, endPos = 2930) par(mfrow = c(2, 1)) plot(x = 1:1465, y = results.TRP1ARS1[,3], type = "n", ylim = c(-0.05, 1), xlab = "Position from the EcoRI site #1", ylab = "probability/occupancy") title("NuCpos: Eight nucleosomes on the Viterbi path") polygon(c(1, 1:1465, 1465), c(0, results.TRP1ARS1[,3], 0), col = 8) points(x = 1:1465, y = results.TRP1ARS1[,4], type = "l", col = 2) points(x = 1:1465, y = results.TRP1ARS1[, 2], type = "h", col = 4) text(x = 800, y = 0.5, labels = "*", cex = 2) plot(x = 1:1465, y = results.NuPoP[,3], type = "n", ylim = c(-0.05, 1), xlab = "Position from the EcoRI site #1", ylab = "probability/occupancy") title("NuPoP: Seven nucleosomes on the Viterbi path") polygon(c(1, 1:1465, 1465), c(0, results.NuPoP[,3], 0), col = 8) points(x = 1:1465, y = results.NuPoP[,4], type = "l", col = 2) points(x = 1:1465, y = results.NuPoP[, 2], type = "h", col = 4) @ As shown above, eight and seven nucleosomes are predicted along the \textit{TRP1ARS1} minichromosome by \Rpackage{nuCpos} and \Rpackage{NuPoP}, respectively. Most of the nucleosomes predicted by \Rpackage{nuCpos} are in similar locations \textit{in vivo} (\cite{stat:Fuse2017}). The predicted nucleosomes \#1-4 appear to correspond to the \textit{in vivo} nucleosomes IV-VII, which locates in the \textit{TRP1} gene. Whereas the predicted nucleosomes \#6-8 appear to correspond to the \textit{in vivo} nucleosomes I-III, which locates between the DNase hypersensitive regions (HSR) A and B. The predicted nucleosome \#5 (labeled with *) is positioned in the \textit{in vivo} nucleosome free region HSR-A, which contains the DNA replication origin \textit{ARS1}, where origin recognition complex may inhibit nucleosome formation \textit{in vivo}. By specifying the argument $\Rfunarg{smoothHBA}$ as TRUE, HBA scores can be smoothed in a 55-bp window as being done by the \verb@predNuPoP@ function of \Rpackage{NuPoP}. <>= predNuCpos(file = system.file("extdata", "TRP1ARS1x3.fasta", package = "nuCpos"), species = "sc", smoothHBA = TRUE, ActLikePredNuPoP = TRUE) results.TRP1ARS1.smooth <- readNuPoP("TRP1ARS1x3.fasta_Prediction4.txt", startPos = 1466, endPos = 2930) par(mfrow = c(2, 1)) plot(x = 1:1465, y = results.TRP1ARS1[,3], type = "n", ylim = c(-0.05, 1), xlab = "Position from the EcoRI site #1", ylab = "probability/occupancy") title("Occupancy(grey)/probability(blue)/Viterbi(red)") polygon(c(1, 1:1465, 1465), c(0, results.TRP1ARS1[,3], 0), col = 8) points(x = 1:1465, y = results.TRP1ARS1[,4], type = "l", col = 2) points(x = 1:1465, y = results.TRP1ARS1[, 2], type = "h", col = 4) plot(x = 1:1465, y = results.TRP1ARS1[,5], type = "n", xlab = "Position from the EcoRI site #1", ylab = "HBA", main = "HBA(red)/smoothed HBA(blue)") points(x = 1:1465, y = results.TRP1ARS1[,5], type = "l", col = 2) points(x = 1:1465, y = results.TRP1ARS1.smooth[,5], type = "l", col = 4) @ As shown as a red line in the bottom one of the above plots, non-smoothed HBA scores in eukaryotic sequences exhibit about 10-bp periodicity. The dyads of predicted nucleosomes usually locate at the coordinates with high HBA scores. HBA scores in the output of \verb@predNuCpos@ can be standardized as being done by the \verb@predNuPoP@ function of \Rpackage{NuPoP} by specifying the argument $\Rfunarg{std}$ of \verb@predNuCpos@ as TRUE. The default setting for $\Rfunarg{std}$ is FALSE. When the argument $\Rfunarg{ActLikePredNuPoP}$ is set as FALSE, which is the default setting, \verb@predNuCpos@ receives a character string or DNAString object as $\Rfunarg{inseq}$. In this case, prediction results will be returned to the R environment, and no file will be generated in the working directory. The input sequence ($\Rfunarg{inseq}$) must not contain characters other than A/C/G/T. The results will contain five columns: \begin{enumerate} \item{\Robject{pos}: position in the input DNA sequence.} \item{\Robject{pstart}: probability that a nucleosome starts at.} \item{\Robject{nucoccup}: nucleosome occupancy score.} \item{\Robject{viterbi}: Viterbi path (1 and 0 for the nucleosome and linker states, respectively).} \item{\Robject{affinity}: histone binding affinity score.} \end{enumerate} <<>>= TRP1ARS1 <- paste(scan(file = system.file("extdata", "TRP1ARS1x1.fasta", package = "nuCpos"), what = character(), skip = 1), sep = "", collapse = "") results.TRP1ARS1.internal <- predNuCpos(inseq = TRP1ARS1, species = "sc", smoothHBA = FALSE, ActLikePredNuPoP = FALSE) results.TRP1ARS1.internal[72:76,] @ \section{Histone binding affinity score calculation with HBA} HBA score can be calculated for a given 147-bp sequence with the \verb@HBA@ function. In the examples bellow, a character string object \Robject{inseq} and a DNAString object \Robject{INSEQ} with the same 147-bp DNA sequences are given to \verb@HBA@. Note: the \Rpackage{Biostrings} package is required for the latter case. <<>>= load(system.file("extdata", "inseq.RData", package = "nuCpos")) HBA(inseq = inseq, species = "sc") for(i in 1:3) cat(substr(inseq, start = (i-1)*60+1, stop = (i-1)*60+60), "\n") load(system.file("extdata", "INSEQ_DNAString.RData", package = "nuCpos")) INSEQ HBA(inseq = INSEQ, species = "sc") @ The argument $\Rfunarg{inseq}$ is the character string object to be given. Alternatively, a DNAString object can be used here. The length of DNA must be 147 bp. The argument $\Rfunarg{species}$ can be specified as follows: mm = \textit{M. musculus}; sc = \textit{S. cerevisiae}; sp = \textit{S. pombe}. \section{Local histone binding affinity score calculation with localHBA} Local HBA scores are defined as HBA scores for 13 overlapping subnucleosomal segments named A to M. They can be calculated for a given 147-bp sequence with the \verb@localHBA@ function. Like \verb@HBA@, this function can receive either a character string object or a DNAString object. The segment G corresponds to the central 21 bp region, in which the dyad axis passes through the 11th base position. This means that the local HBA score for the G segment implies the relationship between DNA and histone proteins at around superhelical locations -0.5 and +0.5. The neighboring F segment, which is 20 bp in length, is for SHLs -1.5 and -0.5. The result of example run shown below suggests that subsequence of \Robject{inseq} around SHL -3.5 and -2.5 is suitable for nucleosome formation. <>= localHBA(inseq = inseq, species = "sc") barplot(localHBA(inseq = inseq, species = "sc"), names.arg = LETTERS[1:13], xlab = "Nucleosomal subsegments", ylab = "local HBA", main = "Local HBA scores for inseq") @ \section{Prediction of nucleosome positioning in wild-type and mutant sequences with mutNuCpos} The function \verb@mutNuCpos@ is designed to examine the effect of genetic alterations on nucleosome positioning. In the example run below, the \textit{TALS} circular minichromosome (1,811 bp) is used as a wild-type sequence. \cite{stat:Ichikawa2014} showed that insertion of telomere repeats into the \textit{TALS} minichromosome inhibits formation of positioned nucleosome \textit{in vivo}. Here, a 178-bp telomere repeat sequence is inserted at the base position 1,464 of \textit{TALS}. <>= TALS <- paste(scan(file = system.file("extdata", "TALS.fasta", package = "nuCpos"), what = character(), skip = 1), sep = "", collapse = "") for(i in 1:23) cat(substr(TALS, start = (i-1)*80+1, stop = (i-1)*80+80), "\n") TTAGGGx29 <- paste(scan(file = system.file("extdata", "TTAGGGx29.fasta", package = "nuCpos"), what = character(), skip = 1), sep = "", collapse = "") for(i in 1:3) cat(substr(TTAGGGx29, start = (i-1)*80+1, stop = (i-1)*80+80), "\n") mut.results <- mutNuCpos(wtseq = TALS, site = 1464, ins = TTAGGGx29, species = "sc", smoothHBA = TRUE, plot.window = 601, ylim.HBA = c(-15, 0), show.occup.window = TRUE, annotation = data.frame(name = "alpha2", color = "purple", left = 1534, right = 1559), full = TRUE) @ The function \verb@mutNuCpos@ can receive either character string objects or DNAString objects as the arguments $\Rfunarg{wtseq}$ and $\Rfunarg{ins}$. The position of insertion is specified by the argument $\Rfunarg{site}$. The argument $\Rfunarg{species}$ is set as in other functions such as \verb@predNuCpos@. When $\Rfunarg{show.occup.window}$ is TRUE, the regions used for calculation of \textit{Difference in occupancy}, which is stored in the results (or printed in the console when \Robject{full = TRUE}), are shaded in the nucleosome occupancy plots. \verb@mutNuCpos@ does not save any data in the working directory. To receive calculation results from the function, use the assignment operator and set the $\Rfunarg{full}$ argument as TRUE to do so. One can obtain the prediction results for a wild-type sequence (or a mutant sequence by giving it as $\Rfunarg{wtseq}$) by setting the arguments $\Rfunarg{ins}$ and $\Rfunarg{del}$ as defaults. The argument $\Rfunarg{annotation}$ is useful for indicating positions of relevant elements. In this example, the alpha-2 operator is shown as purple horizontal lines. The red horizontal lines indicate insertions. This function receives a wild-type DNA sequence longer than 1,000 bp. The sequence can only contain A, C, G and T. The received sequence is mutagenized and pentaplicated before performing predictions to avoid terminal effects. The coordinates (\Robject{pos}) for the results are with respect to the central copy of the pentaplicated sequence. <<>>= mut.results[(((1811+76)*2)-3):(((1811+76)*2)+3),] @ \section{Acknowledgements} We would like to thank Drs. Shimizu, Fuse and Ichikawa for sharing DNA sequences and \textit{in vivo} data, and giving fruitful comments. \bibliographystyle{apalike} %\bibliography{nuCposBib.bib} \begin{thebibliography}{} \bibitem[Wang et~al., 2008]{stat:WangWidom2008} Wang JP, Fondufe-Mittendorf Y, Xi L, Tsai GF, Segal E and Widom J (2008). \newblock Preferentially quantized linker {DNA} lengths in \textit{Saccharomyces cerevisiae}. \newblock {\em PLoS Computational Biology}, 4(9):e1000175. \bibitem[Xi et~al., 2010]{stat:XiWang2010} Xi L, Fondufe-Mittendorf Y, Xia L, Flatow J, Widom J and Wang JP (2010). \newblock Predicting nucleosome positioning using a duration hidden markov model. \newblock {\em BMC Bioinformatics}, 11:346. \bibitem[Brogaard et~al., 2012]{stat:Brogaard2012} Brogaard K, Xi L, and Widom J (2012). \newblock A map of nucleosome positions in yeast at base-pair resolution. \newblock {\em Nature}, 486(7404):496-501. \bibitem[Moyle-Heyrman et~al., 2012]{stat:Moyle-Heyrman2013} Moyle-Heyrman G, Zaichuk T, Xi L, Zhang Q, Uhlenbeck OC, Holmgren R, Widom J and Wang JP (2013). \newblock Chemical map of \textit{Schizosaccharomyces pombe} reveals species-specific features in nucleosome positioning. \newblock {\em Proc. Natl. Acad. Sci. U. S. A.}, 110(50):20158-63. \bibitem[Ichikawa et~al., 2014]{stat:Ichikawa2014} Ichikawa Y, Morohoshi K, Nishimura Y, Kurumizaka H and Shimizu M (2014). \newblock Telomeric repeats act as nucleosome-disfavouring sequences in vivo. \newblock {\em Nucleic Acids Res.}, 42(3):1541-1552. \bibitem[Voong et~al., 2016]{stat:Voong2016} Voong LN, Xi L, Sebeson AC, Xiong B, Wang JP and Wang X (2016). \newblock Insights into Nucleosome Organization in Mouse Embryonic Stem Cells through Chemical Mapping. \newblock {\em Cell}, 167(6):1555-1570. \bibitem[Fuse et~al., 2017]{stat:Fuse2017} Fuse T, Katsumata K, Morohoshi K, Mukai Y, Ichikawa Y, Kurumizaka H, Yanagida A, Urano T, Kato H, and Shimizu M (2017). \newblock Parallel mapping with site-directed hydroxyl radicals and micrococcal nuclease reveals structural features of positioned nucleosomes in vivo. \newblock {\em Plos One}, 12(10):e0186974. \end{thebibliography} \end{document}