--- title: "MACPET" author: - name: "Ioannis Vardaxis" email: iova89@hotmail.com date: "`r doc_date()`" package: "`r BiocStyle::pkg_ver('MACPET')` Rversion >=`r getRversion()`" abstract: | This vignette gives an introduction to the MACPET package which can be used for the analysis of paired-end DNA data like ChIA-PET. Throughout the vignette an introduction of MACPET classes, methods and functions will be given. references: - id: macpet title: MACPET Model-based Analysis for ChIA-PET author: - family: Vardaxis given: Ioannis - family: Drabl\o s given: Finn - family: Rye given: Morten - family: Lindqvist given: Bo Henry container-title: To be published volume: URL: DOI: issue: publisher: page: type: article-journal issued: year: month: output: BiocStyle::pdf_document vignette: > %\VignetteIndexEntry{MACPET} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r style,eval=TRUE,echo=FALSE,results='hide'} BiocStyle::latex ``` \section{Introduction} The \Biocpkg{MACPET} package can be used for general analysis of paired-end (PET) data like ChIA-PET. \Biocpkg{MACPET} currently implements the following five stages: * Stage 0 (Linker filtering): Identifies linkers A and B in the fastq files and classifies the reads as usable (A/A,B/B), chimeric (A/B,B/A) and ambiguous (non/A, non/B, A/non, B/non). * Stage 1 (Mapping to the reference genome): Maps the usable reads separately into the reference genome using \Biocpkg{Rbowtie} package, and keeps only uniquely mapped reads with zero mismatch per read. It then maps the unmapped reads to the reference genome with at most one mismatch and keeps the uniquely mapped reads. Uniquely mapped reads with zero or one mismatch are then merged and paired, their duplicates are marked and a paired-end bam file is created which is used in State 2. * Stage 2 (PET classification): Classifies the PETs as self-ligated (short genomic distance, same chromosome), intra-chromosomal (long genomic distance, same chromosome) by finding the self-ligated cut-off using the elbow method, and inter-chromosomal (different chromosomes). Furthermore, it removes identically mapped PETs for reducing noise created by amplification procedures. Moreover, it can remove black-listed regions based on the genome of the data. Note that loading the data into R might take a while depending on the size of the data. * Stage 3 (Peak-calling): Uses the self-ligated PETs found in Stage 2 and segments the genome into non-overlapping regions. It then uses both reads of each PET and applies 2D mixture models for identifying two-dimensional clusters which represent candidate binding sites using the skewed generalized students-t distributions (SGT). Finally, it uses a local Poisson model for finding significant binding sites. * Stage 4 (Interaction-calling): This stage uses the intra- and inter-chromosomal PETs found in State 2, as well as the significant Peaks found in Stage 3 for calling for significant interactions. NOTE: currently inter-chromosomal PETs are not supported. \Biocpkg{MACPET} identifies binding site locations more accurately than other algorithms which use only one end (like MACS) [@macpet]. The output from Stage 3 in \Biocpkg{MACPET} can be used for interaction analysis using either MANGO or MICC, or the user can run State 4 in \Biocpkg{MACPET} for interaction analysis. Note that in the case of using the output from \Biocpkg{MACPET} in MANGO or MICC for interaction analysis, the user should use the self-ligated cut-off found by \Biocpkg{MACPET}, and not the one found in MANGO or MICC. Both of those algorithms allow the user to specify the self-ligated cut-off. MACPET is mainly written in C++, and it supports the \Biocpkg{BiocParallel} package. Before starting with examples of how to use \Biocpkg{MACPET}, create a test folder to save all the output files of the examples presented in this vignette: ```{r,eval=TRUE,echo=TRUE} #Create a temporary test folder, or anywhere you want: SA_AnalysisDir=file.path(tempdir(),"MACPETtest") dir.create(SA_AnalysisDir)#where you will save the results. ``` Load the package: ```{r} library(MACPET) ``` \section{\Biocpkg{MACPET} Classes} \label{sec:classes} \Biocpkg{MACPET} provides five different classes which all inherit from the \Rclass{GInteractions} class in the \Biocpkg{InteractionSet} package. Therefore, every method associated with the \Rclass{GInteractions} class is also applicable to the \Biocpkg{MACPET} classes. Every \Biocpkg{MACPET} class contains information of the PETs associated with the corresponding class, their start/end coordinates on the genome as well as which chromosome they belong to. This section provides an overview of the \Biocpkg{MACPET} classes, while methods associated with each class are presented in latter sections. The classes provided by \Biocpkg{MACPET} are the following: * \textcolor{LimeGreen}{\Rclass{PSelf}} class contains information about the self-ligated PETs in the data. This class is created using either the \Rfunction{MACPETUlt} function at stage 2 or the \Rfunction{ConvertToPSelf} function. * \textcolor{LimeGreen}{\Rclass{PSFit}} class is an update of the \textcolor{LimeGreen}{\Rclass{PSelf}} class, which contains information about which binding site each PET belongs to, as well as significant peaks found by the peak-calling algorithm. This class is created using the \Rfunction{MACPETUlt} function at stage 3. * \textcolor{LimeGreen}{\Rclass{PInter}} class contains information about Inter-chromosomal PETs in the data. This class is created using the \Rfunction{MACPETUlt} function at stage 2. * \textcolor{LimeGreen}{\Rclass{PIntra}} class contains information about Intra-chromosomal PETs in the data. This class is created using the \Rfunction{MACPETUlt} function at stage 2. * \textcolor{LimeGreen}{\Rclass{GenomeMap}} class contains information about the interactions in the genome. This class is created using the \Rfunction{MACPETUlt} function at stage 4. Then the user can use the \Rfunction{GetSignInteractions} function for subseting the significant interactions from the object and return a \Rclass{GInteractions} class object. \subsection{\textcolor{LimeGreen}{\Rclass{PSelf}} Class} The \textcolor{LimeGreen}{\Rclass{PSelf}} class contains pair-end tag information of self-ligated PETs which is used for binding site analysis. ```{r} load(system.file("extdata", "MACPET_pselfData.rda", package = "MACPET")) class(MACPET_pselfData) #example name MACPET_pselfData #print method ``` Extra information of this class is stored as list in the metadata entries with the following elements: * Self_info: a two-column data.frame with information about the chromosomes in the data (chrom) and the total PET counts of each chromosome (PET.counts). * SLmean: which is the mean size of the self-ligated PETs. * MaxSize: The maximum self-ligated PET size in the data. * MinSize: The minimum self-ligated PET size in the data. ```{r} metadata(MACPET_pselfData) ``` One can also access information about chromosome lengths etc. ```{r} seqinfo(MACPET_pselfData) ``` \subsection{\textcolor{LimeGreen}{\Rclass{PSFit}} Class} The \textcolor{LimeGreen}{\Rclass{PSFit}} class adds information to the \textcolor{LimeGreen}{\Rclass{PSelf}} class about the peak each PET belongs to, as well as the total number of peaks in each chromosome in the data, p-values and FDR for each peak. ```{r} load(system.file("extdata", "MACPET_psfitData.rda", package = "MACPET")) class(MACPET_psfitData) #example name MACPET_psfitData #print method ``` This class updates the Self_info data frame of the \textcolor{LimeGreen}{\Rclass{PSelf}} class with two extra columns: the total regions each chromosome is segmented into (Region.counts) and the total candidate peaks of each chromosome (Peak.counts). Moreover, this class contains a metadata entry which is a matrix containing region and peak IDs for each PET in the data (Classification.Info). Finally, it also contains a metadata entry with information about each peak found (Peaks.Info). Peaks.Info is a data.frame with the following entries: * Chrom: The name of the chromosome * Pets: Total PETs in the peak. * Peak.Summit: Summit of the peak. * Up.Summit: Summit of the left-stream PETs. * Down.Summit: Summit of the right-stream PETs. * CIQ.Up.start: Start of 95 Quantile confidence interval for the left-stream PETs. * CIQ.Up.end: End of 95 Quantile confidence interval for the left-stream PETs. * CIQ.Up.size: Size of 95 Quantile confidence interval for the left-stream PETs. * CIQ.Down.start: Start of 95 Quantile confidence interval for the right-stream PETs. * CIQ.Down.end: End of 95 Quantile confidence interval for the right-stream PETs. * CIQ.Down.size: Size of 95 Quantile confidence interval for the right-stream PETs. * CIQ.Peak.size: Size of the Peak based on the interval (CIQ.Up.start,CIQ.Down.end). * sdx: The standard deviation of the upstream PETs. * lambdax: The skewness of the upstream PETs. * sdy: The standard deviation of the downstream PETs. * lambday: The skewness of the downstream PETs. * lambdaUp: The expected number of PETs in the left-stream Peak region by random chance. * FoldEnrichUp: Fold enrichment for the left-stream Peak region. * p.valueUp: p-value for the left-stream Peak region. * lambdaDown: The expected number of PETs in the right-stream Peak region by random chance. * FoldEnrichDown: Fold enrichment for the right-stream Peak region. * p.valueDown: p-value for the right-stream Peak region. * p.value: p-value for the Peak (p.valueUp*p.valueDown). * FDRUp: FDR correction for the left-stream Peak region. * FDRDown: FDR correction for the right-stream Peak region. * FDR: FDR correction for the Peak. ```{r} head(metadata(MACPET_psfitData)$Peaks.Info) ``` One can also access information about chromosome lengths etc, using \Rfunction{seqinfo(MACPET\_psfitData)}. \subsection{\textcolor{LimeGreen}{\Rclass{PInter}} Class} The \textcolor{LimeGreen}{\Rclass{PInter}} class contains pair-end tag information of Inter-chromosomal PETs: ```{r} load(system.file("extdata", "MACPET_pinterData.rda", package = "MACPET")) class(MACPET_pinterData) #example name MACPET_pinterData #print method ``` One can also access information about chromosome lengths etc, using \Rfunction{seqinfo(MACPET\_pinterData)}. It also contains a two-element metadata list with the following elements: * InteractionCounts: a table with the total number of Inter-chromosomal PETs between chromosomes. Where the rows represent the "from" anchor and the columns the "to" anchor. ```{r,eval=TRUE} metadata(MACPET_pinterData) ``` \subsection{\textcolor{LimeGreen}{\Rclass{PIntra}} Class} The \textcolor{LimeGreen}{\Rclass{PIntra}} class contains pair-end tag information of Intra-chromosomal PETs. ```{r} load(system.file("extdata", "MACPET_pintraData.rda", package = "MACPET")) class(MACPET_pintraData)#example name MACPET_pintraData#print method ``` One can also access information about chromosome lengths etc, using \Rfunction{seqinfo(MACPET\_pintraData)}. It also contains a two-element metadata list with the following elements: * InteractionCounts: a data.frame with the total number of Intra-chromosomal PETs for each chromosome (Counts). ```{r} metadata(MACPET_pintraData) ``` \subsection{\textcolor{LimeGreen}{\Rclass{GenomeMap}} Class} The \textcolor{LimeGreen}{\Rclass{GenomeMap}} class contains all potential interactions between pairs of peaks, as well as the peaks' anchors. ```{r} load(system.file("extdata", "MACPET_GenomeMapData.rda", package = "MACPET")) class(MACPET_GenomeMapData) #example name MACPET_GenomeMapData #print method ``` Extra information of this class is stored as list in the metadata entries with the following elements: * pvalue: The p-value of the interaction. * FDR: The FDR of the interaction. * Order: The order the interaction was entered into the model. * TotalInterPETs: The total interaction PETs between every two interacting peaks. Each row in the metadata entry corresponds to the same row in the main object. ```{r} metadata(MACPET_GenomeMapData) ``` \section{\Biocpkg{MACPET} Methods} This section describes methods associated with the classes in the \Biocpkg{MACPET} package. \subsection{summary-method} All \Biocpkg{MACPET} classes are associated with a summary method which sums up the information stored in each class: \subsubsection{\textcolor{LimeGreen}{\Rclass{PSelf}} Class} \Rfunction{summary} for \textcolor{LimeGreen}{\Rclass{PSelf}} class prints information about the total number of self-ligated PETs for each chromosome, as well as the total number of self-ligated PETs in the data, their min/max length and genome information of the data: ```{r} class(MACPET_pselfData) summary(MACPET_pselfData) ``` \subsubsection{\textcolor{LimeGreen}{\Rclass{PSFit}} Class} \Rfunction{summary} for \textcolor{LimeGreen}{\Rclass{PSFit}} class adds information to the \Rfunction{summary} of \textcolor{LimeGreen}{\Rclass{PSelf}} class. The new information is the total regions found and analysed for each chromosome and the total number of candidate binding sites found on each chromosome: ```{r} class(MACPET_psfitData) summary(MACPET_psfitData) ``` \subsubsection{\textcolor{LimeGreen}{\Rclass{PIntra}} Class} \Rfunction{summary} for \textcolor{LimeGreen}{\Rclass{PIntra}} class prints information about the total number of intra-ligated PETs for each chromosome, as well as information about the genome. The user can choose to plot a heat-map for the total number of intra-ligated PETs on each chromosome: ```{r} class(MACPET_pintraData) requireNamespace("ggplot2") requireNamespace("reshape2") summary(MACPET_pintraData,heatmap=TRUE) ``` \subsubsection{\textcolor{LimeGreen}{\Rclass{PInter}} Class} \Rfunction{summary} for \textcolor{LimeGreen}{\Rclass{PInter}} class prints information about the total number of inter-ligated PETs for each chromosome, as well as information about the genome. The user can choose to plot a heat-map for the total number of inter-ligated PETs connecting the chromosomes: ```{r} class(MACPET_pinterData) requireNamespace("ggplot2") requireNamespace("reshape2") summary(MACPET_pinterData,heatmap=TRUE) ``` \subsubsection{\textcolor{LimeGreen}{\Rclass{GenomeMap}} Class} \Rfunction{summary} for \textcolor{LimeGreen}{\Rclass{GenomeMap}} class prints information about the total number of interactions in the data. The user can provide a threshold for the FDR cut-off of the significant interactions to make the summary from. Alternatively if threshold=NULL all the interactions will be used for the summary. ```{r} class(MACPET_GenomeMapData) summary(MACPET_GenomeMapData) ``` \subsection{plot-method} All \Biocpkg{MACPET} classes are associated with a plot method which can be used to visualize counts, PETs in a region, as well as binding sites. Here we give some examples for the usage of the plot methods, however more arguments can be provided to the plot methods, see \Rpackage{MACPET::plot}. \subsubsection{\textcolor{LimeGreen}{\Rclass{PSelf}} Class} \Rfunction{plot} for \textcolor{LimeGreen}{\Rclass{PSelf}} Class will create a bar-plot showing the total number of self-ligated PETs on each chromosome. The x-axis are the chromosomes and the y-axis are the corresponding frequencies. ```{r} requireNamespace("ggplot2") class(MACPET_pselfData) # PET counts plot plot(MACPET_pselfData) ``` \subsubsection{\textcolor{LimeGreen}{\Rclass{PSFit}} Class} \Rfunction{plot} for \textcolor{LimeGreen}{\Rclass{PSFit}} Class will create a bar-plot (if kind="PeakCounts") showing the total number of candidate binding sites found on each chromosome. The x-axis are the chromosomes and the y-axis are the corresponding frequencies. ```{r} class(MACPET_psfitData) #binding site couts: plot(MACPET_psfitData,kind="PeakCounts") ``` Other kind of plots are also supported for this class. For example if kind="PeakPETs", then a visual representation of a region will be plotted (RegIndex chooses which region to plot with 1 meaning the one with the highest total of PETs in it). The x-axis are the genomic coordinates of the region and the y-axis if the sizes of the PETs. Each segment represents a PET from its start to its end coordinate. Different colors of colors represent which binding site each PET belongs to, with red (PeakID=0) representing the noise cluster. Vertical lines represent the exact binding location of the binding site. ```{r} # region example with binding sites: plot(MACPET_psfitData,kind="PeakPETs",RegIndex=1) ``` \subsubsection{\textcolor{LimeGreen}{\Rclass{PIntra}} Class} \Rfunction{plot} for \textcolor{LimeGreen}{\Rclass{PIntra}} Class will create a bar-plot showing the total number of intra-ligated PETs on each chromosome. The x-axis are the chromosomes and the y-axis are the corresponding frequencies. ```{r} class(MACPET_pintraData) #plot counts: plot(MACPET_pintraData) ``` \subsubsection{\textcolor{LimeGreen}{\Rclass{PInter}} Class} \Rfunction{plot} for \textcolor{LimeGreen}{\Rclass{PInter}} Class. Each node represents a chromosome where the size of the node is proportional to the total number of Inter-chromosomal PETs leaving from this chromosome. Edges connect interacting chromosomes where the thickness of each edge is proportional to the total number of Inter-chromosomal PETs connecting the two chromosomes. ```{r} class(MACPET_pinterData) requireNamespace("igraph") #network plot: plot(MACPET_pinterData) ``` \subsubsection{\textcolor{LimeGreen}{\Rclass{GenomeMap}} Class} \Rfunction{plot} for \textcolor{LimeGreen}{\Rclass{GenomeMap}} Class. Different kind of plot can be created using the Type parameter. The user can also specify a threshold for the significant interactions to make the plots from. In the following example, each node represents a chromosome and the edges show which chromosomes have significant interactions between them. ```{r} class(MACPET_GenomeMapData) requireNamespace("igraph") #network plot: plot(MACPET_GenomeMapData,Type='network-circle') ``` \subsection{exportPeaks methods} \textcolor{LimeGreen}{\Rclass{PSFit}} class has a method which exports the binding site information stored in \Rfunction{metadata(object)[['Peaks.Info']]} into csv files in a given directory if one wishes to have the binding sites in an excel file. The user can also specify a threshold for the FDR. If no threshold is specified all the binding sites found by the algorithm are exported. ```{r,eval=TRUE,echo=TRUE} class(MACPET_psfitData)#PSFit class exportPeaks(object=MACPET_psfitData,file.out="Peaks",threshold=1e-5,savedir=SA_AnalysisDir) ``` \subsection{PeaksToGRanges methods} \textcolor{LimeGreen}{\Rclass{PSFit}} class has also a method which converts the binding sites found by the peak-calling algorithm into a \Rclass{GRanges} object with start and end coordinates the binding site's confidence interval (CIQ.Up.start,CIQ.Down.end). It furthermore contains information about the total number of PETs in the peak (TotPETs), the p-value of the peak (p.value) and its FDR (FDR). The user can also specify an FDR threshold for returning significant peaks. If threshold=NULL, all the found peaks are returned. ```{r,eval=TRUE,echo=TRUE} class(MACPET_psfitData)#PSFit class object=PeaksToGRanges(object=MACPET_psfitData,threshold=1e-5) object ``` \subsection{TagsToGInteractions methods} \textcolor{LimeGreen}{\Rclass{PSFit}} class has also a method which returns only PETs belonging to peaks (removing noisy or insignificant PETs) as a \Rclass{GInteractions} object. This might be useful if one wishes to visualize the tags belonging to PETs of binding sites on the genome-browser. The user can also specify an FDR threshold for returning significant peaks. If threshold=NULL, all the found peaks are returned. ```{r,eval=TRUE,echo=TRUE} class(MACPET_psfitData)#PSFit class TagsToGInteractions(object=MACPET_psfitData,threshold=1e-5) ``` \subsection{PeaksToNarrowPeak methods} \textcolor{LimeGreen}{\Rclass{PSFit}} class has a method which converts peaks of an object of \textcolor{LimeGreen}{\Rclass{PSFit}} class to narrowPeak object. The object is saved in a user specified directory and can be used in the MANGO or MICC algorithms for interaction analysis. Alternatively, the user can use stage 4 in \Rfunction{MACPETUlt} for running the interaction analysis stage. ```{r,eval=TRUE,echo=TRUE} class(MACPET_psfitData)#PSFit class PeaksToNarrowPeak(object=MACPET_psfitData,threshold=1e-5, file.out="MACPET_peaks.narrowPeak",savedir=SA_AnalysisDir) ``` \subsection{ConvertToPSelf methods} This method if for the \textcolor{LimeGreen}{\Rclass{GInteractions}} class. It converts a \Rclass{GInteractions} object to \textcolor{LimeGreen}{\Rclass{PSelf}} object. This method could be used in case the user already has the self-ligated PETs separated from the rest of the data and wishes to run a binding site analysis on those only using stage 3 in \Rfunction{MACPETUlt}. The output object will be saved in the user-specified directory. ```{r,eval=TRUE,echo=TRUE} #--remove information and convert to GInteractions: object=MACPET_pselfData #--remove information and convert to GInteractions: S4Vectors::metadata(object)=list(NULL) class(object)='GInteractions' #----input parameters S2_BlackList=TRUE SA_prefix="MACPET" S2_AnalysisDir=SA_AnalysisDir ConvertToPSelf(object=object, S2_BlackList=S2_BlackList, SA_prefix=SA_prefix, S2_AnalysisDir=S2_AnalysisDir) #load object: rm(MACPET_pselfData)#old object load(file.path(S2_AnalysisDir,"MACPET_pselfData")) class(MACPET_pselfData) ``` \subsection{GetSignInteractions methods} \textcolor{LimeGreen}{\Rclass{GenomeMap}} class has a method which subsets the significant interactions given a user-specified FDR threshold and returns either a \textcolor{LimeGreen}{\Rclass{GInteractions}} class object for the interactions (each row corresponds to one interaction), or it saves the significant interactions into an excel file in a user specified directory. Metadata columns are also provided which given further information about each interaction. ```{r,eval=TRUE,echo=TRUE} class(MACPET_GenomeMapData)#GenomeMap class GetSignInteractions(object=MACPET_GenomeMapData, threshold = NULL, ReturnedAs='GInteractions') ``` \subsection{GetShortestPath methods} \textcolor{LimeGreen}{\Rclass{GenomeMap}} class has a method which finds the length of the shortest path between two user-specified peaks. Currently it only finds the shortest paths between intra-chromosomal peaks. Therefore, the peaks have to be on the same chromosome. The resulting value is a two-element list with the first element named LinearPathLength for the linear length of the path between summits of the two peaks, and the second element named ThreeDPathLength for the 3D length of the shortest path between summits of the two peaks. ```{r,eval=TRUE,echo=TRUE} class(MACPET_GenomeMapData)#GenomeMap class GetShortestPath(object=MACPET_GenomeMapData, threshold = NULL, ChrFrom="chr1", ChrTo="chr1", SummitFrom=10000, SummitTo=1000000) ``` \section{\Biocpkg{MACPET} Supplementary functions} \subsection{AnalysisStatistics function} \Rfunction{AnalysisStatistics} function can be used for all the classes of the \Biocpkg{MACPET} package for printing and/or saving statistics of the classes in csv file in a given working directory. Input for Self-ligated PETs of one of the classes (\textcolor{LimeGreen}{\Rclass{PSelf}}, \textcolor{LimeGreen}{\Rclass{PSFit}}) is mandatory, while input for the Intra- and Inter-chromosomal PETs is not. If the input for the Self-ligated PETs is of \textcolor{LimeGreen}{\Rclass{PSFit}} class, a threshold can be given for the FDR cut-off. Here is an example: ```{r,echo=TRUE,eval=TRUE} AnalysisStatistics(x.self=MACPET_psfitData, x.intra=MACPET_pintraData, x.inter=MACPET_pinterData, file.out='AnalysisStats', savedir=SA_AnalysisDir, threshold=1e-5) ``` \subsection{ConvertToPE\_BAM function} \Rfunction{ConvertToPE\_BAM} in case the user has two separate BAM files from read 1 and 2 of the paired data, and needs to pair them in one paired-end BAM file for further analysis in stage 2-3 on the \Rfunction{MACPETUlt} function. The output paired-end BAM file will be saved in the user-specified directory. Here is an example: ```{r,echo=TRUE,eval=TRUE} requireNamespace('ggplot2') #Create a temporary forder, or anywhere you want: S1_AnalysisDir=SA_AnalysisDir #directories of the BAM files: BAM_file_1=system.file('extdata', 'SampleChIAPETDataRead_1.bam', package = 'MACPET') BAM_file_2=system.file('extdata', 'SampleChIAPETDataRead_2.bam', package = 'MACPET') SA_prefix="MACPET" #convert to paired-end BAM: ConvertToPE_BAM(S1_AnalysisDir=S1_AnalysisDir, SA_prefix=SA_prefix, S1_BAMStream=2000000,S1_image=TRUE, S1_genome="hg19",BAM_file_1=BAM_file_1, BAM_file_2=BAM_file_2) #test if the resulted BAM is paired-end: PairedBAM=file.path(S1_AnalysisDir,paste(SA_prefix,"_Paired_end.bam",sep="")) Rsamtools::testPairedEndBam(file = PairedBAM, index = PairedBAM) bamfile = Rsamtools::BamFile(file = PairedBAM,asMates = TRUE) GenomicAlignments::readGAlignmentPairs(file = bamfile,use.names = FALSE, with.which_label = FALSE, strandMode = 1) ``` \section{Peak Calling Workflow} The main function which the user can use for running a paired-end data analysis is called \Rfunction{MACPETUlt}. It consists of the five stages described in the introduction section. The user may run the whole pipeline/analysis at once using Stages=c(0:4) or step by step using a single stage at a time. The function supports the \Biocpkg{BiocParallel} package. For the following example we run stages 2 and 4 of the \Rfunction{MACPETUlt} only. The reason is that for running state 1, the bowtie index is needed which is too big for downloading it here. ```{r,echo=TRUE,eval=TRUE} #give directory of the BAM file: S2_PairedEndBAMpath=system.file('extdata', 'SampleChIAPETData.bam', package = 'MACPET') #give prefix name: SA_prefix="MACPET" #parallel backhead can be created using the BiocParallel package #parallel backhead can be created using the BiocParallel package #requireNamespace('BiocParallel') #snow <- BiocParallel::SnowParam(workers = 4, type = 'SOCK', progressbar=FALSE) #BiocParallel::register(snow, default=TRUE) # packages for plotting: requireNamespace('ggplot2') #-run for the whole binding site analysis: MACPETUlt(SA_AnalysisDir=SA_AnalysisDir, SA_stages=c(2:4), SA_prefix=SA_prefix, S2_PairedEndBAMpath=S2_PairedEndBAMpath, S2_image=TRUE, S2_BlackList=TRUE, S3_image=TRUE, S4_image=TRUE, S4_FDR_peak=1)# the data is small so use all the peaks found. #load results: SelfObject=paste(SA_prefix,"_pselfData",sep="") load(file.path(SA_AnalysisDir,"S2_results",SelfObject)) SelfObject=get(SelfObject) class(SelfObject) # see methods for this class IntraObject=paste(SA_prefix,"_pintraData",sep="") load(file.path(SA_AnalysisDir,"S2_results",IntraObject)) IntraObject=get(IntraObject) class(IntraObject) # see methods for this class InterObject=paste(SA_prefix,"_pinterData",sep="") load(file.path(SA_AnalysisDir,"S2_results",InterObject)) InterObject=get(InterObject) class(InterObject) # see methods for this class SelfFitObject=paste(SA_prefix,"_psfitData",sep="") load(file.path(SA_AnalysisDir,"S3_results",SelfFitObject)) SelfFitObject=get(SelfFitObject) class(SelfFitObject) # see methods for this class GenomeMapObject=paste(SA_prefix,"_GenomeMapData",sep="") load(file.path(SA_AnalysisDir,"S4_results",GenomeMapObject)) GenomeMapObject=get(GenomeMapObject) class(GenomeMapObject) # see methods for this class #-----delete test directory: unlink(SA_AnalysisDir,recursive=TRUE) ``` \Rfunction{MACPETUlt} saves its outputs in SA\_AnalysisDir in the folders S0\_results, S1\_results, S2\_results, S3\_results and S4\_results based on the stages run. The output of \Rfunction{MACPETUlt} in those folders is the following: Stage 0: (output saved in a folder named S0\_results in SA\_AnalysisDir) * SA\_prefix\_usable\_1.fastq.gz: fastq.gz files with the usable 5-end tags. To be used in Stage 1. * SA\_prefix\_usable\_2.fastq.gz: fastq.gz files with the usable 3-end tags. To be used in Stage 1. * SA\_prefix\_chimeric\_1.fastq.gz: fastq.gz files with the chimeric 5-end tags. * SA\_prefix\_chimeric\_2.fastq.gz: fastq.gz files with the chimeric 3-end tags. * SA\_prefix\_ambiguous\_1.fastq.gz: fastq.gz files with the ambiguous 5-end tags. * SA\_prefix\_ambiguous\_2.fastq.gz: fastq.gz files with the ambiguous 3-end tags. * SA\_prefix\_stage\_0\_image.jpg: Pie chart image with the split of two fastq files used as input (if S0\_image==TRUE) Stage 1: (output saved in a folder named S1\_results in SA\_AnalysisDir) * SA\_prefix\_usable\_1.sam: sam file with the mapped 5-end reads (if S1\_rmSam==FALSE). * SA\_prefix\_usable\_2.sam: sam file with the mapped 3-end reads (if S1\_rmSam==FALSE). * SA\_prefix\_Paired\_end.bam: paired-end bam file with the mapped PETs. To be used in Stage 2 * SA\_prefix\_Paired\_end.bam.bai: .bai paired-end bam file with the mapped PETs. To be used in Stage 2 * SA\_prefix\_stage\_1\_p1\_image.jpg: Pie-chart for the mapped/unmapped reads from SA\_prefix\_usable\_1.sam and SA\_prefix\_usable\_2.sam (if S1\_image==TRUE). * SA\_prefix\_stage\_1\_p2\_image.jpg: Pie-chart for the paired/unpaired reads of SA\_prefix\_Paired\_end.bam (if S1\_image==TRUE). Stage 2: (output saved in a folder named S2\_results in SA\_AnalysisDir) * SA\_prefix\_pselfData: An object of \textcolor{LimeGreen}{\Rclass{PSelf}} class, containing the self-ligated PETs. To be used in Stage 3. * SA\_prefix\_pintraData: An object of \textcolor{LimeGreen}{\Rclass{PIntra}} class, containing the intra-chromosomal PETs. * SA\_prefix\_pinterData: An object of \textcolor{LimeGreen}{\Rclass{PInter}} class, containing the inter-chromosomal PETs. * SA\_prefix\_stage\_2\_p1\_image.jpg: Pie-chart reliable/duplicated/black-listed PETs of SA\_prefix\_Paired\_end.bam (if S2\_image==TRUE). * SA\_prefix\_stage\_2\_p2\_image.jpg: Histogram with the self-ligated/intra-chromosomal cut-off of SA\_prefix\_Paired\_end.bam (if S2\_image==TRUE). * SA\_prefix\_stage\_2\_p3\_image.jpg: Pie-chart for the self-ligated/intra-chromosomal/inter-chromosomal PETs of SA\_prefix\_Paired\_end.bam (if S2\_image==TRUE). Stage 3: (output saved in a folder named S3\_results in SA\_AnalysisDir) * SA\_prefix\_psfitData: An object of \textcolor{LimeGreen}{\Rclass{PSFit}} class. This object contains peaks found by the peak-calling algorithm along with their p-values and FDR. * SA\_prefix\_stage\_3\_p1\_image.jpg: Sizes of the upstream vs downstream peaks of each binding site given the binding site's FDR (if S3\_image==TRUE). * SA\_prefix\_stage\_3\_p2\_image.jpg: FDR of the binding sites. The horizontal red line is at FDR=0.05 (if S3\_image==TRUE). * SA\_prefix\_stage\_3\_p3\_image.jpg: Comparison of binding site sizes given their FDR (if S3\_image==TRUE). * SA\_prefix\_stage\_3\_p4\_image.jpg: FDR for the upstream/donwstream peaks of the binding sites given the binding sites FDR (if S3\_image==TRUE). Stage 4: (output saved in a folder named S4\_results in SA\_AnalysisDir) * SA\_prefix\_GenomeMapData: An object of \textcolor{LimeGreen}{\Rclass{GenomeMap}} class. This object contains all the interactions found in the data. * SA\_prefix\_stage\_4\_p1\_image.jpg: Pie charts for the total number of peaks used in the interaction analysis as well as the total number of interaction PETs used (if S4\_image==TRUE). Stages 0:4: * All the above outputs in separate folders. Furthermore a log-file named SA\_prefix\_analysis.log with the progress of the analysis is also saved in the SA\_AnalysisDir.