--- title: "BioTIP- an R package for characterization of Biological Tipping-Points" author: "Zhezhen Wang, Biniam Feleke, Qier An, Antonio Feliciano and Xinan H Yang" date: '`r format(Sys.Date(), "%m/%d/%Y" )`' output: html_document: fig_caption: yes keep_md: yes toc: yes pdf_document: default word_document: default vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{BioTIP- an R package for characterization of Biological Tipping-Point} %\VignetteEncoding{UTF-8} abstract: "Heterogeneous cells within a stable state are controlled by gene regulatory networks. Cell-state state transitions are not just gradually and linearly changing phenomena, but rather can be described as nonlinear critical transitions or tipping points (Moris 2016). This abrupt transition between two stable states has been well described using a tipping-point model (Scheffer 2012). When a complex system arrives at a tipping point, there exists fundamental instability in the system. Although this brings a potential risks of unwanted collapse, it also presents an opportunity for positive change through reversal of trajectory. The idea of tipping-point prediction has been previously pursued and implimented, however existing methods each have their limitations. The 'early-warning' package in R (http://www.early-warning-signals.org/) is designed for longitudinal data analysis and is applicable to ecosystems, ecology, space, and other fields. However, we found it hard to apply to high-throughput 'omics' data. Another published method for biological tipping-point characterization is Dynamic Network Biomarker (DNB) (Chen 2012). This method relies on indexing gene-oscillation per state, and thus is not always feasible for cross-state comparisons and indication of tipping points. Another published method, Index of critical transition (Ic-score), predicts tipping point is limited by its reliance on pre-defined features (Mojtahedi 2016). The purpose of this R package BioTIP is to characterize Biological Tipping-Points from a high-throughput data matrix where rows represent molecular features such as transcripts and columns represent samples or cells in a data matrix. BioTIP implicates one new method and the two aforementioned published methods(DNB and Ic-score) to allow for flexible analysis of not only longitudinal, but also cross-sectional datasets. Besides overcoming the limitations of previous methods, BioTIP provides functions to optimize feature pre-selection and evaluate empirical significance. Additionally, BioTIP defines transcript biotypes according the GENCODE annotation, allowing for study of noncoding transcription (Wang 2018). These improvements allow for an application to cross-sectional datasets. The BioTIP scheme acts as a hybrid model to join the advantages of both DNB and Ic-scoring, and providing enhanced features for high-throughput 'omics' data analysis." --- #### #### ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r, echo=FALSE, fig.cap="Fig 1. BioTIP workflow with five key analytic steps. RTF: relative transcript fluctuation; PCC: Pearson correlation coefficient; MCI: Module-Criticality Index; Ic: index of critical transition.", fig.align='center', out.width = '60%'} knitr::include_graphics("Fig1.pdf") ``` #### [Standard workflow](#Standard workflow) * [An Identification of Critical Tipping Point](#An Identification of Critical Tipping Point) * ##### [Data Preprocessing](#Data Preprocessing) * ##### [Pre-selection Transcript](#Pre-selection Transcript) * ##### [Network Partition](#Network Partition) * ##### [Identifying Dynamic Network Biomodule](#Identifying Dynamic Network Biomodule) * ##### [Finding Tipping Point](#Finding Tipping Point) * [Transcript Annotation and Biotype](#Transcript Annotation and Biotype) * ##### [Quick Start](#Quick Start) * ##### [Genomic Data Source](#Genomic Data Source) * ##### [Extracting Summary Data](#Extracting Summary Data) * ##### [Loading Data](#Loading Data) * ##### [Prepare GRanges Object](#Prepare GRanges Object) * [Acknowledgements](#Acknowledgements) * [SessionInfo](#SessionInfo) * [References](#References) ###################################################################### ### Standard workflow: An Identification of Critical Tipping Point ### ###################################################################### __Data Preprocessing__ An existing dataset, GSE6136, is used to demonstrate how our functions are applied. Samples were collected from transgenic mouse lymphomas and divided into five groups based on clinical presentation, pathology and flow cytometry (Lenburg 2007), thus belonging to cross-sectional profiles. Noticing these five group represent a control stage similar to non-transgenic B cells and four major periods of B-cell lymphomagenesis, Dr. Chen and coauthors used the DNB method to identify the pre-disease state exists around the normal activated period (P2), i.e., the system transitions to the disease state around the transitional lymphoma period (Figure S12 in publication (Chen 2012)). Start by installing the package 'BioTIP' and other dependent packages such as stringr, psych, and igraph if necessary. Below are some examples. ```{r} # load package library(BioTIP) ``` Once all the required packages are installed, load using `“read.table()”` function as follows. Note to change the read.table(“PATH”) when running your datasets. To check the dimension of your data set use “dim(dataset)” function. Here we show a use of dim function “dim(GSE6136)”. Notice that after editing the column and row, the dimension of our dataset changes from (22690,27) to (22690, 26) because we removed the downloaded first row after assigning it to be column-name of the final numeric data matrix. ```{r} data(GSE6136_matrix) dim(GSE6136_matrix) #[1] 22690rows and 27 columns row.names(GSE6136_matrix) = GSE6136_matrix$ID_REF GSE6136 = GSE6136_matrix[,-1] dim(GSE6136) #[1] 22690 rows and 26 columns ``` The summary of GSE6136_matrix is GSE6136_cli shown below. These two kind of data files can be downloaded from GSE database ```{r} #requires library(stringr) library(BioTIP) data(GSE6136_cli) #dim(GSE6136_cli) #check dimension cli = t(GSE6136_cli) library(stringr) colnames(cli) = str_split_fixed(cli[1,],'_',2)[,2] cli = cli[-1,] cli = data.frame(cli) cli[,"cell-type:ch1"] = str_split_fixed(cli$characteristics_ch1.1,": ",2)[,2] cli[,"Ig clonality:ch1"] = str_split_fixed(cli$characteristics_ch1.3,": ",2)[,2] colnames(cli)[colnames(cli) == "cell-type:ch1"] = "group" cli$Row.names = cli[,1] head(cli[,1:3]) ``` We normalized the expression of genes using log2() scale. This normalization will ensure a more accurate comparison of the variance between the expression groups (clusters). ```{r} dat <- GSE6136 df <- log2(dat+1) head(df) ``` [Go to Top](#) __Pre-selection Transcript__ Once normalized, we can now classify different stages. The tipping point is within the "activated" state in this case. Here we see the number of samples that are classified into states "activated", "lymphoma_aggressive", "lymphoma_marginal", "lymphoma_transitional" and "resting". For instance, states "activated" and "resting" contain three and four samples, respectively. All the contents of the data set "test" can be viewed using `View(test)`. Each clinical state's content can be viewed using `head(test["stage_name"])`. For instance, head(test["activated"]) shows contents of the activated state. ```{r, warning=FALSE} tmp <- names(table(cli$group)) samplesL <- split(cli[,1],f = cli$group) #head(samplesL) test <- sd_selection(df, samplesL,0.01) head(test[["activated"]]) ``` __Network Partition__ A graphical represetation of genes of interest can be achieved using the functions shown below. The `getNetwork` function will obtain an igraph object based on a pearson correlation of `test`. This `igraphL` object is then run using the `getCluster_methods` function classify nodes. ```{r,echo=TRUE, warning=FALSE} library(BioTIP) #library(igraph) library(cluster) igraphL <- getNetwork(test, fdr = 1) cluster <- getCluster_methods(igraphL) ``` ```{r,echo=TRUE, warning=FALSE} names(cluster) head(cluster[[1]]) ``` __Identifying Dynamic Network Biomodule__ Here, ‘module’ refers to a cluster of network nodes (e.g. transcripts) highly linked (e.g. by correlation). “Biomodule” refers to the module resenting a highest score called “Module-Criticality Index (MCI)” per state. The following step shows a graph of classified clustered samples for five different stages. MCI score is calculated for each module using the `getMCI` function. The `getMaxMCImember` function will obtain a list of modules with highest MCI at each stage. Use `"head(maxCIms)"` to view the MCI scores calculated. Use `plotMaxMCI` function to view the line plot of highest MCI score at each stage. ```{r,echo=TRUE, warning=FALSE} membersL_noweight <- getMCI(cluster,test,adjust.size = FALSE) plotBar_MCI(membersL_noweight,ylim = c(0,6)) ``` ```{r,echo=TRUE, warning=FALSE} maxMCIms <- getMaxMCImember(membersL_noweight[[1]],membersL_noweight[[2]],min =10) names(maxMCIms) names(maxMCIms[[1]]) names(maxMCIms[[2]]) ``` ```{r,echo=TRUE, warning=FALSE} head(maxMCIms[['idx']]) head(maxMCIms[['members']][['lymphoma_aggressive']]) ``` To get the selected statistics of biomodules (the module that has the highest MCI score) of each state, please run the following ```{r} biomodules = getMaxStats(membersL_noweight[['members']],maxMCIms[[1]]) maxMCI = getMaxStats(membersL_noweight[['MCI']],maxMCIms[[1]]) head(maxMCI) ``` ```{r} maxSD = getMaxStats(membersL_noweight[['sd']],maxMCIms[[1]]) head(maxSD) ``` To get the biomodule with the highest MCI score among all states, as we call it CTS (Critical Transition Signals), please run the following. ```{r} CTS = getCTS(maxMCI, maxMCIms[[2]]) ``` Run the following to visualize the trendence of every state represented by the cluster with the highest MCI scores. ```{r,echo=TRUE, warning=FALSE} par(mar = c(10,5,0,2)) plotMaxMCI(maxMCIms,membersL_noweight[[2]],states = names(samplesL),las = 2) ``` We then perform simulation for MCI scores based on identified signature size (length(CTS) ) using the `simulationMCI` function.Use `plot_MCI_simulation` function to visualize the result. This step usually takes 20-30mins, so here to save the time, we picked a small number 3 as the length of the CTS. ```{r,echo=TRUE, warning=FALSE} simuMCI <- simulationMCI(3,samplesL,df) plot_MCI_Simulation(maxMCI,simuMCI) ``` [Go to Top](#) __Finding Tipping Point__ The next step is to calculate an Index of Critical transition (Ic score) of the dataset. First, use the getIc function to calculate the Ic score based on the biomodule previously identified. We use the plotIc function to draw a line plot of the Ic score. ```{r} IC <- getIc(df,samplesL,CTS) par(mar = c(10,5,0,2)) plotIc(IC,las = 2) ``` Then use the two functions to evaluate two types of empirical significance, respectively. The first function simulation_Ic calculates random Ic-scores by shuffling features (transcripts). Showing in the plot is Ic-score of the identification (red) against its corresponding size-controlled random scores (grey). ```{r,warning=FALSE} simuIC <- simulation_Ic(length(CTS),samplesL,df) par(mar = c(10,5,0,2)) plot_Ic_Simulation(IC,simuIC,las = 2) ``` Another function plot_simulation_sample calculates random Ic-scores by shuffling samples and visualizes the results. Showing in the plot is observed Ic-score (red vertical line) comparing to the density of random scores (grey), at the tipping point identified. ```{r} simulated_Ic = plot_simulation_sample(df,length(samplesL[['lymphoma_aggressive']]),IC[['lymphoma_aggressive']],CTS) ``` ################################################### ###Transcript Annotation and Biotype### ################################################### __Quick Start__ The R function getBiotype is used to group transcripts of interest into 11 biotypes based on GENCODE annotation (Fig 2a). When a query transcript overlaps at least half with a GENCODE transcript on the same strand, this query transcript will inherit the biotype of the GENCODE transcript. In the previous study conducted, five out of the 11 biotypes showed high protein-coding potential while the others did not (Fig 2b) [4]. We thus concluded these seven other biotypes, including protein-coding antisense RNAs, to be lncRNAs. The remaining coding biotypes in this study included canonic protein coding (CPC), ‘PC_mixed’, and ‘PC_intron’. First start by loading the required libraries: “GenomeInfoDb,” “BioTIP,” “GenomicRanges,” “IRanges” and “BioTIP”. Next load the datasets: “gencode”, “ILEF”, “intron” and “cod”. Using these datasets, excute BioTIP functions getBiotypes and getReadthrough as follows. These steps assume you installed the “BioTIP” package. If you did not install the package, use the `install.packages("BioTIP")` to install in R. ```{r, echo=FALSE, fig.align='center', out.width = '65%'} knitr::include_graphics("Fig2.jpg") ``` Fig 2. A getBiotypes workflow and protein-coding potential in real data analysis [4]. (a) Workflow of an in-house R function (getBiotypes) to query transcripts of interests and classify into biotypes. (b) Pie-chart of eleven types of transcripts assembled from polyadenylated RNA(TARGET). (c) Empirical cumulative distribution plot comparing the transcripts across all 11 biotypes. The protein-coding potential was estimated with the Coding Potential Assessment Tool (CPAT). Line color codes biotypes. The more a line towards the right-bottom corner, the lower protein-coding potential it has. ```{r} library(BioTIP) data(gencode) head(gencode) ``` These illustrations above assumes you have installed "BioTIP" package. If you did not install the package already, use the `install.packages("BioTIP")` to install in R. \ __Genomic Data Source__ High quality human genome sequence data can be obtained from various sources. To demonstrate this package, we obtained a comprehensive gene annotation of human GRCh37 from [GENCODE](https://www.gencodegenes.org/human). For our illustrations, human GRCh37 data will be used. A standard file structure, similar to general transfer format (gtf) format, is required for this package. This gtf file organizes genomic data in rows and columns (fields). Each row contains information about specific samples. The columns are tab separated headers of the data frame. There are eight fixed columns with specific headers. An example of gtf format is shown below. For details of the gtf file format visit this [link](https://useast.ensembl.org/info/website/upload/gff.html#tracklines target="_blank"). ![Chromosome 21 of human GRCh37 gtf](chr21.JPG) The table above contains a chr21 dataset which was extracted from a full genome dataset. An extraction method for filtering chr21 from `gencode` file is described below. [Go to Top](#) \ __Extracting Summary Data__ Before any further analysis, we need to summarize the content of the raw gtf data. There are two ways to get genome biotypes: a) "transcript_type" b) "gene_type". Due to our interst in coding and noncoding regions, the `transcript_type` method was used to extract the regions of interest using python script shown below. __Note__ that the `"PATH_FILE"` refers to the path where the downloded gtf file is located. For instance, if the gtf file is located on your `desktop`, replace the `"PATH_FILE"` Cc `"Users/user/Desktop/gtf"`. **Python codes:** ```{r "python code", eval = FALSE} gtf = ("Your/PATH/TO/THE/FILE") outF = open("gtf_summary_transbiotype.txt","w") def getquote(str,f,target): targetLen = len(target)+2 strInd = str.find(target) st = strInd + len(target)+2 ed = st + str[st:].find("";") #print(st,ed) f.write("\t"+str[st:ed]) if strInd!= -1 else f.write("\t"+"NA.") with open(gtf, "r") as f: for line in f: if line[0] != "#": chromosome = line.split("\t")[0] st = line.split("\t")[3] ed = line.split("\t")[4] strand = line.split("\t")[6] type = line.split("\t")[2] outF.write(chromosome+"\t"+st+"\t"+ed+"\t"+strand+"\t"+type) c = "transcript_id" g = "gene_name" t = "transcript_type" getquote(line,outF,c) getquote(line,outF,g) getquote(line,outF,t) outF.write("\n") outF.close() ``` *** [Go to Top](#) \ __Loading Data__ In order to load your data from a local drive, use the following format. __Note__ that the `"PATH_FILE"` refers to the location of the summary data from the above section. For more details on how to load datasets click [here](https://support.rstudio.com/hc/en-us/articles/218611977-Importing-Data-with-RStudio). ##### loading data from local drive > data <- read.delim("PATH_FILE", comment.char = "#") Internal BioTIP package data is included in the data folder. The data can be loaded into R working console using `data()`function. Here we show an example of how to load a dataset `gencode` from the data directory. A quick view of the data can be achieved using `head(gencode)`. ```{r} library(BioTIP) library(GenomicRanges) data(gencode) head(gencode) ``` [Go to Top](#) \ __Prepare GRanges Object__ Here we show an extraction of "gencode" dataset using R commands. Note to replace `PATH_FILE` with file direcotry path. `gtf` refers to the full genome file. A subset function was used to filter chr21 dataset as follows. `chr21 <- subset(gencode, seqnames == "chr21")` #"genecode" = whole genome gtf > gtf = read.table("PATH_FILE") > gtf = subset(gtf, biotype == "transcript") > colnames(gtf) = c("chr","start","end","strand","biotype") > gr = GRanges(gtf) [Go to Top](#) ##### Processing Query *** ```{r} query <- GRanges(c("chr1:2-10:+","chr1:6-10:-"), Row.names = c("trans1","trans2"), score = c(1,2)) head(query) ``` ##### Classifying Biotypes *** ```{r} library(BioTIP) gr <- GRanges(c("chr1:1-5:+","chr1:2-3:+"),biotype = c("lincRNA","CPC")) head(gr) ``` ##### Extracting intron coordinates *** # Intron coordinates intron <- GRanges("chr1:6-8:+") ```{r} intron <- GRanges("chr1:6-8:+") head(intron) ``` ##### Filtering coding transcripts *** # Filtering non-coding regions using products from example 1, 2 and 3 ```{r} intron_trncp <- getBiotypes(query, gr, intron) intron_trncp ``` # Filtering Intron and Exons Here we show how to obtain protein coding and non-coding from our datasets. The coding transcripts are an expressed section of the genome that is responsible for protein formation. Meanwhile the non-coding transcripts are vital in the formation regulatory elements such promoters, enhancers and silencers. ```{r} library(BioTIP) data("intron") data("ILEF") data("gencode") gencode_gr = GRanges(gencode) ILEF_gr = GRanges(ILEF) cod_gr = GRanges(cod) intron_gr = GRanges(intron) non_coding <- getBiotypes(ILEF_gr, gencode_gr, intron_gr) dim(non_coding) head(non_coding[,1:3]) ``` ```{r} coding <- getBiotypes(ILEF_gr, gencode_gr) dim(coding) head(coding[,1:3]) ``` ##### Finding overlapping transcripts *** # Samples with overlapping coding regions. ```{r} library(BioTIP) data(ILEF) data(cod) ILEF_gr = GRanges(ILEF) cod_gr = GRanges(cod) rdthrough <- getReadthrough(ILEF_gr, cod_gr) head(rdthrough) ``` \ __Acknowledgements__ The development of this package would not be possible without continous help and feedback from individuals and institutions including: The Bioconductor Core Team, Dr. Xianan H Yang, Dr. Tzintzuni Garcia, and National Institutes of Health R21LM012619. \ ```{r SessionInfo} sessionInfo() ``` \ __References__ * Scheffer M, Carpenter SR, Lenton TM, Bascompte J, Brock W, Dakos V, et al. Anticipating critical transitions. Science. 2012;338(6105):344-8. doi: 10.1126/science.1225244. PubMed PMID: 23087241. * Chen L, Liu R, Liu ZP, Li M, Aihara K. Detecting early-warning signals for sudden deterioration of complex diseases by dynamical network biomarkers. Sci Rep. 2012;2:342. Epub 2012/03/31. doi: 10.1038/srep00342. PubMed PMID: 22461973; PubMed Central PMCID: PMC3314989. * Lenburg, M. E., A. Sinha, D. V. Faller and G. V. Denis (2007). "Tumor-specific and proliferation-specific gene expression typifies murine transgenic B cell lymphomagenesis." J Biol Chem 282(7): 4803-4811.PMC2819333 * Moris, N., C. Pina and A. M. Arias (2016). “Transition states and cell fate decisions in epigenetic landscapes.” Nat Rev Genet 17(11): 693-703. PMID: 27616569. * Mojtahedi M, Skupin A, Zhou J, Castano IG, Leong-Quong RY, Chang H, et al. Cell Fate Decision as High-Dimensional Critical State Transition. PLoS Biol. 2016;14(12):e2000640. doi: 10.1371/journal.pbio.2000640. PubMed PMID: 28027308; PubMed Central PMCID: PMCPMC5189937. * Wang, Z. Z., J. M. Cunningham and X. H. Yang (2018). “CisPi: a transcriptomic score for disclosing cis-acting disease-associated lincRNAs.” Bioinformatics34(17): 664-670"