%\VignetteIndexEntry{ASpediaFI: Functional Interaction Analysis of AS Events} %\VignettePackage{ASpediaFI} %\VignetteEngine{knitr::knitr} \documentclass{article} \usepackage{graphicx} \usepackage{microtype} \usepackage[T1]{fontenc} \usepackage[latin1]{inputenc} \usepackage{lmodern} \usepackage{geometry} \usepackage{authblk} \usepackage{float} \usepackage{parskip} \usepackage{caption} \geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm} \captionsetup{justification=raggedright,singlelinecheck=false} \usepackage[table]{xcolor} \bibliographystyle{unsrt} \begin{document} <>= library(knitr) opts_chunk$set(concordance = TRUE, comment = NA) options(scipen = 1, digits = 2, warn = -1, width = 82) @ \title{\texttt{ASpediaFI}: Functional Interaction Analysis of AS Events} \author[1]{Doyeong Yu} \author[1]{Kyubin Lee} \author[1]{Daejin Hyung} \author[1]{Soo Young Cho} \author[1]{Charny Park} \affil[1]{Bioinformatics Branch, Research Institute, National Cancer Center, Gyeonggi-do, Republic of Korea} \maketitle \tableofcontents \pagebreak %------------------------------------------------------------ \section{Introduction} %------------------------------------------------------------ Alternative splicing (AS) is a key contributor to transcriptome and phenotypic diversity. There are hundreds of splicing factors regulating AS events which have a significant impact on diverse biological functions. However, it is a challenge to identify functional AS events related to spliceosome and to explore interplaying genes corresponding to specific pathway. We developed an R package \texttt{ASpediaFI} for a systematic and integrative analysis of alternative splicing events and their functional interactions. \begin{center} \includegraphics{Figure1.png} \captionof{figure}{(a) Analytic workflow of \texttt{ASpediaFI}. (b) Illustration of the two-stage random walk with restart} \vspace{2em} \end{center} Figure 1A shows the analytic workflow of \texttt{ASpediaFI}. RNA-Seq BAM files and reference datasets including a GTF file, a gene-gene interaction network, and pathway gene sets are required for this workflow. The workflow begins with obtaining AS event annotations and quantifications. AS event and gene expression quantifications, a gene-gene interaction network, and pathway gene sets are then used to construct a heterogeneous network which contains multiple types of nodes and edges. The initial heterogeneous network (shown in Figure 1B) consists of gene nodes and two types of feature nodes: AS event and pathway. The next step is to run DRaWR (Discriminative Random Walk with Restarts) [1] on the heterogeneous network to rank AS events and pathways for their relevance to a gene set of interest which is called 'query'. The DRaWR algorithm is demonstrated in Figure 1B. Given the heterogeneous network and a query gene set (colored in yellow), two stages of random walk with restart (RWR) are performed. In the first stage, RWR is run twice on the initial network, one with the query gene set and another with all genes in the network as the restart set. Feature nodes are ranked by the difference between the converged probability distributions in two times of RWR. All feature nodes except top \textit{k} ranked nodes are removed to reconstruct a query-specific subnetwork. In the second stage, RWR is run on the query-specific subnetwork to obtain final rankings of genes and features. \texttt{ASpediaFI} provides the user with the final subnetwork and ranked lists of genes and features for further analysis including gene set enrichment analysis and visualization. %------------------------------------------------------------ \section{Installation} %------------------------------------------------------------ To install \texttt{ASpediaFI}, enter the following commands: <>= if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("ASpediaFI") @ %------------------------------------------------------------ \section{Package contents and overview} %------------------------------------------------------------ %------------------------------------------------------------ \subsection{Overview of \texttt{ASpediaFI}} %------------------------------------------------------------ \texttt{ASpediaFI} provides the following functionalities: \begin{itemize} \item AS event detection and annotation \item AS event quantification \item Functional interaction analysis of AS events \item Visualization of AS events and pathways \end{itemize} This package uses a S4 class \texttt{ASpediaFI} as a wrapper of its functionalities (methods) and a container of inputs and outputs (slots). <>= #Load the ASpediaFI package library(ASpediaFI) names(getSlots("ASpediaFI")) @ The \texttt{ASpediaFI} class contains the following slots: \begin{itemize} \item \texttt{samples}: a data frame containing information about samples. The first three columns should be names, BAM file paths, and conditions. \item \texttt{events}: a list of AS events extracted from a GTF file. \item \texttt{psi}: a \texttt{SummarizedExperiment} object containing AS event quantification \item \texttt{gtf}: a \texttt{GRanges} object containing genomic features extracted from a GTF file. \item \texttt{network}: an \texttt{igraph} object containing a query-specific subnetwork as a result of DRaWR. \item \texttt{gene.table}, \texttt{as.table}, \texttt{pathway.table}: data frames containing gene nodes, AS event nodes, and pathway nodes. \end{itemize} \texttt{ASpediaFI} performs the analysis by stepwise manner using the following methods: \begin{itemize} \item \texttt{annotateASevents}: detects AS events from a GTF file and save it in the \texttt{events} field. Also extract features from a GTF file and save in the \texttt{gtf} field. \item \texttt{quantifyPSI}: quantifies AS events using BAM files specified in the \texttt{samples} field. \item \texttt{analyzeFI}: constructs a heterogeneous network of genes, AS events, and pathways and performs DRaWR. \item \texttt{visualize}: visualizes AS event or pathway nodes. \item \texttt{exportNetwork}: exports a subnetwork related to the given pathway to GML format which can be directly used in Cytoscape. \end{itemize} %------------------------------------------------------------ \subsection{Case study: SF3B1 mutation in myelodysplastic syndrome} %------------------------------------------------------------ We provide a case study dataset of myelodysplastic syndrome (MDS) patients, GEO accession GSE114922 [2]. This dataset contains 82 MDS patient samples, 28 of which harbored SF3B1 mutations. We prepared gene expression and PSI profiles for the case study. The package provides the profiles of 40 MDS patients, due to file size limitations. In MDS, hotspot mutation of splicing factor SF3B1 is known to induce distinct subtype, and the study investigated biological functions regulated by the splicing factor mutation. In the following sections, we will walk through the \texttt{ASpediaFI} workflow shown in Figure 1 to identify AS events associated with SF3B1 mutation and explore their functional interactions. %------------------------------------------------------------ \section{Workflow} %------------------------------------------------------------ %------------------------------------------------------------ \subsection{Input data preparation} %------------------------------------------------------------ Before starting the workflow, \texttt{ASpediaFI} requires the following data files to be prepared: \begin{itemize} \item a GTF file for gene model \item an \texttt{igraph} object containing a gene-gene interaction network \item a GMT file containing pathways gene setse \item RNA-Seq BAM files \end{itemize} To begin, we create the \texttt{ASpediaFI} object using the constructor function which requires sample names, paths to RNA-Seq BAM files, and sample conditions. We then obtain AS event annotations from a GRCh38 GTF file using the \texttt{annotateASevents} method. Due to file size limitations, we extract AS event annotations from a subset of GRCh38 GTF file provided in the \texttt{extdata} directory of the package. <>= #Create ASpediaFI object bamWT <- system.file("extdata/GSM3167290.subset.bam", package = "ASpediaFI") GSE114922.ASpediaFI <- ASpediaFI(sample.names = "GSM3167290", bam.files = bamWT, conditions = "WT") #Detect and annotate AS events from a subset of the hg38 GTF file gtf <- system.file("extdata/GRCh38.subset.gtf", package = "ASpediaFI") GSE114922.ASpediaFI <- annotateASevents(GSE114922.ASpediaFI, gtf.file = gtf, num.cores = 1) sapply(events(GSE114922.ASpediaFI), length) head(events(GSE114922.ASpediaFI)$SE) @ The \texttt{annotateASevents} method identifies five types of AS events: \begin{itemize} \item A5SS (alternative 5' splice site) \item A3SS (alternative 3' splice site) \item SE (skipped exon) \item MXE (mutually exclusive exon) \item RI (retained intron) \end{itemize} A list of AS event annotations contains Ensembl ID, chromosome, strand, genomic coordinates of exons, and AS event ID. AS event ID is written in the format of [gene symbol]:[event type]:[chromosome]:[genomic coordinates of exon boundaries] , as defined by ASpedia (https://combio.snu.ac.kr/aspedia/help.html) [3]. Note that the \texttt{events} function is available for accessing to AS event annotations. The \texttt{annotateASevents} method also extracts genomic features from a GTF file and save in a \texttt{gtf} slot as a \texttt{GRanges} object for the visualization of AS events. Next, we quantify AS events from BAM files using the \texttt{quantifyPSI} method. The \texttt{quantifyPSI} method computes PSI (Percentage Spliced In), the fraction of mRNAs containing the alternatively spliced exon [4]. The user needs to specify a type of RNA-Seq reads (single or paired), read length, insert size, and a minimum number of reads mapped to a given exon. At this point, we compute PSI values from a subset of two BAM files for demonstration. The \texttt{quantifyPSI} method saves PSI values in the \texttt{psi} slot as a \texttt{SummarizedExperiment} object with sample information. Similarly, PSI values can be accessed using the \texttt{psi} function. Note that row names of PSI values are AS event IDs. <>= #Compute PSI values of AS events GSE114922.ASpediaFI <- quantifyPSI(GSE114922.ASpediaFI, read.type = "paired", read.length = 100, insert.size = 300, min.reads = 3, num.cores = 1) tail(assays(psi(GSE114922.ASpediaFI))[[1]]) @ \pagebreak Since we need PSI and gene expression profiles for all samples to construct a heterogeneous network, we load example datasets in the package. We update the \texttt{psi} and \texttt{samples} slots with PSI values and sample information stored in the example dataset. <>= #Load PSI and gene expression data data("GSE114922.fpkm") data("GSE114922.psi") #Update the "samples" and "psi" fields psi(GSE114922.ASpediaFI) <- GSE114922.psi samples(GSE114922.ASpediaFI) <- as.data.frame(colData(GSE114922.psi)) head(samples(GSE114922.ASpediaFI)) @ %------------------------------------------------------------ \subsection{Functional interaction analysis of AS events} %------------------------------------------------------------ The \texttt{analyzeFI} method performs data preprocessing, network construction and DRaWR (Discriminative Random Walk with Restart). As the DRaWR algorithm requires a query gene set as input, we first detect genes differentially expressed in SF3B1-mutated samples using the \texttt{limma} package (other DEG analysis tools such as edgeR and DESeq2 can also be used). Assuming DEGs to represent a functional gene set, we use them as a query to identify AS events and pathways closely related to SF3B1 mutation. If a query is given as a character vector, all genes in the query have equal weights. The user can attribute distinct weights by providing a data frame containing the weights in the second column as a query. <>= #Choose query genes based on differential expression library(limma) design <- cbind(WT = 1, MvsW = samples(GSE114922.ASpediaFI)$condition == "MUT") fit <- lmFit(log2(GSE114922.fpkm + 1), design = design) fit <- eBayes(fit, trend = TRUE) tt <- topTable(fit, number = Inf, coef = "MvsW") query <- rownames(tt[tt$logFC > 1 &tt$P.Value < 0.1,]) head(query) @ The \texttt{analyzeFI} method allows the user to change options for data preprocessing, network construction, and DRaWR. \texttt{restart} and \texttt{num.feats} define a restart probability and the number of features to be retained in the final subnetwork. The restart probability is the probability of jumping back to the restart set (query). If the restart probability is small, the walk tends to move around the neighbors of query nodes [5]. \texttt{num.folds} specifies the number of folds in cross-validation for DRaWR. \texttt{low.expr}, \texttt{low.var}, \texttt{prop.na}, and \texttt{prop.extreme} are options for filtering AS events. \texttt{cor.threshold} defines a threshold of Spearman's correlation for connecting AS event nodes and gene nodes in a heterogeneous network. Please see help(analyzeFI) for details. <>= #Perform functional interaction analysis of AS events GSE114922.ASpediaFI <- analyzeFI(GSE114922.ASpediaFI, query = query, expr = GSE114922.fpkm, restart = 0.9, num.folds = 5, num.feats = 200, low.expr = 1, low.var = 0, prop.na = 0.05, prop.extreme = 1, cor.threshold = 0.3) @ Figure 2 shows an ROC plot from the cross-validation produced by the \texttt{analyze} method. 10\% of a query gene set is held out as a test set, and the remaining gene set is used as query. Using the stationary probabilities of gene nodes after the first stage and second stage RWR, ROC curves for two stages are computed. \pagebreak %------------------------------------------------------------ \subsection{Reporting} %------------------------------------------------------------ The \texttt{analyzeFI} method saves top-ranked AS events and pathways in the \texttt{as.table} and \texttt{pathway.table} fields, respectively. \texttt{as.table} contains the information about AS events including AS event ID, gene symbol, AS event type, final ranking, stationary probability, and PSI values of samples in each condition. <>= #Table of AS nodes in the final subnetwork as.table(GSE114922.ASpediaFI)[1:5, 2:5] #Table of GS nodes in the final subnetwork pathway.table(GSE114922.ASpediaFI)[1:5, 1:11] @ \texttt{gs.table} includes the following information about pathway nodes: \begin{itemize} \item \texttt{Pathway}: name of pathway \item \texttt{Rank}: final ranking \item \texttt{Probability}: stationary probability \item \texttt{Pvalue}: GSEA P-value \item \texttt{Adj.Pvalue}: GSEA adjusted P-value \item \texttt{EnrichmentScore}: enrichment score \item \texttt{NormalizedEnrichmentScore}: enrichment score normalized to the average of random samples \item \texttt{Size}: the number of total genes in the pathway gene set \item \texttt{Count}: the number of genes in the pathway gene set that are also present in the network \item \texttt{AvgRank}: the average final ranking of genes in the pathway gene set \item \texttt{NumEvents}: the number of AS events in the final subnetwork connected to genes in the pathway gene set \item \texttt{Genes}: genes in the pathway gene set that are also present in the network \end{itemize} The results from GSEA are included in \texttt{pathway.table} if the \texttt{samples} field contains information about sample conditions (e.g. SF3B1 mutation). The \texttt{analyzeFI} method saves the final query-specific subnetwork in the \texttt{network} field as an \texttt{igraph} object. The user can explore the interactions between AS events and genes as follows: <>= #Extract AS-gene interactions from the final subnetwork library(igraph) edges <- as_data_frame(network(GSE114922.ASpediaFI)) AS.gene.interactions <- edges[edges$type == "AS", c("from", "to")] head(AS.gene.interactions) @ \texttt{ASpediaFI} also allows the user to export the entire subnetwork or a subnetwork related to specific pathway using the \texttt{exportNetwork} method. Given a pathway node, the \texttt{exportNetwork} method extracts a pathway-specific network and exports it to GML format which can be directly used in Cytoscape. If a pathway node is not given, the entire final subnetwork is exported. <>= #Export a pathway-specific subnetwork to GML format exportNetwork(GSE114922.ASpediaFI, node = "HALLMARK_HEME_METABOLISM", file = "heme_metabolism.gml") @ %------------------------------------------------------------ \subsection{Visualization} %------------------------------------------------------------ The \texttt{visualize} method enables visualization of AS events or pathways. If the user provides an AS event nodes as input, it produces a plot describing the AS event and a boxplot of PSI values. Note that the \texttt{gtf} field must contain a \texttt{GRanges} object with genomic features extracted from the GTF file. The genomic region around the AS event can be zoomed by setting \texttt{zoom} to \texttt{TRUE}. Figure 3 illustrates the mutually exclusive exons of HMBS, which has been shown to be associated with SF3B1 mutation in MDS. <>= #Check if any event on the HMBS gene is included in the final subnetwork as.nodes <- as.table(GSE114922.ASpediaFI)$EventID HMBS.event <- as.nodes[grep("HMBS", as.nodes)] #Visualize event visualize(GSE114922.ASpediaFI, node = HMBS.event, zoom = FALSE) @ If a pathway node is given, the \texttt{visualize} method shows a subnetwork consisting of highly ranked gene nodes and AS event nodes connected to the given pathway. The user can change the number of gene and AS event nodes to be shown in the subnetwork by setting \texttt{n}. Figure 4 demonstrates the subnetwork related to the hallmark pathway of heme metabolism which has also been shown to be associated with SF3B1 mutation in MDS. <>= #Visualize nework pertaining to specific pathway visualize(GSE114922.ASpediaFI, node = "HALLMARK_HEME_METABOLISM", n = 10) @ %------------------------------------------------------------ \section{Session Info} %------------------------------------------------------------ <>= sessionInfo() @ \renewcommand{\refname}{\section{References}} \begin{thebibliography}{1} \bibitem{drawr} Blatti, C. et al. (2016). \newblock Characterizing gene sets using discriminative random walks with restart on heterogeneous biological networks. \newblock {\em Bioinformatics\/}, {\bf 32}, 2167--2175. \bibitem{gse114922} Pellagatti, A. et al. (2018). \newblock Impact of spliceosome mutations on RNA splicing in myelodysplasia: dysregulated genes/pathways and clinical associations. \newblock {\em Blood\/}, {\bf 132}, 1225--1240. \bibitem{aspedia} Hyung, D. et al. (2018). \newblock ASpedia: a comprehensive encyclopedia of human alternative splicing. \newblock {\em Nucleic acids research\/}, {\bf 46}, D58--D63. \bibitem{psi} Katz, Y. et al. (2018). \newblock Analysis and design of RNA sequencing experiments for identifying isoform regulation. \newblock {\em Nature methods\/}, {\bf 7}, 1009--1015. \bibitem{rwr} Jin, W. et al. (2018). \newblock Supervised and extended restart in random walks for ranking and link prediction in networks. \newblock {\em PloS one\/}, {\bf 14}, e0213857. \end{thebibliography} \end{document}