--- title: 'KinSwingR: Predicting kinase activity from phosphoproteomics data' author: "Ashley J. Waardenberg" date: 'Last modified: 2019-04-25. Compiled: `r Sys.Date()`' output: html_document: highlight: tango number_sections: yes toc: yes toc_depth: 3 toc_float: collapsed: no smooth_scroll: no pdf_document: toc: yes toc_depth: '3' word_document: toc: yes toc_depth: '3' references: - DOI: 10.1371/journal.pbio.3000170 URL: 'https://doi.org/10.1371/journal.pbio.3000170' author: - family: Engholm-Keller given: K* - family: Waardenberg given: AJ* - family: Müller given: JA - family: Wark given: JR - family: Fernando given: RN - family: Arthur given: JA - family: Robinson given: PJ - family: Dietrich given: D - family: Schoch given: S - family: Graham given: ME id: Kasper2019 issued: month: 3 year: 2019 publisher: PloS Biology title: The temporal profile of activity-dependent presynaptic phospho-signalling reveals long lasting patterns of post-stimulus regulation type: article-journal vignette: > %\VignetteIndexEntry{KinSwingR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction to KinSwing KinSwingR aims to predict kinase activity from phoshoproteomics data. It implements the alogorithm described in: @Kasper2019 (in greater detail below). KinSwingR predicts kinase activity by integrating kinase-substrate predictions and the fold change and signficance of change for peptide sequences obtained from phospho-proteomics studies. The score is based on the network connectivity of kinase-substrate networks and is weighted for the number of substrates as well as the size of the local network. P-values are provided to assess the significance of the KinSwing scores, which are determined through random permuations of the overall kinase-substrate network. KinSwingR is implemented as 3 core functions: + **```buildPWM()```** builds position weight matrices (PWMs) from known kinase-substrate sequences + **```scoreSequences()```** score PWMs build using ```buildPWM()``` against input phosphoproteome data + **```swing()```** integrates PWM scores, direction of phosphopeptide change and significance of phosphopeptide change into a "swing" score. The KinSwing score is a metric of kinase activity, ranging from positive to negative, and p-values are provided to determine significance. Additional functions are also provided: + **```cleanAnnotation()```** function to tidy annotations and extract peptide sequences. + **```viewPWM()```** function to view PWM models Detailed information for each of these functions can be accessed using the ```?``` command before the function of interest. E.g. ```?buildPWM``` # KinSwingR example workflow We will now consider an example dataset to predict kinase activity. Kinase-substrate sequences and phosphoproteomics data are provided as example data in the KinSwingR package. Begin by loading the KinSwingR library and the two data libraries included in the package. ```{r echo=TRUE} library(KinSwingR) data(example_phosphoproteome) data(phosphositeplus_human) # View the datasets: head(example_phosphoproteome) head(phosphositeplus_human) ## sample 100 data points for demonstration sample_data <- head(example_phosphoproteome, 1000) # Random sample for demosntration purposes set.seed(1234) sample_pwm <- phosphositeplus_human[sample(nrow(phosphositeplus_human), 1000),] # for visualising a motif, sample only CAMK2A CAMK2A_example <- phosphositeplus_human[phosphositeplus_human[,1]== "CAMK2A",] ``` ## Extracting peptides for analysis Where the centered peptide sequences (on the phosphosite of interest) are not provided in the format required for ```scoreSequences()``` (see the argument "input_data", in ?scoreSequences), these can be required to be extracted from another column of annotated data. NB. "input_data" table format must contain columns for "annotation", "peptide", "fold-change" and "p-values". In the example dataset provided, ```example_phosphoproteome```, peptides have not been extracted into a stand-a-lone peptide column. ```cleanAnnotation()``` is provided as a function to extract peptides from annotation columns and place into the peptide column. In the example dataset, ```example_phosphoproteome```, the peptide sequence is the 4th component of the annotation, which corresponds to using the argument ```seq_number = 4``` below, and is seperated by ```|```, which corresponds to the argument ```annotation_delimiter = "|"```. In this case, the annotated data also contains multi-mapped and multi-site information. For example the following annotation ```A1L1I3|Numbl|263;270|PAQPGHVSPTPATTS;SPTPATTSPGEKGEA``` contains two peptides ```PAQPGHVSPTPATTS``` and ```SPTPATTSPGEKGEA``` that map to different sites from the same reference gene ```Numbl```, where the peptides are seperated by ```;```. The annotated data also includes multi-protein mapped (where a peptide could map to more than one protein - not shown) and contains ```X``` instead of ```_``` to indicate sequences that were outside of the length of the coding sequences. KinSwingR requires that these sequences outside of the coding region are marked with ```_``` as deafult and therefore ```replace_search = "X"``` and ```replace_with = "_"``` can be used as arguments in ```cleanAnnotation()``` to replace these. This allows for full flexibility of the input data here, depending of the software used to generate determine the peptide sequences. NB: characters other than ```_``` can be used, but these need to be declared when calling buildPWM and scoreSequences functions later (see their help files). Calling ```cleanAnnotation()``` will produce a new table with the unique combinations of peptide sequences extracted from the annotation column into the peptide column: ```{r echo=TRUE} annotated_data <- cleanAnnotation(input_data = sample_data, annotation_delimiter = "|", multi_protein_delimiter = ":", multi_site_delimiter = ";", seq_number = 4, replace = TRUE, replace_search = "X", replace_with = "_") head(annotated_data) ``` ## Build Position Weight Matrices (PWMs) The first step to inferring kinase activity, is to build Position Weight Matrices (PWMs) for kinases. This can be done using ```buildPWM()``` for any table containing centered substrate peptide sequences for a list of kinases. The example data ```data(phosphositeplus_human)``` indicates the required format for building PWM models. Below, for demosntration, we use a subset that was sampled above ```sample_pwm``` To generate the PWMs: ```{r echo=TRUE} pwms <- buildPWM(sample_pwm) ``` This will build the PWM models, accessible as ```PWM$pwm``` and list the number of substrate sequences used to build each PWM, accesible as ```PWM$kinase```. To view the list of kinases and the number of sequences used: ```{r echo=TRUE} head(pwms$kinase) ``` ### Visualising motifs Motif amino acids are coloured according to their properties. ```color_scheme``` parameter allows options are "lesk" or "shapely" (default). Y-axis is the information content, measured as bits. ```{r echo=TRUE} CAMK2A_pwm <- buildPWM(CAMK2A_example) CAMK2A <- viewPWM(CAMK2A_pwm, which_pwm="CAMK2A", view_pwm = TRUE, color_scheme = "shapely") ``` ## Score PWM matches against peptide sequences Next, we will use the PWM models generated, ```pwms```, to identify matches in the ```annotated_data``` table that was cleaned using ```cleanAnnotation()``` above. ``scoreSequences``` supports multi-core processing - see the example below for setting the number of workers to 4. ```scoreSequences``` draws a random background by default of size ```n = 1000```. It is recommended to use ```set.seed()``` prior to calling ```scoreSequences``` if you wish to reproduce your results. To access the help file, which explains all the arguments, type ```?scoreSequences``` into the console. ```{r echo=TRUE} # As an example of control over multi-core processing # load BiocParallel library library(BiocParallel) # finally set/register the number of cores to use register(SnowParam(workers = 4)) # set seed for reproducible results set.seed(1234) scores <- scoreSequences(input_data = annotated_data, pwm_in = pwms, n = 100) ``` The outputs of ```scores``` are transparent and accessible. These are however primarily intermediate tables for obtaining swing scores. ```scores``` is a simple list object that contains peptide scores ```(scores$peptide_scores)```, p-values for the peptide scores ```(scores$peptide_p)``` and the background peptides used to score significance ```(scores$background)``` for reproducibility (i.e. the background can saved and reused for reproducibility). In summary, ```scoreSequences()``` scores each input sequence for a match against all PWMs provided using ```buildPWM()`` and generates p-values for scores. This is effectively one large network of kinase-substrate edges of dimensions kinase, ***k***, by substrate, ***s***. ## Predict kinase activity using swing() Having built a kinase-substrate network, ```swing()``` then integrates the kinase-substrate predictions, directionality and significance of phosphopeptide fold change to assess local connectivity (or swing) of kinase-substrate networks. The final score is a normalised score of predicted kinase activity that is weighted for the number of substrates used in the PWM model and number of peptides in the local kinase-substrate network. By default, this will permute the network 1000 times (here we use 10 for example purposes). It is recommended to use ```set.seed()``` prior to calling ```swing``` if you wish to reproduce your results. ``swing``` supports multi-core processing - see the example below for setting the number of workers to 4. ```{r echo=TRUE} # after loading BiocParallel library, set/register the number of cores to use register(SnowParam(workers = 4)) # set seed for reproducible results set.seed(1234) swing_out <- swing(input_data = annotated_data, pwm_in = pwms, pwm_scores = scores, permutations = 10) # This will produce two tables, one is a network for use with e.g. Cytoscape and the other is the scores. To access the scores: head(swing_out$scores) ``` The outputs of this table indicate the following: + ```kinase```: The kinase + ```pos```: Number of ***positively*** regulated kinase substrates + ```neg```: Number of ***negatively*** regulated kinase substrates + ```all```: Total number of regulated kinase substrates + ```pk```: Proportion of ***positively*** regulated kinase substrates + ```nk```: Proportion of ***negatively*** regulated kinase substrates + ```swing_raw```: Raw - weighted score + ```n```: Number of subtrate sequence in ```kinase``` PWM + ```swing```: Normalised (Z-score transformed) - weighted score + ```p_greater```: probability of observing a swing score greater than + ```p_less```: probability of observing a swing score less than Note that the ``pos``, ``neg`` and ``all`` include a pseudo-count, that is set in ``?swing``, note ``pseudo_count``. *** See @Kasper2019 for methods description *** # KinSwingR algorithm *** For a full description of the KinSwing algorithm, see @Kasper2019 *** **In brief:** ```buildPWM()``` generates Position Weight Matrices (PWMs) for kinases based on known substrate sequence (Equation 1), where each kinase, $K$, is considered as the log-likelihood ratio of the average frequency of amino acid, $a$, at each position, $p$, divided by background frequencies, $B$ ($C$ is a pseudo count to avoid log zero): ***Equation 1:*** $PWM_{(a,p)}=log((1/n∑^n_{i=1}K_i)+C)/B_a+C)$ ```scoreSequences()``` scores each kinase, $K$, match to a substrate $S$, given as $S_{score}=∑^n_{(i=1)}f(a,p)$ , which corresponds to the sum of the corresponding amino acid, $a$, of peptide sequence length, $i$, from position, $p$, of $PWM_{(a,p)}$ and $f(a,p)=PWM_{ap}∈PWM_{(a,p)}$. The probability of observing $S_score$ for kinase, $K$, is determined as conditional on a randomly sampled reference distribution of size $N$ sequences $P(S_{score}|R,N)$, where $R$ sequences are determined to have a test statistic less than or equal to $S_{score}$: ***Equation 2:*** $R= ∑^N_{n=1}I((S_{score})n* ≥ (S_{score})i)$ ```swing()``` integrates phosphosite data and kinase-substrate scores from ```scoreSequences()``` into a network for scoring kinase activity based on local connectivity, $swing_k$, (Equation 3). $swing_k$ is the weighted product of the proportion of positive, $Pos_k$, and negative, $Neg_k$, network edges, determined as the product of a logic function (described here: @Kasper2019) given a local network of size, $C_k$, with $n$ substrates for kinase, $K$: ***Equation 3:*** $swing_k=log_2((Pos_k+c)/(Neg_k+c))*log_2(C_k)*log_2(S_n)$ $swing_k$, is transformed into a z-score, $Z(swing_k)$, where, $μ$, is the mean and, $σ$, the standard deviation of swing scores, thus allowing for comparison of predicted kinase activity across multiple timepoints and/or conditions. KinSwingR addresses the question of “how likely is it is observe the predicted activity of kinase, $K$, by random chance?” by computing $swing_k$ given $N$ permutations of kinase node labels, $K$, to substrates, $S$, of the total network, $M_{ks}$. Thus, the probability of observing $swing_k$ is conditional on this permuted reference distribution, of size, $N$ (Equation 2). This is computed for each tail of the distribution, that is, positive and negative $swing_k$ scores. # References