--- title: "ABAEnrichment: Gene Expression Enrichment in Human Brain Regions" author: "Steffi Grote" date: "April 2, 2019" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{Introduction to ABAEnrichment} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` ```{r global_options, include=FALSE} knitr::opts_chunk$set(fig.width=10, fig.height=7, warning=FALSE, message=FALSE) options(width=110) set.seed(123) ``` # Overview `r Biocpkg('ABAEnrichment')` is designed to test user-defined genes for expression enrichment in different human brain regions. The package integrates the expression of the input gene set and the structural information of the brain using an ontology, both provided by the Allen Brain Atlas project [1-4]. The statistical analysis is performed by the core function `aba_enrich` which interfaces with the ontology enrichment software FUNC [5]. Additional functions provided in this package are `get_expression`, `plot_expression`, `get_name`, `get_id`, `get_sampled_substructures`, `get_superstructures` and `get_annotated_genes` supporting the exploration and visualization of the expression data. ## Expression data The package incorporates three different brain expression datasets: 1. microarray data from six adult individuals 2. RNA-seq data from 42 individuals of five different developmental stages (prenatal, infant, child, adolescent, adult) 3. developmental effect scores measuring the age effect on expression per gene All three datasets are filtered for protein-coding genes and gene expression is averaged across donors. Although the third dataset does not contain expression data, but a derived score, for simplicity we only refer to 'expression' in this documentation. For details on the datasets see the `r Biocexptpkg('ABAData')` vignette. ## Annotation of genes to brain regions Using the ontology that describes the hierarchical organization of the brain, brain regions get annotated all genes that are expressed in the brain region itself or in any of its substructures. The boundary between 'expressed' and 'not expressed' is defined by different expression quantiles (e.g. using a quantile of 0.4, the lowest 40% of gene expression in the brain are considered 'not expressed' and the upper 60% are considered 'expressed'). These cutoffs are set with the parameter `cutoff_quantiles` and an analysis is run for every cutoff separately. The default cutoffs are 10% to 90% in steps of 10%. ## Enrichment analysis The enrichment analysis is performed by using either the hypergeometric test, the Wilcoxon rank-sum test, the binomial test or the 2x2 contingency table test implemented in the ontology enrichment software FUNC [5]. The **hypergeometric test** evaluates the enrichment of annotated (expressed) candidate genes compared to annotated background genes for each brain region (see [Schematic 1](#hyper_scheme) below). The background genes can be defined explicitly like the candidate genes or, by default, consist of all protein-coding genes from the dataset that are not contained in the set of candidate genes. In contrast to this binary distinction between candidate and background genes, the **Wilcoxon rank-sum test** uses user-defined scores that are assigned to the input genes. It then tests every brain region for an enrichment of genes with high scores in the set of expressed input genes. When genes are associated with two counts (*A* and *B*), e.g. amino-acid changes since a common ancestor in two species, a **binomial test** can be used to identify brain regions with an enrichment of expressed genes with a high fraction of *A* compared to the fraction of *A* in the brain in general. When genes are associated with four counts (*A*-*D*), e.g. non-synonymous or synonymous variants that are fixed between or variable within species, like for a McDonald-Kreitman test [6], the **2x2 contingency table test** can be used. It can identify brain regions which have a high ratio of *A/B* compared to *C/D* in their expressed genes. ![input data and test selection](Input_data_test_selection.png 'overview tests') To account for multiple testing, FUNC computes the family-wise error rate (FWER) using randomsets. The randomsets are generated by permuting the gene-associated variables (e.g. candidate and background genes or the scores assigned to genes for the hypergeometric and Wilcoxon rank-sum test, respectively, see [Schematic 1](#hyper_scheme) below). This is also the default behavior in *ABAEnrichment*. For the hypergeometric test, *ABAEnrichment* additionally provides the option to correlate the chance of a background gene to be selected as a random candidate gene with the length of the background gene (option `gene_len`). Furthermore, instead of defining genes explicitly, whole genomic regions can be provided as input. *ABAEnrichment* then tests brain regions for enrichment of expressed genes located in the candidate regions, compared to expressed genes located in the background regions. The randomsets then also consist of randomly chosen candidate regions inside the background regions, either as a whole block in one background region (default), or on the same chromosome allowing to overlap multiple background regions on that chromosome (option `circ_chrom`, see [Schematic 2](#block_scheme) below). ## Functions included in ABAEnrichment function | description ----------- | ------------------------------------------------------------------------ [aba_enrich](#aba_enrich) | core function for performing enrichment analyses given a candidate gene set [get_expression](#get_expression) | returns expression data for a given set of genes and brain regions [plot_expression](#plot_expression) | plots a heatmap given a matrix of expression data [get_name](#onto) | returns the full name of a brain region given a structure ID [get_sampled_substructures](#onto) | returns the substructures of a given brain region that have expression data available [get_superstructures](#onto) | returns the superstructures of a given brain region [get_id](#getid) | returns the structure ID given the name of a brain region [get_annotated_genes](#anno) | returns genes annotated to enriched or user-defined brain regions # Examples ## Test for gene expression enrichment using the hypergeometric test For a random set of 13 candidate genes, two analyses to identify human brain regions with enriched expression of the candidate genes are performed: one using data from adult donors (from *Allen Human Brain Atlas* [3]) and one using data from five developmental stages (from *BrainSpan Atlas of the Developing Human Brain* [4]). The hypergeometric test evaluates the over-representation of a set of expressed candidate genes in brain regions, compared to a set of expressed background genes (see [Schematic 1](#hyper_scheme) below). The input for the hypergeometric test is a dataframe with two columns: (1) a column with gene identifiers (*Entrez-ID*, *Ensembl-ID* or *gene-symbol*) and (2) a binary column with `1` for a candidate gene and `0` for a background gene. In this example no background genes are defined, so all remaining protein-coding genes of the dataset are used as default background. ```{r} ## load ABAEnrichment package library(ABAEnrichment) ## create input data.frame with candidate genes gene_ids = c('NCAPG', 'APOL4', 'NGFR', 'NXPH4', 'C21orf59', 'CACNG2', 'AGTR1', 'ANO1', 'BTBD3', 'MTUS1', 'CALB1', 'GYG1', 'PAX2') input_hyper = data.frame(gene_ids, is_candidate=1) head(input_hyper) ``` The core function `aba_enrich` performs the enrichment analysis. It takes the `genes` vector as input, together with a `dataset` argument which is set to `adult` (default) or `5_stages` for the analyses of the adult and the developing human brain, respectively. An example with the developmental effect score (`dev_effect`) can be found [below](#dev_score). ```{r, eval=FALSE} ## run enrichment analyses with default parameters ## for the adult and developing human brain res_adult = aba_enrich(input_hyper, dataset='adult') res_devel = aba_enrich(input_hyper, dataset='5_stages') ``` In the following examples two additional parameters are set to lower computation time: `cutoff_quantiles=c(0.5,0.7,0.9)` to use the 50%, 70% and 90% expression quantiles across all genes as the boundary between 'expressed' and 'not expressed' genes, and `n_randsets=100` to use 100 random permutations to calculate the FWER. `cutoff_quantiles` and `n_randsets` have default values `seq(0.1,0.9,0.1)` and `1000`, respectively. ```{r,results='hide'} ## run enrichment analysis with less cutoffs and randomsets ## to save computation time res_devel = aba_enrich(input_hyper, dataset='5_stages', cutoff_quantiles=c(0.5,0.7,0.9), n_randsets=100) ``` The function `aba_enrich` returns a list, the first element of which contains the results of the statistical analysis for each brain region and age category (analyses are performed independently for each developmental stage): ```{r} ## extract first element from the output list, which contains the statistics fwers_devel = res_devel[[1]] ## see results for the brain regions with highest enrichment ## for children (3-11 yrs, age_category 3) fwers_3 = fwers_devel[fwers_devel[,1]==3, ] head(fwers_3) ``` The rows in the output data frame are ordered by `n_significant`, `min_FWER` and `mean_FWER`, `age_category` and `structure_id`; with e.g. `min_FWER` denoting the minimum FWER for enrichment of expressed candidate genes in that brain region across all expression cutoffs. 'n_significant' reports the number of cutoffs at which the FWER was below 0.05. The column `FWERs` lists the individual FWERs for each cutoff. The column `equivalent_structures` lists brain regions with identical expression data due to lack of independent expression measurements in all regions. Nodes (brain regions) in the ontology inherit data from their children (substructures), and in the case of only one child node with expression data, the parent node inherits the child's data leading to identical enrichment statistics. In addition to the statistics, the list that is returned from `aba_enrich` also contains the input genes for which expression data are available, and for each age category the gene expression values that correspond to the requested `cutoff_quantiles`: ```{r} res_devel[2:3] ``` For example, in the enrichment analysis of age category 2 (infant) with an expression cutoff of 0.7 (70%), genes are considered 'expressed' in a particular brain region when their expression value in that region is at least 7.017616. ### Correct for gene length The default behavior of `aba_enrich` is to permute candidate and background genes randomly to compute the FWER. With the option `gene_len=TRUE`, random selection of background genes as candidate genes is dependent on the gene length, i.e. a gene twice as long as another gene also is twice as likely selected as a candidate gene in a randomset. This is useful when the procedure that led to the identification of the candidate gene set is also more likely to discover longer genes. Gene-coordinates were obtained from http://grch37.ensembl.org/biomart/martview/ (GRCh37.p13). The option `ref_genome='grch38'` uses gene-coordinates from the GRCh38 genome (GRCh38.p10) obtained from http://ensembl.org/biomart/martview/. Alternatively also [custom gene-coordinates](#cu_coord) can be provided in a dataframe. ```{r,eval=FALSE} ## run enrichment analysis, with randomsets dependent on gene length res_len = aba_enrich(input_hyper, gene_len=TRUE) ## run the same analysis using gene-coordinates ## from GRCh38 instead of the default GRCh37 res_len_grch38 = aba_enrich(input_hyper, gene_len=TRUE, ref_genome='grch38') ``` ### Test for gene expression enrichment for genomic regions Instead of defining candidate and background genes explicitly in the `genes` input dataframe, it is also possible to define entire chromosomal regions as candidate and background regions. The expression enrichment is then tested for all protein-coding genes located in, or overlapping the candidate regions on the plus or the minus strand. The gene-coordinates used to identify those genes were obtained from http://grch37.ensembl.org/biomart/martview/ (*GRCh37.p13*). The option `ref_genome='grch38'` uses gene-coordinates from the *GRCh38.p10* genome version obtained from http://ensembl.org/biomart/martview/. Alternatively also [custom gene-coordinates](#cu_coord) can be provided in a dataframe. In comparison to defining candidate and background genes explicitly, this option has the advantage that the FWER accounts for spatial clustering of genes. For the random permutations used to compute the FWER, blocks as long as candidate regions are chosen from the merged candidate and background regions and genes contained in these blocks are considered candidate genes ([Schematic 2](#block_scheme)). To define chromosomal regions in the input dataframe, the first column has to be of the form `chr:start-stop`, where `start` always has to be smaller than `stop`. Note that this option requires the input of background regions. If multiple candidate regions are provided, in the randomsets they are placed randomly (but without overlap) into the merged candidate and background regions. The output of `aba_enrich` is identical to the one that is produced for single genes. The second element of the output list contains the candidate and background genes located in the user-defined regions: ```{r} ## create input vector with a candidate region on chromosome 8 ## and background regions on chromosome 7, 8 and 9 regions = c('8:82000000-83000000', '7:1300000-56800000', '7:74900000-148700000', '8:7400000-44300000', '8:47600000-146300000', '9:0-39200000', '9:69700000-140200000') is_candidate = c(1, rep(0,6)) input_region = data.frame(regions, is_candidate) ``` ```{r,results='hide'} ## run enrichment analysis for the adult human brain res_region = aba_enrich(input_region, dataset='adult', cutoff_quantiles=c(0.5,0.7,0.9), n_randsets=100) ``` ```{r} ## look at the results from the enrichment analysis fwers_region = res_region[[1]] head(fwers_region) ## see which genes are located in the candidate region input_genes = res_region[[2]] candidate_genes = input_genes[input_genes[,2]==1,] candidate_genes ``` An alternative method to choose random blocks from the background regions can be used with the option `circ_chrom=TRUE`. Every candidate region is then compared to background regions on the same chromosome ([Schematic 2](#block_scheme)). And in contrast to the default `circ_chrom=FALSE`, randomly chosen blocks do not have to be located inside a single background region, but are allowed to overlap multiple background regions. This means that a randomly chosen block can start at the end of the last background region and continue at the beginning of the first background region on a given chromosome. ### Custom gene-coordinates Gene-coordinates are used when the FWER is corrected for gene length (`gene_len=TRUE`) or for spatial clustering of genes (genomic regions as input). Instead of using the integrated gene-coordinates, one can also provide custom gene-coordinates directly as a dataframe with four columns: gene, chromosome, start, end (parameter `gene_coords`). ```{r,echo=FALSE} gene = c('NCAPG','APOL4','NGFR','NXPH4','C21orf59','CACNG2') chr = c('chr4', 'chr22', 'chr17', 'chr12', 'chr21', 'chr22') start = c(17812436, 36585176, 47572655, 57610578, 33954510, 36956916) end = c(17846487, 36600879, 47592382, 57620232, 33984918, 37098690) custom_coords = data.frame(gene, chr, start, end, stringsAsFactors=FALSE) ``` ```{r} ## example for a dataframe with custom gene-coordinates head(custom_coords) ``` ```{r,eval=FALSE} ## use correction for gene-length based on custom gene-coordinates res_len_cc = aba_enrich(input_hyper, gene_len=TRUE, gene_coords=custom_coords) ``` Note that this allows to use `gene_len=TRUE` to correct the FWER for any user-defined gene 'weight', since the correction for gene length just weights each gene with its length (`end - start`). A gene with a higher weight has a bigger chance of becoming a candidate gene in the randomsets. ## Test for gene expression enrichment using the Wilcoxon rank-sum test When the genes are not divided into candidate and background genes, but are ranked by scores, a Wilcoxon rank-sum test can be performed to find brain regions with a high proportion of genes with high scores in the set of expressed genes. The second column of the `genes` input dataframe then contains the scores assigned to the genes. The output is identical to the one produced with the hypergeometric test. ```{r} ## assign random scores to the genes used above scores = sample(1:50, length(gene_ids)) input_wicox = data.frame(gene_ids, scores) head(input_wicox) ``` ```{r, results='hide'} ## test for enrichment of expressed genes with high scores in the adult brain ## using the Wilcoxon rank-sum test res_wilcox = aba_enrich(input_wicox, test='wilcoxon', cutoff_quantiles=c(0.5,0.7,0.9), n_randsets=100) ``` ```{r} head(res_wilcox[[1]]) ``` ## Test for gene expression enrichment using the binomial test When genes are associated with two counts *A* and *B*, e.g. amino-acid changes since a common ancestor in two species, a binomial test can be used to identify brain regions with an enrichment of expressed genes with a high fraction *A/(A+B)* compared to the fraction of *A* in the brain in general (the root node). To perform the binomial test the input dataframe needs a column with the gene symbols and two additional columns with the corresponding counts: ```{r} ## create a toy example dataset with two counts per gene high_A_genes = c('RFFL', 'NTS', 'LIPE', 'GALNT6', 'GSN', 'BTBD16', 'CERS2') low_A_genes = c('GDA', 'ENC1', 'EGR4', 'VIPR1', 'DOC2A', 'OASL', 'FRY', 'NAV3') A_counts = c(sample(20:30, length(high_A_genes)), sample(5:15, length(low_A_genes))) B_counts = c(sample(5:15, length(high_A_genes)), sample(20:30, length(low_A_genes))) input_binom = data.frame(gene_ids=c(high_A_genes, low_A_genes), A_counts, B_counts) head(input_binom) ``` In this example also the `silent` option is used, which suppresses all output that would be written to the screen (except for warnings and errors): ```{r} ## run binomial test res_binom = aba_enrich(input_binom, cutoff_quantiles=c(0.2,0.9), test='binomial', n_randsets=100, silent=TRUE) head(res_binom[[1]]) ``` ## Test for gene expression enrichment using the 2x2 contingency table test When genes are associated with four counts (*A*-*D*), e.g. non-synonymous or synonymous variants that are fixed between or variable within species, like for a McDonald-Kreitman test [6], the 2x2 contingency table test can be used. It can identify brain regions which have a high ratio of *A/B* compared to *C/D*, which in this example would correspond to a high ratio of *non-synonymous substitutions / synonymous substitutions* compared to *non-synonymous variable / synonymous variable*: ```{r} ## create a toy example with four counts per gene high_substi_genes = c('RFFL', 'NTS', 'LIPE', 'GALNT6', 'GSN', 'BTBD16', 'CERS2') low_substi_genes = c('ENC1', 'EGR4', 'NPTX1', 'DOC2A', 'OASL', 'FRY', 'NAV3') subs_non_syn = c(sample(5:15, length(high_substi_genes), replace=TRUE), sample(0:5, length(low_substi_genes), replace=TRUE)) subs_syn = sample(5:15, length(c(high_substi_genes, low_substi_genes)), replace=TRUE) vari_non_syn = c(sample(0:5, length(high_substi_genes), replace=TRUE), sample(0:10, length(low_substi_genes), replace=TRUE)) vari_syn = sample(5:15, length(c(high_substi_genes, low_substi_genes)), replace=TRUE) input_conti = data.frame(gene_ids=c(high_substi_genes, low_substi_genes), subs_non_syn, subs_syn, vari_non_syn, vari_syn) head(input_conti) ## the corresponding contingency table for the first gene would be: matrix(input_conti[1, 2:5], ncol=2, dimnames=list(c('non-synonymous', 'synonymous'), c('substitution','variable'))) ``` ```{r, results='hide'} res_conti = aba_enrich(input_conti, test='contingency', cutoff_quantiles=c(0.7,0.8,0.9), n_randset=100) ``` The output is analogous to that of the other tests: ```{r} head(res_conti[[1]]) ``` Depending on the counts for each GO-category a Chi-square or Fisher's exact test is performed. Note that this is the only test that is not dependent on the distribution of the gene-associated variables in the root nodes. ## Explore expression data ### get_expression The function `get_expression` enables the output of gene and brain region-specific expression data averaged across donors. Like in all functions of the *ABAEnrichment* package `gene_ids` can be *Entrez-ID*, *Ensembl-ID* or *gene-symbol*. ```{r} ## get expression data for the top 5 regions ## and all input genes ## of the last aba_enrich analysis (res_conti) top_regions = res_conti[[1]][1:5, 'structure_id'] gene_ids = res_conti[[2]][,1] expr = get_expression(structure_ids=top_regions, gene_ids=gene_ids, dataset='adult') expr[,1:6] ``` For the `5_stages` dataset the output of `get_expression` is a list with a data frame for each developmental stage: ```{r} get_expression(structure_ids=c('Allen:10657','Allen:10208'), gene_ids=c('RFFL', 'NTS', 'LIPE'), dataset='5_stages') ``` Note that the brain regions passed to `get_expression` do not have to match the brain regions returned in the output. This is due to the fact that not all brain regions were measured independently. In case a brain region was not measured directly, all available expression data from its substructures are returned. The function [get_sampled_substructures](#onto) can be used to identify substructures with expression data. ### plot_expression The function `plot_expression` enables the visualization of expression data. It needs a matrix of expression data as input, like the one returned by [get_expression](#get_expression): ```{r} ## plot the expression data from above plot_expression(expr, main="microarray expression data for adult brain") ``` For `dataset='5_stages'`, `get_expression` returns a list of matrices; single elements of this list can be passed to `plot_expression`. In the following example `res_devel`, the enrichment analysis from above for 5 developmental stages, is used, and we want to visualize the expression of the candidate genes for the 5 brain regions with the lowest FWER for age category 2. We first extract the top 5 brain regions for age category 2 and the candidate genes used in that analysis. Then we obtain the expression of those genes in the 5 brain regions using `get_expression`, subset to age category 2 and pass the data to `plot_expression`: ```{r} ## plot expression data for the top 5 regions ## of age-category 2 and all input genes ## of the developmental stage aba_enrich analysis above (res_devel) ## extract brain regions devel_stats = res_devel[[1]] devel_stats_2 = devel_stats[devel_stats$age_category==2,] top_regions_2 = devel_stats_2[1:5,'structure_id'] ## extract genes genes = res_devel[[2]][,1] ## get expression for all 5 age categories expr_all = get_expression(top_regions_2, genes, "5_stages") ## subset to age-cateogry 2 expr_2 = expr_all[["age_category_2"]] ## plot heatmap plot_expression(expr_2, main="RPKM from RNA-seq for age_category_2") ``` Note that there are more than 5 brain regions in the plot, because for regions that were not sampled directly, the expression of all their sampled substructures is plotted. Optionally also gene-associated variables (`gene_vars`) can be provided to create a colored side bar. This is useful to visualize the scores of genes used in an `aba_enrich` analysis, and `gene_vars` should be `aba_enrich()[[2]]`, i.e. a data frame with all valid genes from that enrichment analysis together with their input scores. For the *hypergeometric test* the colored side bar indicates candidate genes (red) and background genes (black) and for the *Wilcoxon rank-sum test* it indicates the scores that were used for the enrichment analysis. For the *binomial test* the side bar shows *A/(A+B)* and for the *2x2 contingency table test* *((A+1)/(B+1)) / ((C+1)/(D+1))* (+1 added to prevent division by 0, this is just a visual indication of the proportion of the ratios and not the real odds ratio from the 2x2 contingency table test). In the following example we will plot the expression data for the top 5 brain regions from the 2x2 contingency table test above (`res_conti`), by first extracting the brain regions and genes, then getting the expression data, and finally plotting it together with the gene-associated variables from the 2x2 contingency table test as sidebar (*((A+1)/(B+1)) / ((C+1)/(D+1))*): ```{r} ## plot expression data for the top 5 regions ## and all input genes ## from the 2x2-contingency table analysis (res_conti) ## extract brain regions top_regions = res_conti[[1]][1:5, 'structure_id'] ## extract genes gene_ids = res_conti[[2]][,1] ## get expression expr = get_expression(structure_ids=top_regions, gene_ids=gene_ids, dataset='adult') ## plot expression with sidebar plot_expression(expr, gene_vars=res_conti[[2]], main="heatmap with sidebar") ``` The `plot_expression` function is thought to give a quick overview of the expression pattern, optionally together with gene-associated variables. For more flexible heatmaps, e.g. `heatmap.2` from the package `gplots` can be used (`heatmap.2` is also used inside `plot_expression`): ```{r} ## use heatmap.2 to allow adjusting other parameters like color gplots::heatmap.2(expr, scale="none", col=gplots::greenred(100), main="custom heatmap", trace="none", keysize=1.4, key.xlab="expression", dendrogram="row", margins=c(6, 8)) ``` ### get_name, get_sampled_substructures and get_superstructures As illustrated in the previous example, some brain regions like *frontal neocortex* (Allen:10161) do not have gene expression data available in the data set, but some of their substructures do have. Plotting or requesting expression data for such brain regions automatically obtains the data for all its sampled substructures. *ABAEnrichment* offers some functions to explore the ontology graph which describes the hierarchical organization of the brain regions used in the enrichment analyses. The function `get_sampled_substructures` returns the IDs of all the substructures for which expression data are available, and `get_superstructures` returns all superstructures in the order 'general to special'. The function `get_name` is useful to see the name of a brain region given a structure ID: ```{r} ## get IDs of the substructures of the frontal neocortex (Allen:10161) ## for which expression data are available subs = get_sampled_substructures('Allen:10161') subs ## get the full name of those substructures get_name(subs) ## get the superstructures of the frontal neocortex (from general to special) supers = get_superstructures('Allen:10161') supers ## get the full names of those superstructures get_name(supers) ``` Note that the ontologies and the IDs for brain regions differ between the adult and the developing brain. However, the ontology functions `get_name`, `get_sampled_substructures` and `get_superstructures` work with valid brain regions IDs from both ontologies. ### get_id The function `get_id` searches the ontologies of the adult and developing brain for the structure ID that belongs to a given brain region name: ```{r} ## get structure IDs of brain regions that contain 'accumbens' in their name get_id('accumbens') ## get structure IDs of brain regions that contain 'telencephalon' in their name get_id('telencephalon') ``` Note that the output of `get_id` is restricted to brain regions that are used in *ABAEnrichment*. The function can also be used to get a full list of covered brain regions together with their IDs: ```{r} ## get all brain regions included in ABAEnrichment together with thier IDs all_regions = get_id('') head(all_regions) ``` ### get_annotated_genes Often it is useful to see which genes are annotated to a brain region, i.e. count as 'expressed', given an expression cutoff. While this can be accomplished using the expression cutoffs from `aba_enrich(...)[[3]]` and the expression values from `get_expression`, *ABAEnrichment* now also offers the convenient function `get_annotated_genes`. This function takes the output from `aba_enrich` and a FWER-threshold (`fwer_threshold`, default=0.05) as input and returns all brain-region/expression-cutoff combinations with a FWER below the FWER-threshold together with the corresponding annotated genes, the FWER and the score (1 for candidate and 0 for background genes or the scores from the Wilcoxon rank-sum test). Note that a brain region gets all genes annotated that are expressed in the brain region itself or in any of the sampled substructures (see [Schematic 1](#hyper_scheme) below). ```{r, results='hide'} ## run an enrichment analysis with 7 candidate and 7 background genes ## for the developing brain gene_ids = c('FOXJ1', 'NTS', 'LTBP1', 'STON2', 'KCNJ6', 'AGT', 'ANO3', 'TTR', 'ELAVL4', 'BEAN1', 'TOX', 'EPN3', 'PAX2', 'KLHL1') is_candidate = rep(c(1,0), each=7) input_hyper = data.frame(gene_ids, is_candidate) res = aba_enrich(input_hyper, dataset='5_stages', cutoff_quantiles=c(0.5,0.7,0.9), n_randsets=100) ``` ```{r} head(res[[1]]) ## see which candidate genes are annotated to brain regions with a FWER < 0.01 anno = get_annotated_genes(res, fwer_threshold=0.1) anno ``` In addition to passing the result of an enrichment analysis together with a FWER-threshold to `get_annotated_genes`, it is also possible to define brain regions, expression cutoffs and (optionally) genes directly. The function then returns all genes that have expression above the cutoffs in the respective brain regions. If `genes` are defined, the output is reduced to this set of genes; if not, all protein-coding genes with available expression data are analyzed. ```{r} ## find out which of the above genes have expression ## above the 70% and 90% expression-cutoff ## in the basal ganglia of the adult human brain (Allen:4276) anno2 = get_annotated_genes(structure_ids='Allen:4276', dataset='adult', cutoff_quantiles=c(0.7,0.9), genes=gene_ids) anno2 ``` ## Developmental effect score In the previous examples genes got annotated to brain regions based on their expression. Besides the two gene expression datasets `adult` and `5_stages`, the dataset `dev_effect` can be used, which provides scores for an age effect for genes based on their expression change during development. Using this dataset, the same analyses as above are performed, except that a gene is annotated to a brain region when its developmental effect score in that region is a above the `cutoff_quantiles`. To test brain regions for enrichment of candidate genes in the set of all genes with a developmental effect score above cutoff, the `dataset` parameter has to be set to `dev_effect`. The output of the developmental effect enrichment analysis is equal to that of the expression enrichment analysis: ```{r} ## use previously defined input genes dataframe head(input_hyper) ``` ```{r,results='hide'} ## run enrichment analysis with developmental effect score res_dev_effect = aba_enrich(input_hyper, dataset='dev_effect', cutoff_quantiles=c(0.5,0.7,0.9), n_randsets=100) ``` ```{r} ## see the 5 brain regions with the lowest FWERs top_regions = res_dev_effect[[1]][1:5, ] top_regions ``` As for the expression datasets, the developmental effect scores can be retrieved with the functions `get_expression` and plotted with `plot_expression`: ```{r, eval=F} ## plot developmental effect score of the 5 brain regions with the lowest FWERs plot_expression(top_regions[ ,'structure_id']) ``` # Schematics ## Schematic 1: Hypergeometric test and FWER calculation ![FWER calculation](./Skizze_Fig1.png 'hypergeometric test and FWER') The FWERs for the other three tests are computed analogously: first, for every brain region a statistical test is performed to get an enrichment p-value, then the scores or counts that are assigned to the genes in the input data are permuted and p-values are recomputed, and finally the original p-values are compared to the minimum p-values from the randomsets. ## Schematic 2: `circ_chrom` option for genomic regions input ![options for genomic regions input](./Skizze_Fig2.png 'options for genomic regions input') To use genomic regions as input, the first column of the `genes` input dataframe has to be of the form `chr:start-stop`. The option `circ_chrom` defines how candidate regions are randomly moved inside the background regions for computing the FWER. When `circ_chrom=FALSE` (default), candidate regions can be moved to any background region on any chromosome, but are not allowed to overlap multiple background regions. When `circ_chrom=TRUE`, candidate regions are only moved on the same chromosome and are allowed to overlap multiple background regions. The chromosome is 'circularized' which means that a randomly placed candidate region may start at the end of the chromosome and continue at the beginning. # Session Info ```{r} sessionInfo() ``` # References [1] Hawrylycz, M.J. et al. (2012) An anatomically comprehensive atlas of the adult human brain transcriptome, Nature 489: 391-399. [https://doi.org/10.1038/nature11405] [2] Miller, J.A. et al. (2014) Transcriptional landscape of the prenatal human brain, Nature 508: 199-206. [https://doi.org/10.1038/nature13185] [3] Allen Institute for Brain Science. Allen Human Brain Atlas (Internet). Available from: [http://human.brain-map.org/] [4] Allen Institute for Brain Science. BrainSpan Atlas of the Developing Human Brain (Internet). Available from: [http://brainspan.org/] [5] Pruefer, K. et al. (2007) FUNC: A package for detecting significant associations between gene sets and ontological annotations, BMC Bioinformatics 8: 41. \doi{10.1186/1471-2105-8-41} [https://doi.org/10.1186/1471-2105-8-41] [6] McDonald, J. H. Kreitman, M. (1991). Adaptive protein evolution at the Adh locus in Drosophila, Nature 351: 652-654. [https://doi.org/10.1038/351652a0]