--- title: 'GIGSEAdata: Gene set collection used for the Genotype Imputed Gene Set Enrichment Analysis (GIGSEA)' author: | | Shijia Zhu | Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA date: "June 30, 2018" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{GIGSEAdata} %\VignetteEncoding{UTF-8} output: pdf_document: toc: yes toc_depth: '3' html_document: highlight: tango theme: united toc: yes toc_depth: 3 --- ## Abstract GIGSEAdata is the gene set collection used for GIGSEA (Genotype Imputed Gene Set Enrichment Analysis), which is a novel SNP enrichment method that uses GWAS-and-eQTL-imputed differential gene expression to interrogate gene set enrichment for the trait-associated SNPs. The gene sets are saved as matrices. Such matrices are largely sparse, so, in order to save space, we used the functions provided by the R package "Matrix" to build the sparse matrices and saved into the GIGSEAdata package. ## 1. Description of gene sets GIGSEA is built on the weighted linear regression model, so it permits both discrete-valued and continuous-valued gene sets. In the GIGSEA package, we already included four categories of gene sets: "MSigDB.KEGG.Pathway", "MSigDB.miRNA", "MSigDB.TF", and "TargetScan.miRNA". Here, we added two more categories in the GIGSEAdata package: 1) **discrete-valued gene sets**: - `org.Hs.eg.GO`: Gene sets that contain genes annotated by the same Gene Ontology (GO) term. For each GO term, we not only incorporate its own gene sets, but also incorporate the gene sets belonging to its offsprings. See the database "org.Hs.eg.GO.db" and "GO.db" in R. 2) **continuous-valued gene sets**: - `Fantom5.TF`: The human transcript promoter locations were obtained from Fantom5. Based on the promoter locations, the tool MotEvo was used to predict the human transcriptional factor (TF) target sites. The dataset contains 500 Positional Weight Matrices (PWM) and 21964 genes. For each PWM, there is a list of associated human TFs, ordered by percent identity of TFs known to bind sites of the PWM. The list of associations was checked manually. The entire set of PWMs and mapping to associated TFs is available from the SwissRegulon website . ## 2 Load data of gene sets: We first take as an example of the gene set "org.Hs.eg.GO"", where the row represents the gene, and the column represents the GO term. Each entry takes discrete values of 0 or 1, where 1 represents the gene (row) belongs to the GO term (column), and otherwise, not. ```{r} library(GIGSEAdata) data(org.Hs.eg.GO) class(org.Hs.eg.GO) names(org.Hs.eg.GO) dim(org.Hs.eg.GO$net) head(colnames(org.Hs.eg.GO$net)) head(rownames(org.Hs.eg.GO$net)) head(org.Hs.eg.GO$annot) head(org.Hs.eg.GO$net[,1:30]) ```