--- title: "Building a classifier with sparse genetic data - case/control classification in autism from rare CNVs" author: "Shraddha Pai" package: netDx date: "`r Sys.Date()`" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{03. Build classifier from sparse genetic data} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- # TL;DR ```{r,eval=FALSE} suppressMessages(require(netDx)) suppressMessages(require(GenomicRanges)) # read patient CNVs phenoFile <- paste(path.package("netDx"), "extdata", "AGP1_CNV.txt", sep=getFileSep()) pheno <- read.delim(phenoFile,sep="\t",header=TRUE,as.is=TRUE) # sample metadata table must have ID and STATUS columns colnames(pheno)[1] <- "ID" # create GRanges object. # Must have ID and LOCUS_NAMES in metadata cnv_GR <- GRanges(pheno$seqnames, IRanges(pheno$start,pheno$end), ID=pheno$ID,LOCUS_NAMES=pheno$Gene_symbols) pheno <- pheno[!duplicated(pheno$ID),] pathFile <- fetchPathwayDefinitions( "February",2018,verbose=TRUE) pathwayList <- readPathways(pathFile) # get gene coordinates, use hg18 # cache for faster local access require(BiocFileCache) geneURL <- paste("http://download.baderlab.org/netDx/", "supporting_data/refGene.hg18.bed",sep="") cache <- rappdirs::user_cache_dir(appname = "netDx") bfc <- BiocFileCache::BiocFileCache(cache,ask=FALSE) geneFile <- bfcrpath(bfc,geneURL) genes <- read.delim(geneFile,sep="\t",header=FALSE,as.is=TRUE) genes <- genes[which(genes[,4]!=""),] gene_GR <- GRanges(genes[,1], IRanges(genes[,2],genes[,3]), name=genes[,4] ) # create GRangesList of pathway ranges path_GRList <- mapNamedRangesToSets(gene_GR,pathwayList) outDir <- paste(tempdir(),randAlphanumString(), "ASD",sep=getFileSep()) ## absolute path if (file.exists(outDir)) unlink(outDir,recursive=TRUE); dir.create(outDir) message("Getting java version for debugging") java_ver <- suppressWarnings( system2("java", args="--version",stdout=TRUE,stderr=NULL) ) print(java_ver) message("***") predictClass <- "case" out <- buildPredictor_sparseGenetic( pheno, cnv_GR, predictClass, path_GRList, outDir=outDir, ## absolute path numSplits=3L, featScoreMax=3L, enrichLabels=TRUE,numPermsEnrich=3L, numCores=2L) # plot ROC curve. Note that the denominator only includes # patients with events in networks that are label-enriched dat <- out$performance_denEnrichedNets plot(0,0,type="n",xlim=c(0,100),ylim=c(0,100), las=1, xlab="False Positive Rate (%)", ylab="True Positive Rate (%)", bty='n',cex.axis=1.5,cex.lab=1.3, main="ROC curve - Patients in label-enriched pathways") points(dat$other_pct,dat$pred_pct, col="red",type="o",pch=16,cex=1.3,lwd=2) # calculate AUROC and AUPR tmp <- data.frame( score=dat$score, tp=dat$pred_ol,fp=dat$other_ol, # tn: "-" that were correctly not called tn=dat$other_tot - dat$other_ol, # fn: "+" that were not called fn=dat$pred_tot - dat$pred_ol) stats <- netDx::perfCalc(tmp) tmp <- stats$stats message(sprintf("PRAUC = %1.2f\n", stats$prauc)) message(sprintf("ROCAUC = %1.2f\n", stats$auc)) # examine pathway-level scores; these are # cumulative across the splits - here, each of three # splits has a max feature score of three, so # a feature can score a max of (3 + 3 + 3) = 9. print(head(out$cumulativeFeatScores)) ``` # Introduction netDx natively handles missing data, making it suitable to build predictors with sparse genetic data such as somatic DNA mutations, frequently seen in cancer, and from DNA Copy Number Variations (CNV). This example demonstrates how to use netDx to build a predictor from sparse genetic data. Here we build a case/control classifier for Autism Spectrum Disorder (ASD) diagnosis, starting from rare CNVs. The data is from [Pinto et al. (2014) AJHG 94:677)(https://pubmed.ncbi.nlm.nih.gov/24768552-convergence-of-genes-and-cellular-pathways-dysregulated-in-autism-spectrum-disorders/) . # Design and Adapting the Algorithm for Sparse Event Data In this design, we group CNVs by pathways. The logic behind the grouping is prior evidence showing that genetic events in diseases tend to converge on cellular processes of relevance to the pathophysiology of the disease. For example, see the Pinto et al. paper referenced in the previous section. ## Label enrichment In this design, similarity is defined as a binary function, a strategy that has advantages and drawbacks. In plain terms, ***if two patients share a mutation in a pathway, their similarity for that pathway is 1.0 ; otherwise it is zero.*** This binary definition, while conceptually intuitive, increases the false positive rate in the `netDx` feature selection step. That is, networks with even a single case sample will get a high feature score, regardless of whether that network is enriched for case samples. To counter this problem, we introduce a ***label-enrichment*** step in the feature selection. A bias measure is first computed for each network, such that a network with only cases has +1; one with only controls has a score of -1; and one with an equal number of both has a score of zero. Label-enrichment compares the bias in each real network, to the bias in that network in label-permuted data. It then assigns an empirical p-value for the proportion of times a label-permuted network has a bias as high as the real network. Only networks with a p-value below a user-assigned threshold pass label-enrichment, and feature selection is limited to these networks. In `netDx`, label-enrichment is enabled by setting `enrichLabels=TRUE` in the call to `buildPredictor_sparseGenetic()`. ## Cumulative feature scoring The other difference between this design and those with non-sparse data, is the method of feature scoring. The user specifies a parameter which indicates the number of times to split the data and run feature selection. The algorithm then runs feature selection `numSplits` times, each time leaving 1/`numSplits` of the samples out. In each split, features are scored between 0 and `featScoreMax`, using the same approach as is used for continuous-valued input. Feature scores are then added across the splits so that a feature can score as high as `numSplits*featScoreMax`. ## Evaluating model performance For a given cutoff for features, a patient is called a "case" if they have a genetic event in pathways that pass feature selection at that cutoff; otherwise, at that cutoff, they are labelled a "control". These calls are used to generate the false positive and true positive rates across the various cutoffs, which ultimately generates an ROC curve. # Setup ```{r,eval=TRUE} suppressMessages(require(netDx)) suppressMessages(require(GenomicRanges)) ``` # Data CNV coordinates are read in, and converted into a `GRanges` object. As always, the sample metadata table, here the `pheno` object, must have `ID` and `STATUS` columns. ```{r,eval=TRUE} outDir <- paste(tempdir(),randAlphanumString(), "ASD",sep=getFileSep()) ## must be absolute path if (file.exists(outDir)) unlink(outDir,recursive=TRUE); dir.create(outDir) cat("* Setting up sample metadata\n") phenoFile <- paste(path.package("netDx"), "extdata", "AGP1_CNV.txt", sep=getFileSep()) pheno <- read.delim(phenoFile,sep="\t",header=TRUE,as.is=TRUE) colnames(pheno)[1] <- "ID" head(pheno) cnv_GR <- GRanges(pheno$seqnames,IRanges(pheno$start,pheno$end), ID=pheno$ID,LOCUS_NAMES=pheno$Gene_symbols) pheno <- pheno[!duplicated(pheno$ID),] ``` # Group CNVs by pathways The `fetchPathwayDefinitions()` function downloads pathway definitions from `baderlab.org` but users may provide custom `.gmt` files as well. In the example below, gene coordinates for the hg18 genome build are automatically fetched from a remote location, and converted to a `GRanges` object. The function `mapNamedRangesToSets()` is used to group this `GRanges` object into pathway-level sets. ```{r,eval=TRUE} pathFile <- fetchPathwayDefinitions("February",2018,verbose=TRUE) pathwayList <- readPathways(pathFile) # get gene coordinates, use hg18 # cache for faster local access require(BiocFileCache) geneURL <- paste("http://download.baderlab.org/netDx/", "supporting_data/refGene.hg18.bed",sep="") cache <- rappdirs::user_cache_dir(appname = "netDx") bfc <- BiocFileCache::BiocFileCache(cache,ask=FALSE) geneFile <- bfcrpath(bfc,geneURL) genes <- read.delim(geneFile,sep="\t",header=FALSE,as.is=TRUE) genes <- genes[which(genes[,4]!=""),] gene_GR <- GRanges(genes[,1],IRanges(genes[,2],genes[,3]), name=genes[,4]) ``` Group gene extents into pathway-based sets, which effectively creates grouping rules for netDx. The function `mapNamedRangesToSets()` does this grouping, generating a `GRangesList` object. ```{r,eval=TRUE} path_GRList <- mapNamedRangesToSets(gene_GR,pathwayList) ``` # Run predictor Once the phenotype matrix and grouping rules are set up, the predictor is called using `buildPredictor_sparseGenetic()`. Note that unlike with non-sparse data, the user does not provide a custom similarity function in this application; currently, the only option available is the binary similarity defined above. As discussed above, setting `enrichLabels=TRUE` to enable label-enrichment is highly recommended to reduce false positive rate. ```{r,eval=TRUE} predictClass <- "case" out <- buildPredictor_sparseGenetic(pheno, cnv_GR, predictClass, path_GRList, outDir=outDir, ## absolute path numSplits=3L, featScoreMax=3L, enrichLabels=TRUE,numPermsEnrich=3L, numCores=2L) ``` # Plot results Feature selection identifies pathways that are consistently enriched for the label of interest; here, "case" status. From the diagnostic point of view, a patient with a genetic event in a selected feature - here, a CNV in a feature-selected pathway - is labelled a "case". "True positives" are therefore cases with CNVs in feature-selected pathways, while "false positives" are controls with CNVs in feature-selected pathways. These definitions are used to compute the ROC curve below. ```{r,eval=TRUE} dat <- out$performance_denEnrichedNets plot(0,0,type="n",xlim=c(0,100),ylim=c(0,100), las=1, xlab="False Positive Rate (%)", ylab="True Positive Rate (%)", bty='n',cex.axis=1.5,cex.lab=1.3, main="ROC curve - Patients in label-enriched pathways") points(dat$other_pct,dat$pred_pct, col="red",type="o",pch=16,cex=1.3,lwd=2) ``` We can also compute the AUROC and AUPR from scratch. ```{r,eval=TRUE} tmp <- data.frame( score=dat$score, tp=dat$pred_ol,fp=dat$other_ol, # tn: "-" that were correctly not called tn=dat$other_tot - dat$other_ol, # fn: "+" that were not called fn=dat$pred_tot - dat$pred_ol) stats <- netDx::perfCalc(tmp) tmp <- stats$stats message(sprintf("PRAUC = %1.2f\n", stats$prauc)) message(sprintf("ROCAUC = %1.2f\n", stats$auc)) ``` Pathway scores are also added across the splits, for a total of 9 across the 3 splits (3 + 3 + 3). ```{r,eval=TRUE} # now get pathway score print(head(out$cumulativeFeatScores)) ```