CSAMA 2017: Clustering, classification, and regression with genomic examples

library(genefu); library(survival) data(nkis) map = as.character(annot.nkis$NCBI.gene.symbol) names(map) = as.character(annot.nkis$probe) ndata.nkis = data.nkis colnames(ndata.nkis) = map[colnames(data.nkis)] cbind(ndata.nkis[1:4,1:4], demo.nkis[1:4,5:8])

## ESR1 TBC1D9 GATA3 CA12 grade node size age ## NKI_123 0.195 -0.114 0.202 0.158 3 0 2.0 48 ## NKI_327 0.034 0.033 0.158 0.103 2 1 2.0 49 ## NKI_291 -0.417 0.140 0.006 -0.266 2 1 1.2 39 ## NKI_370 0.429 0.352 -0.050 0.236 1 1 1.8 51

nkSurv = Surv(demo.nkis$t.os, demo.nkis$e.os) odata = ndata.nkis[, intersect(as.character(sig.oncotypedx$symbol), colnames(ndata.nkis))] fullnk = cbind(demo.nkis, odata) coxph(nkSurv~er+age, data=fullnk)

## Call: ## coxph(formula = nkSurv ~ er + age, data = fullnk) ## ## coef exp(coef) se(coef) z p ## er -1.0018 0.3672 0.3425 -2.92 0.0034 ## age -0.0328 0.9677 0.0271 -1.21 0.2268 ## ## Likelihood ratio test=10.1 on 2 df, p=0.00657 ## n= 129, number of events= 36 ## (21 observations deleted due to missingness)

rfullnk = fullnk[,-c(1,2,3,9,10,11,12,13,14,17,18,19)]
library(rpart); r1 = rpart(nkSurv~.,data=rfullnk)
r1

## n=129 (21 observations deleted due to missingness)
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 129 146.652400 1.00000000  
##    2) BIRC5< -0.0365 85  62.712830 0.47436610  
##      4) BIRC5< -0.3975 32   1.801804 0.09909801 *
##      5) BIRC5>=-0.3975 53  52.568420 0.70984040  
##       10) BAG1< -0.219 14   1.660224 0.16988820 *
##       11) BAG1>=-0.219 39  44.603630 0.96814410  
##         22) GSTM1< 0.1565 30  22.464060 0.58792190  
##           44) MKI67>=-0.0655 19   8.070774 0.23294560 *
##           45) MKI67< -0.0655 11   7.582306 1.38868000 *
##         23) GSTM1>=0.1565 9  12.691410 2.77622500 *
##    3) BIRC5>=-0.0365 44  58.962600 2.35960200  
##      6) PGR>=-0.1625 17  16.872130 1.05016300 *
##      7) PGR< -0.1625 27  34.118410 3.40043200  
##       14) GSTM1< -0.1235 7   5.180967 1.32643500 *
##       15) GSTM1>=-0.1235 20  23.712420 4.39730500 *

CRAN package partykit enhances tree support in rpart and provides many additional models

library(partykit)

## 
## Attaching package: 'partykit'

## The following object is masked from 'package:IRanges':
## 
##     width

## The following object is masked from 'package:S4Vectors':
## 
##     width

## The following object is masked from 'package:BiocGenerics':
## 
##     width

p1p = as.party(prune(r1, cp=.05))

## pnr Abd.B lama Mkp3 fz2 ## 1 0.014123479 0.05531271 0.014584370 0.2086337 0.3759253 ## 2 0.009015973 0.01234864 0.014212999 0.3222693 0.5585198 ## 3 0.023047258 0.01486692 0.013431432 0.3599486 0.5329454 ## 4 0.013179102 0.03184486 0.005370888 0.2365888 0.2585371 ## 5 0.008820991 0.06811459 0.016528382 0.1136623 0.1034636

CSAMA 2017: Clustering, classification, and regression with genomic examples

Road map

Use case 1: transcript profiles to distinguish tissue source

Species and organ of origin: microarrays and orthologues (McCall et al., NAR 2012)

Species, organ of origin, and batch: RNA-seq and orthologues (Lin et al., PNAS 2014)

Conflict

Use case 2: Oncotype DX gene signature for breast cancer survival

Setup for NKI breast cancer expression/clinical data

Label expression columns with appropriate symbols; test

Create a survival tree using all available clinical and expression data

Visualize the pruned tree along with K-M curves for leaves

Use case 3: Cell fate signatures from the fruitfly blastocyst

Data setup

Spatial gene-specific patterns

Can we transform spatial patterns for 701 genes to cohere with this fate map?

An assignment of "principal patterns"

Comments

Remainder of talk

On the user interface

Exploring clusters with tissue-of-origin data

Some definitions: general distance

Examples:

Euclidean distance

Manhattan distance

New concept of distance for categorical vectors:

What is the ward.D2 agglomeration method?

What is the Jaccard similarity coefficient?

Summary

On classification methods with genomic data

BiocViews: StatisticalMethod

Conceptual basis for methods covered in the talk

Linear discriminant analysis

Other approaches, issues

A demonstration with tissue-of-origin expression data follows

Remarks