---
title: "Mapping TCGA tumor codes to NCIT"
author: "Vincent J. Carey, stvjc at channing.harvard.edu"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{"Mapping TCGA tumor codes to NCIT"}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    highlight: pygments
    number_sections: yes
    theme: united
    toc: yes
---

# Introduction

The TCGA tumor types cover a collection of anatomical compartments.
Organizing tumor types into groups of related compartments may
be fruitful.  We will use the oncotree OBO representation
from an [NCI thesaurus](https://github.com/NCI-Thesaurus/thesaurus-obo-edition/wiki/Downloads)
OBO distribution in the Bioc 3.9 version of ontoProc.

# A table

This table was constructed by hand on Oct 10 2019 using materials
in ontoProc package.
```{r lkt}
suppressPackageStartupMessages({
library(DT)
library(ontoProc)
library(magrittr)
library(dplyr)
library(BiocOncoTK)
otree = getOncotreeOnto()
})
data("map_tcga_ncit")
datatable(map_tcga_ncit)
```

# Formal annotation of anatomic site

## Expeditious mapping

We will drop the CNTL class, and use only the
first NCIT mapping when two seem to match.
```{r lkanno}
controlindex = which(map_tcga_ncit[,1]=="CNTL")
tcgacodes = map_tcga_ncit[-controlindex,1]
ncitsites = map_tcga_ncit[-controlindex,3]
ssi = strsplit(ncitsites, "\\|")
sites = sapply(ssi, "[", 1)
simpmap = data.frame(code=tcgacodes, oncotr_site=otree$name[sites], ncit=sites,
  stringsAsFactors=FALSE)
simpmap[sample(seq_len(nrow(simpmap)),5),]
```

We now have a 1-1 mapping from TCGA code to NCIT site.
These sites can be grouped according to organ system,
using the knowledge that NCIT:C3263 is the 'neoplasm by site'
(which really should be 'system')
category.

```{r findsys}
poss_sys = otree$children["NCIT:C3263"][[1]] # all possible systems
allanc = otree$ancestors[simpmap$ncit]
specific = sapply(allanc, function(x) intersect(x, poss_sys)[1]) # ignore multiplicities
sys = unlist(otree$name[specific])
datatable(systab <- cbind(simpmap, sys=sys))
```

Neither thymoma nor mesothelioma have NCIT organ system mappings per se.

## Aggregation

We now have 12 categories for 33 tumor types.  A code
pattern for finding the TCGA codes for a given system is:
```{r lkca}
systab %>% filter(grepl("Repro", sys))
```