---
title: "An Introduction to the genoset Package"
author: "Peter M. Haverty"
date: "`r format(Sys.Date())`"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{genoset}
  %\VignetteDepends{}
  %\VignetteKeywords{genoset}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r style, echo = FALSE, results='hide'}
BiocStyle::markdown()
```

# Introduction
The `r Biocpkg("genoset")` package offers an extension of the familiar
Bioconductor `RangedSummarizedExperiment` object for genome assays:
the `GenoSet` class. The `GenoSet` class provides additional API
features (e.g. indexing by genome position). The genoset package also
provides a number of convenient functions for working with data
associated with with genome locations.


## Creating Objects

In typical Bioconductor style, `GenoSet` objects, and derivatives, can be created using the functions
with the same name.  Let's create some fake data to experiment
with. Don't worry too much about how the fake data gets made. Notice
how assays  can be matrices or `RleDataFrames`. They can also
be `BigMatrix` or `BigMatrixFactor` objects (from `r Biocpkg("bigmemoryExtras")`).

```{r objectcreation}
library(genoset)

sample.names = LETTERS[11:13]
probe.names = paste("p", 1:1000, sep="")
num.samples = length(sample.names)
num.probes  = length(probe.names)

locs = GRanges(
  ranges= IRanges(
      start=c(seq(from=125e6,by=3e4,length=400), seq(from=1,length=400,by=3.25e4),
              seq(from=30e6,length=200,by=3e4)),width=1,names=probe.names),
  seqnames=factor(c(rep("chr8",400), rep("chr12",400),rep("chr17",200)),levels=c("chr8","chr12","chr17")))

fake.cn = matrix(c(
  c(rnorm(200,.4,0.05),rnorm(200,.2,0.05),rnorm(200,0.23,0.05),rnorm(200,.15,0.05),rnorm(200,2.,0.05)), 
  c(rnorm(200,0,0.05), rnorm(200,3,0.05),rnorm(200,14,0.05),rnorm(200,.1,0.05),rnorm(200,-0.05,0.05)),
  c(rnorm(200,.1,0.05),rnorm(200,1,0.05),rnorm(200,-.5,0.05),rnorm(200,3,0.05),rnorm(200,3,0.05))
  ),
  nrow=num.probes,ncol=num.samples,dimnames=list(probe.names,sample.names))
fake.pData=data.frame(matrix(LETTERS[1:15],nrow=3,ncol=5,dimnames=list(sample.names,letters[1:5])))

gs = GenoSet( rowRanges=locs, assays=list(cn=fake.cn), colData=fake.pData )
gs

rle.ds = GenoSet( rowRanges=locs,
                 assays=list(cn = fake.cn,
                             cn.segments=RleDataFrame(
                                 K=Rle(c(rep(1.5,300),rep(2.3,700))),L=Rle( c(rep(3.2,700),rep(2.1,300)) ),
                                 M=Rle(rep(1.1,1000)),row.names=rownames(fake.cn))),
                             colData=fake.pData
  )

``` 

Let's have look at what's inside these objects.

```{r objectassaydata}
names(assays(rle.ds))
head( rle.ds[,,"cn"] )
head( rle.ds[,,"cn.segments"] )
``` 

## Accessing Genome Information
Now lets look at some special functions for accessing genome
information from a genoset object. These functions are all defined for
`GenoSet` and `GenomicRanges` objects. We can access per-feature information
as well as summaries of chromosome boundaries in base-pair or
row-index units.

There are a number of functions for getting
portions of the locData data. `chr` and `pos` return the
chromosome and position information for each feature. `genoPos` is like `pos`, but it
returns the base positions counting from the first base in the genome, with the
chromosomes in order by number and then alphabetically for the letter
chromosomes. `chrInfo` returns the genoPos of the first and last feature on each chromosome in
addition to the offset of the first feature from the start of the genome. `chrInfo` results are
used for drawing chromosome boundaries on genome-scale plots. `pos`
and `genoPos` are defined as the floor of the average of each
features start and end positions.

```{r accessgenomeinfo}
head( rowRanges(gs) )
chrNames(gs)
chrOrder(c("chr12","chr12","chrX","chr8","chr7","chrY"))
chrInfo(gs)
chrIndices(gs)

head(chr(gs))
head(start(gs))
head(end(gs))
head(pos(gs))
head(genoPos(gs))
``` 

## Genome Order
`GenoSet` and `GenomicRanges` objects can be set to, and checked for, genome
order. Weak genome order requires that features be ordered within each
chromosome. Strong genome order requires a certain order of chromosomes
as well. Features must be ordered so that features from the same
chromosome are in contiguous blocks.

Certain methods on `GenoSet` objects expect the rows to be in genome
order. Users are free to rearrange rows within
chromosome as they please.

The proper order of chromosomes is desirable for full-genome plots and is specified by the
`chrOrder` function. The object creation method `Genoset` creates objects in strict genome order. 

```{r genomeorder}
chrOrder(chrNames(gs))
gs = toGenomeOrder(gs, strict=TRUE)
isGenomeOrder(gs, strict=TRUE)
```

## Using the Subset Features

GenoSet objects can be subset using array notation. The ``features''
index can be a set of ranges or the usual logical, numeric or character indices. `chrIndices` with a chromosome argument is
a convenient way to get the indices needed to subset by chromosome.

Subset by chromosome   
```{r subsetbychr}
chr12.ds = gs[ chrIndices(gs,"chr12"), ]
dim(chr12.ds)
chrIndices(chr12.ds)
``` 

Subset by a collection of gene locations
```{r subsetbygene}
gene.gr = GRanges(ranges=IRanges(start=c(35e6,127e6),end=c(35.5e6,129e6),                       
                       names=c("HER2","CMYC")), seqnames=c("chr17","chr8"))
gene.ds = gs[ gene.gr, ]
dim(gene.ds)
chrIndices(gene.ds)
``` 

GenoSet objects can also be subset by a group of samples and/or
features, just like an ExpressionSet, or a matrix for that matter.

```{r subsetsamples}
dim(gs[1:4,1:2])
``` 

`eSet`-derived classes tend to have special functions to get and set
specific assayDataElement members (the big data matrices). For
example, `ExpressionSet` has the `exprs` function. It is common to put
other optional matrices in assayData too (genotypes, quality
scores, etc.). These can be get and set with the `assayDataElement`
function, but typing that out can get old.  `GenoSet` and derived
classes use the ``k'' argument to the matrix subsetting bracket to
subset from a specific assayDataElement. In addition to saving some
typing, you can directly use a set of ranges to subset the
assayDataElement.

```{r subsetassaydata}
all( gs[ 1:4,1:2,"cn"] == assay(gs,"cn")[1:4,1:2] )
``` 

# Processing Data

## Correction of Copy Number for local GC content
Copy number data generally shows a GC content effect that appears as
slow ``waves'' along the genome (Diskin et al., NAR, 2008). The
function `gcCorrect` can be used to remove this effect resulting in
much clearer data and more accurate segmentation. GC content is best
measured as the gc content in windows around each feature, about 2Mb
in size.

```{r GC, eval=FALSE}
library(BSgenome.Hsapiens.UCSC.hg19)
gc = rnorm(nrow(gs))
gs[,,"cn"] = gcCorrect(gs[,,"cn"],gc)
``` 

## Segmentation
Segmentation is the process of identifying blocks of the genome in
each sample that have the same copy number value. It is a smoothing
method that attempts to replicate the biological reality where chunks
of chromosome have been deleted or amplified.

Genoset contains a convenience function for segmenting data for each sample/chr using the
`r Biocpkg("DNAcopy")` package (the CBS algorithm).  Genoset adds features to split jobs among
processor cores. When the library `r Biocpkg("parallel")` is loaded, the
argument n.cores can control the number of processor cores
utilized.

Additionally, `GenoSet` stores segment values so that they can be
accessed quickly at both the feature and segment level. We use a
`RleDataFrame` object where each column is a
Run-Length-Encoded `Rle` object. This dramatically reduces the
amount of memory required to store the segments. Note how the segmented values
become just another member of the assayData slot.

Try running CBS directly
```{r cbsdirect}
library(DNAcopy)
cbs.cna = CNA(gs[,,"cn"], chr(gs), pos(gs) )
cbs.smoothed.CNA = smooth.CNA( cbs.cna )
cbs.segs = segment( cbs.cna )
``` 

Or use the convenience function `runCBS`
```{r runCBS}
gs[,,"cn.segs"] = runCBS(gs[,,"cn"],rowRanges(gs))
``` 

Try it with `r Biocpkg("parallel")`
```{r cbsmulticore,eval=FALSE}
library(parallel)
gs[,,"cn.segs"] = runCBS(gs[, , "cn"],rowRanges(gs), n.cores=3)
gs[,,"cn.segs"][1:5,1:3]
``` 

Other segmenting methods can also be used of course. 

This function makes use of the parallel package to run things in
parallel, so plan ahead when picking ``n.cores''. Memory usage can be
a bit hard to predict.

## Segments as tables or runs
Having segmented the data for each sample, you may want to explore different 
representations of the segments. `r Biocpkg("Genoset")` describes data in genome
segments two ways: 1) as a table of segments, and 2) a
Run-Length-Encoded vector. Tables of segments are useful for printing, overlap queries, 
database storage, or for summarizing changes in a sample.  Rle 
representations can be used like regular vectors, plotted as segments (see `genoPlot`), and 
stored efficiently. A collection of `Rle` objects, one for each sample, are often stored as 
one `RleDataFrame` in a `GenoSet`.  `r Biocpkg("genoset")` provides functions to quickly flip back and forth 
between table and Rle representations. You can use these functions on single samples, or 
the whole collection of samples.

```{r segments}
head( gs[,,"cn.segs"] )

segs = segTable( gs[,2,"cn.segs"], rowRanges(gs) )
list.of.segs = segTable( gs[,,"cn.segs"], rowRanges(gs) )
rbind.list.of.segs = segTable( gs[,,"cn.segs"], rowRanges(gs), stack=TRUE )
two.kinds.of.segs = segPairTable( gs[,2,"cn.segs"], gs[,3,"cn.segs"], rowRanges(gs) )

rle = segs2Rle( segs, rowRanges(gs) )
rle.df = segs2RleDataFrame( list.of.segs, rowRanges(gs) )

bounds = matrix( c(1,3,4,6,7,10),ncol=2,byrow=TRUE)
cn = c(1,3,2)
rle = bounds2Rle( bounds, cn, 10 )
```

`segPairTable` summarizes two Rle objects into segments that have one unique 
value for each Rle. This is useful for cases where you want genome regions with one 
copy number state, and one LOH state, for example.

`bounds2Rle` is convenient if you already know the genome feature indices corresponding 
to the bounds of each segment.

Currently we use `data.frames` for tables of segments. In the near future these will have colnames 
that will make it easy to coerce these to `GRanges`. Coercion to `GRanges`
takes a while, so we don't do that by default.

## Gene Level Summaries
Analyses usually start with SNP or probeset level data. Often it is desirable to get
summaries of assayData matrices over an arbitrary set of ranges, like
exons, genes or cytobands. The function `rangeSampleMeans` serves this
purpose. Given a `GenomicRanges` of arbitrary genome ranges and a
`GenoSet` object, `rangeSampleMeans` will return a matrix of values
with a row for each range. 

`rangeSampleMeans` uses `boundingIndicesByChr` to select the features 
bounding each range. The bounding features are the features with locations equal to or within the start
and end of the range. If no feature exactly matches an end of the range, the nearest features outside 
the range will be used. This bounding ensures that the full extent of the
range is accounted for, and more importantly, at least two features
are included for each gene, even if the range falls between two
features.

`rangeColMeans` is used to do a fast average of each of a
set of such bounding indices for each sample. These functions are
optimized for speed.  For example, with 2.5M features and ~750
samples, it takes 0.12 seconds to find the features bounded by all
Entrez Genes (one RefSeq each). Calculating the mean value for each
gene and sample takes ~9 seconds for a matrix of array data and ~30
seconds for a RleDataFrame of compressed Rle objects.

Generally, you will want to summarize segmented data and will be working with a 
`RleDataFrame`, like that returned by `runCBS`.

As an example, let's say you want to get the copynumber of your two
favorite genes from the subsetting example:

Get the gene-level summary:
<<genelevel, eval=TRUE>>=
boundingIndicesByChr( gene.gr, rowRanges(gs) )
rangeSampleMeans(gene.gr, gs, "cn.segs")
``` 

# Plots
Genoset has some handy functions for plotting data along the genome.
Segmented data ``knows'' it should be plotted as lines, rather than
points. One often wants to plot just one chromosome, so a convenience
argument for chromosome subsetting is provided. Like `plot`, `genoPlot` plots x against y. 
'x' can be some form of location data, like a `GenoSet` or `GenomicRanges`. 'y' is some 
form of data along those coordinates, like a numeric vector or `Rle`. `genoPlot` marks 
chromosome boundaries and labels positions in ``bp'', ``kb'', ``Mb'', 
or ``Gb'' units as appropriate.

```{r plotgenome, echo=TRUE}
genoPlot(gs, gs[ , 1, "cn"])
genoPlot(gs, gs[ , 1, "cn.segs"], add=TRUE, col="red")
``` 

Segmented copy number across the genome for 1st sample}

```{r plotchr, echo=TRUE}
genoPlot(gs,gs[,1,"cn"],chr="chr12")
genoPlot(gs,gs[,1,"cn.segs"],chr="chr12",add=TRUE, col="red")
``` 
Segmented copy number across chromosome 12 for 1st sample

Plot data without a `GenoSet` object using numeric or `Rle` data:

```{r plotchrsimple, eval=FALSE}
chr12.ds = gs[chr(gs) == "chr12",]
genoPlot(pos(chr12.ds),chr12.ds[,1,"cn"],locs=rowRanges(chr12.ds))  # Numeric data and location
genoPlot(pos(chr12.ds),chr12.ds[,1,"cn.segs"],add=TRUE, col="red") # Rle data and numeric position
``` 

## Processing B-Allele Frequency Data
B-Allele Frequency (BAF) data can be converted into the ``Modified BAF'' or mBAF metric, introduced by Staaf, et
al., 2008. mBAF folds the values around the 0.5 axis and makes the HOM positions NA.  The
preferred way to identify HOMs is to use genotype calls from a matched
normal (AA, AC, AG, etc.), but NA'ing greater than a certain value works OK.  A hom.cutoff of 0.90 is
suggested for Affymetrix arrays and 0.95 for Illumina arrays, following Staaf, et al.

Return data as a matrix:

```{r mbafcutoff}
fake.baf  = sample(c(0,0.5,1), length(probe.names), replace=TRUE) + rnorm(length(probe.names),0,0.01)
fake.baf[ fake.baf > 1 ] = fake.baf[ fake.baf > 1 ] - 1
fake.baf[ fake.baf < 0 ] = fake.baf[ fake.baf < 0 ] + 1
hets = fake.baf < 0.75 & fake.baf > 0.25
fake.baf[ 101:200 ][ hets[101:200] ] = fake.baf[ 101:200 ][ hets[101:200] ] + rep(c(-0.2,0.2),50)[hets[101:200]]
fake.baf = matrix(fake.baf,nrow=num.probes,ncol=num.samples,dimnames=list(probe.names,sample.names))

baf.ds = GenoSet( rowRanges=locs, assays=list(lrr=fake.cn, baf=fake.baf), colData=fake.pData )
baf.ds[, , "mbaf"] = baf2mbaf(baf.ds[, , "baf"], hom.cutoff = 0.90)
``` 

... or use compress it to an RleDataFrame. This uses ~1/3 the space
on our random test data.

```{r mbaftorle}
mbaf.data = RleDataFrame( sapply(colnames( baf.ds),
  function(x) { Rle( baf.ds[,x, "mbaf"] ) },
  USE.NAMES=TRUE, simplify=FALSE ) )
as.numeric(object.size( baf.ds[, ,"mbaf"]))  / as.numeric( object.size(mbaf.data))
``` 

Using the HOM SNP calls from the matched normal works much better. A
matrix of genotypes can be used to set the HOM SNPs to NA. A list of
sample names matches the columns of the genotypes to the columns of
your baf matrix. The names of the list should match column names in
your baf matrix and the values of the list should match the column
names in your genotype matrix.  If this method is used and some
columns in your baf matrix do not have an entry in this list, then
those baf columns are cleaned of HOMs using the hom.cutoff, as above.

Both mBAF and LRR can and should be segmented. Consider storing mBAF
as an RleDataFrame as only the ~1/1000 HET positions are being used
and all those NA HOM positions will compress nicely.

```{r segment}
baf.ds[,,"baf.segs"] = runCBS( baf.ds[, ,"mbaf"], rowRanges(baf.ds) )
baf.ds[,,"lrr.segs"] = runCBS( baf.ds[, , "lrr"], rowRanges(baf.ds) )
``` 

## Plots

```{r plotlrr}
genoPlot(baf.ds,baf.ds[,1,"lrr"],chr="chr12",main="LRR of chr12")
genoPlot(baf.ds,baf.ds[,1,"lrr.segs"],chr="chr12",add=TRUE,col="red")
```

Segmented copy number across the genome for 1st sample`

```{r plotbaf, echo=TRUE}
par(mfrow=c(2,1))
genoPlot(baf.ds,baf.ds[,1,"baf"],chr="chr12", main="BAF of chr12")
genoPlot(baf.ds,baf.ds[,1,"mbaf"],chr="chr12", main="mBAF of chr12")
genoPlot(baf.ds,baf.ds[,1,"baf.segs"],chr="chr12", add=TRUE,col="red")
``` 

# Cross-sample summaries
You can quickly calculate summaries across samples to identify regions
with frequent alterations.  A bit more care is necessary to work one
sample at a time if your data "matrix" is an `RleDataFrame`.

```{r crosssample}
gain.list = lapply(colnames(baf.ds),
  function(sample.name) {
    as.logical( baf.ds[, sample.name, "lrr.segs"] > 0.3 )
})
gain.mat = do.call(cbind, gain.list)
gain.freq = rowMeans(gain.mat,na.rm=TRUE)
``` 

GISTIC (by Behroukhim and Getz of the Broad Institute) is the standard 
method for assessing significance of such summaries. You'll find 
`segTable` convenient for getting your data formatted for input. I 
find it convenient to load GISTIC output as a `GRanges` for 
intersection with gene locations.

# Big Data and bigmemoryExtras
Genome-scale data can be huge and keeping everything in memory can get
you into trouble quickly, especially if you like using `r Biocpkg("parallel")`'s `mclapply`.

It is often convenient to use `BigMatrix` objects from the `r Biocpkg("bigmemoryExtras")`
package as assayDataElements, rather than base matrices.  `BigMatrix` is
based on the `r Biocpkg("bigmemory")` package, which provides a matrix API to
memory-mapped files of numeric data. This allows for data matrices
larger than R's maximum size with just the tiniest footprint in
RAM. The `r Biocpkg("bigmemoryExtras")` vignette has more details about using `eSet`-derived
classes and `BigMatrix` objects.