---
title: "Summarization and quantitative trait analysis of CNV ranges"
author:
  - name: Vinicius Henrique da Silva
    affiliation:
    - Animal Breeding and Genomics, Wageningen University & Research, Wageningen, The Netherlands
    - Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Uppsala, Sweden
    - Netherlands Institute of Ecology (NIOO-KNAW), Wageningen, The Netherlands
    email: vinicius.dasilva@wur.nl
  - name: Ludwig Geistlinger
    affiliation: School of Public Health, City University of New York
    email: ludwig.geistlinger@sph.cuny.edu
package: CNVRanger
abstract: >
    The _CNVRanger_ package implements a comprehensive tool suite for the analysis 
    of copy number variation (CNV). This includes functionality for summarizing 
    individual CNV calls across a population,
    assessing overlap with functional genomic regions, and association analysis 
    with gene expression and quantitative phenotypes.
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  % \VignetteIndexEntry{Summarization and quantitative trait analysis of CNV ranges}
  % \VignetteEngine{knitr::rmarkdown}
---

```{r setup, echo=FALSE}
suppressPackageStartupMessages({ 
    library(CNVRanger)
    library(AnnotationHub)
    library(regioneR)
    library(BSgenome.Btaurus.UCSC.bosTau6.masked)
    library(SummarizedExperiment)
    library(curatedTCGAData)
    library(TCGAutils)
    library(Gviz)
})
```

# Introduction

Copy number variation (CNV) is a frequently observed deviation from the diploid 
state due to duplication or deletion of genomic regions.
CNVs can be experimentally detected based on comparative genomic hybridization, 
and computationally inferred from SNP-arrays or next-generation sequencing data.
These technologies for CNV detection have in common that they report, for each
sample under study, genomic regions that are duplicated or deleted with respect 
to a reference. Such regions are denoted as _CNV calls_ in the following and will be considered 
the starting point for analysis with the `r Biocpkg("CNVRanger")` package.

The following figure provides an overview of the analysis capabilities of `CNVRanger`. 

```{r oview, echo=FALSE, fig.wide=TRUE}
vign.dir <- system.file("vignettes", package = "CNVRanger")
knitr::include_graphics(file.path(vign.dir, "CNVRanger.png"))
```

**(A)** The `r Biocpkg("CNVRanger")` package
imports CNV calls from a simple file format into 
`R`, and stores them in dedicated Bioconductor data structures, and 
**(B)** implements three frequently used approaches for summarizing CNV calls 
across a population: 
_(i)_ the [CNVRuler](http://www.ircgp.com/CNVRuler) procedure
that trims region margins based on regional density
[Kim et al., 2012](https://doi.org/10.1093/bioinformatics/bts239), 
_(ii)_ the reciprocal overlap procedure that requires sufficient mutual overlap 
between calls 
[Conrad et al., 2010](https://doi.org/10.1038/nature08516), and 
_(iii)_ the [GISTIC](http://www.broadinstitute.org/cancer/cga/gistic) procedure 
that identifies recurrent CNV regions
[Beroukhim et al., 2007](https://doi.org/10.1073/pnas.0710052104).
**(C)** `CNVRanger` builds on `r Biocpkg("regioneR")` 
for overlap analysis of CNVs with functional genomic regions, 
**(D)** implements RNA-seq expression Quantitative Trait Loci (eQTL) analysis 
for CNVs by interfacing with `r Biocpkg("edgeR")`, and 
**(E)** interfaces with
[PLINK](http://zzz.bwh.harvard.edu/plink) for traditional
genome-wide association studies (GWAS) between CNVs and quantitative phenotypes.

The key parts of the functionality implemented in `r Biocpkg("CNVRanger")` were 
developed, described, and applied in several previous studies:

- Genome-wide detection of CNVs and their association with meat tenderness in 
    Nelore cattle
    [da Silva et al., 2016](https://doi.org/10.1371/journal.pone.0157711)

- Widespread modulation of gene expression by copy number variation in skeletal 
    muscle
    [Geistlinger et al., 2018](https://doi.org/10.1038/s41598-018-19782-4)

- CNVs are associated with genomic architecture in a songbird 
    [da Silva et al., 2018](https://doi.org/10.1186/s12864-018-4577-1)

# Applicability and Scope
 
As described in the above publications, `r Biocpkg("CNVRanger")` has been 
developed and extensively tested for SNP-based CNV calls as obtained with
[PennCNV](https://doi.org/10.1101/gr.6861907).
We also tested `CNVRanger` for sequencing-based CNV calls as obtained with
[CNVnator](https://doi.org/10.1101/gr.114876.110) (a read-depth approach) or
[LUMPY](https://doi.org/10.1186/gb-2014-15-6-r84) (an approach that combines 
evidence from split-reads and discordant read-pairs).

In general, `r Biocpkg("CNVRanger")` can be applied to CNV calls associated with
integer copy number states, where we assume the states to be encoded as:

- `0`: homozygous deletion (2-copy loss)
- `1`: heterozygous deletion (1-copy loss)
- `2`: normal diploid state
- `3`: 1-copy gain
- `4`: amplification (>= 2-copy gain)  

Note that for CNV calling software that uses a different encoding or that 
does not provide integer copy number states, it is often possible to (at least
approximately) transform the output to a format that is compatible with the 
input format of `r Biocpkg("CNVRanger")`. See Section 4.1
[Input data format](#input-data-format) 
for details.

`r Biocpkg("CNVRanger")` is designed to work with CNV calls from one tool at a 
time. See [EnsembleCNV](https://doi.org/10.1093/nar/gkz068) and
[FusorSV](https://doi.org/10.1186/s13059-018-1404-6)
for combining CNV calls from multiple SNP-based callers or multiple 
sequencing-based callers, respectively.

`r Biocpkg("CNVRanger")` assumes CNV calls provided as input to be already
filtered by quality, using the software that was used for CNV calling, or specific
tools for that purpose. 
`r Biocpkg("CNVRanger")` provides downstream summarization and association analysis 
for CNV calls, it does not implement functions for CNV calling or quality control.
`r Biocpkg("CNVRanger")` is applicable for diploid species only.

# Key functions

Analysis step           | Function                    
----------------------- | ---------------------------------------------
(A) Input               | `GenomicRanges::makeGRangesListFromDataFrame`
(B) Summarization       | `populationRanges`                            
(C) Overlap analysis    | `regioneR::overlapPermTest`
(D) Expression analysis | `cnvEQTL`                              
(E) Phenotype analysis  | `cnvGWAS`                                    

Note: we use the `package::function` notation for functions from other packages.
For functions from this package and base R functions, we use the function name
without preceding package name.

# Reading and accessing CNV data

```{r input, echo=FALSE, fig.wide=TRUE}
vign.dir <- system.file("vignettes", package = "CNVRanger")
knitr::include_graphics(file.path(vign.dir, "Input.png"))
```

The `r Biocpkg("CNVRanger")` package uses Bioconductor core data structures 
implemented in the `r Biocpkg("GenomicRanges")` and `r Biocpkg("RaggedExperiment")` 
packages to efficiently represent, access, and manipulate CNV data.

We start by loading the package.

```{r lib}
library(CNVRanger)
```

## Input data format

`r Biocpkg("CNVRanger")` reads CNV calls from a simple file format, providing at 
least chromosome, start position, end position, sample ID, and integer copy number 
for each call.

For demonstration, we consider CNV calls as obtained with 
[PennCNV](http://penncnv.openbioinformatics.org) from SNP-chip data in a 
Brazilian cattle breed 
([da Silva et al., 2016](https://doi.org/10.1371/journal.pone.0157711)).

Here, we use a data subset and only consider CNV calls on chromosome 1 and 2,
for which there are roughly 3000 CNV calls as obtained for 711 samples. 
We use `read.csv` to read comma-separated values, but we could use a different
function if the data were provided with a different delimiter 
(for example, `read.delim` for tab-separated values). 

```{r readCalls}
data.dir <- system.file("extdata", package="CNVRanger")
call.file <- file.path(data.dir, "Silva16_PONE_CNV_calls.csv")
calls <- read.csv(call.file, as.is=TRUE)
nrow(calls)
head(calls)
```

```{r nrSamples}
length(unique(calls[,"NE_id"]))
```

We observe that this example dataset satisfies the basic five-column input format 
required by `r Biocpkg("CNVRanger")`. 

The last column contains the integer copy 
number state for each call, encoded as

- `0`: homozygous deletion (2-copy loss)
- `1`: heterozygous deletion (1-copy loss)
- `2`: normal diploid state
- `3`: 1-copy gain
- `4`: amplification (>= 2-copy gain)  

For CNV detection software that uses a different encoding, it is necessary to 
convert to the above encoding.
For example, GISTIC uses 

- `-2`: homozygous deletion (2-copy loss)
- `-1`: heterozygous deletion (1-copy loss)
- `0`: normal diploid state
- `1`: 1-copy gain
- `2`: amplification (>= 2-copy gain)  

which can be converted by simply adding 2.

In Section 7.2
[Application to TCGA data](#application-to-tcga-data-stored-in-a-multiassayexperiment) 
we also describe how to transform segmented log2 copy ratios to integer copy number states.

The basic five-column input format can be augmented with additional columns,
providing additional characteristics and metadata for each CNV call.
In Section 8 
[CNV-phenotype association analysis](#cnv-phenotype-association-analysis), we 
demonstrate how to make use of such an extended input format. 


## Representation as a `GRangesList`

Once read into an R `data.frame`, we group the calls by sample ID and convert
them to a `GRangesList`.
Each element of the list corresponds to a sample, and contains the genomic 
coordinates of the CNV calls for this sample (along with the copy number state 
in the `state` metadata column).  

```{r cnvCalls}
grl <- GenomicRanges::makeGRangesListFromDataFrame(calls, 
    split.field="NE_id", keep.extra.columns=TRUE)
grl
```

The advantage of representing the CNV calls as a `GRangesList` is that it allows
to leverage the comprehensive set of operations on genomic regions implemented 
in the `r Biocpkg("GenomicRanges")` packages - for instance, sorting of the calls 
according to their genomic coordinates.

```{r sortCalls}
grl <- GenomicRanges::sort(grl)
grl
```

## Representation as a `RaggedExperiment`

An alternative matrix-like representation of the CNV calls can be obtained with 
the `r Biocpkg("RaggedExperiment")` data class. 
It resembles in many aspects the `r Biocpkg("SummarizedExperiment")`
data class for storing gene expression data as e.g. obtained with RNA-seq. 

```{r RaggedExperiment}
ra <- RaggedExperiment::RaggedExperiment(grl)
ra
```

As apparent from the `dim` slot of the object, it stores the CNV calls in the 
rows and the samples in the columns. 
Note that the CN state is now represented as an assay matrix which can be 
easily accessed and subsetted.   

```{r RaggedExperiment-assay}
RaggedExperiment::assay(ra[1:5,1:5])
```

As with `r Biocpkg("SummarizedExperiment")` objects, additional information for 
the samples are annotated in the `colData` slot.
For example, we annotate the steer weight and its feed conversion ratio (FCR) 
using simulated data.
Feed conversion ratio is the ratio of dry matter intake to live-weight gain. 
A typical range of feed conversion ratios is 4.5-7.5 with a lower number being 
more desirable as it would indicate that a steer required less feed per pound of
gain.

```{r RaggedExperiment-colData}
weight <- rnorm(ncol(ra), mean=1100, sd=100)
fcr <- rnorm(ncol(ra), mean=6, sd=1.5)
RaggedExperiment::colData(ra)$weight <- round(weight, digits=2)
RaggedExperiment::colData(ra)$fcr <- round(fcr, digits=2)
RaggedExperiment::colData(ra)
```

# Summarizing individual CNV calls across a population

```{r summarization, echo=FALSE, out.width = "40%", out.extra='style="float:right; padding:10px"'}
vign.dir <- system.file("vignettes", package = "CNVRanger")
knitr::include_graphics(file.path(vign.dir, "Summarization.png"))
```
In CNV analysis, it is often of interest to summarize individual calls across
the population, (i.e. to define CNV regions), for subsequent association
analysis with expression and phenotype data.
In the simplest case, this just merges overlapping individual calls into 
summarized regions.
However, this typically inflates CNV region size, and more appropriate approaches
have been developed for this purpose.

As mentioned in the Introduction, we emphasize the need for quality control of 
CNV calls and appropriate accounting for sources of technical bias before 
applying these summarization functions (or in general downstream analysis with
CNVRanger).

For instance, protocols for read-depth CNV calling typically exclude calls 
overlapping defined repetitive and low-complexity regions including the UCSC 
list of segmental duplications 
[Trost et al., 2018](https://doi.org/10.1016/j.ajhg.2017.12.007),
[Zhou et al., 2018](https://doi.org/10.1136/jmedgenet-2018-105272).
We also note that 
[CNVnator](https://doi.org/10.1101/gr.114876.110), a very popular read-depth 
CNV caller, implements the $q0$-filter to explicitely flag and, if desired, 
exclude calls that are likely to stem from such regions.  
If systematically over-represented in the input CNV calls, summarization procedures
such as GISTIC will identify these regions as recurrent independent of whether 
there are biological or technical reasons for that.

## Trimming low-density areas

Here, we use the approach from [CNVRuler](http://www.ircgp.com/CNVRuler) 
to summarize CNV calls to CNV regions (see 
[Figure 1](https://academic.oup.com/view-large/figure/83392426/bts239f1.jpeg)
in 
[Kim et al., 2012](https://doi.org/10.1093/bioinformatics/bts239)
for an illustration of the approach). 
This trims low-density areas as defined by the `density` argument, 
which is set here to <10\% of the number of calls within a summarized region.

```{r cnvrs}
cnvrs <- populationRanges(grl, density=0.1)
cnvrs
```

Note that CNV frequency (number of samples overlapping each region) and CNV type
(gain, loss, or both) have also been annotated in the columns `freq` and `type`,
respectively.

## Reciprocal overlap

We also provide an implementation of the _Reciprocal Overlap (RO)_ procedure that 
requires calls to sufficiently overlap with one another as e.g. used by 
[Conrad et al., 2010](https://doi.org/10.1038/nature08516).
This merges calls with an RO above a threshold as given by the `ro.thresh` argument.
For example, an RO of 0.51 between two genomic regions _A_ and _B_ requires that 
_B_ overlaps at least 51\% of _A_, *and* that _A_ also overlaps at least 51\% of _B_.

```{r cnvrsRO}
ro.cnvrs <- populationRanges(grl[1:100], mode="RO", ro.thresh=0.51)
ro.cnvrs
```

## Identifying recurrent regions

In particular in cancer, it is important to distinguish driver from 
passenger mutations, i.e. to distinguish meaningful events from random 
background aberrations. 
The [GISTIC](http://www.broadinstitute.org/cancer/cga/gistic) method identifies 
those regions of the genome that are aberrant more often than would be expected 
by chance, with greater weight given to high amplitude events 
(high-level copy-number gains or homozygous deletions) that are less likely to 
represent random aberrations 
([Beroukhim et al., 2007](https://doi.org/10.1073/pnas.0710052104)).

By setting `est.recur=TRUE`, we deploy a `GISTIC`-like significance estimation 

```{r gistic}
cnvrs <- populationRanges(grl, density=0.1, est.recur=TRUE)
cnvrs
```

and filter for recurrent CNVs that exceed a significance threshold of 0.05. 

```{r recurr}
subset(cnvrs, pvalue < 0.05)
```

We can illustrate the landscape of recurrent CNV regions using the function 
`plotRecurrentRegions`.
We therefore provide the summarized CNV regions, a valid 
[UCSC](https://genome.ucsc.edu/cgi-bin/hgGateway) genome assembly, and a 
chromosome of interest.   

```{r recurrViz}
plotRecurrentRegions(cnvrs, genome="bosTau6", chr="chr1")
```

The function plots (from top to bottom):
(i) an ideogram of the chromosome (note that staining bands are not available for `bosTau6`), 
(ii) a genome axis indicating the chromosomal position, (iii) a bar plot showing
for each CNV region the number of samples with a CNV call in that region, and 
(iv) an annotation track that indicates whether this is a _recurrent_ region 
according to a significance threshold (an argument to the function, default: 0.05).

# Overlap analysis of CNVs with functional genomic regions

```{r overlap, echo=FALSE, out.width = "40%", out.extra='style="float:right; padding:10px"'}
vign.dir <- system.file("vignettes", package = "CNVRanger")
knitr::include_graphics(file.path(vign.dir, "Overlap.png"))
```
Once individual CNV calls have been summarized across the population, it is 
typically of interest whether the resulting CNV regions overlap with functional 
genomic regions such as genes, promoters, or enhancers. 
As a certain amount of overlap can be expected just by chance, an assessment of 
statistical significance is needed to decide whether the observed overlap is 
greater (enrichment) or less (depletion) than expected by chance. 

The `r Biocpkg("regioneR")` package implements a general framework for testing 
overlaps of genomic regions based on permutation sampling.
This allows to repeatedly sample random regions from the genome, matching size 
and chromosomal distribution of the region set under study (here: the CNV regions).
By recomputing the overlap with the functional features in each permutation, 
statistical significance of the observed overlap can be assessed.

We demonstrate in the following how this strategy can be used to assess the 
overlap between the detected CNV regions and protein-coding regions in the 
cattle genome. 
We expect to find a depletion as protein-coding regions are highly conserved and
rarely subject to long-range structural variation such as CNV. 
Hence, is the overlap between CNVs and protein-coding genes less than expected by
chance?

To obtain the location of protein-coding genes, we query available 
_Bos taurus_ annotation from Ensembl 

```{r getBtGenes}
library(AnnotationHub)
ah <- AnnotationHub::AnnotationHub()
ahDb <- AnnotationHub::query(ah, pattern = c("Bos taurus", "EnsDb"))
ahDb
```

and retrieve gene coordinates in the UMD3.1 assembly (Ensembl 92).

```{r getBtGenes2}
ahEdb <- ahDb[["AH60948"]]
bt.genes <- ensembldb::genes(ahEdb)
GenomeInfoDb::seqlevelsStyle(bt.genes) <- "UCSC"
bt.genes
```

To speed up the example, we restrict analysis to chromosomes 1 and 2. 

```{r formatBtGenes}
sel.genes <- subset(bt.genes, seqnames %in% paste0("chr", 1:2))
sel.genes <- subset(sel.genes, gene_biotype == "protein_coding")
sel.cnvrs <- subset(cnvrs, seqnames %in% paste0("chr", 1:2))
```

Now, we are applying an overlap permutation test with 100 permutations
(`ntimes=100`), while maintaining chromosomal distribution of the CNV
region set (`per.chromosome=TRUE`). 
Furthermore, we use the option `count.once=TRUE` to count an overlapping CNV 
region only once, even if it overlaps with 2 or more genes.
We also allow random regions to be sampled from the entire genome (`mask=NA`),
although in certain scenarios masking certain regions such as telomeres and 
centromeres is advisable.
Also note that we use 100 permutations for demonstration only.
To draw robust conclusions a minimum of 1000 permutations should be carried out.

```{r ovlpTest, warning=FALSE}
library(regioneR)
library(BSgenome.Btaurus.UCSC.bosTau6.masked)
res <- regioneR::overlapPermTest(A=sel.cnvrs, B=sel.genes, ntimes=100, 
    genome="bosTau6", mask=NA, per.chromosome=TRUE, count.once=TRUE)
res
```

```{r permDist}
summary(res[[1]]$permuted)
```

The resulting permutation *p*-value indicates a significant depletion. Out of
the `r length(sel.cnvrs)` CNV regions, `r res[[1]]$observed` overlap with at 
least one gene. 
In contrast, when repeatedly drawing random regions matching the CNV regions in 
size and chromosomal distribution, the mean number of overlapping
regions across permutations was `r round(mean(res[[1]]$permuted), digits=1)` 
$\pm$ `r round(sd(res[[1]]$permuted), digits=1)`.

This finding is consistent with our observations across the whole genome
([da Silva et al., 2016](https://doi.org/10.1371/journal.pone.0157711)) 
and findings from the 1000 Genomes Poject 
([Sudmant et al., 2015](https://www.nature.com/articles/nature15394)).

```{r vizPermTest}
plot(res)
```

Note: the function `regioneR::permTest` allows to incorporate user-defined
functions for randomizing regions and evaluating additional measures of overlap
such as total genomic size in bp.

# CNV-expression association analysis
```{r expression, echo=FALSE, out.width = "40%", out.extra='style="float:right; padding:10px"'}
vign.dir <- system.file("vignettes", package = "CNVRanger")
knitr::include_graphics(file.path(vign.dir, "Expression.png"))
```
Studies of expression quantitative trait loci (eQTLs) aim at the discovery of 
genetic variants that explain variation in gene expression levels 
([Nica and Dermitzakis, 2013](https://www.ncbi.nlm.nih.gov/pubmed/23650636)).
Mainly applied in the context of SNPs, the concept also naturally extends to the
analysis of CNVs. 

The `r Biocpkg("CNVRanger")` package implements association testing between CNV 
regions and RNA-seq read counts using `r Biocpkg("edgeR")`, which applies 
generalized linear models based on the negative-binomial distribution 
while incorporating normalization factors for different library sizes.

In the case of only one CN state deviating from 2n for a CNV region under 
investigation, this reduces to the classical 2-group
comparison.  For more than two states (e.g. 0n, 1n, 2n), edgeR’s
ANOVA-like test is applied to test all deviating groups for
significant expression differences relative to 2n.

## Application to individual CNV and RNA-seq assays

We demonstrate the functionality by loading RNA-seq read count data from 
skeletal muscle samples for 183 Nelore cattle steers, which we analyzed for 
CNV-expression effects as previously described 
([Geistlinger et al., 2018](https://doi.org/10.1038/s41598-018-19782-4)).

```{r rseqdata}
rseq.file <- file.path(data.dir, "counts_cleaned.txt")
rcounts <- read.delim(rseq.file, row.names=1, stringsAsFactors=FALSE)
rcounts <- as.matrix(rcounts)
dim(rcounts)
rcounts[1:5, 1:5]
```

For demonstration, we restrict analysis to the 939 genes on chromosome 1 and 2, 
and store the RNA-seq expression data in a `r Biocpkg("SummarizedExperiment")`.

```{r rse}
library(SummarizedExperiment)
rranges <- GenomicRanges::granges(sel.genes)[rownames(rcounts)]
rse <- SummarizedExperiment(assays=list(rcounts=rcounts), rowRanges=rranges)
rse
```

Assuming distinct modes of action, effects observed in the CNV-expression analysis 
are typically divided into (i) local effects (*cis*), where expression changes 
coincide with CNVs in the respective genes, and (ii) distal effects (*trans*), where
CNVs supposedly affect trans-acting regulators such as transcription factors.

However, due to power considerations and to avoid detection of spurious effects,
stringent filtering of (i) not sufficiently expressed genes, and (ii) CNV regions
with insufficient sample size in groups deviating from 2n, should be carried out 
when testing for distal effects. 
Local effects have a clear spatial indication and the number of genes locating 
in or close to a CNV region of interest is typically small; testing for 
differential expression between CN states is thus generally better powered for 
local effects and less stringent filter criteria can be applied. 

In the following, we carry out CNV-expression association analysis by providing 
the CNV regions to test (`cnvrs`), the individual CNV calls (`grl`) to determine 
per-sample CN state in each CNV region, the RNA-seq read counts (`rse`),
and the size of the genomic window around each CNV region (`window`).
The `window` argument thereby determines which genes are considered for testing 
for each CNV region and is set here to 1 Mbp. 

Further, use the `filter.by.expr` and `min.samples` arguments to exclude from 
the analysis (i) genes with very low read count across samples, and (ii) CNV 
regions with fewer than `min.samples` samples in a group deviating from 2n.

```{r cnvEQTL}
res <- cnvEQTL(cnvrs, grl, rse, window="1Mbp", verbose=TRUE)
res
```

The resulting `GRangesList` contains an entry for each CNV region tested, storing
the genes tested in the genomic window around the CNV 
region, and (i) log2 fold change with respect to the 2n group, 
(ii) edgeR's DE _p_-value, and 
(iii) the (per default) Benjamini-Hochberg adjusted _p_-value. 

## Application to TCGA data stored in a `MultiAssayExperiment`

In the previous section, we individually prepared the CNV and RNA-seq data for 
CNV-expression association analysis.
In the following, we demonstrate how to perform an integrated preparation of the
two assays when stored in a `r Biocpkg("MultiAssayExperiment")`.
We therefore consider glioblastoma 
[GBM](https://cancergenome.nih.gov/cancersselected/glioblastomamultiforme) 
data from 
The Cancer Genome Atlas [TCGA](https://cancergenome.nih.gov), 
which can conveniently be accessed with the `r Biocpkg("curatedTCGAData")` package.

```{r tcgaSetup, message=FALSE}
library(curatedTCGAData)
gbm <- curatedTCGAData::curatedTCGAData("GBM", 
        assays=c("GISTIC_Peaks", "CNVSNP", "RNASeq2GeneNorm"), 
        dry.run=FALSE)
gbm
```

The returned `MultiAssayExperiment` contains three assays:

- the SNP-based CNV calls stored in a `RaggedExperiment` (`GBM_CNVSNP`),
- the recurrent CNV regions summarized across the population using the
    [GISTIC](http://www.broadinstitute.org/cancer/cga/gistic) method 
    (`GBM_GISTIC_Peaks`), and
- the normalized RNA-seq gene expression values in a `SummarizedExperiment` 
    (`GBM_RNASeq2GeneNorm`).

To annotate the genomic coordinates of the genes measured in the RNA-seq assay, 
we use the function `symbolsToRanges` from the `r Biocpkg("TCGAutils")` package.
For demonstration, we restrict the analysis to chromosome 1 and 2. 

```{r tcgaGeneAnno, message=FALSE}
library(TCGAutils)
gbm <- TCGAutils::symbolsToRanges(gbm, unmapped=FALSE)
for(i in 1:3) 
{
    rr <- rowRanges(gbm[[i]])
    GenomeInfoDb::genome(rr) <- "hg19"
    GenomeInfoDb::seqlevelsStyle(rr) <- "UCSC"
    ind <- as.character(seqnames(rr)) %in% c("chr1", "chr2")
    rowRanges(gbm[[i]]) <- rr
    gbm[[i]] <- gbm[[i]][ind,]
}
gbm
```

We now restrict the analysis to intersecting patients of the three assays 
using `MultiAssayExperiment`'s `intersectColumns` function, and select 
_Primary Solid Tumor_ samples using the `splitAssays` function from the 
`r Biocpkg("TCGAutils")` package.

```{r gbmIntersect}
gbm <- MultiAssayExperiment::intersectColumns(gbm)
TCGAutils::sampleTables(gbm)
data(sampleTypes, package="TCGAutils")
sampleTypes
gbm <- TCGAutils::splitAssays(gbm, sampleCodes="01")
gbm
```

The SNP-based CNV calls from TCGA are provided as segmented log2 copy number ratios. 

```{r segmean}
ind <- grep("CNVSNP", names(gbm))
head( mcols(gbm[[ind]]) )
summary( mcols(gbm[[ind]])$Segment_Mean )
```

It is thus necessary to convert them to integer copy number states for further 
analysis with `r Biocpkg("CNVRanger")`.

In a diploid genome, a single-copy gain in a perfectly pure, homogeneous sample 
has a copy ratio of 3/2. On log2 scale, this is log2(3/2) = 0.585, and a 
single-copy loss is log2(1/2) = -1.0.

We can roughly convert a log ratio `lr` to an integer copy number by

```{r lr2int, eval=FALSE}
round( (2^lr) * 2)
```

Note that this is not the ideal way to calculate absolute integer copy numbers.
Especially in cancer, differences in tumor purity, tumor ploidy, and subclonality
can substantially interfere with the assumption of a pure homogeneous sample.
See [ABSOLUTE](https://doi.org/10.1038/nbt.2203) and the `r Biocpkg("PureCN")` 
package for accurately taking such tumor characteristics into account.

However, without additional information we transform the log ratios into
integer copy number states using the rough approximation outlined above.

```{r transformToStates}
smean <- mcols(gbm[[ind]])$Segment_Mean
state <- round(2^smean * 2)
state[state > 4] <- 4
mcols(gbm[[ind]])$state <- state
gbm[[ind]] <- gbm[[ind]][state != 2,]
mcols(gbm[[ind]]) <- mcols(gbm[[ind]])[,3:1]
table(mcols(gbm[[ind]])$state)
```

The data is now ready for CNV-expression association analysis, where we find 
only four CNV regions with sufficient sample size for testing using the default
value of 10 for the `minSamples` argument.

```{r gbmEQTL}
res <- cnvEQTL(cnvrs="01_GBM_GISTIC_Peaks-20160128", 
    calls="01_GBM_CNVSNP-20160128", 
    rcounts="01_GBM_RNASeq2GeneNorm-20160128_ranged", 
    data=gbm, window="1Mbp", verbose=TRUE)
res
```

We can illustrate differential expression of genes in the neighborhood of a CNV
region of interest using the function `plotEQTL`.
 
```{r plotEQTL}
res[2]
(r <- GRanges(names(res)[2]))
plotEQTL(cnvr=r, genes=res[[2]], genome="hg19", cn="CN1")
```

The plot shows consistent decreased expression (negative log2 fold change) 
of genes in the neighborhood of the CNV region, when comparing samples with a one 
copy loss (1$n$) in that region to the 2$n$ reference group.

Note that a significant expression change is not only observed for genes locating
within the CNV region (dosage effect, here: PARK7), but also genes locating in 
close proximity of the CNV region (neighborhood effect, here: CAMTA1 and RERE).
This is consistent with previous observations in mouse 
[Cahan et al., 2009](https://doi.org/10.1038/ng.350) and our observations in cattle 
[Geistlinger et al., 2018](https://doi.org/10.1038/s41598-018-19782-4).

# CNV-phenotype association analysis

```{r phenotype, echo=FALSE, out.width = "40%", out.extra='style="float:right; padding:10px"'}
vign.dir <- system.file("vignettes", package = "CNVRanger")
knitr::include_graphics(file.path(vign.dir, "Phenotype.png"))
```
Specifically developed for CNV calls inferred from SNP-chip data, 
`r Biocpkg("CNVRanger")` allows to carry out a probe-level genome-wide association 
study (GWAS) with quantitative phenotypes.
CNV calls from other sources such as sequencing data are also supported by using 
the start and end position of each call as the corresponding probes.

As previously described [da Silva et al., 2016](https://doi.org/10.1371/journal.pone.0157711),
we construct CNV segments from probes representing common CN polymorphisms
(allele frequency >1\%), and carry out a GWAS as implemented in 
[PLINK](http://zzz.bwh.harvard.edu/plink/gvar.shtml)
using a standard linear regression of phenotype on allele dosage.

For CNV segments composed of multiple probes, the segment _p_-value is chosen 
from the probe _p_-values, using either the probe with minimum _p_-value or the 
probe with maximum CNV frequency.

For demonstration, we use CNV data of a wild population of songbirds 
([da Silva et al., 2018](https://doi.org/10.1186/s12864-018-4577-1)).

```{r readCallsGWAS}
cnv.loc <- file.path(data.dir, "CNVOut.txt") 
cnv.calls <- read.delim(cnv.loc, as.is=TRUE) 
head(cnv.calls)
```

Here, we use the extensibility of the basic five-column input format described
in [Section 4.1](#input-data-format). 
In addition to the required five columns (providing chromosome, start position, 
end position, sample ID, and integer copy number state), we provided three 
optional columns storing the number of probes supporting the call, and the 
Affymetrix ID of the first and last probe contained in the call.

As these columns are optional, it is not ultimately necessary to provide them 
in order to run a CNV GWAS. 
However, we recommend to provide this information when available as it 
allows for a more fine-grained probe-by-probe GWAS.

As described in [Section 4.2](#representation-as-a-grangeslist), we store the CNV calls 
in a `GRangesList` for further analysis.

```{r calls2grl}
cnv.calls <- GenomicRanges::makeGRangesListFromDataFrame(cnv.calls, 
    split.field="sample.id", keep.extra.columns=TRUE)
sort(cnv.calls)
```

In the following, we use genomic estimated breeding values (GEBVs) for breeding 
time (BT) as the quantitative phenotype, and accordingly analyze for each CNV 
region whether change in copy number is associated with change in the genetic 
potential for breeding time. 

## Setting up a CNV-GWAS

For compatibility with [PLINK](http://zzz.bwh.harvard.edu/plink/gvar.shtml)'s 
[fam file format](http://www.cog-genomics.org/plink/1.9/formats#fam),
we read phenotype information from a tab-delimited file, containing exactly four 
columns: sample ID, family ID, sex, and the quantitative phenotype (here: breeding time, BT). 

```{r phenoData}
phen.loc <- file.path(data.dir, "Pheno.txt")
phen.df <- read.delim(phen.loc, as.is=TRUE)
head(phen.df)
```

Although family ID and sex are listed as columns, this info is not considered in 
the current implementation and can be set to `NA`. 

As described in [Section 4.3](#representation-as-a-raggedexperiment), we 
combine the CNV calls with the phenotype information in a `RaggedExperiment` for
coordinated representation and analysis.


```{r importPhen}
re <- RaggedExperiment::RaggedExperiment(cnv.calls, colData=phen.df)
re
```

If probe information is available and has been annotated to the CNV calls, as we
did above, the probe IDs and corresponding genomic positions should be provided 
in a separate file.  
For compatibility with [PLINK](http://zzz.bwh.harvard.edu/plink/gvar.shtml)'s 
[map file format](http://www.cog-genomics.org/plink/1.9/formats#map), this is 
expected to be a tab-delimited file containing exactly three columns: probe ID,
chromosome, and the position in bp.

```{r map}
map.loc <- file.path(data.dir, "MapPenn.txt")
map.df <- read.delim(map.loc, as.is=TRUE)
head(map.df)
```

Given a `RaggedExperiment` storing CNV calls together with phenotype information, 
and optionally a map file for probe-level coordinates, the `setupCnvGWAS` function 
sets up all files needed for the GWAS analysis.

The information required for analysis is stored in the resulting `phen.info` list:

```{r importPhenRagged}
phen.info <- setupCnvGWAS("example", cnv.out.loc=re, map.loc=map.loc)
phen.info
```

The last item of the list displays the working directory:

```{r Wdir}
all.paths <- phen.info$all.paths
all.paths
```

For the GWAS, chromosome names are assumed to be `integer` (i.e. `1, 2, 3, ...`).
Non-integer chromosome names can be encoded by providing a `data.frame` that 
describes the mapping from `character` names to corresponding integers. 

For the example data, chromosomes _1A_, _4A_, _25LG1_, _25LG2_, and _LGE22_ 
are correspondingly encoded via 

```{r CNVGWASNames}
# Define chr correspondence to numeric
chr.code.name <- data.frame(   
                    V1 = c(16, 25, 29:31), 
                    V2 = c("1A", "4A", "25LG1", "25LG2", "LGE22"))
```

## Running a CNV-GWAS

We can then run the actual CNV-GWAS, here without correction for multiple testing
which is done *for demonstration only*. 
In real analyses, multiple testing correction is recommended to avoid inflation 
of false positive findings.  

```{r CNVGWA}
segs.pvalue.gr <- cnvGWAS(phen.info, chr.code.name=chr.code.name, method.m.test="none")
segs.pvalue.gr
```

The CNV-GWAS uses the concept of CNV segments to define more fine-grained CNV loci
within CNV regions.

```{r, echo=FALSE, fig.cap="Definition of CNV segments. The figure shows construction of a CNV segment from 4 individual CNV calls in a CNV region based on pairwise copy number concordance between adjacent probes.", out.width = '100%'}
knitr::include_graphics(file.path(vign.dir, "CNVsegments.png"))
```

This procedure was originally proposed in our previous work in Nelore cattle 
([da Silva et al., 2016](https://doi.org/10.1371/journal.pone.0157711)) and 
defines CNV segments based on CNV genotype similarity of subsequent SNP probes. 

The default is `min.sim=0.95`, which will continuously add probe positions to a 
given CNV segment until the pairwise genotype similarity drops below 95%. 
An example of detailed up-down CNV genotype concordance that is used for the 
construction of CNV segments is given in S12 Table in 
[da Silva et al., 2016](https://doi.org/10.1371/journal.pone.0157711). 

As `PLINK` returns a _p_-value for each probe, only one of the _p_-values of the
probes contained in a CNV segment is chosen as the segment _p_-value. 

This is similar to a common approach used in differential expression (DE) analysis of 
microarray gene expression data, where typically the most significant DE probe
is chosen in case of multiple probes mapping to the same gene. 

Here, the representative probe for the CNV segment can be chosen to be the probe 
with lowest _p_-value (`assign.probe="min.pvalue"`, default) or the one with highest 
CNV frequency (`assign.probe="high.freq"`). 

Multiple testing correction based on the number of CNV segments tested is carried
out using the FDR approach (default). 
Results can then be displayed as for regular GWAS via a Manhattan plot 
(which can optionally be exported to a pdf file). 

```{r manh}
## Define the chromosome order in the plot
order.chrs <- c(1:24, "25LG1", "25LG2", 27:28, "LGE22", "1A", "4A")

## Chromosome sizes
chr.size.file <- file.path(data.dir, "Parus_major_chr_sizes.txt")
chr.sizes <- scan(chr.size.file)
chr.size.order <- data.frame(chr=order.chrs, sizes=chr.sizes, stringsAsFactors=FALSE)

## Plot a pdf file with a manhatthan within the 'Results' workfolder
plotManhattan(all.paths, segs.pvalue.gr, chr.size.order, plot.pdf=FALSE)
```

## Producing a GDS file in advance

The genomic data structure (GDS) file format supports efficient memory management
for genotype analysis.
To make use of this efficient data representation, CNV genotypes analyzed with 
the `cnvGWAS` function 
are stored in a `CNV.gds` file, which is automatically produced and placed in 
the `Inputs` folder (i.e. `all.paths[1]`). 

Therefore, running a GWAS implies that any GDS file produced by previous analysis 
will be overwritten. Use `produce.gds=FALSE` to avoid overwriting in the GWAS run. 

For convenience, a GDS file can be produced before the GWAS analysis with the 
`generateGDS` function.
This additionally returns a `GRanges` object containing the genomic 
position, name and, frequency of each probe used to construct the CNV segments
for the GWAS analysis. 

Note that `probes.cnv.gr` object contains the integer chromosome names 
(as the GDS file on disk). 
Only the `segs.pvalue.gr`, which stores the GWAS results, has the character 
chromosome names.  

```{r prodGDS}
## Create a GDS file in disk and export the SNP probe positions
probes.cnv.gr <- generateGDS(phen.info, chr.code.name=chr.code.name)
probes.cnv.gr

## Run GWAS with existent GDS file
segs.pvalue.gr <- cnvGWAS(phen.info, chr.code.name=chr.code.name, produce.gds=FALSE)
```

## Using relative signal intensities 

SNP-based CNV callers such as [PennCNV](http://penncnv.openbioinformatics.org) 
and [Birdsuit](https://www.broadinstitute.org/birdsuite/birdsuite)
infer CNVs from SNP-chip intensities (log R ratios, LRRs) and allele frequencies 
(B allelel frequencies, BAFs).
As an auxiliary analysis, it can be interesting to carry out the GWAS based on the 
LRR relative signal intensities itself 
([da Silva et al., 2018](https://doi.org/10.1186/s12864-018-4577-1)). 

To perform the GWAS using LRR values, import the LRR/BAF values and set `run.lrr=TRUE` 
in the `cnvGWAS` function:

```{r importLRR}
# List files to import LRR/BAF 
files <- list.files(data.dir, pattern = "\\.cnv.txt.adjusted$")
samples <- sub(".cnv.txt.adjusted$", "", files)
samples <- sub("^GT","", samples)
sample.files <- data.frame(file.names=files, sample.names=samples)
 
# All missing samples will have LRR = '0' and BAF = '0.5' in all SNPs listed in the GDS file
importLrrBaf(all.paths, data.dir, sample.files, verbose=FALSE)

# Read the GDS to check if the LRR/BAF nodes were added
cnv.gds <- file.path(all.paths[1], "CNV.gds")
(genofile <- SNPRelate::snpgdsOpen(cnv.gds, allow.fork=TRUE, readonly=FALSE))
gdsfmt::closefn.gds(genofile)

# Run the CNV-GWAS with existent GDS
segs.pvalue.gr <- cnvGWAS(phen.info, chr.code.name=chr.code.name, produce.gds=FALSE, run.lrr=TRUE)
```

# Session info
```{r sessionInfo}
sessionInfo()
```