---
title: "PrecisionTrialDrawer - To analyze and design NGS-based gene panels"
author: 
- name: Giorgio Melloni
  affiliation: Brigham and Women's Hospital, Boston, MA.
- name: Alessandro Guida
  affiliation: European Institute of Oncology, Milan, Italy
- name: Luca Mazzarella
  affiliation: European Institute of Oncology, Milan, Italy
abstract: >
  In recent years, two new forms of clinical trial designs emerged 
  in clinical cancer genomics: umbrella and basket trials. 
  Designs of such trials can be extremely difficult, starting 
  from the creation of a valid screening panel, to the optimization 
  of sequencing space and to the calculation of a correct sample 
  size for the trial. PTD provides a series of tools that 
  can help bioinformaticians, clinicians and biostatisticians to 
  design, analyze and finalize a custom gene panel for cancer genomics.
vignette: >
  %\VignetteIndexEntry{Bioconductor style for HTML documents}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output: 
  BiocStyle::html_document:
    self_contained: false
    toc_float: false
---

```{r macro, echo=FALSE}
library(BiocStyle)
```

```{r setup, include=FALSE}
# I can measure the timing of all the chunks
knitr::knit_hooks$set(timeit = local({
  now = NULL
  function(before, options) {
    if (before) {
      now <<- Sys.time()
    } else {
      res = difftime(Sys.time(), now)
      now <<- NULL
      # use options$label if you want the chunk label as well
      paste('Time for this code chunk:', as.character(res))
    }
  }})
)
```

# Introduction

In the era of personalized medicine and personalized 
health there is a tremendous effort 
to bring NGS based technology in the clinical routine. 
Nevertheless, the information coming 
from research studies is in most cases useless for the 
clinical practice and in addition too 
expensive. To bridge the gap between cancer research and clinical genomics, 
we propose `r Rpackage("PrecisionTrialDrawer")`, 
a suite of simple tools to explore the ability 
of less expensive cancer gene panels to be used in clinical trials. 
With this package, you can use information coming from different sources 
like mutations, expression, copynumber and fusion data 
and apply them to your cancer panel 
to see for example how many patients are covered 
by a specific mutations or better, 
how many patients would be covered by a drug 
that acts on multiple targets. 
Given a panel and tumor types to analyze, `r Rpackage("PrecisionTrialDrawer")`
can be a terrific tool to create a pilot 
project for a basket or umbrella design 
and also to design the library to be ready for sequencing.

# Data Preparation

The only input data the user has to provide is the panel itself. 
We create an easy to read standard format, suitable for every kind 
of alteration supported by `r Rpackage("PrecisionTrialDrawer")`. 
For example, while in the research practice 
genomic regions are a standard way to 
represent mutations (e.g. VCF style, 1:10000:A,C) 
this is not the most used form of 
representing druggable or pathogenic variants in clinical routine. 
The use of amino acid notation (*BRAF* V600E is 
sensitive to *Vemurafenib*) or dbsnp rs numbers 
(for example in [OMIM](http://www.omim.org/) database) 
is way more common but these are formats incompatible 
with NGS genomic format. With our tool, every kind 
of notation is accepted so that you can easily integrate 
any database at your disposal without any particular effort.

## CancerPanel object

First of all, let's see what is the format of a cancer panel:

```{r chunk1, echo=TRUE, eval=TRUE , message=FALSE, warning=FALSE , timeit=NULL}
library(PrecisionTrialDrawer)
library(knitr)
data(panelexample)
knitr::kable(panelexample)
```

A cancer panel is a data.frame of character columns and every row represents an
alteration and its properties.  We put together a panel using known specific
drug targets (like Vemurafenib for *BRAF* V600E/K) or pathway targeting drugs
(like kinases inhibitor Vandetanib). We also add known driver/prognostic genes
with no associated drug  (like *KRAS* or *TP53*). You have to bear in mind that
the panel is both what you target and what you analyze.  If you ask for a *BRAF*
V600E alteration, you will sequence 3 bases but only V600E will have the
characteristic  of being druggable and will be considered in simulation
analysis. In other words, by putting V600E you occupies 3 bases  in your panel
but in simulations only a change from valine to glutamic acid will be considered
druggable.

 - **drug**: drug names or compound or empty lines if this is not a druggable
alteration;  - **gene_symbol**: HGNC official gene symbols where the alteration;
- **alteration**: can be one of the following 'SNV', 'CNA', 'expression',
'fusion':  - **exact\_alteration**: a character vector that identifies the exact
alteration depending on alteration value. For example, alteration
'CNA' could correspond to exact_alteration 'amplification' or 'deletion'.  -
**mutation\_specification**: a character vector that must be empty if alteration
is not 'SNV' and identifies              the exact mutation in the format
specified in *exact_alteration*  - **group**: this is a free field that is
useful to identify groups of alteration. In the example is used to divide
druggable from non-druggable driver variants. Another use could be to identify
different panels and compare them.

We now present a list of possible alteration formats that must be respected.
Possible values for a cancer panel in columns alteration, **exact\_alteration**
and mutation_specification are:

|alteration |exact.alteration    |mutation_specification |
|:----------|:-------------------|:----------------------| |CNA
|amplification       |                       | |CNA        |deletion
|                       | |expression |up                  |
| |expression |down                |                       | |fusion     |
|                       | |SNV        |                    |
| |SNV        |mutation_type       |missense               | |SNV
|mutation_type       |truncating             | |SNV        |amino_acid_position
|300-300                | |SNV        |amino_acid_position |300-350
| |SNV        |amino_acid_variant  |V600E                  | |SNV
|genomic_position    |13:20000-40000         | |SNV        |genomic_position
|13:20000-20000         | |SNV        |genomic_variant     |13:20000:A,C
| |SNV        |dbSNP_rs            |rs1234567              |


Copynumber and expression are basically self explained. Mutations can be
expressed in a plethora of ways  but remember to keep the 1-base convention and
in case of single position (amino acid or genomic) to  replicate start and end
(e.g. 300-300). If you don't want to specify a particular mutation,  you can
either leave *exact_alteration* and *mutation_specification*  empty or specify a
subtype of alteration, like only missense  or only truncating variants. In the
example panel, *BRCA1* is sensitive to a PARP inhibitor  only if it is either
deleted or truncated,  because generally missense mutations are not considered
pathogenic.  Fusions have no particular format, but are expressed in the
gene\textunderscore symbol column.  If you put a single gene, `r
Rpackage("PrecisionTrialDrawer")` will interpret it as  'every fusion involving
that gene' (see *RET*),  while if you put a specific fusion (see *EML4\_\_ALK*),
it will look only for that specific fusion.  The format for specific fusions
must be always *gene1\_\_gene2*.

Now that we have our panel, we need to build an object of class CancerPanel:

```{r chunk2, echo=TRUE , eval=TRUE, message=TRUE, warning=FALSE, timeit=NULL}
mypanel <- newCancerPanel(panelexample)
```

```{r chunk2_b, echo=FALSE, eval=TRUE, message=TRUE, warning=FALSE, timeit=NULL}
# Keep a copy of this object in this status
mypanel2 <- mypanel
```


The class constructor performs two main tasks. 
The first one is to check for inconsistency in panel data.frame. 
The second is to calculate a genomic length for every row of the panel. 
CNA and expression alteration will have a length equal to 
the CDS or CDS plus UTR of the requested gene. 
Fusions follow the same rule, but when a 
specific fusion is requested, the length 
is equal to the mean of two genes involved. 
Finally, mutations will take a length equal to the exact alteration requested 
plus a *padding_length* that can be decided by the user. 
Note that this calculation does not take into account 
overlapping regions as this is just a rough estimation 
for simulations in which every row 
counts for itself. For a precise calculation, 
refer to the last section **PanelDesigner**.

Gene length calculations are based on biomaRt and 
ensembl database which is located in the UK. 
For your convenience, we add the possibility of 
changing the host and go to a faster mirror 
located closer to your location.

```{r chunk2_h, echo=TRUE, eval=FALSE, message=FALSE, warning=FALSE,timeit=NULL}
# Connect to US east coast server
# The default is www.ensembl.org
mypanel <- newCancerPanel(panelexample , myhost="uswest.ensembl.org")
```

A convenient list of mirror servers can be found here: 
[ensembl mirrors](http://www.ensembl.org/info/about/mirrors.html). 
The same option is also available for *panelDesigner* method.

## Rules parameter

There are specific situations where not every tumor in the panel 
can be associated with a drug. Such cases include 
known drug resistance or simply drugs that are 
approved only for specific tumor types.
We add a parameter called **rules** that define a set 
of modifiers of associations between drug and tumor types.
For example we can create a data.frame like the following:

```{r chunk_rules, timeit=NULL}
rules <- data.frame(
    drug=c("Erlotinib" , "Erlotinib", "Erlotinib","Erlotinib","Olaparib")
    , gene_symbol=c("EGFR" , "KRAS", "" , "", "")
    , alteration=c("SNV" , "SNV", "" , "", "")
    , exact_alteration=c("amino_acid_variant" , "", "" , "", "")
    , mutation_specification=c("T790M" , "" , "" , "", "")
    , group=c("Driver" , "Driver", "Driver" , "Driver", "Driver")
    , tumor_type=c("luad" , "luad" , "luad" , "coadread","brca")
    , in_out=c("exclude" , "exclude" , "include" , "include" , "include")
    , stringsAsFactors = FALSE
)
knitr::kable(rules)
```

It can be added directly when the object is 
constructed or later on during **subsetAlteration**.

```{r chunk2_e, echo=TRUE, eval=FALSE, message=FALSE, warning=FALSE,timeit=NULL}
mypanel_exclude <- newCancerPanel(panelexample 
                                  , myhost="uswest.ensembl.org" 
                                  , rules = rules)
```

We developed two type of possible rules described below.

### Drug resistance rules

```{r chunk_rules2, timeit=NULL}
knitr::kable(rules[1:2 , ])
```

The first two rows implement a drug resistance rule. 
Both alterations (*EGFR* T790M and *KRAS* mutations) can cause 
resistance to *EGFR* inhibitors and all the patients harboring 
these alterations cannot be associated with that treatment.

The data.frame columns are exactly the same as the panel 
but in this case the **drug** is a drug that cannot be assigned 
to all the patients with the specified alteration.
The **group** column is instead what the new couple sample-drug 
should be considered. In our case, *Actionable* becomes *Driver*.

An extra column specifying the **tumor_type** 
in which the rule is applied is added.
If the rule is valid for any tumor type, simply put "" (empty string).

**in_out** define the type of rule. In this 
case is an exclusion (in_out=exclude) 
rule so any patient harboring that mutation 
cannot be associated with Erlotinib. 
Also inclusion only rules can be implemented using in_out=include.

### Drug association rules

```{r chunk_rules3, timeit=NULL}
knitr::kable(rules[3:5 , ])
```

Not every drug can be associated with all tumor types. 
Certain treatments (e.g. FDA approved) are only available for 
specific tumor types. We implemented a 
second set of rules in the rows 3,4 and 5.

The interpretation is: *drugA* can be used only in 
(*include*) or not used in (*exclude*) the specific *tumor_type* and 
it will be associated with group *Driver*.

Multiple *include* on the same drug are cumulative 
and every tumor type require a new row 
(Erlotinib can be used only in coadread and luad).

These rules are universal and do not depend on the data 
or the panel. In the example we are going to develop, 
coadread is not included but the rule can stay.

## getAlterations

Now that we have our panel, we want to test it against tumor samples. 
We will choose two different cancer types among the available ones: 

```{r chunk3, echo=TRUE , eval=FALSE, message=TRUE, warning=FALSE, timeit=NULL}
# Check available tumor types
knitr::kable( head(showTumorType()) , row.names=FALSE)
# Fetch data from breast cancer and lung adenocarcinoma samples
mypanel <- getAlterations(mypanel , tumor_type=c("brca" , "luad"))
# See the updated slot dataFull for mutations
str( cpData(mypanel)$mutations , vec.len = 1)
```

```{r chunk3_b, echo=FALSE , eval=TRUE, message=TRUE, warning=FALSE,timeit=NULL}
# Check available tumor types
knitr::kable( head(showTumorType()) , row.names=FALSE)
# Fetch data from breast cancer and lung adenocarcinoma samples
data(cpObj)
mypanel <- cpObj
# See the updated slot dataFull for mutations
str( cpData(mypanel)$mutations , vec.len = 1)
rm(cpObj)
```

This method performs 2 tasks. Download data from 
cBioportal and MD Anderson Fusion Portal 
for every alteration type for the requested genes and 
retrieve the total number of samples for each available data type.

In case you are interested into requesting data for every tumor type, 
it is possible to assign "all_tumors" to the parameter 
tumor_type ```getAlterations(mypanel , tumor_type="all_tumors")```

### Specify the cancer studies

Additionally, instead of specifying the tumor type, it is 
also possible to display the individual study 
associated with each tumor type and select them 
one by one. This for example allows 
the user more control over what dataset will be used to run the simulation. 

```{r sct,echo=TRUE,eval=FALSE,message=FALSE,warning=FALSE,timeit=NULL}
# Check available tumor types
knitr::kable( showCancerStudy(c("brca","luad")) , row.names = FALSE )
# Fetch data from breast cancer and lung adenocarcinoma samples
mypanel_alternative <- getAlterations(mypanel 
                        , tumor_type=c("luad_tcga_pan_can_atlas_2018"
                                       , "brca_tcga_pan_can_atlas_2018"))
```

At this point, the data are downloaded by gene, without 
looking for the exact alteration requested.
If you want, you can always put your own data, as long 
as you respect the format for each alteration type. 

```{r chunk4,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE,timeit=NULL}
# Extract the data from the panel as a toy example.
repos <- lapply(cpData(mypanel) , '[' , c('data' , 'Samples'))
# How custom data should look like
str(repos , vec.len=1)
```

```{r chunk4_bis,echo=TRUE,eval=FALSE,message=FALSE,warning=FALSE,timeit=NULL}
# Use custom data in a new CancerPanel object
mypanel_toy <- newCancerPanel(panelexample)
mypanel_toy <- getAlterations(mypanel_toy , repos=repos)
```

```{r chunk4_b,echo=FALSE,eval=TRUE,message=FALSE,warning=FALSE,timeit=NULL}
# Use custom data in a new CancerPanel object
mypanel_toy <- mypanel2
mypanel_toy <- getAlterations(mypanel_toy , repos=repos)
```


In this toy example, we extract the data from our 
CancerPanel object and transfer them to a new object. 
This operation can be useful when you want to test 
subsets of your original panel using a freeze of 
your data. Remember that sample names are taken from 
the element 'Sample' of the list for each alteration type 
because our tool needs to know the actual reference 
samples for which a genomic screening were made, 
even if they have no alterations. 
If 'Sample' is not present, the sample names are retrieved 
directly from the data using column *case_id*.
Adding more data than necessary is not a problem at this point, because every 
statistics and calculation are made after the call to **subsetAlterations**.

### Add or filter custom data

In case you want to add extra data from a different source, 
and you dispose of the data 
already according to the @dataFull slot format, 
you can append this dataset to an existing panel. 
This allows to increase the sample size for those 
rare tumor_types that are not present in cBioPortal.

```{r append,echo=TRUE,eval=FALSE,message=FALSE,warning=FALSE,timeit=NULL}
# Add two mutations from two different samples for breast cancer
newmutations <- data.frame(
    entrez_gene_id=c(7157 , 7157)
    ,gene_symbol=c("TP53" , "TP53")
    ,case_id=c("new_samp1" , "new_samp2")
    ,mutation_type=c("Nonsense_Mutation" , "Nonsense_Mutation")
    ,amino_acid_change=c("R306*" , "Y126*")
    ,genetic_profile_id=c("mynewbreaststudy" , "mynewbreaststudy")
    ,tumor_type=c("brca" , "brca")
    ,amino_position=c(306 , 126)
    ,genomic_position=c("17:7577022:G,A" , "17:7578552:G,C")
    ,stringsAsFactors=FALSE
    )
newsamples <- c("new_samp1" , "new_samp2")
# A dataFull slot style list should look like this
toBeAdded <- list(fusions=list(data=NULL , Samples=NULL)
                 , mutations=list(data=newmutations 
                                  , Samples=list(brca=newsamples))
                 , copynumber=list(data=NULL , Samples=NULL)
                 , expression=list(data=NULL , Samples=NULL)
                 )
new_panel <- appendRepo(mypanel_toy , toBeAdded)
```

Furthermore, if you want to filter certain mutations 
or fusions from your dataset, 
you can use the methods `filterMutations` and 
`filterFusions`. The first method, can be used 
with specific mutations or using a bed file.

```{r filter,echo=TRUE,eval=FALSE,message=FALSE,warning=FALSE,timeit=NULL}
# Create a bedfile
bed <- data.frame(chr = paste0("chr" , c(7 , 17)) 
                  , start = c(140534632 , 41244326) 
                  , end = c(140534732 , 41244426) 
                  , stringsAsFactors=FALSE)
knitr::kable(bed , row.names=FALSE)
# Apply the filter
# You can decide to exclude or keep only the mutations in the bed file
new_panel <- filterMutations(mypanel_toy , bed = bed , mode = "exclude")
# Filtering can be also tumor-wise
new_panel <- filterMutations(mypanel_toy , bed = bed 
                             , mode = "exclude" , tumor_type="brca")
```


```{r chunk5,echo=FALSE,eval=TRUE,message=FALSE,warning=FALSE,timeit=NULL}
rm(mypanel_toy)
# rm(new_panel)
```

## subsetAlterations

After getting alterations, we have to subset 
alterations for the exact specifications of our panel.

```{r chunk6,echo=TRUE,eval=FALSE,message=FALSE,warning=FALSE,timeit=NULL}
# Subset the alterations using panel information
mypanel <- subsetAlterations(mypanel)
# See the updated slot dataSubset
str( cpDataSubset(mypanel) , vec.len = 1)
```

As we can see, the format of the new slot is now the 
same for every alteration type and the 
information contained in the panel was merged. This format is necessary to 
perform our simulations since we can easily bind together these datasets.

If the user provides an *exclude* data.frame (either 
at the object construction or here), all the patients 
excluded from certain treatment opportunities will be 
stored in the excluded slot. See Negation Rules section above.

The *druggability* parameter can be used in this context too.

For example we can use the data already downloaded and 
use them in a panel with an exclude and druggability parameter:

```{r ,echo=TRUE,eval=TRUE,message=FALSE,warning=FALSE,timeit=NULL}
mypanel_exclude <- mypanel
mypanel_exclude <- subsetAlterations(mypanel_exclude 
                                     , rules=rules)
```


# Plots and Stats

`r Rpackage("PrecisionTrialDrawer")` has three different 
plot capabilities/statistics. Every plot can be turned off by using 
noPlot=TRUE option and statistics are reported instead. 
The first plot is the **coveragePlot**, 
which returns the absolute number of samples covered by 
at least one or more alterations. 
A more compact but less informative version of the same 
plot is the **coverageStackPlot**, which returns a breakdown 
of variable (e.g. drug) by its subcomponents (e.g. genes).
The second one is called **saturationPlot**, and 
expresses the cumulative number of samples 
covered or the mean number of alteration per sample 
by adding one piece of the panel at a time. 
In particular, this plot is able to estimate the trade-off 
between the number of covered samples 
and the genomic space occupied by the panel (as a sort of proxy of cost). 
The third plot is called **coocMutexPlot** 
and is able to explore the relationship between pairs of features 
as mutual exclusive or cooccurent. For example, in a typical 
umbrella design, we seek for drugs or genes that are possibly 
mutual exclusive between each other, in order to maximize 
the number of covered patients and avoid confusing multiple targeting.

All plots accept a **CancerPanel** object and a grouping 
variable to see the plot stratified 
by gene or tumor type. Furthermore, you can decide the kind 
of alteration to plot, between mutations, copynumber, fusions,  
expression or any combination of these.

## coveragePlot

Let's draw the simplest of the coverage plot. We want to see 
how many samples are covered 
by at least one of the alteration of our panel, divided by tumor type: 


```{r plot1, echo=TRUE, eval=TRUE , timeit=NULL}
coveragePlot(mypanel 
, alterationType=c("mutations" , "expression" , "copynumber" , "fusions")
, grouping="tumor_type")
```

If we do not want the plot to be displayed but just the statistics, 
we can add the parameter noPlot set to TRUE.

```{r chunk7 , echo=TRUE , eval=TRUE , timeit=NULL}
# Same command, but we ask for no plot, just the statistics
covStats <- coveragePlot(mypanel
             , alterationType=c("mutations" , "expression" 
                                , "copynumber" , "fusions")
             , grouping="tumor_type" 
             , noPlot=TRUE)
# Proportion of covered samples by at least one alteration
atLeastOne <- covStats$plottedTable[ , 1]/covStats$Samples
```
  
The results appears to be extremely good, because we reached 
`r paste0( round(unname(atLeastOne[1]) , 2)*100 , "\\%")` 
of breast samples and `r paste0( round(unname(atLeastOne[2]) , 2)*100 , "\\%")` 
for lung adenocarcinoma samples.

How many of these samples are actually actionable by at 
least one alteration? Since we use 'group' 
in our panel to divide druggable from non-druggable alterations, 
it is pretty easy to answer this question.


```{r plot2 , echo=TRUE , eval=TRUE , timeit=NULL}
# We add a new grouping variable, 'group'
coveragePlot(mypanel 
            , alterationType=c("mutations" , "expression" 
                               , "copynumber" , "fusions")
            , grouping=c("tumor_type" , "group")
            )
```


```{r chunk8 , echo=FALSE , eval=TRUE , timeit=NULL}
# Same command as above, but we ask for no plot, just the statistics
covStats2 <- coveragePlot(mypanel
             , alterationType=c("mutations" , "expression" 
                                , "copynumber" , "fusions")
             , grouping=c("tumor_type" , "group")
             , noPlot=TRUE)
# Proportion of covered samples by at least one actionable alteration
atLeastOne2 <- covStats2$plottedTable[ , 1]/covStats2$Samples
```

Only the `r paste0( round(unname(atLeastOne2[1]) , 2)*100 , "%")` 
of breast samples are actionable and 
`r paste0( round(unname(atLeastOne2[3]) , 2)*100 , "%")` 
of lung adenocarcinomas.

## coverageStackPlot

Some of the drugs, like Vandetanib, are multitarget that acts 
on several kinases. So now we are interested in knowing 
what are the most altered genes for each drug. We could add 
one more grouping variable to the **coveragePlot** 
but we will end up with a very hard to read plot (one plot per gene). 
To overcome this difficulty, 
we implemented a more compact version of the coverage plot 
that only returns the number of samples 
with at least one alteration stratified for a 'grouping' variable.


```{r plotStack,echo=TRUE,eval=TRUE,timeit=NULL}
# the stacked coverage plot requires:
  # 'var' parameter (that selects the bars)
  # 'grouping' parameter (that selects the stratification variable, optional)
  # 'tumor_type' parameter (to select one or more tumor type, optional)
par(mfrow=c(2,1))
coverageStackPlot(mypanel
             , alterationType=c("mutations" , "expression" 
                                , "copynumber" , "fusions")
             , var="drug"
             , grouping="gene_symbol"
             , tumor_type="brca"
             )
coverageStackPlot(mypanel
             , alterationType=c("mutations" , "expression" 
                                , "copynumber" , "fusions")
             , var="drug"
             , grouping="gene_symbol"
             , tumor_type="luad"
             )
```
  
The stacked version of coverage plot reports a breakdown of 
each drug by gene and then the total samples covered by 
each drug. When the total is lower than the breakdown, it 
means that some of the samples are altered in more than one gene 
targeted by the same drug.

```{r statsStack , echo=FALSE , eval=TRUE, timeit=NULL}
stackStats <- lapply(c("brca" , "luad") , function(x) {
               coverageStackPlot(mypanel
                , alterationType=c("mutations" , "expression" 
                                   , "copynumber" , "fusions")
                , var="drug"
                , grouping="gene_symbol"
                , tumor_type=x
                ,noPlot=TRUE)
              })
freqGene <- lapply(stackStats , function(x) {
  rowSums(x$plottedTable)/x$Samples['all_tumors']})
freqDrug <- lapply(stackStats , function(x) {
  x$plottedTable['Total' , ]/x$Samples['all_tumors']})
names(freqGene) <- c("brca" , "luad")
names(freqDrug) <- c("brca" , "luad")
olaparib <- sapply(freqDrug , '[' , 'Total Olaparib')
```

From this plot we can show for example that olaparib covers, 
in both tumors the 10\% of the samples 
(`r paste0( round(unname(olaparib[1]) , 2)*100 , "%")` for breast and 
`r paste0( round(unname(olaparib[2]) , 2)*100 , "%")` for lung). 
The contribute of its target *BRCA1* and *BRCA2* 
is different. While in breast cancer 
the number of altered samples is splitted between the two genes, 
in lung the contribution of *BRCA2* is 
predominant ( *BRCA1* covers the 
`r paste0( round(unname(freqGene$luad['BRCA1']) , 2)*100 , "%")` while 
*BRCA1* the `r paste0( round(unname(freqGene$luad['BRCA2']) , 2)*100 , "%")` ).

Furthermore, a multitarget drug like Vandetanib could be a 
valuable therapeutic option, especially in lung adenocarcinoma. 
It covers the 
`r paste0( round(unname(freqDrug$luad['Total Vandetanib']) , 2)*100 , "\\%")` 
of the samples with a large contribution of *KDR* 
( `r paste0( round(unname(freqGene$luad['KDR']) , 2)*100 , "%")`) 
and *ABL2* (`r paste0( round(unname(freqGene$luad['ABL2']) , 2)*100 , "%")`) 
that is not present in breast cancer.

If the plot result is not clear, we can create 
an interactive version of the same graph.

```{r plotStackHTML1,echo=TRUE,eval=TRUE,timeit=NULL}
# Add parameter html=TRUE
myHTMLPlot <- coverageStackPlot(mypanel
             , alterationType=c("mutations" , "expression" 
                                , "copynumber" , "fusions")
             , var="drug"
             , grouping="gene_symbol"
             , html=TRUE)
```

```{r plotStackHTML2 , echo=TRUE , eval=FALSE, timeit=NULL}
# The plot is not displayed immediately to allow the user 
# to embed it in markdown document or web page
# Now run plot
plot(myHTMLPlot)
```

```{r plotStackHTMLshow,echo=FALSE,eval=TRUE,results='asis',timeit=NULL}
# Print is the actual command that works inside a Rmd
# Plot is the one to use in case you are on a R console
print(myHTMLPlot , tag="chart")
```

## Rules parameter effect

Implementing negation rules can significantly alter the number 
of subject that can be potentially treated by a drug. 
In this example, we will see the effect of the **rules** 
parameter on Erlotinib drug in the lung dataset.

```{r plotStack_exclude,echo=TRUE,eval=TRUE,timeit=NULL}
# First plot with exclude activated, second without
par(mfrow=c(2,1))
coverageStackPlot(mypanel_exclude
             , alterationType=c("mutations" , "expression" 
                                , "copynumber" , "fusions")
             , var="drug"
             , grouping=NA
             , tumor_type="luad"
             )
coverageStackPlot(mypanel
             , alterationType=c("mutations" , "expression" 
                                , "copynumber" , "fusions")
             , var="drug"
             , grouping=NA
             , tumor_type="luad"
             )
```

As you can see, ~ 1% of the patients can no longer be potentially 
treated with Erlotinib because of specific resistant mutations. 
Furthermore, Olaparib association was completely 
eliminated by the druggability parameter.

## saturationPlot

Our panel is performing very well on the absolute number of covered samples. 
What about its construction? 
With the next analysis, we will try to answer to questions 
related to each element of our panel and try to optimize it. 
In particular we want to ask:

* Is there a particular class of genes or drugs or 
alteration that is predominant over the others?
* What is the cumulative number of samples covered 
by growing the panel one piece at a time?
* Is our panel optimize in terms of length?
* Are there alterations/genes that are never seen in our samples?
* Do all the drugs have at least one altered target?

To answer these questions, let's draw a saturation plot.

```{r plot3 , echo=TRUE , eval=TRUE, timeit=NULL}
# Saturation plot by adding one gene at a time divided by tumor type
saturationPlot(mypanel 
            , alterationType=c("mutations" , "expression" 
                               , "copynumber" , "fusions")
            , grouping="tumor_type"
            , adding="gene_symbol"
            )
```


The plot is quite simple to interpret. On the X axis we have genomic space, 
on the Y axis the mean number of alteration per sample.
The plot is build by adding one gene at a time (**adding** parameter), 
using all alteration types (**alterationType** parameter) and drawing 
one curve per tumor type (**grouping** parameter). 
The genes are added by their frequency of alteration, starting from the 
most altered. As can be seen from the bottom right annotation, 
some of the genes are never altered in these tumor types. 
To find out who they are, we can run this simple lines of code:

```{r chunk9 , echo=TRUE , eval=TRUE, timeit=NULL}
# As always, we can ask for the statistics with the option noPlot
satStats <- saturationPlot(mypanel
             , alterationType=c("mutations" , "expression" 
                                , "copynumber" , "fusions")
             , grouping="tumor_type"
             , adding="gene_symbol"
             , noPlot=TRUE
             )
# Retrieve all the genes of the panel
panelGenes <- unique( cpArguments(mypanel)$panel$gene_symbol )
# Look for the ones that are not present in the plot with a simple setdiff
missingGenes <- sapply( c("brca" , "luad") , function(x) {
                  setdiff( panelGenes 
                        , satStats[ satStats$grouping==x , "gene_symbol2" ] )
                  })
# Print out
missingGenes
```

As we can see, *HRAS* is never altered in the positions requested in the panel, 
both in breast and lung cancer samples.
We can think about remove it from our panel and save some genomic space.
This plot can also be seen from the drug point of view, 
by changing the variable *adding*:


```{r plot4 , echo=TRUE , eval=TRUE, timeit=NULL}
# Saturation plot by adding one drug at a time divided by tumor type
saturationPlot(mypanel 
            , alterationType=c("mutations" , "expression" 
                               , "copynumber" , "fusions")
            , grouping="tumor_type"
            , adding="drug"
            , y_measure="absolute"
            )
```


In this case, we plot the absolute number of cumulative 
covered patient (like in the coverage plot) 
by using the parameter **y_measure** set to "absolute".
We can notice that by using just Idelasib (target of *PIK3CA*) 
and Trastuzumab (target of *ERBB2*), 
we can cover almost the 50% of breast cancer patients. 
Moreover, two drugs have no available target 
and could be excluded from a potential trial. 
On the other hand, lung samples are predominantly 
mutated in *KRAS* and *TP53* and since these two genes 
are not targeted by any drug, the first "drug" category is indeed *no_drug*. 
In addition, we can appreciate a very high coverage 
for LUAD samples, but the saturation 
is reached at the point of Idelalisib/Olaparib and 
all other drugs increase the coverage of 1-2%. 
Olaparib itself, occupies almost 25 Kbases alone with 
its targets *BRCA1/2*, which are very long genes.

## coocMutexPlot

The information coming from the **saturationPlot** is useful 
but not complete. In fact, the order of the genes/drugs 
can be misleading about the true coverage of a single feature. 
For example, let's plot the coverage by drug and tumor type:


```{r plot5 , echo=TRUE , eval=TRUE , timeit=NULL}
# One tumor at a time, we draw a stack plot of drug without stratification
par(mfrow=c(2,1))
coverageStackPlot(mypanel , var="drug" 
        , grouping=NA , tumor_type="brca" , collapseByGene=TRUE)
coverageStackPlot(mypanel , var="drug" 
        , grouping=NA , tumor_type="luad" , collapseByGene=TRUE)
```

```{r parToNormal, echo=FALSE , eval=TRUE, timeit=NULL}
par(mfrow=c(1,1))
```

In this case we add the parameter **collapseByGene**. 
This option is useful to prevent the package from counting 
two different types of alteration on the same gene and 
same sample to be counted twice. For example, if *ERBB2* is 
both amplified and overexpressed (two things that 
are extremely correlated) in the same sample, 
it will count for 1 alteration only. This parameter 
has an effect on the bars from 2 to 10, 
but obviously not on the 'at least one' alteration bar.

From this plot we can see for example that Olaparib (a target of *BRCA1/2*) 
accounts for almost the 11% of breast cancer samples, 
while in the saturationPlot its cumulative 
contribute appeared to be between 5-6%. 
This is where the **coocMutexPlot** comes handy.


```{r plot6 , echo=TRUE , eval=TRUE , timeit=NULL}
# coocMutexPlot of drug divided by tumor type
coocMutexPlot(mypanel 
     , alterationType=c("mutations" , "expression" , "copynumber" , "fusions")
     , grouping="tumor_type"
     , var="drug")
```


This plot represents significant mutual exclusivity and cooccurence between 
couples of drugs. In particular, it tests whether 
all the alterations targeted by a drug tends to appear in different patients 
or together in the same set of patients. 
For example, we can appreciate that 
Olaparib is mutual exclusive with Idelalisib, 
so the drugs are active on different set of patients. 
This is particular useful to know for balancing a basket trial design. 
The more two drugs are mutual exclusive, the larger is the set of 
druggable patients.

The very same plot, using genes instead of drugs, shows a mutual exclusivity 
on both *BRCA1* and *BRCA2* against *PIK3CA*.


```{r plot7 , echo=TRUE , eval=TRUE , timeit=NULL}
# coocMutexPlot of genes divided by tumor type
# To avoid plotting gene pairs with no significance, we set parameter plotrandom
coocMutexPlot(mypanel 
             , alterationType=c("mutations" , "expression" 
                                , "copynumber" , "fusions")
             , grouping="tumor_type"
             , var="gene_symbol"
             , plotrandom=FALSE
             )
```

Cooccurence and mutual exclusivity can also be seen from the 
prospective of similarity, rather than pair-wise mutex. For this purpose, 
we developed an alternative style for the plot that exploits binary 
distance between occurrences of alterations between genes or drugs.

```{r plotDendro,echo=TRUE,eval=TRUE, timeit=NULL}
# coocMutexPlot of drugs divided by tumor type
# Style set to dendro
coocMutexPlot(mypanel 
             , alterationType=c("mutations" , "expression" 
                                , "copynumber" , "fusions")
             , grouping="tumor_type"
             , var="drug"
             , style="dendro"
             )
```

Again, if you want just the statistics and no plot, 
set the parameter **noPlot** to TRUE.

```{r chunk10,echo=TRUE,eval=TRUE,timeit=NULL}
# coocMutexPlot of genes divided by tumor type
coocMutexStats <- coocMutexPlot(mypanel 
             , alterationType=c("mutations" , "expression" 
                                , "copynumber" , "fusions")
             , grouping="tumor_type"
             , var="drug"
             , noPlot=TRUE
             )
knitr::kable(head(coocMutexStats) , digits = 3)
```

```{r chunkInvisible , echo=FALSE , eval=TRUE , timeit=NULL}
idx <- coocMutexStats$sp1_name=="Idelalisib" & 
  coocMutexStats$sp2_name=="Olaparib" & coocMutexStats$grouping=="brca"
idx2 <- coocMutexStats$sp2_name=="Idelalisib" & 
  coocMutexStats$sp1_name=="Olaparib" & coocMutexStats$grouping=="brca"
olapstats <- coocMutexStats[idx | idx2 , , drop=TRUE]
```

Idelalisib and Olaparib are mutual exclusive with a p-value of 
`r prettyNum(olapstats$pVal.MutEx , digits=3 , format="E")`.

## Create your own plots

*PrecisionTrialDrawer* is designed to allow the user to perform 
a lot of tasks for designing a genomic based trial. 
Nevertheless, not all the possible plots are available inside the package. 
We implemented a function to extract the relevant information 
for any kind of simulation, called `dataExtractor`.

```{r extract,echo=TRUE,eval=TRUE,timeit=NULL}
#Extract mutations from BRCA tumor type
myData <- dataExtractor(mypanel , alterationType = "mutations" 
                        , tumor_type = "brca")
# Check the format
str(myData)
```

A data extraction is therefore a list composed by data, 
Samples divided by tumor type and a character vector 
of tumors that are not present under 
the requested parameters. For example, we could ask 
how many samples are targeted by 0, 1, 2, 3 drugs. 
This function is not directly implemented in our 
package and so we can build it up here.

```{r customplot,echo=TRUE,eval=TRUE,timeit=NULL}
#Extract drug samples couples
dataDrug <- unique(myData$data[ , c("case_id" , "drug")])
# Retrieve all screened samples
samps <- myData$Samples$all_tumors
# Calculate number of drugs per patient (we consider also non altered samples)
drugsPerPatient <- table( factor(dataDrug$case_id , levels = samps) )
# Now in terms of frequency
drugsPerPatientFreq <- table(drugsPerPatient)/length(samps)
# Ready to plot
barplot(drugsPerPatientFreq ,  ylim = c(0,1) 
        , main="Patients with 0, 1, 2... molecular targets" 
        , col="darkcyan")
```


# Panel Optimization

While copynumber, expression and fusions require reading the whole gene, 
mutations can be targeted with high sensitivity. 
For particularly large genes, according 
to the tumor type under examination, it is often better to 
target specific regions rather than sequencing the whole gene. 
To design such regions, a "ratiometric" 
approach could be the best solution. It is better to evaluate 
the trade-off between genomic space occupied 
by the panel and number of mutations captured.

A shiny-based interactive application was created to allow 
the users to design their custom regions. For a better 
explanation of all the features of **panelOptimizer**, 
we suggest you to follow the manual page `?panelOptimizer` 
and reading carefully the instruction inside the mini app itself.

```{r panOpt , echo=TRUE , eval=FALSE, timeit=NULL}
panOpt <- panelOptimizer(mypanel)
```


# Sample Size and Power Calculation

In a classical epidemiological study design, one of the 
most important task is power and sample size estimation. 
In `r Rpackage("PrecisionTrialDrawer")` is currently possible 
to evaluate two of the most common oncological trial designs: 
time-to-event (survival) and proportion equality. The methods 
**survPowerSampleSize** and **propPowerSampleSize** 
are currently implemented inside the package. In this vignette, 
we will evaluate the first of the two methods. The latter is very similar, 
but requires a different set of initial parameter (*pCase* and *pControl*). 
A single-sample time-to-event calculator is also 
available as **survPowerSampleSize1Arm**. It is similar 
to **survPowerSampleSize** but instead of Hazard Ratio it requires *MED1* 
and *MED0*, median survival for the case group and historical median survival.


## Estimating full panel sample size and power

First, we calculate the sample size required to obtain 
4 levels of power (from 0.6 to 0.9) for 4 different 
postulated hazard ratios (from 0.625 to 0.3). Type I error 
was set to 5% ('alpha') and baseline event rate per unit of time was set to 0.9 
('ber', which correspond to 10% 1-year progression-free survival). 
Average follow-up time is 3 years ('fu'). Allocation is equal 
between treatment and control groups ('case.fraction').

```{r chunkSurv1 , echo=TRUE , eval=TRUE , timeit=NULL}
survPowerSampleSize(mypanel
  , HR = c(0.625 , 0.5 ,  0.4 , 0.3)
  , power = seq(0.6 , 0.9 , 0.1)
  , alpha = 0.05
  , ber = 0.9
  , fu = 3
  , case.fraction = 0.5
  )
```

Our function estimates the real number of samples required to 
start the study under the hypothesis that only a fraction of them 
will have at least one alteration included in the panel. 
The number of samples that will actually enter in the study can be retrieved 
using the option **noPlot**, in the following way.

```{r chunkSurv2 , echo=TRUE , eval=TRUE , timeit=NULL}
# Add option noPlot to retrieve results as dataframe
sampSize <- survPowerSampleSize(mypanel
  , HR = c(0.625 , 0.5 ,  0.4 , 0.3)
  , power = seq(0.6 , 0.9 , 0.1)
  , alpha = 0.05
  , ber = 0.9
  , fu = 3
  , case.fraction = 0.5
  , noPlot=TRUE
  )
knitr::kable(sampSize[ sampSize$HazardRatio==0.5 , ] , row.names = FALSE)
```

Under the column EligibleSampleSize we can see the actual 
number of patients in treatment and control 
groups combined under the postulated hypothesis (
`r sampSize[sampSize$HazardRatio==0.5&sampSize$Power==0.8,"EligibleSampleSize"]`
).

Our sample size estimation function can also be used to 
estimate power from a postulated screening sample size. We first 
bring the sample size from screening to eligibility 
(multiplying by the frequency of alterations) and then perform the calculation. 
As a toy example, we use the same screening sample sizes 
calculated before and show the calculated power.

```{r chunkSurv3 , echo=TRUE , eval=TRUE , timeit=NULL}
# Calculate power from sample size at screening
survPowerSampleSize(mypanel
  , HR = 0.5
  , sample.size = sampSize[ sampSize$HazardRatio==0.5 , "ScreeningSampleSize"]
  , alpha = 0.05
  , ber = 0.9
  , fu = 3
  , case.fraction = 0.5
  )
```

```{r chunkSurv3invisible , echo=FALSE , eval=TRUE , timeit=NULL}
# Calculate power from sample size at screening
survSampInv <- survPowerSampleSize(mypanel
  , HR = 0.5
  , sample.size = sampSize[ sampSize$HazardRatio==0.5 , "ScreeningSampleSize"]
  , alpha = 0.05
  , ber = 0.9
  , fu = 3
  , case.fraction = 0.5
  , noPlot=TRUE
  )
```

So with 
`r with(sampSize,sampSize[HazardRatio==0.5 & Power==0.8,"ScreeningSampleSize"])` 
patients we obtain an estimated power 
of `r survSampInv[ survSampInv$ScreeningSampleSize==120 , "Power"]`.

## Estimating arm-wise sample size and power

In a typical basket or umbrella design, we can also be 
interested in what is the power or sample size 
of the design of each single drug or tumor type instead of the whole panel. 
By adding the parameter **var**, this task is easy to achieve.

```{r chunkSurv4 , echo=TRUE , eval=TRUE , timeit=NULL}
# Calcualte sample size by tumor type
survPowerSampleSize(mypanel
  , var = "tumor_type"
  , HR = c(0.625 , 0.5 ,  0.4 , 0.3)
  , power = seq(0.6 , 0.9 , 0.1)
  , alpha = 0.05
  , ber = 0.9
  , fu = 3
  , case.fraction = 0.5
  )
```

Similarly, we can also ask a design divided by drug. 
Given the frequency of alterations targetable by each drug, we 
can estimate how many patients should be screened to 
obtain the minimum number of samples to reach a certain level of power. 
In this case, we don't ask for all the 
drug type, but only *Olaparib* and *Idelalisib*.

```{r chunkSurv5 , echo=TRUE , eval=TRUE , timeit=NULL}
# Calcualte sample size by drug
# Visualize only "Olaparib" and "Idelalisib"
survPowerSampleSize(mypanel
  , var = "drug"
  , stratum = c("Olaparib" , "Idelalisib")
  , HR = c(0.625 , 0.5 ,  0.4 , 0.3)
  , power = seq(0.6 , 0.9 , 0.1)
  , alpha = 0.05
  , ber = 0.9
  , fu = 3
  , case.fraction = 0.5
  )
```

It is noteworthy that the calculations performed arm-wise are 
based on the assumption that each 
arm is completely independent from the others. 
In other terms, patients with two targetable alterations enter 
in the frequency calculations twice as they could 
be treated with both drugs. In a typical 
umbrella design (many drugs, one tumor type) the patient is generally 
not treated with multiple drugs and can enter in one arm only.

## Priority Trial

The two sample size estimations above are useful in two particular occasions:

1. The design of a genomic trial VS standard of care (full panel sample size)
2. The design of single-drug VS standard of care (arm-wise sample size)

In case of a multi-drug design, the calculation of the number of 
patients to screen can be heavily biased. 
The first estimates the power of the whole study correctly, 
but each drug can be underpowered. 
The second estimates the power of each single arm correctly, 
but overestimates the total number of patients to screen. This is because 
the patients that are not eligible to enter in a specific arm 
cannot be assigned to a different drug and are simply discarded. 
To give an example, 
it is like performing 3 different clinical trials in 3 different 
hospitals with no possibility that a patient discarded from 
the first trial can be eligible in the other 2.

In a real umbrella trial scenario, the sample size should 
be calculated according to 3 assumptions:

1. Each arm should reach a minimum level of power
2. The total screening sample size should be minimized
3. In case of multiple alterations and multiple drug opportunities, 
there could be a priority order for each drug

**survPowerSampleSize** has a specific parameter to fulfill 
these prerequisites called *priority.trial*, a character vector listings 
drugs or group levels to build an umbrella trial based on a priority order. 
Sample size calculation becomes a multistep process reproduced 
by a cascade algorithm of screening and assignment.

The algorithm is composed by steps that implement a series of rules 
to screen and allocate patients to each arm of the trial. By default, 
the screening starts from the drug whose linked alteration are the 
rarest in the object up to the most common. This default rule guarantees 
the minimal sample size to screen. Alternatively, the user can decide 
the order of "importance" of each drug by changing the default 
`priority.trial.order="optimal"` to `"as.is"`. In this case, the order 
of importance is dictated by the order of the *priority.trial* vector.

1. Calculate a fixed Eligible Sample Size (ESS) that is common to all arms. 
This is the minimal threshold of patients that should enter in each arm.
2. Start screening with the first drug, reaching the 
Screening Sample Size (SSS) necessary to reach ESS
3. From the samples not eligible for the first drug, test the 
second drug/group and collects as many samples as possible up to ESS
4. Continue using the not eligible samples up to the end of all drugs.
    1. If it is the last drug to screen, allocate all the possible samples, 
    even over ESS. They are not eligible for any other drug.
    2. If it is not the last drug to screen, but all the other drugs 
    already reached ESS, allocate as many samples as possible (like point 4.1).
5. If all the drugs/groups have reached ESS, stop screening of new samples, 
    otherwise start a new screening with the first 
    drug/group that has not reached ESS yet.
6. Repeat from point 3 up to the point 
    that all drugs have reached ESS threshold.

In order to calculate the fraction of expected patients with an 
alteration that make them eligible for a drug, 
we need to calculate a matrix as shown below. 
Under a example design with 4 drugs (A, B, C, D), 
the fraction (probability) of obtaining an eligible patient is calculated as:


$$\mathbf{P_{A,B,C,D}} = \left[\begin{array}
{rrrr}
P(A) & P(B|\bar{A}) & P(C|\bar{A}\cup\bar{B}) & P(D|\bar{A}\cup\bar{B}\cup\bar{C}) \\
P(A|\bar{B}) & P(B) & P(C|\bar{A}\cup\bar{B}) & P(D|\bar{A}\cup\bar{B}\cup\bar{C}) \\
P(A|\bar{C}) & P(B|\bar{A}\cup\bar{C}) & P(C) & P(D|\bar{A}\cup\bar{B}\cup\bar{C}) \\
P(A|\bar{D}) & P(B|\bar{A}\cup\bar{D}) & P(C|\bar{A}\cup\bar{B}\cup\bar{C}) & P(D)
\end{array}\right]
$$

The dependence scheme derives from the following order 
of priority of the drugs taken into consideration 
($A \rightarrow B \rightarrow C \rightarrow D$). The main diagonal 
represents the screening phase, that starts with new 
samples and so is not influenced by any other drugs. Moving to 2, 
the second drug, the probability is calculated as 
$P(2|\bar{1})$ that in the table above is $P(B|\bar{A})$. Moving to 3 means 
taking dependencies from both 1 and 2, so 
$P(2|\bar{1}\cup\bar{2})$ ( $P(C|\bar{A}\cup\bar{B})$ ). 

$$\mathbf{Order} = \left[\begin{array}
{rrrr}
1 & 2 & 3 & 4 \\
2 & 1 & 3 & 4 \\
2 & 3 & 1 & 4 \\
2 & 3 & 4 & 1
\end{array}\right]
$$

The main diagonal (new screening line) uses the 
original frequencies of each drug, estimated as the 
number of eligible patients over the total number 
of samples in our CancerPanel object. 
Every advancement on the columns must take in consideration 
a depletion of all the samples 
that were already allocated. For example, $P(B|\bar{A})$, 
( first row , second column), calculates the number of 
eligible patients for B in a dataset epurated from 
all the samples eligible for A.
If the drugs cover different pathways, the dependent 
frequencies are very similar to the original ones. 
Conversely, if you imagine two drugs with very 
similar molecular targets, once the first has removed 
all its eligible patients, very few patients 
can be allocated to the second treatment 
that will drop dramatically in frequency.

Now, let's see an example. We take into consideration 
a design with four drugs, and we ask for the optimal order 
of priority of the drugs (basically starting from the rarest).

```{r prior1 , echo=TRUE , eval=TRUE , timeit=NULL}
drugs4 <- c("Idelalisib" , "Olaparib" , "Trastuzumab" , "Vandetanib")
prior4drugs <- survPowerSampleSize(mypanel
  , var = "drug"
  , priority.trial = drugs4 
  , priority.trial.order="optimal"
  , HR = 0.5
  , power = 0.8
  , noPlot = TRUE
  )
prior4drugs
```

The output is completely different from the regular 
one of this function. It is a list composed by 5 elements: 
Summary, Screening.scheme, Allocation.scheme, 
Probability.scheme, Base.probabilities.

- **Summary**: A matrix listings the number of samples 
screened, eligible (allocable) and not eligible (lost)
- **Screening.scheme**: A square matrix that reports the 
number of patients available for screening at each new screening
                        (diagonal) and, by row, all the not 
                        eligible that are tested for the next drugs
- **Allocation.scheme**: Same as above, but now listings the 
number of patients eligible at each new step
- **Probability.scheme**: An square matrix reproducing the 
mathematical matrix above. Our algorithm is parsimonious, in the sense that 
some of those dependent probabilities are not calculated 
if the required sample size has been already reached.
- **Base.probabilities**: A named numerical vector listings the initial 
probability to find a patient eligible for each drug under 
no dependence assumptions. These values are used on the diagonal 
of Screening.scheme when a new screening is started.
                          
A message is telling us that each drug must reach a number of 
eligible patients equal to 66, under a default case/control fraction 
of 0.5. We are designing a trial where 
each drug needs 33 treated and 33 controls. 
Reading from the Summary, our algorithm rules that 
`r prior4drugs$Summary["Screened" , "Total"]` patients must 
be screened to obtain `r prior4drugs$Summary["Eligible" , "Total"]` 
eligible subjects and discarding 
`r prior4drugs$Summary["Not.Eligible" , "Total"]` 
patients that cannot be inserted in any arm of the trial.

Let's now compare the result to the ones seen in the two sections above.

```{r prior2 , echo=TRUE , eval=TRUE , timeit=NULL}
# We use stratum to create a design with only four drugs
fullDesign <- survPowerSampleSize(mypanel
  , var = "drug"
  , stratum = drugs4
  , HR = 0.5
  , power = 0.8
  , noPlot = TRUE)
knitr::kable(fullDesign)
```

As we can see, the Eligible Sample Size is the same 
as above, equal to `r fullDesign[1 , "EligibleSampleSize"]`. 
The Full Design calculates a number of patients to 
screen equal to `r fullDesign[fullDesign$Var=="Full Design" , "ScreeningSampleSize"]`, way 
lower than `r prior4drugs$Summary["Screened" , "Total"]` 
calculated by our priority trial. 
This result guarantees at least 80% of power on the whole trial, 
but it doesn't assure sufficient power for all the arms of the study. 
Conversely, the sum of all the patients to screen 
assuming 4 independent trials is way higher, because a patient 
discarded from an arm of the study cannot be eligible for a different arm.

```{r prior3 , echo=TRUE , eval=TRUE , timeit=NULL}
sum(fullDesign[fullDesign$Var %in% drugs4 , "ScreeningSampleSize"])
```

If we want to calculate the post-hoc power for the whole study, 
using a sample size equal to the one 
calculated with priority trial, we can do:

```{r prior4 , echo=TRUE , eval=TRUE , timeit=NULL}
# Extract the number of samples calculated with priority.trial
sizePrior4Drugs <- prior4drugs[[1]]$Summary["Screened" , "Total"]
# Calculate post-hoc power, given sample.size
postHocPower <- survPowerSampleSize(mypanel
  , var = "drug"
  , stratum = drugs4
  , HR = 0.5
  , sample.size = sizePrior4Drugs
  , noPlot = TRUE)
postHocPower[ postHocPower$Var=="Full Design" , "Power"]
```

The power is basically 100%. In conclusion, priority.trial 
guarantees an high level of power for the whole trial 
and at least the same level of power for each drug, 
minimizing the samples to screen.


# Basket Trial Design: tumor.freqs and tumor.weights parameters

The problem of using public data is that the information 
available from each tumor type depends on how many samples per 
tumor type were sequenced. In our example, we can see that 
mutation data are not the same between tumor types 
with `r length(cpData(mypanel)$mutations$Samples$brca)` 
sequenced samples for breast cancer and 
`r length(cpData(mypanel)$mutations$Samples$luad)` 
for lung adenocarcinoma. Therefore, when calculating compound 
frequencies across tumor types, breast cancer weights more than lung cancer. 

This behaviour was purposely set by default given the 
nature of the original idea of a panel-wise analysis 
that is the base of **Umbrella Designs** 
(one tumor type, multiple targets). In other words, 
`r Rpackage("PrecisionTrialDrawer")` is best suited 
for tumor-wise analysis where homogeneity 
of samples is automatically granted.

Nevertheless, **Basket Designs** (one target, multiple tumor types) 
can also be explored using this package by adding 
two parameters common to many of our functions.

1. **tumor.freqs**: is a named vector, containing frequencies 
that are used to calculated a weighted mean by tumor types. 
e.g. if *PIK3CA* is mutated in the 30% of BRCA samples and 20% of LUAD samples, 
if you set tumor.freqs=c(brca=0.5 , luad=0.5) you are saying that you expect 
the same number of breast and lung samples in your study, 
therefore the weighted frequency now become 0.3x0.5 + 0.2x0.5 = 0.25;
2. **tumor.weights**: is a named vector, containing the number of samples that 
will be randomly sampled from the original cohort contained in the *CancerPanel*
object. e.g. by setting tumor.weights=c(brca=100 , luad=100), 
we randomly select 100 samples per tumor type and run the function.

|Function            |tumor.freqs |tumor.weights |
|:-------------------|:-----------|:-------------|
|coveragePlot        |present     |present       |
|coverageStackPlot   |present     |present       |
|cpFreq              |present     |present       |
|saturationPlot      |            |present       |
|survPowerSampleSize |present     |present       |
|coocMutExPlot       |            |present       |

Table: *tumor.freqs* and *tumor.weights* parameters

Let's now see some examples that make use of these two parameters.

## tumor.freqs: calculate weighted frequencies

As mentioned in the introduction, alteration frequencies are by default 
calculated on all available samples inside the object. 
If we want to establish the frequency of samples that can be targeted by 
each drug, we can run a simple *coverageStackPlot* like this:

```{r basket1 , echo=TRUE , eval=TRUE , timeit=NULL}
coverageStackPlot(mypanel , var="drug" 
        , grouping=NA , collapseByGene=TRUE)
```

```{r basket1Hidden , echo=FALSE , eval=TRUE, timeit=NULL}
#we use this to calculate inline stuff
covstackdrug <- coverageStackPlot(mypanel , var="drug" , grouping=NA 
                                  , collapseByGene=TRUE , noPlot=TRUE)
```

This plot derives from a calculation made with 
`r covstackdrug$Samples['brca']` breast samples and 
`r covstackdrug$Samples['luad']` lung samples, that are the ones with all 
available data from copynumber, mutations, fusions and expression. 
As you can see, this is not a balanced sample 
because breast tumors are the large majority.
What if we had the same number of samples for each tumor type?

```{r basket2 , echo=TRUE , eval=TRUE , timeit=NULL}
coverageStackPlot(mypanel , var="drug" 
        , grouping=NA , collapseByGene=TRUE 
        , tumor.freqs=c(brca=0.5 , luad=0.5))
```

```{r basket2Hidden , echo=FALSE , eval=TRUE, timeit=NULL}
covstackdrug2 <- coverageStackPlot(mypanel , var="drug" 
        , grouping=NA , collapseByGene=TRUE 
        , tumor.freqs=c(brca=0.5 , luad=0.5) , noPlot=TRUE)
```

While some of the drugs do not change dramatically, 
some other drop down or increase significantly 
(check red numbers on top of the bars). For example, *Idelalisib* 
drops down in frequency by a 10% 
(before 
`r unname(covstackdrug$plottedTable[1,"Idelalisib"]/covstackdrug$Samples['all_tumors'])`, 
now `r covstackdrug2$plottedTable[1 , "Idelalisib"]`). 
The reason is that is linked to *PIK3CA*, 
that is a gene that is rarely mutated in lung cancer, while it represents 
around the 30% breast cancer case. When the design is balanced, 
the frequency is dragged down towards the low LUAD frequency.
Note that the plot is also slightly different from the previous one: 
now Y-axis reports the relative frequency 
(the same as the red text on the top of the bars), 
instead of the absolute number of samples that is usually reported. 
The reason is that tumor.freqs simply weights the tumor-wise frequencies 
by a 0.5 - 0.5 factor per tumor type and the 
original number of samples is therefore lost.

If we want to check our theory about *PIK3CA*, 
we can simply use *tumor.freqs* in *cpFreq* function.

```{r basket3 , echo=TRUE , eval=TRUE , timeit=NULL}
# Run cpFreq with and withouth tumor.freqs
flatfreq <- cpFreq(mypanel , alterationType = "mutations")
weightedfreq <- cpFreq(mypanel , alterationType = "mutations" 
                       , tumor.freqs=c(brca=0.5 , luad=0.5) )
```

```{r basket3show1 , echo=TRUE , eval=TRUE , timeit=NULL}
# Frequency with no weights
knitr::kable(flatfreq[ flatfreq$gene_symbol=="PIK3CA" , ] , row.names=FALSE)
```

```{r basket3show2 , echo=TRUE , eval=TRUE , timeit=NULL}
# Frequency with balanced weights
knitr::kable(weightedfreq[ weightedfreq$gene_symbol=="PIK3CA" , ] 
             , row.names=FALSE)
```

## tumor.weights: calculate frequency with fixed sample size

In this section, we analyze how a basket trial simulation can be carried on. 
Imagine a clinical trial on *Idelalisib*, 
a drug that acts on *PIK3CA* mutations. 
The study is designed so that each tumor type arm is closed when we reach 
50 breast cancer cases and 40 lung cancer samples. Now we ask three questions:

1. What is the average mutation frequency we expect 
from this exact sample composition?
2. What is the distribution and variability of this frequency?
3. What is the maximum and minimum frequency 
we can expect, or in other words, 
the worst and best case scenario in terms of treatment opportunity?

```{r basket4 , echo=TRUE , eval=TRUE , timeit=NULL}
samplefreqs <- cpFreq(mypanel , alterationType = "mutations" 
                      , tumor.weights=c(brca=50 , luad=40))
knitr::kable(samplefreqs[ samplefreqs$gene_symbol=="PIK3CA" , ] 
             , row.names=FALSE)
```

*tumor.weights* allows a random extraction of 50 and 40 samples from each tumor 
type and calculate the frequency. 
This is just 1 extraction, so the result could be 
completely biased. We need to run a resampling 
simulation to answer our questions.

```{r basket5 , echo=TRUE , eval=TRUE , timeit=NULL}
# Run cpFreq 20 times
samplefreqsboot <- replicate( 20 
                          , cpFreq(mypanel , alterationType = "mutations" 
                            , tumor.weights=c(brca=50 , luad=40)) 
                          , simplify=FALSE)
# Extract PIK3CA frequency in the 20 runs
pik3caboot <- sapply(samplefreqsboot , function(x) {
  x[x$gene_symbol=="PIK3CA" , "freq"]})
```

Now to answer the first two questions, let's 
draw the distribution of our resampling:

```{r basket6 , echo=TRUE , eval=TRUE , timeit=NULL}
# calculate mean and sd of PIK3CA frequency distribution
avgpik3ca <- mean(pik3caboot)
sdpik3ca <- sd(pik3caboot)
# Plot the distribution
title <- paste("PIK3CA frequency distribution with 50 BRCA and 40 LUAD samples"
              , paste("Mean:" , round(avgpik3ca , 3))
              , paste("SD:" , round(sdpik3ca , 3))
              , sep="\n")
# draw a simple histogram. the red line shows the mean value
hist(pik3caboot , col="cadetblue" , breaks=30 
     , main=title , xlab="Frequencies of resampling")
abline(v = avgpik3ca , col="red" , lwd=3 , xpd=FALSE)
```

So, with our expected sample composition, 
we have an average of `r round(avgpik3ca , 3)` 
samples mutated on *PIK3CA* with a standard 
deviation of  `r round(sdpik3ca , 3)`. 
Also a confidence interval for our measure can be calculated, 
using percentile bootstrap confidence interval

$$(\theta_{\alpha/2}^{*};\theta_{1-\alpha/2}^{*})$$

To summarize what we should expect from this study, 
including answering to question 3:

```{r basket7 , echo=TRUE , eval=TRUE , timeit=NULL}
# Write a simple function to calculate confidence interval of a proportion
CI <- function(x){
  left <- quantile(x , 0.025)
  right <- quantile(x , 0.975)
  return(c(left , right))
}
# Create a small data.frame to summarize everything
pik3caSummary <- data.frame(Gene = "PIK3CA"
          , AverageMutationRate = round(avgpik3ca , 3)
          , SDMutationRate = round(sdpik3ca , 3)
          , MaxMutationRate = round(max(pik3caboot)[1] , 3)
          , MinMutationRate = round(min(pik3caboot)[1] , 3)
          , CI = paste( round( CI(pik3caboot) , 3) , collapse=" - "))
knitr::kable(pik3caSummary , row.names=FALSE)
```

What we known now is how *PIK3CA* would behave if we randomly select 90 
individuals divided in 50 breast cancer and 40 lung cancer cases.
 
How is this affecting sample size estimation? 
As mentioned in the previous section, 
sample size at screening is influenced by the 
number of affected samples we expect to find.
If you need to treat with *Idelalisib* a total 
of 100 samples, you would need at least 
300 patients sequenced to obtain 100 *PIK3CA* 
mutated samples (considering a mutation rate of 30%). 
This means that while sample size at treatment 
is fixed and depends only from hazard 
ratio, power and other parameters chosen beforehand, 
sample size at screening depends 
on how lucky/unlucky we are in finding enough 
patients with the alteration we are targeting. 

Also *survPowerSampleSize* has a tumor.weights parameter that can be resampled 
as we did for cpFreq. Let's see how. 
To simplify things, we hypothesize an hazard ratio of 2 
and we want to reach a power of 80%.

```{r basket8 , echo=TRUE , eval=TRUE , timeit=NULL}
# We want to know how many samples we will need under
  # HR = 2
  # power = 0.8
survboot <- replicate( 10 
                    , survPowerSampleSize(mypanel 
                            , var="gene_symbol"
                            , alterationType = "mutations"
                            , HR=2
                            , power=0.8
                            , tumor.weights=c(brca=50 , luad=40)
                            , noPlot=TRUE)
                    , simplify=FALSE)
# Extract PIK3CA results
survbootpik3ca <- sapply(survboot , function(x) {
  x[ x$Var=="PIK3CA" , "ScreeningSampleSize"]} )
```

Again, we create a little table that would 
tell us the average scenario, the best, 
the worst etc. Calculation of confidence 
interval follows the same formula as above

```{r basket9 , echo=TRUE , eval=TRUE , timeit=NULL}
# Create a small data.frame to summarize 
# everything (we round up to integer values)
survSummary <- data.frame(Gene = "PIK3CA"
            , Mean.Screen.Samp.Size = round(mean(survbootpik3ca))
            , SD.Screen.Samp.Size = round(sd(survbootpik3ca))
            , Max.Screen.Samp.Size = round(max(survbootpik3ca)[1])
            , Min.Screen.Samp.Size = round(min(survbootpik3ca)[1])
            , CI = paste( round( CI(survbootpik3ca)) , collapse=" - "))
knitr::kable(survSummary , row.names=FALSE)
```

What we learnt is that in the average scenario, we would need 
`r survSummary[1 , "Mean.Screen.Samp.Size"]` 
samples to start our trial, with a minimum 
of `r survSummary[1 , "Min.Screen.Samp.Size"]` and, if we are very unlucky, 
a maximum of `r survSummary[1 , "Max.Screen.Samp.Size"]` patients at screening.


# Panel Design

Once we are satisfied with our panel (and we suppose we are), 
the next step is to submit it to our preferred NGS company. 
The most accepted format to submit a panel is the 
[bed format](http://www.ensembl.org/info/website/upload/bed.html). 
As mentioned in the introduction, it is sometimes 
difficult to reconcile all the possible formats of alteration in a genomic 
position and that is what **panelDesigner** is designed for.

**panelDesigner** takes our **CancerPanel** object and 4 possible arguments.

1. **alterationType**: decide to create the panel using only 
'mutations','copynumber' ,'fusions', 'expression' regions or any 
combination of these. Default is to perform design on the whole panel;
2. **padding_length**: every region is elongated by this value in bp. Default 0. 
                      This operation is generally performed by the sequencing 
                      company itself in order to better characterize certain regions;
3. **merge_window**: regions with a distance below this 
threshold in bp are merged to reduce the final number of amplicons 
at expense of sequencing more genomic space;
4. **utr**: genes that demanded for the full length can be sequenced 
on their CDS or CDS + utr (full exons). Default FALSE
5. **canonicalTranscript**: decide weather to take only the canonical 
transcript or collapse the genomic ranges of any transcript.


```{r chunk11 , echo=TRUE , eval=FALSE , message=FALSE, timeit=NULL}
# Design our panel in full with no padding 
# length and merge window equal to 100 bp
mydesign <- panelDesigner(mypanel 
                        , padding_length=0 
                        , merge_window=100)
```

```{r chunk11_bis , echo=FALSE , eval=TRUE , message=FALSE, timeit=NULL}
# Design our panel in full with no padding 
# length and merge window equal to 100 bp
data(cpDesign)
mydesign <- cpDesign
str(mydesign , vec.len=1)
```


The design is a list composed by four elements:

1. **GeneIntervals**: all CDS and CDS + UTR for all the genes in the panel
2. **TargetIntervals**: all requested target regions 
(specific single mutations) divided and collapsed by gene symbol
3. **FullGenes**: gene symbols of the genes considered for their full sequence
4. **BedStylePanel**: the entire panel in bed format, 
merged by chromosome, start and end.

So if you want to submit your panel right away, just write your bed file:

```{r chunk12 , echo=TRUE , eval=TRUE , message=FALSE, timeit=NULL}
# Retrieve the panel in bed format
bedPanel <- mydesign$BedStylePanel
head(bedPanel)
```

To know the total length of my panel, 
just sum up all the regions in the bed file:

```{r chunk12_bis , echo=TRUE , eval=TRUE , message=FALSE, timeit=NULL}
# Calculate genomic space in kilo bases
sum( bedPanel$end - bedPanel$start )/100
```

How **panelDesigner** decide the regions, 
according to the alteration specification?

1. fusions: full sequence of all the genes involved in the fusions;
2. copynumber: full sequence of all the genes requested for copynumber;
3. expression: full sequence of all the genes requested for expression;
4. mutations: if every mutations are requested, 
full sequence, otherwise, specific genomic targets

As mentioned before, fusion, copynumber and expression data are probably 
collected through different technologies (maybe no NGS at all). 
Considering a specific design for every alteration type is therefore desirable.

```{r chunk13 , echo=TRUE , eval=FALSE , message=FALSE, timeit=NULL}
# Design the panel for mutations
myMutationDesign <- panelDesigner(mypanel
                        , alterationType="mutations"
                        , padding_length=0 
                        , merge_window=100)
```

By setting parameter **alterationType**, we require a design only for mutations.

# Session Information

```{r info,echo=TRUE,eval=TRUE, timeit=NULL}
sessionInfo()
```