---
title: "Simulating Quasispecies Composition"
author: "Guerrero-Murillo, M and Gregori, J"
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
bibliography: simulation.bib
output:
    BiocStyle::html_document:
        number_sections: yes
        toc_float: true
vignette: >
    %\VignetteIndexEntry{QSutils-Simulation}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

# Introduction 

A viral quasispecies is understood as a collection of closely related viral 
genomes produced by viruses with low replication fidelity. RNA viruses show a 
high replication error rate due to a lack of proofreading mechanisms. It is 
estimated that for viruses with typically high replicative loads, every 
possible point mutation and many double mutations are generated with each 
viral replication cycle, and these may be present within the population at 
any time. Given this inherent dynamic, we may be interested in comparing the 
viral diversity indices between sequential samples from a single patient or 
between samples from groups of patients. These comparisons can provide 
information on the patient’s clinical progression or the appropriateness
of a given treatment.

QSUtils is a package intended for use with quasispecies amplicon data obtained
by NGS, but it could also be useful for analyzing 16S/18S ribosomal-based
metagenomics or tumor genetic diversity by amplicons.

In this tutorial, we illustrate how the functions provided in the package can
be used to simulate quasispecies data. This implies simulation of closely 
related genomes, (eventually with segregating subpopulations at higher 
genetic distances) and their abundances.

In particular, we can differentiate between acute and chronic infection
profiles by the quasispecies composition. An acute infection is expected
to show a prominent genome that is highly abundant, together with a set
of low-abundance genomes. On the other hand, besides the implicit dynamics,
a chronic infection is expected to show a number of relatively abundant genomes
together with a myriad of derived genomes at low and very low abundances.

In viral terms, the fitness of a genome is a measure of its replicative 
performance. High-fitness haplotypes show high abundances after a transient
state, during which they overcome other genomes in the quasispecies. During 
infection of a human host, the quasispecies typically shows variations in the 
fitness of each genome caused by changes in the bioenvironment. Because of this
dynamic, we may observe the profile of a typical acute infection also in 
chronic patients, at least at the level of magnification provided by current
NGS technology.

A few functions in the package were designed to simulate quasispecies
composition, with the aim of studying the statistical properties of the 
diversity indices [@Gregori2016] [@Gregori2014]. 

# Install package

```{r install,message=FALSE}
BiocManager::install("QSutils")
library(QSutils)
```

# Abundance

Two different types of information define the quasispecies composition: the
genomes present and their current frequency (abundance) in the viral 
population. The package provides three ways to simulate abundance 
with various decreasing profiles.

## Powers of a fraction

$fn.ab$ setting the argument $fn$ to $pf$ computes consecutive fractions, 
given the frequency of the most abundant haplotype, according to: 

$$h ~r^{(i-1)}, ~~ r<1, ~~ i=1..n$$
Using $table$ on the $fn.ab$ result, we obtain the number of haplotypes for
each frequency:

```{r fn1}
table(fn.ab(25,fn="pf"))
```

```{r plotfn1,fig.cap="Profile of abundances simulated with `fn.ab` with fn=pf"}
par(mfrow=c(1,2))
plot(fn.ab(25,fn="pf"),type="h")
plot(fn.ab(25,r=0.7,fn="pf"),type="h")
```

The default value for $r$ is 0.5. Higher $r$ values moderate the decrease of
abundances. By default, the frequency of the most abundant haplotype, $h$, is
10000. 

## Power of consecutive fractions

$fn.ab$ with the argument $fn="pcf"$ computes the power of consecutive 
fractions, according to:

$$h ~ \frac{1}{i^{r}}, ~~ r>0, ~~ i=1..n$$
Default values are 0.5 for $r$ and 10,000 for $h$. The higher the $r$, the more 
pronounced the decrease in frequencies.

```{r fn2}
table(fn.ab(25,r=3,fn="pcf"))
```

```{r fn2.2}
table(fn.ab(25,r=2,fn="pcf"))
```

```{r plot-fn2,fig.cap="Abundances simulated with `fn.ab` with fn=pcf"}
par(mfrow=c(1,2))
plot(fn.ab(25,r=3,fn="pcf"),type="h")
plot(fn.ab(25,r=2,fn="pcf"),type="h")
```

## Decreasing fractional powers

$fn.ab$ with $fn="dfp"$ computes decreasing roots of the maximum frequency, $h$,
according to:

$$h^{1/i},  ~~ i=1..n$$
```{r fn3}
table(fn.ab(25,fn="dfp"))
```


```{r plot-fn3,fig.cap="Abundances simulated with `fn.ab` with fn=dpf",out.width = '70%'}
par(mfrow=c(1,2))
plot(fn.ab(25,fn="dfp"),type="h")
```

The figure and the previous table both show that this function is the one that 
generates the greatest distance between the dominant haplotype and the others.

To compare the profiles of the three functions, this figure plots the outputs 
of the functions with default parameters.

```{rplot-fn3.2,fig.cap="Comparison of the data simulation functions"}
par(mfrow=c(1,3))
plot(fn.ab(25,fn="pf"),type="h",main="fn.ab - pf")
plot(fn.ab(25,r=3,fn="pcf"),type="h",main="fn.ab - pcf")
plot(fn.ab(25,fn="dfp"),type="h",main="fn.ab - dpf")
```

A linear combination of results of the three functions provides greater
flexibility.


```{r linearcomb}
ab <- 0.25*fn.ab(25,fn="pf")+0.75*fn.ab(25,r=3,fn="pcf")
table(ab)
```
```{r,fig.cap="Abundances simulated with a linear combination of the functions"}
plot(ab,type="h",main="Linear combination of results")
```

```{r linearcomb2}
ab <- 0.7*fn.ab(25,fn="pf")+0.3*fn.ab(25,fn="dfp")
table(ab)
```

```{r,fig.cap="Abundances simulated with a linear combination of the functions"}
plot(ab,type="h",main="Linear combination of results")
```


## Geometric sequence

Appropriate for the typical load of rare haplotypes observed in our experiments
with the HCV quasispecies before the abundance filter. That is, the large 
number of very low fitness or defective haplotypes are best simulated by a 
geometric sequence with low values for the parameter. The geometric sequence 
is expressed as:

$$p ~(1-p)^{k-1},  ~~ k=1..n,  ~~ 0 < p < 1$$

and is implemented in the function $geom.series$, taking two arguments: $n$, 
the number of frequencies to compute and $p$, the parameter of the geometric 
function.

This function is useful to simulate a broad spectrum of frequency profiles, 
from quasispecies with very prevalent haplotypes to the above-mentioned long 
queues of very low abundances, as illustrated in the next figure.

```{r geomseq}
par(mfrow=c(1,2))
ab1 <- 1e5 * geom.series(100,0.8)
plot(ab1,type="h",main="Geometric series with p=0.8",cex.main=1)
ab2 <- 1e5 * geom.series(100,0.001)
plot(ab2,type="h",main="Geometric series with p=0.001",ylim=c(0,max(ab2)),
    cex.main=1)
```

Linear combinations of geometric sequences with parameters of differing
magnitudes help to obtain typical quasispecies profiles:

```{r plot-geomseq}
ab1 <- 1e5 * (geom.series(100,0.8)+geom.series(100,0.05))
plot(ab1,type="h",main="Combination of geometric series")
```

The function $fn.ab$ with fn argument set to $pf$, $pcf$, and $dfp$ is flexible
enough to obtain typical quasispecies profiles after filtering out all 
haplotypes below an abundance threshold, considered the technical noise level. 
This function, combined with geometric series having low to very low parameter 
values, provide profiles close to those observed empirically.

# Random genomes and variant haplotypes

In addition to frequencies, we need to simulate the quasispecies genomes. The
first task for this purpose is to generate the dominant haplotype by 
$GetRandomSeq$. The only parameter for this function is the genome length.
The output is a fully random sequence of nucleotides, returned as a character 
string.

```{r randomseq}
set.seed(23)
m1 <- GetRandomSeq(50)
m1
```

Variant genomes of this haplotype can be generated by $GenerateVars$. This 
function takes four parameters, $seq$ the dominant haplotype, $nhpl$ the number
of variants to generate, $max.muts$ the maximum number of mutations in a 
genome, and $p.muts$ the probability for each number of mutations from 1 to 
$max.muts$. It returns a vector of character strings with the variant genomes.

```{r generatevars}
v1 <- GenerateVars(m1,20,2,c(10,1))
DottedAlignment(c(m1,v1))
```

## Generate a quasispecies of acute infection

With these functions we can simulate a quasispecies with a profile of acute 
infection; that is, characterized by a dominant haplotype which is fairly 
abundant, together with a number of haplotypes at low abundances.

```{r haploaccute}
set.seed(23)
n.genomes <- 25
m1 <- GetRandomSeq(50)
v1 <- GenerateVars(m1,n.genomes-1,2,c(10,1))
w1 <- fn.ab(n.genomes,r=3,fn="pcf")
data.frame(Hpl=DottedAlignment(c(m1,v1)),Freq=w1)
```

The quasispecies composition can be visualized using a bar plot depicting the
haplotype frequencies, with haplotypes sorted by increasing number of mutations
with respect to the dominant haplotype, and within the number of mutations,
by decreasing order of abundance:

```{r plot accute,fig.cap="Simulated abundances in an acute infection"}
qs <- DNAStringSet(c(m1,v1))
lst <- SortByMutations(qs,w1) 
qs <- lst$bseqs

cnm <- cumsum(table(lst$nm))+1 
nm.pos <- as.vector(cnm)[-length(cnm)]
names(nm.pos) <- names(cnm[-1])

bp <- barplot(lst$nr,col="lavender")
axis(1,at=bp[nm.pos],labels=names(nm.pos),cex.axis=0.7)
```

## Generate a quasispecies of chronic infection

In contrast to acute infection, chronic infection develops more slowly; hence,
a larger number of mutations are generated with regard to the dominant 
haplotype. Furthermore, the mutated haplotypes may be more abundant in chronic
than in acute infections. In this case, we would use $GenerateVars$ with a 
higher value of $max.muts$ and higher probabilities for mutants at any level.


```{r haplo-chronic}
set.seed(23)
n.genomes <- 40
m1 <- GetRandomSeq(50)
v1 <- GenerateVars(m1,n.genomes-1,6,c(10,3,1,0.5,2,0.5)) 
w1 <- fn.ab(n.genomes,r=1.5,fn="pcf")
data.frame(Hpl=DottedAlignment(c(m1,v1)),Freq=w1)
```

Again, we can visualize the quasispecies composition using a bar plot.

```{r plot-chronic,fig.cap="Simulated abundances in a chronic infection"}
qs <- DNAStringSet(c(m1,v1)) 
lst <- SortByMutations(qs,w1)
qs <- lst$bseqs
cnm <- cumsum(table(lst$nm))+1
nm.pos <- as.vector(cnm)[-length(cnm)]
names(nm.pos) <- names(cnm[-1])
bp <- barplot(lst$nr,col="lavender")
axis(1,at=bp[nm.pos],labels=names(nm.pos),cex.axis=0.7)
```

## Diverging populations

Along the quasispecies dynamics we may see emergence of a segregating
subpopulation with improved fitness due to the combination of mutations,
rather than a single one. In this instance, the $Diverge$ function helps by 
producing variants with a common pattern of mutations.


```{r diverge}
set.seed(23)
m1 <- GetRandomSeq(50)
p2 <- Diverge(3:5,m1)
DottedAlignment(c(m1,p2))
```

Variants of these sequences can be produced in the usual way by $GenerateVars$.

```{r example}
v1 <- GenerateVars(m1,20,3,c(10,4,0.2))
wv1 <- fn.ab(length(v1),h=1000,r=1.5,fn="pcf")
wp2 <- c(600,1000,400)
v2 <- GenerateVars(p2[2],20,3,c(10,1,0.1))
wv2 <- fn.ab(length(v2),r=2,h=wp2[2]*3,fn="pcf")

qs <-DNAStringSet(c(m1,v1,p2,v2))
w <- round(c(10000,wv1,wp2,wv2))

lst <- SortByMutations(qs,w)
qs <- lst$bseqs
data.frame(Hpl=DottedAlignment(qs),nr=lst$nr)
```

The genome Hpl_4_0001 gives rise to the segregating population.

```{r plot-example,fig.cap="Simulated abundances with diverging populations"}
cnm <- cumsum(table(lst$nm))+1
nm.pos <- as.vector(cnm)[-length(cnm)]
names(nm.pos) <- names(cnm[-1])
bp <- barplot(lst$nr,col=c("lavender","pink")[c(rep(1,22),rep(2,20))])
axis(1,at=bp[nm.pos],labels=names(nm.pos),cex.axis=0.7)
```

***

```{r,, echo = FALSE}
sessionInfo()
```


# References