<!--
%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{Sample output files for tximport}
-->

# Sample output files for tximport

This data package provides a set of output files from running a number
of various transcript abundance quantifiers on 6 samples from the
[GEUVADIS Project](http://www.geuvadis.org/web/geuvadis). The
files are contained in the `inst/extdata` directory.

A citation for the GEUVADIS Project is:

> Lappalainen, et al., "Transcriptome and genome sequencing uncovers
> functional variation in humans", Nature 501, 506-511 (26 September
> 2013) [doi:10.1038/nature12531](http://dx.doi.org/10.1038/nature12531).

The purpose of this vignette is to detail which versions of software
were run, and exactly what calls were made.

# Sample information and quantification files

A small file, `samples.txt` is included in the `inst/extdata` directory:

```{r}
dir <- system.file("extdata", package="tximportData")
samples <- read.table(file.path(dir,"samples.txt"), header=TRUE)
samples
```

Further details can be found in a more extended table:

```{r}
samples.ext <- read.delim(file.path(dir,"samples_extended.txt"), header=TRUE)
colnames(samples.ext)
```

The quantification outputs themselves can be found in sub-directories:

```{r}
list.files(dir)
list.files(file.path(dir,"cufflinks"))
list.files(file.path(dir,"rsem","ERR188021"))
list.files(file.path(dir,"kallisto","ERR188021"))
list.files(file.path(dir,"salmon","ERR188021"))
list.files(file.path(dir,"sailfish","ERR188021"))
```

# Genome and gene annotation file

* For Cufflinks and Sailfish, the Illumina iGenomes was used as the index, see details below. 
* For RSEM, Salmon and kallisto (without inference replicates), 
  the Gencode v27 CHR transcripts were used (`gencode.v27.transcripts.fa`).
* For the `salmon_gibbs` and `kallisto_boot` directories, 
  the Ensembl v87 cDNA transcripts were used (`Homo_sapiens.GRCh38.cdna.all.fa`).
* For the `salmon_dm` directory, the Ensembl Drosophila melanogaster
  v92 transcripts were used (either just cDNA or combining cDNA with
  non-coding transcripts).

Illumina iGenomes: The human genome and annotations were downloaded from
[Illumina iGenomes](https://support.illumina.com/sequencing/sequencing_software/igenome.html)
for the UCSC hg19 version. The human genome FASTA file used was in the
`Sequence/WholeGenomeFasta` directory and the gene annotation GTF file used
was the `genes.gtf` file in the `Annotation/Genes` directory. This GTF
file contains RefSeq transcript IDs and UCSC gene names. The
`Annotation` directory contained a `README.txt` file with the text:

> The contents of the annotation directories were downloaded from UCSC
> on: June 02, 2014.

The `genes.gtf` file was filtered to include only chromosomes
1-22, X, Y, and M.

# Cufflinks

Tophat2 version 2.0.11 was run with the call:

```
tophat -p 20 -o tophat_out/$f genome fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz;
```

Cufflinks version 2.2.1 was run with the call:

```
cuffquant -p 40 -b $GENO -o cufflinks/$f genes.gtf tophat_out/$f/accepted_hits.bam;
```

Cuffnorm was run with the call:

```
cuffnorm genes.gtf -o cufflinks/ \
cufflinks/ERR188297/abundances.cxb \
cufflinks/ERR188088/abundances.cxb \
cufflinks/ERR188329/abundances.cxb \
cufflinks/ERR188288/abundances.cxb \
cufflinks/ERR188021/abundances.cxb \
cufflinks/ERR188356/abundances.cxb 
```

# RSEM

RSEM version 1.2.31 was run with the call:

```
rsem-calculate-expression --num-threads 6 --bowtie2 --paired-end <(zcat fastq/$f\_1.fastq.gz) <(zcat fastq/$f\_2.fastq.gz) index rsem/$f/$f
```

# kallisto

kallisto version 0.43.1 was run with the call:

```
kallisto quant --bias -i index -t 6 -o kallisto/$f fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz
```

For the files in `kallisto_boot` directory, kallisto version 0.43.0
was run, quantifying against the Ensembl transcripts (v87) in
`Homo_sapiens.GRCh38.cdna.all.fa`, using the call:

```
kallisto quant -i index -t 6 -b 5 -o kallisto_0.43.0/$f fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz
```

# Salmon

Salmon version 0.8.2 was run with the call:

```
salmon quant -p 6 --gcBias -i index -l IU -1 fastq/$f\_1.fastq.gz -2 fastq/$f\_2.fastq.gz -o salmon/$f
```

For the files in the `salmon_gibbs` directory, Salmon version 0.8.1
was run, quantifying against the Ensembl transcripts (v87) in 
`Homo_sapiens.GRCh38.cdna.all.fa`, using the call:

```
salmon quant -p 6 --numGibbsSamples 5 -i index -l IU -1 fastq/$f\_1.fastq.gz -2 fastq/$f\_2.fastq.gz -o salmon_gibbs/$f
```

For the files in the `salmon_dm` directory (Drosophila melanogaster),
Salmon version 0.10.2 was run (once with only cDNA, once combining
cDNA with non-coding transcripts):

```
salmon quant -l A --gcBias --seqBias --posBias -i Drosophila_melanogaster.BDGP6.cdna.v92_salmon_0.10.2 -o SRR1197474 -1 SRR1197474_1.fastq.gz -2 SRR1197474_2.fastq.gz
```

# Sailfish

Sailfish version 0.9.0 was run with the call:

```
sailfish quant -p 10 --biasCorrect -i sailfish_0.9.0/index -l IU -1 <(zcat fastq/$f\_1.fastq.gz) -2 <(zcat fastq/$f\_2.fastq.gz) -o sailfish_0.9.0/$f
```

# Session info

```{r}
sessionInfo()
```