---
title: "automatic RNA-Seq present/absent gene expression calls generation"
author: "Julien Wollbrett, Sara Fonseca Costa, Marc Robinson-Rechavi, Frederic Bastian"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{automatic RNA-Seq present/absent gene expression calls generation}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: sentence
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```
`BgeeCall` is a collection of functions that uses [Bgee](https://bgee.org/) expertise to create RNA-Seq gene expression present/absent calls.
The `BgeeCall` package allows to:
-   Generate calls of presence/absence of expression at the gene level. You can generate these calls for any RNA-Seq samples as long as reference intergenic sequences have been generated by the Bgee team or by the Bgee community.
-   Download reference intergenic sequences for species available in Bgee.
-   Paralellize calls generation on a cluster with slurm queuing system.
-   Collect stats across libraries, provided from different calls approaches.
-   Merging calls from different libraries.
If you find a bug or have any issues with `BgeeCall` please write a bug report in our [GitHub issues manager](https://github.com/BgeeDB/BgeeCall/issues).
## How present/absent calls are generated
In Bgee RNA-Seq calls are generated using a threshold specific to each RNA-Seq library, calculated using reads mapped to reference intergenic regions.
This is unlike the more usual use of an arbitrary threshold below which a gene is not considered as expressed (e.g log2(TPM) = 1).
### Bgee database
Bgee is a database to retrieve and compare gene expression patterns in multiple animal species and produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data).
It notably integrates RNA-Seq libraries for 29 species.
### Reference intergenic regions
Reference intergenic regions are defined in the [Bgee RNA-Seq pipeline](https://github.com/BgeeDB/bgee_pipeline/tree/master/pipeline/RNA_Seq).
Candidate intergenic regions are defined using gene annotation data.
For each species, over all available libraries, reads are mapped to these intergenic regions with [kallisto](https://github.com/pachterlab/kallisto), as well as to genes.
This "intergenic expression" is deconvoluted to distinguish reference intergenic from non annotated genes, which have higher expression.
Reference intergenic regions are then defined as intergenic regions with low expression level over all RNA-Seq libraries, relative to genes.
This step allows not to consider regions wrongly considered as intergenic because of potential gene annotation quality problem as intergenic.
For more information please refer to the [Bgee RNA-Seq pipeline](https://github.com/BgeeDB/bgee_pipeline/tree/master/pipeline/RNA_Seq).
### Threshold of present/absent
BgeeCall pipeline allows to download reference intergenic regions resulting from the expertise of the Bgee team.
Moreover BgeeCall allows to use these reference intergenic regions to automatically generate calls for your own RNA-Seq libraries as long as the species is integrated to [Bgee](https://bgee.org/).
By default BgeeCall calculate a pValue to define calls.
By default genes are consider present if the pValue is lower or equal to 0.05.
More information on this pValue and potential other approaches to generate the calls are available [here](#cutoff_type)
## Installation
In R:
```{r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("BgeeCall")
```
## How to use the BgeeCall package
BgeeCall is highly tunable.
Do not hesitate to have a look at the reference manual to have a precise descripton of all slots of the 4 main S4 classes (AbundanceMetadata, KallistoMetadata, BgeeMetadata and UserMetadata) or of all available functions.
BgeeCall needs kallisto to run.
If you do not have kallisto installed you will find more information how to install it [here](#install-kallisto)
### Load the package
```{r, message = FALSE, warning = FALSE}
library(BgeeCall)
```
### Quick start
With the BgeeCall package it is easy to generate present/absent gene expression calls.
Generation of the kallisto transcriptome index can require a lot of time.
As the time needed for this step depend on the size of the transcriptome we choose, as an example, the smallest transcriptome file among all species available on Bgee (C. elegans).
To generate these calls you will need :
-   a transcriptome
-   gene annotations
-   your RNA-Seq reads in fastq files
For this vignette we created a toy fastq file example based on the SRX099901 RNA-Seq library using the [ShortRead](http://bioconductor.org/packages/release/bioc/html/ShortRead.html) R package
```{r, eval=FALSE}
library("ShortRead")
# keep 48.000 reads
sampler <- FastqSampler(file.path("absolute_path","/SRX099901/SRR350955.fastq.gz"), 48000)
set.seed(1); SRR350955 <- yield(sampler)
writeFastq(object = SRR350955, 
           file =file.path( "absolute_path","SRX099901_subset", "SRR350955_subset.fastq.gz"), 
           mode = "w", full = FALSE, compress = TRUE)
```
In this example we used the Bioconductor AnnotationHub to load transcriptome and gene annotations but you can load them from wherever you want.
```{r, message = TRUE, warning = TRUE}
ah <- AnnotationHub::AnnotationHub()
ah_resources <- AnnotationHub::query(ah, c("Ensembl", "Caenorhabditis elegans", "84"))
annotation_object <- ah_resources[["AH50789"]]
# remove MtDNA not tag as C. elegans genome
annotation_object <- GenomeInfoDb::dropSeqlevels(annotation_object, "MtDNA", "coarse")
transcriptome_object <- rtracklayer::import.2bit(ah_resources[["AH50453"]])
```
Once you have access to transcriptome, gene annotations and your RNA-Seq library, an object of class `UserMetadata` has to be created.
```{r, message = FALSE, warning = FALSE}
# create an object of class UserMetadata and specify the species ID
user_BgeeCall <- new("UserMetadata", species_id = "6239")
# import annotation and transcriptome in the user_BgeeCall object
# it is possible to import them using an S4 object (GRanges, DNAStringSet) or a file (gtf, fasta)
user_BgeeCall <- setAnnotationFromObject(user_BgeeCall, annotation_object, "WBcel235_84")
user_BgeeCall <- setTranscriptomeFromObject(user_BgeeCall, transcriptome_object, "WBcel235")
# provide path to the directory of your RNA-Seq library
user_BgeeCall <- setRNASeqLibPath(user_BgeeCall, 
                                  system.file("extdata", 
                                              "SRX099901_subset", 
                                              package = "BgeeCall"))
```
Note that in BgeeCall it is possible to specify the annotation source.
The default source is ensembl, but gencode files can also be used, by specifying that in the attribute gtf_source in the class UserMetadata.
And that's it...
You can run the generation of your present/absent gene expression calls
```{r, eval = FALSE}
# if kallisto is already installed in your computer
calls_output <- generate_calls_workflow(userMetadata = user_BgeeCall)
# if kallisto is not yet installed in your computer
kallisto <- new("KallistoMetadata", download_kallisto = TRUE)
calls_output <- generate_calls_workflow(abundanceMetadata = kallisto, userMetadata = user_BgeeCall)
```
```{r, echo=FALSE}
user_BgeeCall <- setWorkingPath(user_BgeeCall, system.file("extdata", package = "BgeeCall"))
user_BgeeCall<- setSimpleArborescence(user_BgeeCall, TRUE)
calls_output <- generate_presence_absence(myUserMetadata = user_BgeeCall)
```
Each analyze generates 5 files and return path to each one of them.
-   calls_tsv_path : path to main tsv file with TPM, count, length, biotype, type, and presence/absence of expression summarized at gene level (or at the [transcript level](#transcript_level) if it was requested)
```{r, message = FALSE, warning = FALSE}
head(read.table(calls_output$calls_tsv_path, header = TRUE), n = 5)
```
-   cutoff_info_file_path : path to tsv summary of the analyze containing the proportion of gene, protein coding and intergenic defined as expressed. It also contains the library ID and the present/absent TPM threshold
```{r, message = FALSE, warning = FALSE}
read.table(calls_output$cutoff_info_file_path)
```
-   abundance_tsv : path to tsv kallisto quant output file
```{r, message = FALSE, warning = FALSE}
head(read.table(calls_output$abundance_tsv, header = TRUE), n = 5)
```
-   TPM_distribution_path : path to plot in pdf reprensting density distribution of TPM values for all sequences, protein coding sequences, and intergenic sequences. The grey line corresponds to TPM threshold used to generate present/absent calls.
```{r, eval = FALSE}
openPDF(calls_output$TPM_distribution_path)
```
-   S4_slots_summary : path to tsv file containing a summary of values used for the most important slots of the three S4 classes (UserMetadata, KallistoMetadata, and BgeeMetadata).
```{r, message = FALSE, warning = FALSE}
read.table(calls_output$S4_slots_summary, header = TRUE, sep = "\t")
```
Note that in the pValue approach (default method to generate present/absent gene expression calls) the calls file as well as the cutoff_info_file and the TPM_distribution files not contain extension name of the approach used.
On the other hand, all the other approaches contain the approach name in the file names.
### Generate present/absent calls for more than one RNA-Seq library
You will potentialy be also interested to generate present/absent calls on different RNA-Seq libraries, potentially on different species, or using The main function `generate_presence_absence()` allows to generate present/absent calls from a UserMetadata object but also from a data frame or a tsv file depending on the arguments of the function you use.
Please choose one of the three following arguments : - userMetadata : Allows to generate present/absent calls for one RNA-Seq library using one object of the class UserMetadata.\
- userDataFrame : Provide a dataframe where each row correspond to one present/absent call generation.
It allows to generate present/absent calls on different libraries, species, transcriptome, genome annotations, etc. - userFile : Similar to userDataFrame except that the information are stored in a tsv file.
A template of this file called `userMetadataTemplate.tsv` is available at the root of the package.
Columns of the dataframe or the tsv file are :
-   species_id : The NCBI ID of the species.
-   run_ids : The runs of the RNA-Seq library you want to use for the generation of the calls. Allows to generate expression calls for a subset of the runs of one RNA-Seq library as described in [Generate calls for a subset of RNA-Seq runs](#run_ids). If not interested by this option, leave the column empty.
-   reads_size : The size of the reads of your RNA-Seq library.
-   rnaseq_lib_path : Path to the directory containing all fastq files generated for this library. This directory can only contains single-end runs or paired-end runs.
-   transcriptome_path : path to the transcriptome file.
-   annotation_path : path to the genome annotation file. Works with GTF of GFF3 files.
-   working_path : path to the working directory where results will be stored. Using the same working directory for different RNA-Seq libraries of the same species will allow to reuse previously generated data like the custom transcriptome index (generated from both transcriptome and reference intergenic sequences). By default the working path is defined by the `getwd()` function and correspond to the working directory of your R session. If not interested by this option, leave the column empty.
-   output_directory : Both species results and RNA-Seq libraries results are by default stored at the same place using the value of the `working_path` column. However, this column allows you to define a different output_directory for RNA-Seq results. For instance it allows you to save calls information directly in the RNA-Seq directory. If not interested by this option, leave the column empty.
Once the file has been fill in expression calls can be generated with :
```{r, eval=FALSE}
calls_output <- generate_calls_workflow(userFile = "path_to_your_file.tsv")
```
### Parallized generation of present/absent calls on a cluster
BgeeCall already implement everything you need to generate calls on a cluster if it uses slurm as queuing system.
The same TSV file as described in the previous section will be used as input.
In addition to tuning options available when running BgeeCall on your computer, It is possible to modify how slurm jobs are submitted.
More information available in section [Modify slurm options](#slurm_options) In order to optimize parallelization calls will be generated in 2 steps.\
\\itemize {
\item
Generate data at species level (e.g trancriptome with intergenic sequences, kallisto indexes)
\item
Generate expression calls for each RNA-Seq libraries
```{r, eval = FALSE}
# generate kallisto indexes
generate_slurm_indexes(userFile = "path_to_your_file.tsv")
# generate expression calls
generate_slurm_calls(userFile = "path_to_your_file.tsv")
```
### Reference intergenic sequences
#### Releases of reference intergenic sequences
Different releases of reference intergenic sequences are available.
It is possible to list all these releases :
```{r}
list_intergenic_release()
```
It is then possible to choose one specific release to create a `BgeeMetadata` object.
Always use the setter method `setIntergenicRelease()` when changing the release of an already existing BgeeMetadata object.
```{r}
# create BgeeMetadata object and define one reference intergenic release
bgee <- new("BgeeMetadata", intergenic_release = "0.1")
# change the reference intergenic release of your BgeeMetadata object
bgee <- setIntergenicRelease(bgee, "0.2")
```
By default the reference intergenic release used when a `BgeeMetadata` object is created is the last stable one created by the Bgee team.
#### Core reference intergenic from Bgee
`Core reference intergenic` releases are created by the Bgee team when a lot of new RNA-Seq libraries have been manually curated for already existing species and/or for new species.
These releases are the only ones with a release number (e.g "0.1").
Each of these releases contains reference intergenic sequences for a list of species.
Bgee reference intergenic sequences have been generated using Bgee team expertise.
The RNA-Seq libraries were manually curated as healthy and wild type.
Quality Control have been done along all steps of generation of these sequences.
Reference intergenic sequences have been selected from all potential intergenic regions (see [Bgee pipeline](https://github.com/BgeeDB/bgee_pipeline/tree/master/pipeline/RNA_Seq)).
BgeeCall allows to generate gene expression call from Bgee reference intergenic sequences for any RNA-Seq libraries as long as these sequences have been generated by the Bgee team.
A tsv file containing all species available for current release of reference intergenic is available [here](https://bgee.org/ftp/intergenic/current/species_info.tsv).
This file also contains a column describing the number of RNA-Seq libraries used to generated the reference intergenic sequences of each species.
It is also possible to list in R all species for which Bgee reference intergenic sequences have been created :
```{r}
list_bgee_ref_intergenic_species(myBgeeMetadata = bgee)
```
#### Community reference intergenic
If you want to use BgeeCall on a species for which Bgee does not provide reference intergenic sequences you have the possibility to create them by yourself and share them with the Bgee community by following all steps of [this tutorial](https://github.com/BgeeDB/reference_intergenic_standalone#protocol-to-generate-reference-intergenic-sequences).
Do not forget that the number of RNA-Seq libraries is a key point to the generation of precise reference intergenic sequences.
It is possible to list in R all species for which reference intergenic sequences have been created by the community using the following code
```{r}
list_community_ref_intergenic_species()
```
If reference intergenic sequences of the species you are interested in are available only from the community release it is then possible to use this release to generate your present/absent calls
```{r, eval=FALSE}
# create a BgeeMetadata object using the community release
bgee <- new("BgeeMetadata", intergenic_release = "community")
calls_output <- generate_calls_workflow(bgeeMetadata = bgee, userMetadata = user_BgeeCall)
```
#### Your own reference intergenic
If you generated your own reference intergenic sequences follwowing [this tuorial](https://github.com/BgeeDB/reference_intergenic_standalone#protocol-to-generate-reference-intergenic-sequences) but did not share them for the moment (do not forget to do it...), it is also possible to use BgeeCall with a file containing the sequences.
In this case you need to select the custom release and provide the path to the file containing reference intergenic sequences :
```{r, eval=FALSE}
bgee <- new("BgeeMetadata", intergenic_release = "custom")
user_BgeeCall@custom_intergenic_path = "path/to/custom/ref_intergenic.fa.gz"
calls_output <- generate_calls_workflow(bgeeMetadata = bgee, userMetadata = user_BgeeCall)
```
### Generate present/absent calls at transcript level (beta version)
kallisto generates TPMs at the transcript level.
In the Bgee pipeline we summarize this expression at the gene level to calculate our present/absent calls.
In `BgeeCall` it is now possible to generate present/absent calls at the transcript level.
Be careful when using this feature as it has not been tested for the moment.
To generate such calls you only have to create one object of the class `KallistoMetadata` and edit the value of one attribute
```{r, eval=FALSE}
kallisto <- new("KallistoMetadata", txOut = TRUE)
calls_output <- generate_calls_workflow(myAbundanceMetadata = kallisto, userMetadata = user_BgeeCall)
```
### Tune how to use kallisto
#### Download or reuse your own kallisto {#install-kallisto}
By default `BgeeCall` suppose that kallisto is installed.
If kallisto is not installed on your computer you can either :
-   let BgeeCall automatically download the version 0.45 of kallisto. BgeeCall will use it to quantify abundance of transcripts. It will only be used by this package and will have no impact on your potential already existing version of kallisto.
```{r, eval=FALSE}
kallisto <- new("KallistoMetadata", download_kallisto = TRUE)
calls_output <- generate_calls_workflow(abundanceMetadata = kallisto, userMetadata = user_BgeeCall)
```
-   You can download and install kallisto by following the instructions on the official website : 
#### Edit kallisto quant attributes
By default kallisto is run with the same parameters that we use in the RNA-Seq Bgee pipeline:
-   single end : "-t 1 --single -l 180 -s 30 --bias"
-   paired end : "-t 1 --bias"
It is possible to modify them and use your favourite kallisto parameters
```{r, eval=FALSE}
kallisto <- new("KallistoMetadata", single_end_parameters = "-t 3 --single -l 150 -s 30", pair_end_parameters = "-t 2 -b --seed 36")
calls_output <- generate_calls_workflow(myAbundanceMetadata = kallisto, userMetadata = user_BgeeCall)
```
#### Choose between two kmer size
By default 2 indexes with 2 different kmer sizes can be used by `BgeeCall` The default kmer size of kallisto (31) is used for libraries with reads length equal or larger than 50 bp.
A kmer size of 15 is used for libraries with reads length smaller than 50 bp.
We decided not to allow to tune kmers size because the generation of the index is time consuming and index generation takes even more time with small kmers size (\< 15bp).
However it is possible to modify the threshold of read length allowing to choose between default and small kmer size.
```{r, eval=FALSE}
# libraries with reads smaller than 70bp will use the index with kmer size = 15
kallisto <- new("KallistoMetadata", read_size_kmer_threshold = 70)
calls_output <- generate_calls_workflow(myAbundanceMetadata = kallisto, userMetadata = user_BgeeCall)
```
Note that for libraries with read length unspecified the default kmer size(31) will be used.
### Generate calls for a subset of RNA-Seq runs
By default gene expression calls are generated using all runs of the RNA-Seq library.
It is possible to select only a subset of these runs.
```{r, eval=FALSE}
# RNA-Seq run SRR350955_subsetof from the RNA-Seq library will be used to generate the calls
user_BgeeCall <- setRunIds(user_BgeeCall, c("SRR350955_subset"))
calls_output <- run_from_object(myUserMetadata = user_BgeeCall)
```
When run IDs are selected, the name output directory combine the library ID and all selected run IDs.
In our example the expression calls will be stored in the directory `SRX099901_SRR350955_subset`.
### Modify present/absent threshold {#cutoff_type}
#### Default pValue approach
By default BgeeCall generate calls using a pValue approach.
In order to generate the pValue per gene id we compute the numerical measure, Z-score, that describes the value relationship to the mean.
The Z-score measure is done in terms of standards deviations from the referent mean, in this case using a set of reference intergenic regions, as specified in the formula below:
$$ zScore = \frac {log2(tpmValue) - mean(log2(IntergenicTpmValues))}{sd(log2(IntergenicTpmValues))} $$ From the Z-score value, for each gene id, we compute a distribution of pValues using the following formula:
$$ pValue = pnorm(zscore, lower.tail = FALSE) $$ Genes are consider present if the pValue is lower or equal to 0.05.
By default all genes with an abundance higher than 0 (i.e having at least one read mapped to a transcript) and with a pValue lower or equal to 0.05 are considered as present.
Other genes are called absent.
It is possible to modify the pValue cutoff.
Be careful when editing this value as it has a big impact on your present absent.
```{r}
kallisto <- new("KallistoMetadata", cutoff = 0.1)
```
#### Intergenic threshold approach
Expression calls can also be generated using a threshold of intergenic sequences called present.
This approach was used to generate Bgee expression calls until release 14 with the formula :
$$ \frac {proportion\ of\ reference\ intergenic\ present}{proportion\ of\ protein\ coding\ present} = 0.05 $$ It is possible to change the cutoff value of this ratio.
```{r}
# use intergenic approach with default cutoff ratio 0.05
kallisto <- new("KallistoMetadata", cutoff_type = "intergenic")
# use intergenic approach with cutoff ratio 0.01
kallisto <- new("KallistoMetadata", cutoff_type = "intergenic", cutoff = 0.01)
```
#### qValue threshold approach
The expression calls can also be generated using the qValue approach.
In this approach we perform a linear interpolation for each density region, genic and reference intergenic, followed by the numerical integration.
Then for each unique abundance value (TPM) we integrate and scale across the region.
After that, the correspondent qValue for each gene id is computed following the formula:
$$ qValue = \frac {intergenic}{intergenic + genic} $$ \### Collect statistics
In BgeeCall the user are able to collect all the statistics provided by the different approaches to call expressed genes at individual library.
To provide this informative table, all cutoff_info_file\_ from each individual library and from each individual approach are used to retrieve the correspondent information.
```{r, eval=FALSE}
summaryStats <- get_summary_stats(userFile = "path/to/file.tsv", outDir = "path/to/output_directory")
```
### Run BgeeCall in quiet mode
By default BgeeCall write output messages for all parts of the workflow.
It is possible not to write any messages by changing the value of the slot verbose in the UserMetadata object.
By default this value is set to true but it is possible to change it to false with this line :
```{r}
user_BgeeCall@verbose <- FALSE
```
### Do not rerun parts of the pipeline
Generation of present/absent expression calls is done in several steps.
It is possible to force overwritting already existing intermediary files : - overwrite_index : slot of the object KallistoMetadata.
The value has to be a logical.
If FALSE (default), the index generation step is skiped if one index already exists.
If TRUE, the kallisto index will be generated even if one index already exists.
- overwrite_quant : slot of the object KallistoMetadata.
The value has to be a logical.
If FALSE (default), the kallisto quantification step is skiped if a quantification file already exists.
If TRUE, the kallisto quantification step will be run even if a quantification file already exists.
- overwrite_calls : slot of the object KallistoMetadata.
The value has to be a logical.
If FALSE, the generation of present/absent calls is skiped if an index already exists.
If TRUE (default), the generation of present/absent calls will be run even if calls were already generated.
### Ignore transcript version
It can happen that calls are generated using transcriptome or annotations containing transcript version (number after a dot in transcript IDs e.g ENSMUST00000082908.2) and annotations or transcriptome without transcript version.
This is problematic when using tximport to transform abundance at transcript level to abundance at gene level and result in an error.
```         
Error in .local(object, ...) : 
  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.
Example IDs (file): [, ...]
Example IDs (tx2gene): [ENSMUST00000193812, ENSMUST00000082908, ENSMUST00000192857, ...]
  This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.
Calls: do.call ... tximport -> summarizeToGene -> summarizeToGene -> .local
Execution halted
```
To solve this error tximport implemented an option called ignoreTxVersion that remove the transcript version from both transcriptome and annotations.
It is possible to use this option by modifying the value of the slot ignoreTxVersion (default FALSE) of the S4 class KallistoMetadata.
```{r}
kallisto <- new("KallistoMetadata", ignoreTxVersion = TRUE)
```
### Generate calls with a simple arborescence of directories
By default the arborescence of directories created by `BgeeCall` is as simple as possible.
the results will be created using the path `working_path/intergenic_release/all_results/libraryId`.
Generating present/absent gene expression calls for the same RNA-Seq library using different transcriptome or annotation versions using this arborescence will overwrite previous results.
The `UserMetadata` class has an attribute `simple_arborescence` that is `TRUE` by default.
If `FALSE`, a complexe arborescence of directories containing the name of the annotation and transcriptome files will be created.
This complex arborescence will then allow to generate present/absent calls for the same library using different version of transcriptome or annotaiton.
```{r, eval=FALSE}
user_BgeeCall <- setSimpleArborescence(userObject = user_BgeeCall, simpleArborescence = FALSE)
calls_output <- run_from_object(myUserMetadata = user_BgeeCall)
```
### Change directory where calls are saved
By default directories used to save present/absent calls are subdirectories of `UserMetadata@working_path`.
However it is possible to select the directory where you want the calls to be generated.
```{r}
user_BgeeCall <- setOutputDir(user_BgeeCall, "path/to/calls/for/this/library/")
```
This output directory will only contains results generated at the RNA-Seq library level.
All data generated at species level are still stored using the `UserMetadata@working_path`.
They can then still be reused to generate calls from other libraries of the same species.
### Modify slurm options
Two functions are available to run BgeeCall on a slurm queuing system.
Parameters described below are available for both of them.
#### Number of jobs
The full idea of using a cluster is to parallelize your jobs.
By default 10 jobs are run at the same time.
It is possible to modify this number with the parameter `nodes`.
```{r, eval=FALSE}
# run 50 jobs in parallel
generate_slurm_indexes(userFile = "path/to/file.tsv", nodes = 50)
```
#### Do not submit the jobs
In order to be able to check files automatically generated to run the jobs it is possible to generate these files without submiting your jobs.
More information on created files are available on the vignette of the ()[rslurm package].
```{r, eval=FALSE}
# create temporary files but do not submit the jobs
generate_slurm_indexes(userFile = "path/to/file.tsv", submit = FALSE)
```
#### Modify slurm options
A bash scirpt is automatically created to run the jobs.
This script contains default slurm options (array, cpus-per-task, job-name, output).
All other slurm options recognized by the sbatch command can be updated b creating a named list where name correspond to long name of options (e.g do not use 'p' but 'partition').
```{r, eval=FALSE}
# add slurm options to the sbatch script
slurm_options_index <- list(account = "account", time = "2:00:00", partition = "partition", mem = "30G")
generate_slurm_indexes(userFile = "path/to/file.tsv", slurm_options = slurm_options_index)
```
#### Add modules to your environment
In some cluster programs are not loaded by default.
The modules parameter allows to load them by adding one line in the sbatch script.
This option has been implemented to add modules but could potentially be used to add any custom line of code in the sbatch script.
```{r, eval=FALSE}
# load R 3.6.1 and kallisto in a cluster environment where software has to loaded manually
modules <- c("module add R/3.6.1;", "module add kallisto;")
generate_slurm_indexes(userFile = "path/to/file.tsv", modules = modules)
```
#### Modify BgeeCall objects
By default except for columns present in the tsv file all other slots of the 3 BgeeCall classes will use default values.
In order to tune these parameters it is possible to create the objects and pass them to the slurm functions.
\strong{ Important :} When generating these objects it is mandatory to keep the same name as in the example below.
```{r, eval=FALSE}
# create BgeeCall objects and use them to generate indexes
kallistoMetadata <- new("KallistoMetadata", download_kallisto=TRUE)
userMetadata <- new("UserMetadata", working_path = "/path/to/working/dir")
bgeeMetadata <- new("BgeeMetadata", intergenic_release = "0.1")
generate_slurm_indexes(userFile = "path/to/file.tsv", kallistoMetadata = kallistoMetadata, bgeeMetadata = bgeeMetadata, userMetadata = userMetadata)
```
## Merging multiple libraries
In the BgeeCall package we have the possibility to perform calls of expression by merging/combine multiple libraries that belongs to a particular condition.
In order to use this functionality, the calls at individual library need to be done by using the p-value or q-value approach, since this approaches provide for each $gene_i$ a quantitative metric.
If the calls of expression genes, at individual library, was done using the pValue approach, then the merging will be done by using the Benjamini & Hochberg method.
For this method a set of pValues are collected, for each $gene_i$ across $n$ libraries, and then the referent p.adjusted values are computed.
To call expressed genes in the merging, we verify if one of the p.adjusted values is lower then the cutoff (threshold value provided by the user), if yes, the $gene_i$ is classified as present for the condition, otherwise is absent.
```{r, eval=FALSE}
mergingLibraries <- merging_libraries(userFile = "path/to/userFile.tsv", approach = "BH", condition = "species_id", cutoff = 0.05, outDir = "path/to/output_directory/")
```
On the other side, if the approach used to call expressed genes, at individual library, was the qValue, then the merging will be done by using the inverse fdr correction, this means by applying the correspondent formula:
$$ FDR_{inverse} = (1-((1-q)^{(\frac {1}{n})})) $$ where $q$ is the desired cut-off (threshold value provided by the user) and $n$ is the number of libraries.
```{r, eval=FALSE}
# merging libraries
mergingLibraries <- merging_libraries(userFile = "path/to/userFile.tsv", approach = "fdr_inverse", condition = "species_id", cutoff = 0.05, outDir = "path/to/output_directory/")
```
To call expressed genes in the merging condition, we verify if one of the libraries have a q-Value lower then the FDR_inverse, if yes, the $gene_i$ is classified as present, otherwise is classified as absent.
### Arguments and user file to perform the merging
In the merging_libraries() function, different parameters need to be passed to the referent arguments: userFile, approach, condition, cutoff or outDir.
In the userFile argument a path to a text file containing all information about the target libraries to be merged.
This file should contain a fundamental column called: specied_id in order to run the merging_libraries() function.
This column should not be empty and also an important request is that you should have at least 2 libraries to be processed, since this is a merging process.
If you want to merge/combine libraries for a more detailed condition you need to add the correspondent extra columns to your user file.
Note that the argument condition recognize the following parameters: specied_id, anatEntity, devStage, sex and strain.
This means that your table columns also should have the same ID.
The order of the columns are not relevant when the user file is structured, as well as, the order of the parameters passed to the condition argument.
Is always recommended to use "-" instead of empty rows when some information is missing, as represented in the **BgeeCall/inst/userMetadataTemplate_merging.tsv**
The argument approach allow to specify the method that will be used to make the calls of expression in the merging libraries, and as described before, the condition argument will allow to specify the particular condition of interest, the condition argument allow to target the correspondent libraries where the merging will occur.
More restricted cut-off can be applied to call expressed genes in the merging libraries for a target condition, for this the cutoff argument can be modified, otherwise by default is 0.05.
```{r, eval=FALSE}
# merging libraries where p-values were calculated
mergingLibraries_pValue <- merging_libraries(userFile = "path/to/userFile.tsv", approach = "BH", condition = c("species_id", "anatEntity", "strain"), cutoff = 0.005, outDir = "path/to/output_directory/")
# merging libraries where q-values were calculated
mergingLibraries_qValue <- merging_libraries(userFile = "path/to/userFile.tsv", approach = "fdr_inverse", condition = c("species_id", "devStage" ,"anatEntity", "strain"), cutoff = 0.001, outDir = "path/to/output_directory/")
```
#Session Info
```{r sessioninfo}
sessionInfo()
```
```{r cleanup_after, echo=FALSE, message=FALSE, warning=FALSE}
unlink(BgeeCall:::get_kallisto_dir_path(kallisto, user_BgeeCall), recursive = TRUE)
unlink(file.path(getWorkingPath(user_BgeeCall), paste0(getIntergenicPrefix(bgee), "*"), user_BgeeCall@species_id, "intergenic_ids.txt"))
unlink(file.path(getWorkingPath(user_BgeeCall), paste0(getIntergenicPrefix(bgee), "*"), user_BgeeCall@species_id, "bgee_intergenic.fa.gz"))
unlink(file.path(getWorkingPath(user_BgeeCall), paste0(getIntergenicPrefix(bgee), "*"), "all_results", "SRX099901_subset", "gene_*"), recursive = TRUE)
unlink(file.path(getWorkingPath(user_BgeeCall), paste0(getIntergenicPrefix(bgee), "*"), "all_results", "SRX099901_subset", "S4_slots_summary.tsv"), recursive = TRUE)
```