---
title: "How to use the BSgenomeForge package to forge a BSgenome data package"
author:
- name: "Atuhurira Kirabo Kakopo"
- name: "Hervé Pagès"
date: "Modified: 20 June 2023; Compiled: `r format(Sys.time(), '%d %B, %Y')`"
package: BSgenomeForge
vignette: >
  %\VignetteIndexEntry{How to use the BSgenomeForge package to forge a BSgenome data package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
  number_sections: true
  toc: true
  toc_depth: 4
---

# Introduction

BSgenome data packages are one of the many types of annotation packages
available in [*Bioconductor*](https://bioconductor.org). They contain the
genomic sequences, which comprise chromosome sequences and other DNA
sequences of a particular genome assembly for a given organism. For example
`r Biocpkg("BSgenome.Hsapiens.UCSC.hg19")` is a BSgenome data package that
contains the genomic sequences of the hg19 genome from
[UCSC](https://genome.ucsc.edu/cgi-bin/hgGateway).
Users can easily and efficiently access the sequences, or portions of the
sequences, stored in these packages, via a common API implemented in the
`r Biocpkg("BSgenome")` software package.

Bioconductor currently provides more than [100 BSgenome data packages](https://bioconductor.org/packages/release/BiocViews.html#___BSgenome),
for more than 30 organisms. Most of them contain the genomic sequences of
_UCSC genomes_ (i.e. genomes supported by the UCSC Genome Browser) or
_NCBI assemblies_. The packages are used in various Bioconductor workflows, as
well as in man page examples and vignettes of other Bioconductor packages,
typically in conjunction with tools available in the
`r Biocpkg("BSgenomeForge")` and `r Biocpkg("Biostrings")` software packages.
New BSgenome data packages get added on a regular basis, based on user demand.

The current process of forging a BSgenome data package requires one to download
the sequence data files, either as a single twoBit file (e.g. musFur1.2bit) or
in a collection of compressed FASTA files. The user then needs to prepare the
BSgenome data package seed file, and then call `forgeBSgenomeDataPkg` function
on the seed file to forge the package. This is quite a convoluted task that can
be an inconvenience for the average user.

The BSgenomeForge package simplifies the whole process, by allowing the user to
use one function to create a BSgenome data package. It is comprised of two main
functions: `forgeBSgenomeDataPkgfromNCBI` and `forgeBSgenomeDataPkgfromUCSC`
which allow the user to forge BSgenome data packages with sequence data files
sourced from NCBI or UCSC respectively.

# Installation

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("BSgenomeForge")
```

# Basic usage

## Using `forgeBSgenomeDataPkgFromNCBI()`

```{r, message=FALSE}
library(BSgenomeForge)
```

**Example 1:** Information about assembly ASM972954v1 can be found at
https://www.ncbi.nlm.nih.gov/assembly/GCF_009729545.1/, including the assembly
accession, *GCA_009729545.1* and organism name, *Acidianus infernus*. Assembly
ASM972954v1 does not contain any circular sequences to be specified: 

```{r}
forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCA_009729545.1",
                             pkg_maintainer="Jane Doe <janedoe@gmail.com>",
                             organism="Acidianus infernus")
```

**Example 2:** Information about assembly ASM836960v1 can be found at
https://www.ncbi.nlm.nih.gov/assembly/GCA_008369605.1/, including the assembly
accession, *GCA_008369605.1* and organism name, *Vibrio cholerae*.
Assembly ASM836960v1 contains three circular sequence, "1", "2" and "unnamed".
See CP043554.1, CP043556.1, and CP043555.1 in the NCBI Nucleotide database at
https://www.ncbi.nlm.nih.gov/nuccore/. They must be specified as shown in the
example below:

```{r}
forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCA_008369605.1",
                             pkg_maintainer="Jane Doe <janedoe@gmail.com>",
                             organism="Vibrio cholerae",
                             circ_seqs = c("1", "2", "unnamed"))
```

Check `?forgeBSgenomeDataPkgFromNCBI` for more information.

## Using `forgeBSgenomeDataPkgFromUCSC()`

**Example 3:** Information about genome wuhCor1 can be found at
<https://genome.ucsc.edu/cgi-bin/hgGateway>. This belongs to the organism
*Severe acute respiratory syndrome coronavirus 2*. Genome wuhCor1 does not
contain any circular sequences to be specified:

```{r}
forgeBSgenomeDataPkgFromUCSC(
    genome="wuhCor1",
    organism="Severe acute respiratory syndrome coronavirus 2",
    pkg_maintainer="Jane Doe <janedoe@gmail.com>"
)
```

Check `?forgeBSgenomeDataPkgFromUCSC` for more information.

# Final steps

`forgeBSgenomeDataPkgfromNCBI` or `forgeBSgenomeDataPkgfromUCSC` returns the
path to the created package at the end of its execution. This can be used to
find the package location, and afterwards carry out the following commands to
build the package source tarball via command line (i.e. in a Linux/Unix terminal
or Windows PowerShell terminal).

```
R CMD build <pkgdir>
```

where \<pkgdir\> is the path to the source tree of the package. Then check the
package with

```
R CMD check <tarball>
```

where \<tarball\> is the path to the tarball produced by R CMD build. Finally
install the package with

```
R CMD INSTALL <tarball>
```

These operations can also be carried out within R, instead, using the
`r CRANpkg("devtools")` package

```{r}
devtools::build("./BSgenome.Ainfernus.NCBI.ASM972954v1")
```

```{r, eval=FALSE}
devtools::check_built("BSgenome.Ainfernus.NCBI.ASM972954v1_1.0.0.tar.gz")
```

```{r, eval=FALSE}
devtools::install_local("BSgenome.Ainfernus.NCBI.ASM972954v1_1.0.0.tar.gz")
```

# sessionInfo()

```{r}
sessionInfo()
```