Contents

1 Introduction

Originally, this package was written when the kallisto | bustools concept was still experimental, to test a new and fast way to generate the gene count matrix from fastq files for scRNA-seq. In the past year, kallisto | bustools has matured. Now there’s a wrapper kb-python that can download a prebuilt kallisto index for human and mice and call kallisto bus and bustools to get the gene count matrix. So largely, the old way of calling kallisto bus and bustools, and some functionalities of BUSpaRse, such as getting transcript to gene mapping, are obsolete.

So now the focus of BUSpaRse has shifted to finer control of the transcripts that go into the transcriptome and more options. Now all tr2g_* functions (except tr2g_ensembl) can filter transcripts for gene and transcript biotypes, only keep standard chromosomes (so no scaffolds and haplotypes), and extract the filtered transcripts from the transcriptome. GTF files from Ensembl, Ensembl fasta files, GFF3 files from Ensembl and RefSeq, TxDb, and EnsDb can all be used here.

library(BUSpaRse)
library(BSgenome.Hsapiens.UCSC.hg38)
#> Loading required package: BSgenome
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:Matrix':
#> 
#>     expand, unname
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#> 
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#> 
#>     strsplit
#> Loading required package: rtracklayer
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
#> Loading required package: GenomicFeatures
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
library(EnsDb.Hsapiens.v86)
#> Loading required package: ensembldb
#> Loading required package: AnnotationFilter
#> 
#> Attaching package: 'ensembldb'
#> The following object is masked from 'package:stats':
#> 
#>     filter

2 Downloading a transcriptome

The transcriptome can be downloaded from a specified version of Ensembl and filtered for biotypes and standard chromosomes, not only for the vertebrate database (www.ensembl.org and its mirrors), but also other Ensembl sites for plants, fungi, protists, and metazoa. The gene_biotype_use = "cellranger" means that the same gene biotypes Cell Ranger uses for its reference package are used here. By default, only standard chromosomes are kept. The dl_transcriptome function not only downloads the transcriptome and filters it, it also output the tr2g.tsv file of all transcripts in the filtered transcriptome, without column names, so can be directly used for bustools.

Wonder which biotypes are available? The lists of all gene and transcript biotypes from Ensembl are now provided in this package, and can be queried by data("ensembl_gene_biotypes") and data("ensembl_tx_biotypes").

Resources for common invertebrate model organisms such as Drosophila melanogaster and C. elegans are actually available on the vertebrate site (www.ensembl.org).

# For Drosophila
dl_transcriptome("Drosophila melanogaster", out_path = "fly", 
                 gene_biotype_use = "cellranger", verbose = FALSE)
#> Warning: Ensembl will soon enforce the use of https.
#> Ensure the 'host' argument includes "https://"

#> Warning: Ensembl will soon enforce the use of https.
#> Ensure the 'host' argument includes "https://"
#> Ensembl site unresponsive, trying asia mirror
#> Ensembl site unresponsive, trying asia mirror
#> Version is not applicable to IDs not of the form ENS[species prefix][feature type prefix][a unique eleven digit number].
list.files("fly")
#> [1] "Drosophila_melanogaster.BDGP6.32.cdna.all.fa.gz"
#> [2] "tr2g.tsv"                                       
#> [3] "tx_filtered.fa"

The first file is the original fasta file. The second is the tr2g file without column names. The third is the filtered fasta file.

For C. elegans, from an archived version of Ensembl

dl_transcriptome("Caenorhabditis elegans", out_path = "worm", verbose = FALSE,
                 gene_biotype_use = "cellranger", ensembl_version = 97)
#> Version is not applicable to IDs not of the form ENS[species prefix][feature type prefix][a unique eleven digit number].
list.files("worm")
#> [1] "Caenorhabditis_elegans.WBcel235.cdna.all.fa.gz"
#> [2] "tr2g.tsv"                                      
#> [3] "tx_filtered.fa"

For Saccharomyces cerevisiae. Note that the versioning of Ensembl for the plants, fungi, and etc. sites, that are actually www.ensemblgenomes.org, is different from that of the vertebrate site.

dl_transcriptome("Saccharomyces cerevisiae", out_path = "yeast", 
                 type = "fungus", gene_biotype_use = "cellranger", 
                 verbose = FALSE)
#> Warning: Ensembl will soon enforce the use of https.
#> Ensure the 'host' argument includes "https://"

#> Warning: Ensembl will soon enforce the use of https.
#> Ensure the 'host' argument includes "https://"
#> Version is not applicable to IDs not of the form ENS[species prefix][feature type prefix][a unique eleven digit number].
list.files("yeast")
#> [1] "Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz"
#> [2] "tr2g.tsv"                                       
#> [3] "tx_filtered.fa"

3 Obtaining transcript to gene information

3.1 From Ensembl

The transcript to gene data frame can be generated by directly querying Ensembl with biomart. This can query not only the vertebrate database (www.ensembl.org), but also the Ensembl databases for other organisms, such as plants (plants.ensembl.org) and fungi (fungi.ensembl.org). By default, this will use the most recent version of Ensembl, but older versions can also be used. By default, Ensembl transcript ID (with version number), gene ID (with version number), and gene symbol are downloaded, but other attributes available on Ensembl can be downloaded as well. Make sure that the Ensembl version matches the Ensembl version of transcriptome used for kallisto index.

# Specify other attributes
tr2g_mm <- tr2g_ensembl("Mus musculus", ensembl_version = 99, 
                        other_attrs = "description", 
                        gene_biotype_use = "cellranger")
#> Querying biomart for transcript and gene IDs of Mus musculus
head(tr2g_mm)
ABCDEFGHIJ0123456789
 
 
transcript
<chr>
gene
<chr>
3ENSMUST00000082421.1ENSMUSG00000064370.1
5ENSMUST00000082419.1ENSMUSG00000064368.1
6ENSMUST00000082418.1ENSMUSG00000064367.1
10ENSMUST00000082414.1ENSMUSG00000064363.1
11ENSMUST00000084013.1ENSMUSG00000065947.3
13ENSMUST00000082411.1ENSMUSG00000064360.1
# Plants
tr2g_at <- tr2g_ensembl("Arabidopsis thaliana", type = "plant")
#> Version is only available to vertebrates.
#> Querying biomart for transcript and gene IDs of Arabidopsis thaliana
#> Warning: Ensembl will soon enforce the use of https.
#> Ensure the 'host' argument includes "https://"
#> File /tmp/Rtmp605ceL/Rbuild35ba1e2bb78f6f/BUSpaRse/vignettes/tr2g.tsv already exists.
head(tr2g_at)
ABCDEFGHIJ0123456789
 
 
transcript
<chr>
gene
<chr>
gene_name
<chr>
chromosome_name
<chr>
1AT3G11415.1AT3G114153
2AT1G31258.1AT1G312581
3AT5G24735.1AT5G247355
4AT2G45780.1AT2G457802
5AT2G42425.1AT2G424252
6AT4G01533.1AT4G015334

3.2 From FASTA file

We need a FASTA file for the transcriptome used to build kallisto index. Transcriptome FASTA files from Ensembl contains gene annotation in the sequence name of each transcript. Transcript and gene information can be extracted from the sequence name. At present, only Ensembl FASTA files or FASTA files with sequence names formatted like in Ensembl are accepted.

By default, the tr2g.tsv file and filtered fasta file (if filtering for biotypes and chromosomes) are written to disk, but these can be turned off so only the tr2g data frame is returned into the R session.

# Subset of a real Ensembl FASTA file
toy_fasta <- system.file("testdata/fasta_test.fasta", package = "BUSpaRse")
tr2g_fa <- tr2g_fasta(file = toy_fasta, write_tr2g = FALSE, save_filtered = FALSE)
head(tr2g_fa)
ABCDEFGHIJ0123456789
transcript
<chr>
gene
<chr>
gene_name
<chr>
ENST00000362684.1ENSG00000277234.1RNU1-5P
ENST00000459215.1ENSG00000239102.1RNU7-185P
ENST00000410940.1ENSG00000222872.1RNU4-78P
ENST00000383971.1ENSG00000206698.1RNU1-73P
ENST00000516317.2ENSG00000252126.2RNU6-313P

3.3 From GTF and GFF3 files

If you have GTF or GFF3 files for other purposes, these can also be used to generate the transcript to gene file. Now tr2g_gtf and tr2g_gff3 can extract transcriptome from a genome that is either a BSgenome or a DNAStringSet.

# Subset of a reral GTF file from Ensembl
toy_gtf <- system.file("testdata/gtf_test.gtf", package = "BUSpaRse")
tr2g_tg <- tr2g_gtf(toy_gtf, Genome = BSgenome.Hsapiens.UCSC.hg38,
                    gene_biotype_use = "cellranger",
                    out_path = "gtf")
#> 635 sequences in the genome are absent from the annotation.
head(tr2g_tg)
ABCDEFGHIJ0123456789
transcript
<chr>
gene
<chr>
gene_name
<chr>
ENST00000287218.9ENSG00000156639.12ZFAND3
ENST00000373391.6ENSG00000156639.12ZFAND3
ENST00000474522.5ENSG00000156639.12ZFAND3
ENST00000373389.5ENSG00000156639.12ZFAND3
ENST00000469208.1ENSG00000156639.12ZFAND3
ENST00000440482.2ENSG00000156639.12ZFAND3

A new GTF or GFF3 file after filtering biotypes and chromosomes is also written, and this can be turned off by setting save_filtered_gtf = FALSE or save_filtered_gff = FALSE. The transcriptome, with biotypes filtered and only standard chromosomes, is in transcriptome.fa. Use compress_fa = TRUE to gzip it.

list.files("gtf")
#> [1] "gtf_filtered.gtf" "tr2g.tsv"         "transcriptome.fa"

3.4 From TxDb

TxDb is a class for storing transcript annotations from the Bioconductor package GenomicFeatures. Unfortunately, TxDb.Hsapiens.UCSC.hg38.knownGene does not have biotype information or gene symbols.

tr2g_hs <- tr2g_TxDb(TxDb.Hsapiens.UCSC.hg38.knownGene, get_transcriptome = FALSE,
                     write_tr2g = FALSE)
#> 'select()' returned 1:1 mapping between keys and columns
head(tr2g_hs)
ABCDEFGHIJ0123456789
 
 
transcript
<chr>
gene
<chr>
tx_id
<int>
seqnames
<chr>
1ENST00000456328.21002871021chr1
2ENST00000450305.21002871022chr1
3ENST00000473358.11079857303chr1
4ENST00000469289.11079857304chr1
5ENST00000607096.11003022785chr1
9ENST00000641515.2795019chr1

3.5 From EnsDb

EnsDb is a class for Ensembl gene annotations, from the Bioconductor package ensembldb. Ensembl annotations as EnsDb are available on AnnotationHub (since version 87), and some older versions are stand alone packages (e.g. EnsDb.Hsapiens.v86).

tr2g_hs86 <- tr2g_EnsDb(EnsDb.Hsapiens.v86, get_transcriptome = FALSE, 
                        write_tr2g = FALSE, gene_biotype_use = "cellranger",
                        use_gene_version = FALSE, use_transcript_version = FALSE)
head(tr2g_hs86)
ABCDEFGHIJ0123456789
 
 
transcript
<chr>
gene
<chr>
gene_name
<chr>
gene_biotype
<chr>
seqnames
<chr>
1ENST00000000233ENSG00000004059ARF5protein_coding7
2ENST00000000412ENSG00000003056M6PRprotein_coding12
3ENST00000000442ENSG00000173153ESRRAprotein_coding11
4ENST00000001008ENSG00000004478FKBP4protein_coding12
5ENST00000001146ENSG00000003137CYP26B1protein_coding2
6ENST00000002125ENSG00000003509NDUFAF7protein_coding2

3.6 Deprecation

There used to be sections about sort_tr2g and save_tr2g_bustools, but these functions have been superseded by the new version of tr2g functions and dl_transcriptome, which sort the transcriptome after extracting it so the tr2g and the transcriptome are in the same order. The new version of tr2g functions and dl_transcriptome also by default writes the tr2g.tsv without column names with the first column as transcript and the second as gene to disk.

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] EnsDb.Hsapiens.v86_2.99.0               
#>  [2] ensembldb_2.18.0                        
#>  [3] AnnotationFilter_1.18.0                 
#>  [4] TxDb.Hsapiens.UCSC.hg38.knownGene_3.14.0
#>  [5] GenomicFeatures_1.46.0                  
#>  [6] AnnotationDbi_1.56.0                    
#>  [7] Biobase_2.54.0                          
#>  [8] BSgenome.Hsapiens.UCSC.hg38_1.4.4       
#>  [9] BSgenome_1.62.0                         
#> [10] rtracklayer_1.54.0                      
#> [11] Biostrings_2.62.0                       
#> [12] XVector_0.34.0                          
#> [13] GenomicRanges_1.46.0                    
#> [14] GenomeInfoDb_1.30.0                     
#> [15] IRanges_2.28.0                          
#> [16] S4Vectors_0.32.0                        
#> [17] BiocGenerics_0.40.0                     
#> [18] ggplot2_3.3.5                           
#> [19] zeallot_0.1.0                           
#> [20] Matrix_1.3-4                            
#> [21] BUSpaRse_1.8.0                          
#> [22] TENxBUSData_1.7.0                       
#> [23] BiocStyle_2.22.0                        
#> 
#> loaded via a namespace (and not attached):
#>  [1] colorspace_2.0-2              rjson_0.2.20                 
#>  [3] ellipsis_0.3.2                farver_2.1.0                 
#>  [5] bit64_4.0.5                   interactiveDisplayBase_1.32.0
#>  [7] fansi_0.5.0                   xml2_1.3.2                   
#>  [9] cachem_1.0.6                  knitr_1.36                   
#> [11] jsonlite_1.7.2                Rsamtools_2.10.0             
#> [13] dbplyr_2.1.1                  png_0.1-7                    
#> [15] shiny_1.7.1                   BiocManager_1.30.16          
#> [17] compiler_4.1.1                httr_1.4.2                   
#> [19] assertthat_0.2.1              fastmap_1.1.0                
#> [21] lazyeval_0.2.2                later_1.3.0                  
#> [23] htmltools_0.5.2               prettyunits_1.1.1            
#> [25] tools_4.1.1                   gtable_0.3.0                 
#> [27] glue_1.4.2                    GenomeInfoDbData_1.2.7       
#> [29] dplyr_1.0.7                   rappdirs_0.3.3               
#> [31] Rcpp_1.0.7                    jquerylib_0.1.4              
#> [33] vctrs_0.3.8                   ExperimentHub_2.2.0          
#> [35] xfun_0.27                     stringr_1.4.0                
#> [37] plyranges_1.14.0              mime_0.12                    
#> [39] lifecycle_1.0.1               restfulr_0.0.13              
#> [41] XML_3.99-0.8                  AnnotationHub_3.2.0          
#> [43] zlibbioc_1.40.0               scales_1.1.1                 
#> [45] hms_1.1.1                     promises_1.2.0.1             
#> [47] MatrixGenerics_1.6.0          ProtGenerics_1.26.0          
#> [49] parallel_4.1.1                SummarizedExperiment_1.24.0  
#> [51] yaml_2.2.1                    curl_4.3.2                   
#> [53] memoise_2.0.0                 sass_0.4.0                   
#> [55] biomaRt_2.50.0                stringi_1.7.5                
#> [57] RSQLite_2.2.8                 BiocVersion_3.14.0           
#> [59] highr_0.9                     BiocIO_1.4.0                 
#> [61] filelock_1.0.2                BiocParallel_1.28.0          
#> [63] rlang_0.4.12                  pkgconfig_2.0.3              
#> [65] matrixStats_0.61.0            bitops_1.0-7                 
#> [67] evaluate_0.14                 lattice_0.20-45              
#> [69] purrr_0.3.4                   GenomicAlignments_1.30.0     
#> [71] bit_4.0.4                     tidyselect_1.1.1             
#> [73] magrittr_2.0.1                bookdown_0.24                
#> [75] R6_2.5.1                      magick_2.7.3                 
#> [77] generics_0.1.1                DelayedArray_0.20.0          
#> [79] DBI_1.1.1                     pillar_1.6.4                 
#> [81] withr_2.4.2                   KEGGREST_1.34.0              
#> [83] RCurl_1.98-1.5                tibble_3.1.5                 
#> [85] crayon_1.4.1                  utf8_1.2.2                   
#> [87] BiocFileCache_2.2.0           rmarkdown_2.11               
#> [89] progress_1.2.2                grid_4.1.1                   
#> [91] blob_1.2.2                    digest_0.6.28                
#> [93] xtable_1.8-4                  tidyr_1.1.4                  
#> [95] httpuv_1.6.3                  munsell_0.5.0                
#> [97] bslib_0.3.1