Introduction to SigsPack

Sigspack provides tools for easy computation of signature exposures from mutational catalogues.

The package provides several features, allowing to read the primary mutation data, normalise the mutational catalogues if necessary and compute signature exposures with their bootstrapped variation estimates.

Loading the package

library(SigsPack)

Loading a VCF

In most cases you will want to analyse real data, e.g. fit a sample’s mutational profile to known mutational signatures. You can easily load your data and use it with the package.

if (require(BSgenome.Hsapiens.UCSC.hg19)) {
  sample <- vcf2mut_cat(
    system.file("extdata", "example.vcf.gz", package="SigsPack"),
    BSgenome.Hsapiens.UCSC.hg19
  )
}
#> Loading required package: BSgenome.Hsapiens.UCSC.hg19
#> Loading required package: BSgenome
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following object is masked from 'package:SigsPack':
#> 
#>     normalize
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#> 
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#> 
#>     strsplit
#> Loading required package: rtracklayer

It will be transformed into a ‘mutational profile’ sorting all single nucleotide variants according to their trinucleotide context and mutation. The result is a 96 by 1 matrix that can be used as input to the analysis functions of the package as well as in most other packages of this field.

Simulating data

Instead of using real-life data it is also possible to generate synthetic data that can be used to analyse signatures stability. The following code generates 10 mutational catalogues with exposure to the COSMIC signatures 2, 6, 15 and 27. The catalogues consist of mutation counts sampled from a distribution that is a linear combination of the aforementioned signatures. The weight of each of these signatures can optionally be specified, too. For convenience, the COSMIC reference signature matrix is included in the package and used as default in the functions. However, it is also possible to use all functions with a custom signature matrix.

data("cosmicSigs")

cats <- create_mut_catalogues(10, 500, P=cosmicSigs, sig_set = c(2,6,15,27))
knitr::kable(head(cats[['catalogues']]))
sample_1 sample_2 sample_3 sample_4 sample_5 sample_6 sample_7 sample_8 sample_9 sample_10
A[C>A]A 0 2 1 2 0 1 2 0 2 1
A[C>A]C 2 1 2 2 2 3 2 0 1 4
A[C>A]G 0 0 0 1 0 0 0 0 0 0
A[C>A]T 2 2 2 1 0 1 2 0 0 1
C[C>A]A 1 5 3 2 4 6 5 5 5 2
C[C>A]C 6 8 3 8 9 10 2 3 7 6

Estimating signature exposures and bootstrapping samples

The function bootstrap_mut_catalogues bootstraps a sample returning a specified amount of re-samples which can be used to gain a better variability estimation of the sample’s signature exposure.

The signature exposure is calculated using quadratic programming. This can be done on one or several samples at once. The COSMIC signatures have been included in the package, and are used by default. However, it is possible to use your own signature matrix, or use a sub-set of COSMIC signatures, instead of the whole matrix.


reps <- bootstrap_mut_catalogues(n = 1000, original = cats[["catalogues"]][,1])

# using only signatures 4, 17, 23 and 30 for signature estimation
sigs <- signature_exposure(reps, sig_set = c(4,17,23,30))

print(sigs$exposures[,1])
#>  Signature 1  Signature 2  Signature 3  Signature 4  Signature 5  Signature 6 
#>   0.00000000   0.00000000   0.00000000   0.31313227   0.00000000   0.00000000 
#>  Signature 7  Signature 8  Signature 9 Signature 10 Signature 11 Signature 12 
#>   0.00000000   0.00000000   0.00000000   0.00000000   0.00000000   0.00000000 
#> Signature 13 Signature 14 Signature 15 Signature 16 Signature 17 Signature 18 
#>   0.00000000   0.00000000   0.00000000   0.00000000   0.06071580   0.00000000 
#> Signature 19 Signature 20 Signature 21 Signature 22 Signature 23 Signature 24 
#>   0.00000000   0.00000000   0.00000000   0.00000000   0.02321469   0.00000000 
#> Signature 25 Signature 26 Signature 27 Signature 28 Signature 29 Signature 30 
#>   0.00000000   0.00000000   0.00000000   0.00000000   0.00000000   0.60293724

With one command you can bootstrap a mutational catalogue and obtain a table and a plot illustrating the results of the signature estimation for this sample and the bootstrapped re-samples. The plot shows the distribution of estimated signature exposure for all the re-samples, highlighting the one of the original mutational catalogue, thus providing insights on the reliability of the estimates.


report <- summarize_exposures(reps[,1])

knitr::kable(
  head(report)
  )
original exposure min 1. quartile median 3. quartile max
Signature 1 0.0000000 0.000000 0.0000000 0.0000000 0.0100734 0.1791492
Signature 2 0.1795390 0.095461 0.1586940 0.1752018 0.1918205 0.2474150
Signature 3 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
Signature 4 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0181542
Signature 5 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
Signature 6 0.2644519 0.000000 0.1888099 0.2325030 0.2761179 0.4544724

Tri-nucleotide contexts and normalization

Accurate exposures estimation requires matching tri-nucleotide frequencies between observations and the signature matrix. The COSMIC signature matrix provided in the package is relative to the whole genome (GRCh37) tri-nucleotide frequencies. So if you want to fit those signatures to exome data, the data need to be normalized to match the signatures prior to exposures estimation.

Let’s say, that the vcf we derived the mutational catalogue from contained exome data. In the case of this example, we want to scale our exome data to the genome ‘space’ to match the COSMIC reference signatures. Hence, we need both the tri-nucleotide distribution of the human genome (GRCh37) and the exome that our data were collected on.

Extract the tri-nucleotide context frequencies from a genome (BSgenome object) or exome and use them to normalize the data.

if (require(BSgenome.Hsapiens.UCSC.hg19)){
  genome_contexts <- get_context_freq(BSgenome.Hsapiens.UCSC.hg19)
  exome_contexts <- get_context_freq(
    BSgenome.Hsapiens.UCSC.hg19,
    system.file("extdata", "example.bed.gz", package="SigsPack")
  )
  normalized_mut_cat <- SigsPack::normalize(sample, exome_contexts, hg19context_freq)
}

The normalization returns frequencies, not count numbers. If you want to use them for exposure estimation or for bootstrapping, the catalogue size must be scaled, or input together with the normalized catalogue.

if (require(BSgenome.Hsapiens.UCSC.hg19)) {
  sigs_norm <- signature_exposure(sum(sample) * normalized_mut_cat)
  report_norm <- summarize_exposures(normalized_mut_cat, m=sum(sample))
  reps_norm <- bootstrap_mut_catalogues(
      n=1000,
      original=normalized_mut_cat, 
      m=sum(sample)
  )
}

sessionInfo

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.62.0                  
#>  [3] rtracklayer_1.54.0                Biostrings_2.62.0                
#>  [5] XVector_0.34.0                    GenomicRanges_1.46.0             
#>  [7] GenomeInfoDb_1.30.0               IRanges_2.28.0                   
#>  [9] S4Vectors_0.32.0                  BiocGenerics_0.40.0              
#> [11] SigsPack_1.8.0                   
#> 
#> loaded via a namespace (and not attached):
#>  [1] MatrixGenerics_1.6.0        Biobase_2.54.0             
#>  [3] httr_1.4.2                  sass_0.4.0                 
#>  [5] bit64_4.0.5                 jsonlite_1.7.2             
#>  [7] bslib_0.3.1                 assertthat_0.2.1           
#>  [9] highr_0.9                   BiocFileCache_2.2.0        
#> [11] blob_1.2.2                  GenomeInfoDbData_1.2.7     
#> [13] Rsamtools_2.10.0            yaml_2.2.1                 
#> [15] progress_1.2.2              pillar_1.6.4               
#> [17] RSQLite_2.2.8               lattice_0.20-45            
#> [19] quadprog_1.5-8              glue_1.4.2                 
#> [21] digest_0.6.28               htmltools_0.5.2            
#> [23] Matrix_1.3-4                XML_3.99-0.8               
#> [25] pkgconfig_2.0.3             biomaRt_2.50.0             
#> [27] zlibbioc_1.40.0             purrr_0.3.4                
#> [29] BiocParallel_1.28.0         tibble_3.1.5               
#> [31] KEGGREST_1.34.0             generics_0.1.1             
#> [33] ellipsis_0.3.2              cachem_1.0.6               
#> [35] SummarizedExperiment_1.24.0 GenomicFeatures_1.46.0     
#> [37] magrittr_2.0.1              crayon_1.4.1               
#> [39] memoise_2.0.0               evaluate_0.14              
#> [41] fansi_0.5.0                 xml2_1.3.2                 
#> [43] tools_4.1.1                 prettyunits_1.1.1          
#> [45] hms_1.1.1                   BiocIO_1.4.0               
#> [47] lifecycle_1.0.1             matrixStats_0.61.0         
#> [49] stringr_1.4.0               DelayedArray_0.20.0        
#> [51] AnnotationDbi_1.56.0        compiler_4.1.1             
#> [53] jquerylib_0.1.4             rlang_0.4.12               
#> [55] grid_4.1.1                  RCurl_1.98-1.5             
#> [57] rappdirs_0.3.3              VariantAnnotation_1.40.0   
#> [59] rjson_0.2.20                bitops_1.0-7               
#> [61] rmarkdown_2.11              restfulr_0.0.13            
#> [63] curl_4.3.2                  DBI_1.1.1                  
#> [65] R6_2.5.1                    GenomicAlignments_1.30.0   
#> [67] knitr_1.36                  dplyr_1.0.7                
#> [69] utf8_1.2.2                  fastmap_1.1.0              
#> [71] bit_4.0.4                   filelock_1.0.2             
#> [73] stringi_1.7.5               parallel_4.1.1             
#> [75] Rcpp_1.0.7                  vctrs_0.3.8                
#> [77] png_0.1-7                   tidyselect_1.1.1           
#> [79] dbplyr_2.1.1                xfun_0.27