---
title: "_igblastr_ overview"
author:
- name: Hervé Pagès
  affiliation: Fred Hutch Cancer Center, Seattle, WA
- name: Kellie MacPhee
  affiliation: Fred Hutch Cancer Center, Seattle, WA
- name: Ollivier Hyrien
  affiliation: Fred Hutch Cancer Center, Seattle, WA
date: "Compiled `r BiocStyle::doc_date()`; Modified 7 October 2025"
package: igblastr
vignette: |
  %\VignetteIndexEntry{igblastr overview}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document
---


```{r setup, include=FALSE}
library(BiocStyle)
```


# Introduction

The Immunoglobulin Basic Local Alignment Search Tool (IgBLAST) is a
specialized bioinformatics tool developed by the National Center for
Biotechnology Information (NCBI) for the analysis of B-cell receptor (BCR)
and T-cell receptor (TCR) sequences (Ye et al., 2013). IgBLAST performs
sequence alignment and annotation, with key outputs including germline
V, D, and J gene assignments; detection of somatic hypermutations
introduced during affinity maturation; identification and annotation
of complementarity-determining regions (CDR1–CDR3) and framework
regions (FR1–FR4); and both nucleotide and protein-level alignments.

`r Biocpkg("igblastr")` is an R/Bioconductor package that provides functions
to conveniently install and use a local IgBLAST installation from within R.
The package is designed to make it as easy as possible to use IgBLAST in R
by streamlining the installation of both IgBLAST and its associated germline
databases. In particular, these installations can be performed with a single
function call, do not require root access, and persist across R sessions.

The main function in the package is `igblastn()`, a wrapper to the `igblastn`
_standalone executable_ included in IgBLAST. In addition to `igblastn()`,
the package provides:

- A function (`install_igblast()`) for conveniently downloading and
  installing a pre-compiled IgBLAST from NCBI.

- Functions to download and install germline databases from the
  IMGT/V-QUEST download site, and to configure them for use with `igblastn()`.

- A set of built-in germline databases from OGRDB, the AIRR Community’s
  Open Germline Reference Database.

- A set of built-in constant region (C-region) databases (a.k.a. constant
  region databases) from IMGT/V-QUEST.

- Utility functions to parse the results returned by `igblastn()`.

- A simple tool to vizualize the annotated sequences in a browser.

- Utility functions to download data from OAS, the Observed Antibody
  Space database, and prepare it for use with IgBLAST.

- Etc.

IgBLAST is described at <https://pubmed.ncbi.nlm.nih.gov/23671333/>

Online IgBLAST: <https://www.ncbi.nlm.nih.gov/igblast/>

Please use <https://github.com/HyrienLab/igblastr/issues> to report bugs,
provide feedback, request features (etc) about `r Biocpkg("igblastr")`.


# Install and load the package


Like any Bioconductor package, `r Biocpkg("igblastr")` should be installed
with `BiocManager::install()`:
```{r, eval=FALSE}
if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("igblastr")
```

`BiocManager::install()` will take care of installing the package dependencies
that are missing.

Load the package:
```{r, message=FALSE}
library(igblastr)
```


# Install IgBLAST


If IgBLAST is already installed on your system, you can tell
`r Biocpkg("igblastr")` to use it by setting the environment variable
`IGBLAST_ROOT` to the path of your IgBLAST installation.
See `?IGBLAST_ROOT` for more information.

Otherwise, simply call `install_igblast()` to install the latest
version of IgBLAST. As of March 2025, NCBI provides pre-compiled
versions of IgBLAST for Linux, Windows, Intel Mac and Mac Silicon.
`install_igblast()` will automatically download the appropriate pre-compiled
version of IgBLAST for your platform from the NCBI FTP site, and install it
in a location that will be remembered across R sessions.
```{r}
if (!has_igblast())
    install_igblast()
```
See `?install_igblast` for more information.

Note that we use `has_igblast()` to avoid reinstalling IgBLAST if
`r Biocpkg("igblastr")` already has access to a working IgBLAST installation.

Display basic information about the IgBLAST installation used
by `r Biocpkg("igblastr")`:
```{r}
igblast_info()
```


# Install and select a germline database


The `r Biocpkg("igblastr")` package includes a FASTA file containing
8,437 paired heavy and light chain human antibody sequences (16,874
individual sequences) retrieved from OAS. These sequences will serve as
our _query sequences_, that is, the immunoglobulin (Ig) sequences that
we will analyze in this vignette.

Before we can do so, the `igblastn` _standalone executable_ in IgBLAST (and,
by extension, our `igblastn()` function) needs access to the germline V, D,
and J gene sequences for humans. Before `igblastn` can use them, these
germline sequences must first be stored in three separate BLAST databases.
We will refer to these as the V-region, D-region, and J-region databases,
respectively. When considered together, we will refer to them as the
_germline database_.


## Built-in germline databases

The `r Biocpkg("igblastr")` package includes a set of built-in germline
databases for human and mouse that were obtained from the OGRDB database
(AIRR community). These can be listed with:
```{r}
list_germline_dbs(builtin.only=TRUE)
```
The last part of the database name indicates the date of the download in
YYYYMM format.

If our query sequences are from humans or from one of the mouse strains listed
above, we can already select the appropriate database with `use_germline_db()`
and skip the subsection below.

## Install additional germline databases

AIRR/OGRDB and IMGT/V-QUEST are two providers of germline databases that
can be used with IgBLAST. If, for any reason, none of the built-in
AIRR/OGRDB germline databases is suitable (e.g., if your query sequences
are not from human or mouse), you can use `install_IMGT_germline_db()`
to install additional germline databases. Below, we show how to install the
latest human germline database from IMGT/V-QUEST.

First, we list the most recent IMGT/V-QUEST releases:
```{r}
head(list_IMGT_releases())
```

The organisms included in release 202518-3 are:
```{r}
list_IMGT_organisms("202518-3")
```

Next, we install the human germline database from the latest IMGT/V-QUEST
release:
```{r, message=FALSE}
install_IMGT_germline_db("202518-3", organism="Homo sapiens", force=TRUE)
```
See `?install_IMGT_germline_db` for more information.

Finally, we select the newly installed germline database for use
with `igblastn()`:
```{r}
use_germline_db("IMGT-202518-3.Homo_sapiens.IGH+IGK+IGL")
```
See `?use_germline_db` for more information.

## Display the full list of cached germline dbs

To see the full list of cached germline dbs:
```{r}
list_germline_dbs()
```
Note that the asterisk (`*`) displayed at the far right of the output
from `list_germline_dbs()` indicates the currently selected germline
database (you may need to scroll horizontally to see the asterisk).

See `?list_germline_dbs` for more information.


# Select a constant region database (optional)


The `igblastn` _standalone executable_ in IgBLAST can also use constant
region (C region) sequences to improve the analysis. As with the
germline V, D, and J gene sequences, the C-region sequences should
typically come from the same organism as the query sequences, and they
must also be formatted as a BLAST database. We will refer to this database
as the C-region database.

The `r Biocpkg("igblastr")` package includes a set of built-in C-region
databases for human, mouse, and rabbit, obtained from IMGT/V-QUEST. These
can be listed using:
```{r}
list_c_region_dbs()
```
The last part of the database name indicates the date of the download in
YYYYMM format.

If your query sequences are from human, mouse, or rabbit, you can
select the appropriate database using `use_c_region_db()`:
```{r}
use_c_region_db("_IMGT.human.IGH+IGK+IGL.202412")
```

Calling `list_c_region_dbs()` again should display an asterisk (`*`)
at the far right of the output, indicating the currently selected
C-region database.

See `?list_c_region_dbs` for more information.


# Use igblastn()


Now that we have selected a germline and C-region database for use
with `igblastn()`, we are almost ready to call `igblastn()` to perform
the alignment.

As mentioned earlier, the `r Biocpkg("igblastr")` package includes a
FASTA file containing 8,437 paired heavy and light chain human antibody
sequences retrieved from OAS. These serve as our _query sequences_,
that is, the set of BCR sequences that we will analyse using `igblastn()`.

To get the path to the query sequences, use:
```{r}
query <- system.file(package="igblastr", "extdata",
                     "BCR", "1279067_1_Paired_sequences.fasta.gz")
```

The `r Biocpkg("igblastr")` package also includes a JSON file containing
metadata associated with the query BCR sequences:
```{r}
json <- system.file(package="igblastr", "extdata",
                    "BCR", "1279067_1_Paired_All.json")
query_metadata <- jsonlite::fromJSON(json)
query_metadata
```
The 8,347 paired sequences come from memory B cells isolated from
peripheral blood mononuclear cell (PBMC) samples of a single human
donor (age 38) with no known disease or vaccination history.
The source for these sequences is the Jaffe et al. (2022) study; the
DOI link to the publication is provided above.

Before calling `igblastn()`, we first check the selected databases by
calling `use_germline_db()` and `use_c_region_db()` with no arguments:
```{r}
use_germline_db()
use_c_region_db()
```

Now, let's call `igblastn()`. Since we are only interested in the best
alignment for each query sequence, we set `num_alignments_V` to 1. This
may take up to around 3 min on a standard laptop:
```{r}
AIRR_df <- igblastn(query, num_alignments_V=1)
```

By default, the result is a tibble in AIRR format:
```{r}
AIRR_df
```

You can call `igbrowser()` on `AIRR_df` to visualize the annotated
sequences in a browser. For each sequence, the V, D, J, and C segments
will be shown as well as the FWR1-4 and CDR1-3 regions.

See `?igblastn` and `?igbrowser` for more information.


# Downstream analysis examples


```{r, message=FALSE}
library(ggplot2)
library(dplyr)
library(scales)
library(ggseqlogo)
```

One common analysis of AIRR format data is to examine the distribution of
percent mutation across BCR sequences. Here we analyze the percent mutation
in the V segments of each chain type (heavy, kappa, and lambda).
Note that V percent mutation is 100 - v\_identity.
```{r}
AIRR_df |>
    ggplot(aes(locus, 100 - v_identity)) +
    theme_bw(base_size=14) +
    geom_point(position = position_jitter(width = 0.3), alpha = 0.1) +
    geom_boxplot(color = "blue", fill = NA, outliers = FALSE, alpha = 0.3) +
    ggtitle("Distribution of V percent mutation by locus") +
    xlab(NULL)
```

Another common analysis is to investigate the distribution of germline
genes (e.g., V genes). In this case, we typically stratify the analysis
by locus or chain type.
```{r}
plot_gene_dist <- function(AIRR_df, loc) {
    df_v_gene <- AIRR_df |>
        filter(locus == loc) |>
        mutate(v_gene = sub("\\*[0-9]*", "", v_call)) |> # drop allele info
        group_by(v_gene) |>
        summarize(n = n(), .groups = "drop") |>
        mutate(frac = n / sum(n))
    df_v_gene |>
        ggplot(aes(frac, v_gene)) +
        theme_bw(base_size=13) +
        geom_col() +
        scale_x_continuous('Percent of sequences', labels = scales::percent) +
        ylab("Germline gene") +
        ggtitle(paste0(loc, "V gene prevalence"))
}
```
```{r, fig.height=8}
plot_gene_dist(AIRR_df, "IGH")
```
```{r, fig.height=5.9}
plot_gene_dist(AIRR_df, "IGK")
```
```{r, fig.height=5.3}
plot_gene_dist(AIRR_df, "IGL")
```

A third category of analysis focuses on CDR3 sequences, including their
lengths and motifs, which are often visualized using sequence logo plots.
```{r, fig.height=4.5}
AIRR_df$cdr3_aa_length <- nchar(AIRR_df$cdr3_aa)

AIRR_df |>
    group_by(locus, cdr3_aa_length) |>
    summarize(n = n(), .groups = "drop") |>
    ggplot(aes(cdr3_aa_length, n)) +
    theme_bw(base_size=14) +
    facet_wrap(~locus) +
    geom_col() +
    ggtitle("Histograms of CDR3 length by locus")
```

```{r}
AIRR_df |>
    filter(locus == "IGK", cdr3_aa_length == 9) |>
    pull(cdr3_aa) |>
    ggseqlogo(method = "probability") +
    theme_bw(base_size=14) +
    ggtitle("Logo plot of kappa chain CDR3 sequences that are 9 AA long")
```


# Advanced usage


## Passing additional arguments to igblastn()

The `igblastn` _standalone executable_ in IgBLAST supports many command
line arguments. You can quickly list them with `igblastn_help()` (or with
`igblastn_help(TRUE)` to get an expanded list with more details --
see `?igblastn_help`):
```{r}
igblastn_help()
```

All these command line arguments can be passed to the `igblastn()`
function using the usual `argument_name=argument_value` syntax. For
example, to set the number of threads to 8 (`-num_threads` command line
argument), do `igblastn(query, num_threads=8)`.

## Restrict the search to a subset of user-supplied gene alleles

Some arguments of particular interest are the `germline_db_[VDJ]_seqidlist`
arguments. They allow restricting the search of the germline database to
a list of gene alleles supplied by the user. This list can be provided
as a character vector of gene allele identifiers (e.g. `IGHV3-23*01`,
`IGHV3-23*04`, etc..), or as the path to a file containing the identifiers
(one identifier per line). For example:
```{r, eval=FALSE}
V_alleles <- c("IGHV3-23*01", "IGHV3-23*04")
igblastn(query, germline_db_V_seqidlist=V_alleles)
```

If the gene alleles are stored in a file, say in
`path/to/my_V_gene_alleles.txt`, then use:
```{r, eval=FALSE}
igblastn(query, germline_db_V_seqidlist=file("path/to/my_V_gene_alleles.txt"))
```
Note that in this case, the path to the file containing the gene alleles
identifiers must be wrapped in a call to `file()`.

See `?igblastn` for more information.

## Add novel gene alleles to the search

The `r Biocpkg("igblastr")` package also provides facilities to add
novel gene alleles to the search. This is done in two steps:

1. Use one of the `augment_germline_db_[VDG]()` functions to create a new
   augmented V, D, or J germline database.

2. Tell `igblastn()` to use the augmented V, D, or J germline database
   by pointing the `germline_db_V`, `germline_db_D`, or `germline_db_J`
   argument to it.

Note that this functionality can also be used to combine germline
databases from two different organisms. See `?augment_germline_db` for
more information and some examples.

Finally note that augmenting and restricting can be combined e.g. by
pointing `germline_db_V` to an augmented V germline database _and_ by
supplying the `germline_db_V_seqidlist` argument.


# A TCR analysis example


Even though NCBI IgBLAST primary use case is BCR analysis, it can also be used
for TCR sequence analysis, and so does the `r Biocpkg("igblastr")` package.

File `SRR11341217.fasta.gz` included in the package contains 10,875
human beta chain TCR transcripts running from 5' of reverse transcription
reaction to beginning of constant region.
See <https://www.ncbi.nlm.nih.gov/sra/SRX7944361> for more information
about this dataset.
```{r}
filename <- "SRR11341217.fasta.gz"
query <- system.file(package="igblastr", "extdata", "TCR", filename)
```

To analyze this dataset with `igblastn()`, we need to install and select
a human TCR germline database:
```{r, message=FALSE}
db_name <- install_IMGT_germline_db("202518-3", organism="Homo sapiens",
                                    tcr.db=TRUE, force=TRUE)
use_germline_db(db_name)
list_germline_dbs()
```
See `?install_IMGT_germline_db` and `?use_germline_db` for more information.

When calling `igblastn()`, we don't need to specify the `ig_seqtype`
argument as it will be automatically set to `"TCR"` (see documentation
of the `ig_seqtype` argument in `?igblastn` for more information):
```{r}
AIRR_df <- igblastn(query)
AIRR_df
```


# Future developments


At the moment, the `r Biocpkg("igblastr")` package does not provide
access to the full functionality of the IgBLAST software. Most notably,
the `igblastp` _standalone executable_ included in IgBLAST has no
counterpart in `r Biocpkg("igblastr")`.

Some future developments include:

- Implement `igblastp()`, a wrapper to the `igblastp` _standalone executable_
  included in IgBLAST for protein-level alignments.

- Add facilities to retrieve germline databases from OGRDB, the AIRR
  Community’s Open Germline Reference Database.


# Session information

Here is the output of `sessionInfo()` on the system on which this document
was compiled:
```{r}
sessionInfo()
```