%\VignetteIndexEntry{How to forge a BSgenome data package}
%\VignetteKeywords{Genome, BSgenome, DNA, Sequence, UCSC, BSgenome data package}
%\VignettePackage{BSgenome}

%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
\SweaveOpts{keep.source=TRUE}

\documentclass[10pt]{article}

%\usepackage{amsmath}
%\usepackage[authoryear,round]{natbib}

%
% NOTE -- There is an obscure issue with the use of \url from the hyperref
% package that will trigger a MiKTeX/pdflatex error:
%   ! pdfTeX error (ext4): \pdfendlink ended up in different nesting level than \pd
%   fstartlink.
%   \AtBegShi@Output ...ipout \box \AtBeginShipoutBox
%                                                     \fi \fi
%   l.96 \end{document}
%
%   !  ==> Fatal error occurred, no output PDF file produced!
%   Transcript written on BSgenomeForge1.log.
% The error is hard to reproduce. I've observed it on the r34270 version of this
% vignette and with the following version of the MiKTeX/pdflatex command:
%   MiKTeX-pdfTeX 2.7.3147 (1.40.9) (MiKTeX 2.7)
\usepackage{hyperref}

\usepackage{underscore}

\textwidth=6.5in
\textheight=8.5in
\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}

\newcommand{\R}{\textsf{R}}
\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\term}[1]{\emph{#1}}
\newcommand{\Rpackage}[1]{\textsf{#1}}
\newcommand{\Rfunction}[1]{\texttt{#1}}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rclass}[1]{\textit{#1}}
\newcommand{\Rmethod}[1]{\textit{#1}}
\newcommand{\Rfunarg}[1]{\textit{#1}}

\bibliographystyle{plainnat}

\begin{document}

\title{How to forge a BSgenome data package}

\author{Herv\'e Pag\`es \\
  Gentleman Lab \\
  Fred Hutchinson Cancer Research Center \\
  Seattle, WA}
\date{\today}
\maketitle

\tableofcontents


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Introduction}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


IMPORTANT NOTE: Starting with Bioconductor 2.14, the
\Rfunction{forgeBSgenomeDataPkg} function doesn't handle masks anymore
i.e. now it produces a \term{BSgenome data package} that contains only
the bare sequences. A new function \Rfunction{forgeMaskedBSgenomeDataPkg}
has been added to forge a \term{BSgenome data package} with masked
sequences. This vignette has been updated to reflect these changes.

This vignette describes the process of forging a \term{BSgenome data package}.
It is intended for Bioconductor users who want to make a new \term{BSgenome
data package}, not for regular users of these packages.

Start {\R} (make sure you are using the latest release version), load the
\Rpackage{BSgenome} package, and use the \Rfunction{available.genomes}
function to get the list of \term{BSgenome data packages} available
in the current release version of Bioconductor.
So you confirm that none of those genomes suits your needs? And you want to
make your own package? If you answer yes to these 2 questions, then you've
come to the right place.

Requirements:
\begin{itemize}
\item Some basic knowledge of the Unix/Linux command line is required. The
      commands that you will most likely need are: \code{cd}, \code{mkdir},
      \code{mv}, \code{rmdir}, \code{tar}, \code{gunzip}, \code{unzip},
      \code{ftp} and \code{wget}. Also you will need to create and edit some
      text files.
\item You need access to a good Unix/Linux build machine with a decent amount
      of RAM (>= 4GB), especially if your genome is big. For smaller genomes,
      2GB or even 1GB of RAM might be enough.
\item You need the latest release versions of {\R} plus the
      \Rpackage{Biostrings} and \Rpackage{BSgenome} packages installed on the
      build machine. To check your installation, start {\R} and try to load
      the \Rpackage{BSgenome} package.
\item Finally, you need to obtain the \term{source data files} of the genome
      that you want to build a package for.
      There are 2 groups of \term{source data files}: (1) the files containing
      the sequence data (those files are required), and (2) the files
      containing the mask data (those files are optional).
      For most organisms, these files have been made publicly available on the
      internet by genome providers like UCSC, NCBI, FlyBase, TAIR, etc.
      The next sections of this vignette explain how to obtain and prepare
      these files.
\end{itemize}

Refer to the \textit{R Installation and Administration} manual
\footnote{\url{http://cran.r-project.org/doc/manuals/R-admin.html}}
if you need to install {\R} or upgrade your {\R} version,
and to the \textit{Bioconductor - Install} page
\footnote{\url{http://bioconductor.org/install/}}
on the Bioconductor website if you need to install or update the
\Rpackage{Biostrings} or \Rpackage{BSgenome} packages.

Questions, comments or bug reports about this vignette or about the
BSgenomeForge functions are welcome. Please address them to the author
(\code{hpages@fredhutch.org}) or post them on the Bioconductor support
site \footnote{\url{https://support.bioconductor.org/}}.

In the next section (``How to forge a BSgenome data package with bare
sequences''), we describe how to forge a \term{BSgenome data package}
with sequences that have no masks on them. If this is what you're after,
then you'll be done at the end of the section. Only if you need to forge a
package with masked sequences, you'll have to keep reading but you'll have
to forge a \term{BSgenome data package} with bare sequences first.

In this vignette, we call \term{target packages} the \term{BSgenome
data packages} that we're going to forge (one target package with
bare sequences and possibly a 2nd target package with masked sequences).


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{How to forge a BSgenome data package with bare sequences}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\subsection{Obtain and prepare the sequence data}

\subsubsection{What do I need?}

The sequence data must be in a single twoBit file (e.g. \textit{musFur1.2bit})
or in a collection of FASTA files (possibly gzip-compressed).

If the latter, then you need 1 FASTA file per sequence that you want to put
in the \term{target package}. In that case the name of each FASTA file must
be of the form \textit{<prefix>}\textit{<seqname>}\textit{<suffix>} where
\textit{<seqname>} is the name of the sequence in it and \textit{<prefix>}
and \textit{<suffix>} are a prefix and a suffix (possibly empty) that are
the same for all the FASTA files.

\subsubsection{Examples}

UCSC provides the genome sequences for Stickleback (gasAcu1) in various forms:
(a) as a twoBit file (\textit{gasAcu1.2bit} located in
\url{http://hgdownload.cse.ucsc.edu/goldenPath/gasAcu1/bigZips/},
(b) as a big compressed tarball containing one file per chromosome
(\textit{chromFa.tar.gz} located in the same folder), and (c) as a
collection of compressed FASTA files (\texttt{chrI.fa.gz},
\texttt{chrII.fa.gz}, \texttt{chrIII.fa.gz}, ..., \texttt{chrXXI.fa.gz},
\texttt{chrM.fa.gz} and \texttt{chrUn.fa.gz} located in
\url{http://hgdownload.cse.ucsc.edu/goldenPath/gasAcu1/chromosomes/}).
To forge a \term{BSgenome data package} for gasAcu1, the easiest would be
to use (a) (the twoBit file). However it would also be possible to use (c)
(the collection of compressed FASTA files). In that case the suffix would
need to be set to \texttt{.fa.gz}. Note that the prefix here is empty and
not \texttt{chr}, because \texttt{chr} is considered to be part of the
sequence names (a commonly used chromosome naming convention at UCSC).
Alternatively, the big compressed tarball (\texttt{chromFa.tar.gz}) could
be used. After downloading and extracting it, we should end up with the same
files as with (c).

You can use the \Rfunction{fasta.seqlengths} function from the
\Rpackage{Biostrings} package to get the lengths of the sequences in a
FASTA file:
<<>>=
library(Biostrings)
file <- system.file("extdata", "ce2chrM.fa.gz", package="BSgenome")
fasta.seqlengths(file)
@

For the \Rpackage{BSgenome.Mfuro.UCSC.musFur1} package, the twoBit file was
used and downloaded with:
\begin{verbatim}
    wget http://hgdownload.soe.ucsc.edu/goldenPath/musFur1/bigZips/musFur1.2bit
\end{verbatim}

For the \Rpackage{BSgenome.Rnorvegicus.UCSC.rn4} package, the big compressed
tarball was used. It was downloaded and extracted with:
\begin{verbatim}
    wget http://hgdownload.cse.ucsc.edu/goldenPath/rn4/bigZips/chromFa.tar.gz
    tar zxf chromFa.tar.gz
\end{verbatim}

\subsubsection{The \textit{<seqs\_srcdir>} folder}

So we assume that you've downloaded the \term{sequence data files} and that
they are now located in a folder on your machine\footnote{Note that checking
the md5sums after download is always a good idea.}.
From now on, we'll refer to this folder as the \textit{<seqs\_srcdir>} folder.

Depending on what data files you downloaded (see previous sub-subsection),
you might also need to extract and possibly rename the files.
Note that all the \term{sequence data files} should be located directly in the
\textit{<seqs\_srcdir>} folder, not in subfolders of this folder. For example,
depending on the genome, UCSC provides either a big \texttt{chromFa.tar.gz}
or \texttt{chromFa.zip} file that contains the sequence data for all the
chromosomes. But it could be that, after extraction of this big file, the
individual FASTA files for each chromosome end up being located one level
down the \textit{<seqs\_srcdir>} folder (granted that you were in this
folder when you extracted the file). If this is the case, then you will
need to move them one level up (use \code{mv -i */*.fa .} for this,
then remove all the empty subfolders with \code{rmdir *}).


\subsection{Prepare the BSgenome data package seed file}

\subsubsection{Overview}

The \term{BSgenome data package seed file} will contain all the information
needed by the \Rfunction{forgeBSgenomeDataPkg} function to forge the
\term{target package}.

The format of this file is DCF (Debian Control File), which is also the format
used for the \texttt{DESCRIPTION} file of any {\R} package. The valid fields of
a \term{seed file} are divided in 3 categories:
\begin{enumerate}
\item Standard \texttt{DESCRIPTION} fields. These fields are actually
      the mandatory fields found in any \texttt{DESCRIPTION} file.
      They will be copied to the \texttt{DESCRIPTION} file of the \term{target
      package}.

\item Non-standard \texttt{DESCRIPTION} fields. These fields are specific
      to \term{seed files} and they will also be copied to the
      \texttt{DESCRIPTION} file of the \term{target package}.
      In addition, the values of these fields will be stored in the
      \Rclass{BSgenome} object that will be contained in the \term{target
      package}. This means that the users of the \term{target package} will be
      able to retrieve these values via the accessor methods defined for
      \Rclass{BSgenome} objects. See the man page for the \Rclass{BSgenome}
      class (\code{{?}`BSgenome-class`}) for a description of these methods.

\item Additional fields that don't fall in the first 2 categories.
\end{enumerate}

The 3 following sub-subsections give an extensive description of all the
valid fields of a \term{seed file}.

Alternatively, the reader in a hurry can go directly to the last
sub-subsection of this subsection for an example of a \term{seed file}.

\subsubsection{Standard \texttt{DESCRIPTION} fields}

\begin{itemize}
\item \code{Package}: Name to give to the \term{target package}. The convention
      used for the packages built by the Bioconductor project is to use a name
      made of 4 parts separated by a dot.
      Part 1 is always \code{BSgenome}.
      Part 2 is the abbreviated name of the organism (when the name of
      the organism is made of 2 words, we put together the first letter of the
      first word in upper case followed by the entire second word in lower
      case e.g. \code{Rnorvegicus}).
      Part 3 is the name of the organisation who provided the genome
      (e.g. \code{UCSC}).
      Part 4 is the release string or number used by this organisation
      to identify this version of the genome (e.g. \code{rn4}).

\item \code{Title}: The title of the \term{target package}. E.g. \code{Full
      genome sequences for Rattus norvegicus (UCSC version rn4)}.

\item \code{Description}, \code{Version}, \code{Author}, \code{Maintainer},
      \code{License}: Like the 2 previous fields, these are mandatory fields
      found in any \texttt{DESCRIPTION} file.
      Please refer to the \textit{The DESCRIPTION file} subsection
      of the \textit{Writing R Extensions} manual
      \footnote{\url{http://cran.r-project.org/doc/manuals/R-exts.html\#The-DESCRIPTION-file}}
      for more information about these fields.
      If you plan to distribute the \term{target package} that you are about
      to forge, please pickup the license carefully and make sure that it is
      compatible with the license of the \term{sequence data files} if any.

\item \code{Suggests}: [OPTIONAL] If you're going to add examples to the
      \code{Examples} section of the \term{target package} (see
      \code{PkgExamples} field below for how to do this), use this field
      to list the packages used in these examples.
\end{itemize}

\subsubsection{Non-standard \texttt{DESCRIPTION} fields}

\begin{itemize}
\item \code{organism}: The scientific name of the organism in the format
      \code{Genus species} (e.g. \code{Rattus norvegicus}, \code{Homo sapiens})
      or \code{Genus species subspecies} (e.g. \code{Homo sapiens
      neanderthalensis}).

\item \code{common\_name}: The common name of the organism (e.g. \code{Rat}
      or \code{Human}). For organisms that have more than one commmon name
      (e.g. \code{Bovine} or \code{Cow} for Bos taurus), choose one.
      Note that for the packages built by the Bioconductor project from a
      UCSC genome, this field corresponds to the \code{SPECIES} column of
      the \textit{List of UCSC genome releases} table
      \footnote{\url{http://genome.ucsc.edu/FAQ/FAQreleases\#release1}}.

\item \code{provider}: The provider of the \term{sequence data files} e.g.
      \code{UCSC}, \code{NCBI}, \code{BDGP}, \code{FlyBase}, etc...
      Should preferably match part 3 of the package name (field \code{Package}).

\item \code{provider\_version}: The provider-side version of the genome.
      Should preferably match part 4 of the package name (field \code{Package}).
      For the packages built by the Bioconductor project from a UCSC genome,
      this field corresponds to the \code{UCSC VERSION} field of the
      \textit{List of UCSC genome releases} table.

\item \code{release\_date}: When this assembly of the genome was released.
      For the packages built by the Bioconductor project from a UCSC genome,
      this field corresponds to the \code{RELEASE DATE} field of the
      \textit{List of UCSC genome releases} table.

\item \code{release\_name}: The release name or build number of this assembly
      of the genome.
      For the packages built by the Bioconductor project from a UCSC genome,
      this field corresponds to the \code{RELEASE NAME} field of the
      \textit{List of UCSC genome releases} table.

\item \code{source\_url}: The permanent URL where the \term{sequence data
      files} used to forge the \term{target package} can be found.

\item \code{organism\_biocview}: The official biocViews term for this organism.
      This is generally the same as the \code{organism} field except that spaces
      should be replaced by underscores. The value of this field matters only if
      the \term{target package} is going to be added to a Bioconductor
      repository since it will determine under which subview of the
      \textit{Bioconductor Task View} for \textit{Organism}
      \footnote{\url{http://bioconductor.org/packages/release/Organism.html}}
      the package will appear.
      Note that this is the only field in this category that won't be stored
      in the \Rclass{BSgenome} object that will be contained in the
      \term{target package}.
\end{itemize}

\subsubsection{Other fields}

\begin{itemize}
\item \code{BSgenomeObjname}: Should match part 2 of the package name (see
      \code{Package} field above).

\item \code{seqnames}: [OPTIONAL] Needed only if you are using a collection
      of \term{sequence data files}. In that case \code{seqnames} must be an
      {\R} expression returning the names of the single sequences to forge
      (in a character vector).
      E.g. \code{paste("chr", c(1:20, "X", "M", "Un", paste(c(1:20, "X", "Un"),
      "\_random", sep="")), sep="")}.

\item \code{circ\_seqs}: [OPTIONAL] An {\R} expression returning the names of
      the circular sequences (in a character vector). If the \code{seqnames}
      field is specified, then \code{circ\_seqs} must be a subset of it.
      E.g. \code{"chrM"} for rn4 or \code{c("chrM", "2micron")} for the sacCer2
      genome (Yeast) from UCSC.
      The default value for \code{circ\_seqs} is \code{NULL} (no circular
      sequence).

\item \code{mseqnames}: [OPTIONAL] An {\R} expression returning the names of
      the multiple sequences to forge (in a character vector).
      E.g. \code{c("random", "chrUn")} for \code{rn5}.
      The default value for \code{mseqnames} is \code{NULL} (no multiple
      sequence).

\item \code{PkgDetails}: [OPTIONAL] Some arbitrary text that will be copied
      to the \code{Details} section of the man page of the \term{target
      package}.

\item \code{SrcDataFiles}: [OPTIONAL] Some arbitrary text that will be copied
      to the \code{Note} section of the man pages of the \term{target package}.
      \code{SrcDataFiles} should describe briefly where the \term{sequence
      data files} are coming from. Will typically contain URLs (permanent
      URLs are a must).

\item \code{PkgExamples}: [OPTIONAL] Some {\R} code (possibly with comments)
      that will be added to the \code{Examples} section of the man page of the
      \term{target package}. Don't forget to list the packages used in these
      examples in the \code{Suggests} field.

\item \code{seqs\_srcdir}: The absolute path to the folder containing the
      \term{sequence data files}.

\item \code{seqfile\_name}: Required if the \term{sequence data files} is a
      single twoBit file (e.g. \code{musFur1.2bit}). Ignored otherwise.

\item \code{seqfiles\_prefix}, \code{seqfiles\_suffix}: [OPTIONAL] Needed only
      if you are using a collection of \term{sequence data files}.
      The common prefix and suffix that need to be added to all the sequence
      names (fields \code{seqnames} and \code{mseqnames}) to get the name of
      the corresponding FASTA file.
      Default values are the empty prefix for \code{seqfiles\_prefix} and
      \code{.fa} for \code{seqfiles\_suffix}.

\item \code{ondisk\_seq\_format}: [OPTIONAL] Specifies how the single sequences
      should be stored in the \term{target package}. Can be \code{2bit},
      \code{rda}, \code{fa.rz}, or \code{fa}.
      If \code{2bit} (the default), then all the single sequences are stored
      in a single twoBit file.
      If \code{rda}, then each single sequence is stored in a separated
      serialized \Rclass{XString} object (one per single sequence).
      If \code{fa.rz} or \code{fa}, then all the single sequences are stored
      in a single FASTA file (compressed in the RAZip format if \code{fa.rz}).
\end{itemize}

\subsubsection{An example}

The \term{seed files} used for the packages forged by the Bioconductor project
are included in the \Rpackage{BSgenome} package:
<<>>=
library(BSgenome)
seed_files <- system.file("extdata", "GentlemanLab", package="BSgenome")
tail(list.files(seed_files, pattern="-seed$"))

## Display seed file for musFur1:
musFur1_seed <- list.files(seed_files, pattern="\\.musFur1-seed$", full.names=TRUE)
cat(readLines(musFur1_seed), sep="\n")

## Display seed file for rn4:
rn4_seed <- list.files(seed_files, pattern="\\.rn4-seed$", full.names=TRUE)
cat(readLines(rn4_seed), sep="\n")
@

From now we assume that you have managed to prepare the \term{seed file} for
your \term{target package}.


\subsection{Forge the \term{target package}}

To forge the \term{target package}, start \R, load the \Rpackage{BSgenome}
package, and call the \Rfunction{forgeBSgenomeDataPkg} function on your
\term{seed file}. For example, if the path to your \term{seed file} is
\texttt{"path/to/my/seed"}, do:
<<eval=false>>=
library(BSgenome)
forgeBSgenomeDataPkg("path/to/my/seed")
@

Depending on the size of the genome and your hardware, this can take between
2 minutes and 1 or 2 hours. By default \Rfunction{forgeBSgenomeDataPkg} will
create the source tree of the \term{target package} in the current directory.

Once \Rfunction{forgeBSgenomeDataPkg} is done, ignore the warnings (if any),
quit \R, and build the source package (tarball) with
\begin{verbatim}
    R CMD build <pkgdir>
\end{verbatim}
where \code{<pkgdir>} is the path to the source tree of the package.

Then check the package with
\begin{verbatim}
    R CMD check <tarball>
\end{verbatim}
where \code{<tarball>} is the path to the tarball produced by
\code{R CMD build}.

Finally install the package with
\begin{verbatim}
    R CMD INSTALL <tarball>
\end{verbatim}
and use it!


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{How to forge a BSgenome data package with masked sequences}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\subsection{Obtain and prepare the mask data}

\subsubsection{What do I need?}

The mask data are not available for all organisms. What you download exactly
depends of course on what's available and also on what built-in masks you
want to have in the \term{2nd target package}.
4 kinds of built-in masks are currently supported by BSgenomeForge:
\begin{itemize}
\item the masks of assembly gaps, aka ``the AGAPS masks'';
\item the masks of intra-contig ambiguities, aka ``the AMB masks'';
\item the masks of repeat regions that were determined by the RepeatMasker
      software, aka ``the RM masks'';
\item the masks of repeat regions that were determined by the Tandem Repeats
      Finder software (where only repeats with period less than or equal to
      12 were kept), aka ``the TRF masks''.
\end{itemize}

For the AGAPS masks, you need UCSC ``gap'' or NCBI ``agp'' files. It can be one
file per chromosome or a single big file containing the assembly gap information
for all the chromosomes together. In the former case, the name of each file must
be of the form \textit{<prefix>}\textit{<seqname>}\textit{<suffix>}. Like for
the FASTA files in group 1, \textit{<seqname>} must be the name of the sequence
(sequence names for FASTA files and AGAPS masks must match) and
\textit{<prefix>} and \textit{<suffix>} must be a prefix and a suffix
(possibly empty) that are the same for all the files (this prefix/suffix
doesn't need to be, and typically is not, the same as for the FASTA files in
group 1).

You don't need any file for the AMB masks.

For the RM masks, you need RepeatMasker \texttt{.out} files. Like for the AGAPS
masks, it can be one file per chromosome or a single big file containing the
RepeatMasker information for all the chromosomes together.
In the former case, the name of each file must also be of the form
\textit{<prefix>}\textit{<seqname>}\textit{<suffix>}. Same comments apply
as for the AGAPS masks above.

For the TRF masks, you need Tandem Repeats Finder \texttt{.bed} files.
Again, it can be one file per chromosome or a single big file.
In the former case, the name of each file must also be of the form
\textit{<prefix>}\textit{<seqname>}\textit{<suffix>}). Same comments apply
as for the AGAPS masks above.

Again, for some organisms none of the masks above are available or only some
of them are.

\subsubsection{An example}

Here is how the \term{mask data files} for the
\Rpackage{BSgenome.Rnorvegicus.UCSC.rn4} package were obtained and prepared:

\begin{itemize}
  \item AGAPS masks: all the \texttt{chr*\_gap.txt.gz} files (UCSC ``gap''
        files) were downloaded from the UCSC \textit{database} folder
        \footnote{\url{http://hgdownload.cse.ucsc.edu/goldenPath/rn4/database/}}
        for \code{rn4}. This was done with the standard Unix/Linux
        \code{ftp} command:
        \begin{verbatim}
            ftp hgdownload.cse.ucsc.edu # login as "anonymous"
            cd goldenPath/rn4/database
            prompt
            mget chr*_gap.txt.gz
        \end{verbatim}
        Then all the downloaded files were uncompressed with:
        \begin{verbatim}
            for file in chr*_gap.txt.gz; do gunzip $file ; done
        \end{verbatim}
  \item RM masks: file \texttt{chromOut.tar.gz} was downloaded from
        the UCSC \textit{bigZips} folder and extracted with:
        \begin{verbatim}
            tar zxf chromOut.tar.gz
        \end{verbatim}
  \item TRF masks: file \texttt{chromTrf.tar.gz} was downloaded from
        the UCSC \textit{bigZips} folder and extracted with:
        \begin{verbatim}
            tar zxf chromTrf.tar.gz
        \end{verbatim}
\end{itemize}

\subsubsection{The \textit{<masks\_srcdir>} folder}

From now we assume that you've downloaded (checking the md5sums is always a
good idea) and possibly extracted all the \term{mask data files},
and that they are located in the \textit{<masks\_srcdir>} folder.

Note that all the \term{mask data files} should be located directly in the
\textit{<masks\_srcdir>} folder, not in subfolders of this folder.


\subsection{Prepare the seed file for the masked BSgenome data package}

\subsubsection{Overview}

The \term{seed file} for the \term{BSgenome data package} with masked
sequences (i.e. \term{2nd target package}) is similar to the \term{seed file}
for the \term{BSgenome data package} containing the corresponding bare
sequences (a.k.a. the \term{reference target package} i.e. the package you
forged in the previous section). It will contain all the information needed
by the \Rfunction{forgeMaskedBSgenomeDataPkg} function to forge the
\term{2nd target package}.

The fields of this \term{seed file} are described in the 3 following
sub-subsections.

Alternatively, the reader in a hurry can go directly to the last
sub-subsection of this subsection for an example of \term{seed file}.

\subsubsection{Standard \texttt{DESCRIPTION} fields}

\begin{itemize}
\item \code{Package}: Name to give to the \term{2nd target package}.
      The convention used for the packages built by the Bioconductor project
      is to use the same name as the \term{reference target package} with the
      \code{.masked} suffix added to it.

\item \code{Title}: The title of the package. E.g. \code{Full masked genome
      sequences for Rattus norvegicus (UCSC version rn4)}.

\item \code{Description}, \code{Version}, \code{Author}, \code{Maintainer},
      \code{License}: See previous section.
\end{itemize}

\subsubsection{Non-standard \texttt{DESCRIPTION} fields}

\begin{itemize}
\item \code{organism\_biocview}: Same as for the \term{reference target
      package}.

\item \code{source\_url}: The permanent URL where the \term{mask data files}
      used to forge this package can be found.
\end{itemize}

\subsubsection{Other fields}

\begin{itemize}
\item \code{RefPkgname}: The name of the \term{reference target package}.

\item \code{nmask\_per\_seq}: The number of masks per sequence (1 to 4).

\item \code{PkgDetails}, \code{PkgExamples}: See previous section.

\item \code{SrcDataFiles}: [OPTIONAL] Some arbitrary text that will be copied
      to the \code{Note} section of the man pages of the \term{target package}.
      \code{SrcDataFiles} should describe briefly where the \term{mask data
      files} are coming from. Will typically contain URLs (permanent
      URLs are a must).

\item \code{masks\_srcdir}: The path to the \textit{<masks\_srcdir>} folder.

\item \code{AGAPSfiles\_type}: [OPTIONAL] Must be \code{gap} (the default) if
      the \term{source data files} containing the AGAPS masks information are
      UCSC ``gap'' files, or \code{agp} if they are NCBI ``agp'' files.

\item \code{AGAPSfiles\_name}: [OPTIONAL] Omit this field if you have one
      \term{source data file} per single sequence for the AGAPS masks and use
      the \code{AGAPSfiles\_prefix} and \code{AGAPSfiles\_suffix} fields below
      instead.
      Otherwise, use this field to specify the name of the single big file.

\item \code{AGAPSfiles\_prefix}, \code{AGAPSfiles\_suffix}: [OPTIONAL] Omit
      these fields if you have one single big \term{source data file} for all
      the AGAPS masks and use the \code{AGAPSfiles\_name} field above instead.
      Otherwise, use these fields to specify the common prefix and suffix that
      need to be added to all the single sequence names (field \code{seqnames})
      to get the name of the file that contains the corresponding AGAPS mask
      information.
      Default values are the empty prefix for \code{AGAPSfiles\_prefix} and
      \code{\_gap.txt} for \code{AGAPSfiles\_suffix}.

\item \code{RMfiles\_name}, \code{RMfiles\_prefix}, \code{RMfiles\_suffix}:
      [OPTIONAL] Those fields work like the \code{AGAPSfiles*} fields above
      but for the RM masks.
      Default values are the empty prefix for \code{RMfiles\_prefix} and
      \texttt{.fa.out} for \code{RMfiles\_suffix}.

\item \code{TRFfiles\_name}, \code{TRFfiles\_prefix}, \code{TRFfiles\_suffix}:
      [OPTIONAL] Those fields work like the \code{AGAPSfiles*} fields above
      but for the TRF masks.
      Default values are the empty prefix for \code{TRFfiles\_prefix} and
      \texttt{.bed} for \code{TRFfiles\_suffix}.
\end{itemize}

\subsubsection{An example}

The \term{seed files} used for the packages forged by the Bioconductor project
are included in the \Rpackage{BSgenome} package:
<<>>=
library(BSgenome)
seed_files <- system.file("extdata", "GentlemanLab", package="BSgenome")
tail(list.files(seed_files, pattern="\\.masked-seed$"))
rn4_masked_seed <- list.files(seed_files, pattern="\\.rn4\\.masked-seed$", full.names=TRUE)
cat(readLines(rn4_masked_seed), sep="\n")
@

From now we assume that you have managed to prepare the \term{seed file} for
the \term{2nd target package}.


\subsection{Forge the \term{2nd target package}}

To forge the package, start \R, make sure the \term{reference target package}
is installed (try to load it), load the \Rpackage{BSgenome} package, and call
the \Rfunction{forgeMaskedBSgenomeDataPkg} function on the \term{seed file}
you made for the \term{2nd target package}.
For example, if the path to your \term{seed file} is
\texttt{"path/to/my/seed"}, do:
<<eval=false>>=
library(BSgenome)
forgeMaskedBSgenomeDataPkg("path/to/my/seed")
@

Depending on the size of the genome and your hardware, this can take between
2 minutes and 1 or 2 hours. By default \Rfunction{forgeMaskedBSgenomeDataPkg}
will create the source tree of the \term{2nd target package} in the current
directory.

Once \Rfunction{forgeMaskedBSgenomeDataPkg} is done, ignore the warnings (if
any), quit \R, and build the source package (tarball) with
\begin{verbatim}
    R CMD build <pkgdir>
\end{verbatim}
where \code{<pkgdir>} is the path to the source tree of the package.

Then check the package with
\begin{verbatim}
    R CMD check <tarball>
\end{verbatim}
where \code{<tarball>} is the path to the tarball produced by
\code{R CMD build}.

Finally install the package with
\begin{verbatim}
    R CMD INSTALL <tarball>
\end{verbatim}
and use it!

If you want to distribute this package, you need to distribute it with
the \term{reference target package} since it depends on it.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Session information}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


The output in this vignette was produced under the following conditions:

<<>>=
sessionInfo()
@

\end{document}