%\VignetteIndexEntry{RnBeads Annotation} %\VignetteKeywords{analysis, methylation, RnBeads} \documentclass[a4paper]{scrartcl} \usepackage[OT1]{fontenc} \usepackage{hyperref} \usepackage{color} \usepackage[table]{xcolor} \usepackage{vmargin} \usepackage[english,american]{babel} \usepackage{fancyhdr} \selectlanguage{american} \definecolor{darkblue}{rgb}{0.0,0.0,0.3} \hypersetup{colorlinks,breaklinks, linkcolor=darkblue,urlcolor=darkblue, anchorcolor=darkblue,citecolor=darkblue} \setpapersize{A4} \setmargins{2.5cm}{2.0cm}% % left margin and upper margin {16cm}{23cm}% % text width and height {60pt}{22pt}% % Header height and margin {0pt}{30pt}% % \footheight (egal) and footer margin %%%%%%%%%%%%%%%%% Some style sttings %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newcommand{\insertemptypage}[0]{\newpage \thispagestyle{empty} \vspace*{\fill}} % Layout \newcommand{\emphr}[1]{\textcolor{red}{\sffamily{\textbf{#1}}}} \newcommand{\itemr}[1]{\item \emphr{#1} \\} \newcommand{\ns}[1]{{\sffamily #1}} % non-serif font \newcommand{\nsb}[1]{\ns{\textbf{#1}}} \newcommand{\nsbc}[1]{\nsb{\textcolor{blue}{#1}}} \newcommand{\nsbcr}[1]{\nsb{\textcolor{red}{#1}}} \newcommand{\cs}[1]{\texttt{#1}} % code style \newcommand{\vocab}[1]{\textit{\ns{#1}}} \definecolor{grey}{rgb}{0.6,0.6,0.6} \definecolor{backgroundtabheader}{rgb}{0.7,0.7,0.7} \definecolor{backgroundtabrow1}{rgb}{1,1,1} \definecolor{backgroundtabrow2}{rgb}{0.9,0.9,0.9} % Ueberschriften \renewcommand*{\sectfont }{\color{blue} \sffamily \bfseries} %\renewcommand{\thesection}{\Alph{section}} %neuer absatz \newcommand{\newpara}[0]{\vskip 10pt} %simple hyperrefs without having to type the link double \newcommand{\hrefs}[1]{\href{#1}{#1}} % Mathematik-Modus % Mengen \newcommand{\setN}[0]{\ensuremath{\mathds{N}}} \newcommand{\setR}[0]{\ensuremath{\mathds{R}}} \newcommand{\setZ}[0]{\ensuremath{\mathds{Z}}} \newcommand{\set}[1]{\left\{#1\right\}} \newcommand{\tupt}[1]{\ensuremath{\langle}#1\ensuremath{\rangle}} % Statistics \newcommand{\pr}[0]{\ensuremath{\text{Pr}}} \newcommand{\E}[0]{\ensuremath{\text{E}}} \newcommand{\var}[0]{\ensuremath{\text{Var}}} \newcommand{\cov}[0]{\ensuremath{\text{Cov}}} \newcommand{\sd}[0]{\ensuremath{\text{sd}}} % Statistical Independence \newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}} \def\independenT#1#2{\mathrel{\setbox0\hbox{$#1#2$}% \copy0\kern-\wd0\mkern4mu\box0}} % Unsicher \newcommand{\chk}[0]{\textcolor{red}{(\textbf{???)}}} \newcommand{\chktxth}[1]{\textcolor{red}{(\textbf{???} #1 \textbf{???})}} \newcommand{\chkside}[0]{\marginpar{\textcolor{red}{(\textbf{???)}}}} \newcommand{\chktxt}[1]{\marginpar{\textcolor{red}{(\textbf{???} #1 \textbf{???})}}} \newcommand{\todo}[1]{\marginpar{\textcolor{red}{\textbf{TODO:}\newline #1}}} \newcommand{\todomark}[0]{\marginpar{\textcolor{red}{\textbf{TODO}}}} \newcommand{\chkm}[0]{\textcolor{red}{(\textbf{???)}} \marginpar{\textcolor{red}{\textbf{TODO}}}} \newcommand{\ph}[1]{\paragraph{} \textcolor{red}{[ #1 ]}} %placeholder \newcommand{\ra}[0]{$\rightarrow$} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \lhead{\cs{RnBeads Annotations}} \rhead{\thepage} \cfoot{} \pagestyle{fancy} \begin{document} \title{\cs{RnBeads} -- Annotations} \author{Yassen Assenov, Fabian M\"uller, Pavlo Lutsik, Michael Scherer} \thispagestyle{empty} \maketitle \section{Introduction} \cs{RnBeads} is accompanied by four annotation packages: \cs{RnBeads.hg19}, \cs{RnBeads.mm10}, \cs{RnBeads.mm9}, \cs{RnBeads.rn5}. They contain genomic locus and Infinium probe annotation tables, as well as definition tables for tiling regions, genes, promoters and CpG islands. The \cs{RnBeads} package contains routines for handling and updating these annotations. The vignette named \textbf{RnBeads -- Comprehensive DNA Methylation Analysis} also contains examples that use functions presented here.\\ The data packages described above are hosted on the \cs{RnBeads} website\footnote{\href{http://rnbeads.mpi-inf.mpg.de/}{rnbeads.mpi-inf.mpg.de}}. Note that at least one of the data packages should be installed prior to installing \cs{RnBeads}. \section{Data Extraction and Processing} \subsection{Sites} Currently, every data package examines cytosines in the context of CpG and contains an annnotation table of all CpGs in the respective genome. CpG density and GC content are also computed for the neighborhood of length 100 base pairs centered on each locus. The total number of dinucleotides annotated in HG19 is 28,217,009 represented both on the forward and reverse DNA strands. \subsection{HumanMethylation450 Probes} The Infinium HumanMethylation450 BeadChip interrogates over 470,000 loci in the human genome. Control and methylation probe annotation is obtained from online sources, validated for consistency and enriched with locations for the SNP-associated loci obtained from dbSNP. In addition, the annotation includes information about CpG density and GC content of the sequence neighborhood around every probe target as described in the section above. \subsection{dbSNP} Single-Nucleotide Polymorphisms (SNPs) in the human genome are downloaded from the respective FTP directory of dbSNP\footnote{\href{http://www.ncbi.nlm.nih.gov/snp/organisms}{www.ncbi.nlm.nih.gov/snp/organisms/}}. \cs{RnBeads} uses the provided VCF files; for mouse and rat these are one file per chromosome. The following table shows the references in dbSNP used for the supported assemblies. \\ \begin{tabular}{|c|l|} \hline \multicolumn{1}{|c}{\textbf{Package}} & \multicolumn{1}{c|}{\textbf{Reference}} \\ \hline RnBeads.hg19 & GRCh37.p10 \\ RnBeads.mm10 & GCF\_000001635.21 \\ RnBeads.rn5 & GCF\_000001895.4 \\ \hline \end{tabular} \\ The data package \cs{RnBeads.mm9} does not include SNP-related information. Only SNPs that meet all of the following criteria are considered: \begin{enumerate} \item \textbf{assembly} is \texttt{GRCh37.p5,reference}; \item \textbf{chrom} is the expected chromosome; \item \textbf{allele origin} is not \texttt{somatic}; \item (for HG19) major allele frequency is at most 0.95. \end{enumerate} These SNP positions are used to enrich the annotations for CpG dinucleoties and Infinium probes. \subsection{Regions} Every data package defines the following sets of regions for the dedicated assembly: \begin{description} \item[Tiling regions]{Tiling regions with a window size of 5 kilobases are defined over the whole genome.} \item[Genes and promoters]{Ensembl\footnote{\href{http://www.ensembl.org/}{www.ensembl.org/}} gene definitions are downloaded using the \cs{biomaRt} package. A promoter is defined as the region spanning 1,500 bases upstream and 500 bases downstream of the transcription start site of the corresponding gene.} \item[GpG islands]{The CpG island track is downloaded from the dedicated FTP directory of the UCSC Genome Browser\footnote{\href{http://genome.ucsc.edu/}{genome.ucsc.edu/}}.} \end{description} CpG density and GC content are computed for all region types listed above. \section{Annotations} <>= suppressPackageStartupMessages(library(RnBeads)) #for knitr: avoid tidy because it messes up line breaks # opts_chunk$set(tidy=FALSE) @ \subsection{Genome Assemblies} Every annotation table and other data structure is specific to a genome assembly. The list of supported assemblies can be obtained by the calling the function \cs{rnb.get.assemblies}, as shown in the example below. <>= rnb.get.assemblies() @ \subsection{Built-in Annotation Tables} Control probe information for the Illumina 450K array is stored (and can be extracted) as a \cs{data.frame} in which every row corresponds to a control probe. Every other annotation table is represented by a dedicated \cs{GRangesList} object, that is, a list of consistent \cs{GRanges} objects, one per chromosome. Every instance of \cs{GRanges} defines the genomic positions of the corresponding sites, probes or regions. Identifiers, if present, can be obtained using the \cs{names} method. Strand information is also included when applicable. Any additional annotation is stored as metadata. Please refer to the documentations of the \cs{GenomicRanges} package and the \cs{GRanges} class for more details.\\ \begin{table} \centering \rowcolors{2}{backgroundtabrow2}{backgroundtabrow1} \begin{tabular}{|c|p{5.1cm}|p{7cm}|} \hline \multicolumn{1}{|c}{\textbf{No.}} & \multicolumn{1}{c}{\textbf{Column Name}} & \multicolumn{1}{c|}{\textbf{Description}} \\ \hline 1 & ID & Probe identifier. \\ 2 & Target & Probe category. \\ 3 & Color & Probe color. \\ 4 & Description & Probe description, abbreviated. \\ 5 & AVG & ... \\ 6 & Evaluate Green & If probe is used in the evaluation of the green channel. \\ 7 & Evaluate Red & If probe is used in the evaluation of the red channel. \\ 8 & Expected Intensity & Expected intensity of the probe. \\ 9 & Sample-dependent & Flag indicating if the intensity of the probe is sample-dependent. \\ 10 & Index & Index of the probe in its category. \\ \hline \end{tabular} \caption{Columns in the HumanMethylation450 control probe annotation table.} \label{tab:controls} \end{table} In an R session, every annotation object can be accessed using the function \cs{rnb.get.annotation}. The code snippet below shows how the control probe annotations can be extracted. Table~\ref{tab:controls} briefly describes the characteristics of these probes. <>= control.annotation <- rnb.get.annotation("controls450") head(control.annotation) @ \begin{table} \centering \rowcolors{2}{backgroundtabrow2}{backgroundtabrow1} \begin{tabular}{|c|p{5.1cm}|p{7cm}|} \hline \multicolumn{1}{|c}{\textbf{No.}} & \multicolumn{1}{c}{\textbf{Column Name}} & \multicolumn{1}{c|}{\textbf{Description}} \\ \hline 1 & Design & Probe design type. \tabularnewline 2 & Color & Color channel. \tabularnewline 3 & Context & Probe context. \tabularnewline 4 & Random & Flag indicating if the probe\'s location was randomly chosen. \tabularnewline 5 & HumanMethylation27 & Flag indicating if the probe is also covered by HumanMethylation27k assay. \tabularnewline 6 & Mismatches A & Number of base mismatches between the provided and expected probe sequence. \tabularnewline 7 & Mismatches B & Number of base mismatches between the provided and expected probe sequence. \tabularnewline 8 & CGI Relation & Relation to a CpG island. \tabularnewline 9 & CpG & Number of CpG dinucleotides in the sequence neighborhood of the target. \tabularnewline 10 & GC & Percentage of C and G bases in the sequence neighborhood of the target. \tabularnewline 11 & SNPs 3 & Number of SNPs that overlap with the last 3 bases of a probe's target sequence. \tabularnewline 12 & SNPs 5 & Number of SNPs that overlap with the last 5 bases of a probe's target sequence. \tabularnewline 13 & SNPs Full & Number of SNPs that overlap with the probe's sequence. \\ \hline \end{tabular} \caption{Metadata accompanying the HumanMethylation450 probe definitions.} \label{tab:probes} \end{table} In a similar manner, the command below extracts the annotation of HumanMethylation450 probes. The metadata structure for the methylation and SNP-associated probes is presented in Table~\ref{tab:probes}. <>= probe.annotation <- rnb.get.annotation("probes450") probe.annotation @ \begin{table} \centering \rowcolors{2}{backgroundtabrow2}{backgroundtabrow1} \begin{tabular}{|c|p{5.1cm}|p{7cm}|} \hline \multicolumn{1}{|c}{\textbf{No.}} & \multicolumn{1}{c}{\textbf{Column Name}} & \multicolumn{1}{c|}{\textbf{Description}} \\ \hline 1 & symbol & Gene symbols associated with this region. \\ 2 & entrezID & Entrez gene identifiers associated with this region. \\ 3 & CpG & Number of CpG dinucleotides in the region. \\ 4 & GC & Total number of C and G bases in the region. \\ \hline \end{tabular} \caption{Metadata accompanying the Ensembl gene and promoter definitions.} \label{tab:genes} \end{table} Every data package integrates various region types with the CpG sites (and Infinium probes for HG19) presented in the previous section. Obtaining a region annotation information is very similar to accessing site or probe annotation. The code snippet below extracts the available gene definitions. Table~\ref{tab:genes} provides information about the metadata of these regions. <>= gene.annotation <- rnb.get.annotation("genes") attr(gene.annotation, "version") gene.annotation @ The function \cs{rnb.region.types} returns all supported region definitions for a given genome assembly. These currently include \cs{"cpgislands"}, \cs{"genes"}, \cs{"promoters"}, \cs{"tiling"}. The \cs{GRangesList} objects are well optimized for storage and fast access of genomic elements. However, exporting such objects to tables in human-readable form is not trivial. The function \cs{rnb.annotation2data.frame} provides an easy way to convert a probe or region annotation to a single table. The code snippet below exemplifies this function by saving the gene promoter definitions to a comma-separated value (CSV) file named \cs{promoters.csv}. <>= promoters <- rnb.annotation2data.frame(rnb.get.annotation("promoters")) write.csv(promoters, file = "promoters.csv", row.names = FALSE) @ \subsection{Custom Annotations} Custom region annotations can be included using the function \cs{rnb.set.annotation}. The information is either loaded from a BED file, or it must be presented as a \cs{data.frame} containing at least the following three columns - \textbf{chromosome}, \textbf{start} and \textbf{end}. Note that the included genomic regions remain available in the current R session only. \\ The vignette named \textbf{RnBeads -- Comprehensive DNA Methylation Analysis} contains a detailed example for including custom annotations to an analysis. \section{Data Sources} The data packages accompanying \cs{RnBeads} obtained data for the annotation tables from the following sources: \begin{description} \item[Other Bioconductor packages] Methylation and control probe annotation is obtained from the package \href{http://www.bioconductor.org/packages/release/data/annotation/html/FDb.InfiniumMethylation.hg19.html}{FDb.InfiniumMethylation.hg19}. In addition, \cs{RnBeads} uses the packages: \begin{description} \item \href{http://www.bioconductor.org/packages/release/data/annotation/html/BSgenome.Hsapiens.UCSC.hg19.html}{BSgenome.Hsapiens.UCSC.hg19} \item \href{http://www.bioconductor.org/packages/release/data/annotation/html/BSgenome.Mmusculus.UCSC.mm9.html}{BSgenome.Mmusculus.UCSC.mm9} \item \href{http://www.bioconductor.org/packages/release/data/annotation/html/BSgenome.Mmusculus.UCSC.mm10.html}{BSgenome.Mmusculus.UCSC.mm10} \item \href{http://www.bioconductor.org/packages/release/data/annotation/html/BSgenome.Rnorvegicus.UCSC.rn5.html}{BSgenome.Rnorvegicus.UCSC.rn5} \end{description} to validate the probe annotations and to enrich all annotation tables with sequence-based properties. \item[CRAN packages] Additional control probe information is obtained from the package \href{http://cran.r-project.org/web/packages/HumMeth27QCReport/index.html}{HumMeth27QCReport}. \item[dbSNP] Information on SNPs in the human genome is downloaded as BED files from the \href{ftp://ftp.ncbi.nih.gov/snp/organisms/}{FTP site of dbSNP}. \item[Gene Expression Omnibus (GEO)] Methylation and control probe annotation is obtained from GEO record \href{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL13534}{GPL13534} and validated for consistency with the other sources. \item[Biomart] \href{http://www.ensembl.org/}{Ensembl} gene definitions (Ensembl versions 73 (for hg19, mm10 and rn5), and 67 (for mm9) are downloaded using the \href{http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html}{biomaRt} package. \item[UCSC Genome Browser] The CpG island tracks are downloaded from the \href{ftp://hgdownload.cse.ucsc.edu/goldenPath/}{FTP site of the UCSC Genome Browser}. \end{description} \end{document}