%\VignetteIndexEntry{Validate genomic data with "DupChecker" package} %\VignettePackage{dupchecker} %\VignetteEngine{knitr::knitr} \documentclass[12pt]{article} \usepackage{hyperref} \author{Quanhu Sheng$^{*}$, Yu Shyr, Xi Chen \\[1em] \small{Center for Quantitative Sciences, Vanderbilt University, Nashville, USA} \\ \small{\texttt{$^*$shengqh (at) gmail.com}}} \title{DupChecker: a bioconductor package for checking high-throughput genomic data redundancy in meta-analysis} \begin{document} \maketitle \begin{abstract} Meta-analysis has become a popular approach for high-throughput genomic data analysis because it often can significantly increase power to detect biological signals or patterns in datasets. However, when using public-available databases for meta-analysis, duplication of samples is an often encountered problem, especially for gene expression data. Not removing duplicates could lead false positive finding, misleading clustering pattern or model over-fitting issue, etc in the subsequent data analysis. We developed a Bioconductor package Dupchecker that efficiently identifies duplicated samples by generating MD5 fingerprints for raw data. A real data example was demonstrated to show the usage and output of the package. Researchers may not pay enough attention to checking and removing duplicated samples, and then data contamination could make the results or conclusions from meta-analysis questionable. We suggest applying DupChecker to examine all gene expression data sets before any data analysis step. In this vignette, we demonstrate the application of DupChecker as a package for checking high-throughput genomic data redundancy in meta-analysis. DupChecker can download the GEO/ArrayExpress raw data files from EBI/ncbi ftp server, extract individual data files, calculate MD5 fingerprint for each data file and validate the redundancy of those data files. \vspace{1em} \textbf{DupChecker version:} \Sexpr{packageDescription("DupChecker")$Version} \vspace{1em} \begin{center} \begin{tabular}{ | l | } \hline If you use DupChecker in published research, please cite: \\ \\ Quanhu Sheng, Yu Shyr, Xi Chen.: DupChecker: a bioconductor package for checking high-throughput genomic \\ data redundancy in meta-analysis. BMC bioinformatics 2014, 15:323. \\ \hline \end{tabular} \end{center} \end{abstract} \tableofcontents \section{Introduction} Meta-analysis has become a popular approach for high-throughput genomic data analysis because it often can significantly increase power to detect biological signals or patterns in datasets. However, when using public-available databases for meta-analysis, duplication of samples is an often encountered problem, especially for gene expression data. Not removing duplicates would make study results questionable. We developed a package DupChecker that efficiently identifies duplicated samples by generating MD5 fingerprints for individual data file. \section{Standard workflow} \subsection{Quick start} Here we show the most basic steps for a validation procedure. You need to create a target directory used to store the data. In order to illustrate the procedure, here we used two very small datasets without redundant CEL files. You may also try a real example with redundancy at example~\ref{realexample}. <>= library(DupChecker) ## Create a "DupChecker" directory under temporary directory rootDir<-paste0(dirname(tempdir()), "/DupChecker") dir.create(rootDir, showWarnings = FALSE) message(paste0("Downloading data to ", rootDir, " ...")) geoDownload(datasets=c("GSE1478"), targetDir=rootDir) arrayExpressDownload(datasets=c("E-MEXP-3872"), targetDir=rootDir) datafile<-buildFileTable(rootDir=rootDir, filePattern="cel$") result<-validateFile(datafile) if(result$hasdup){ duptable<-result$duptable write.csv(duptable, file=paste0(rootDir, "/duptable.csv")) } @ \subsection{GEO/ArrayExpress data download} Firstly, function geoDownload/arrayExpressDownload will download raw data from ncbi/EBI ftp server based on datasets user provided. Once the compressed raw data is downloaded, individual data files will be extracted from compressed raw data. Some of the data files may have been compressed. Since the files compressed from same data file by different softwares will have different MD5 figureprints, we will decompress those compressed data files and validate the file redundancy using decompressed data files. <>= datatable<-geoDownload(datasets = c("GSE1478"), targetDir=rootDir) datatable datatable<-arrayExpressDownload(datasets=c("E-MEXP-3872"), targetDir=rootDir) datatable @ The datatable is a data frame containing dataset name and how many files in that dataset. There are two possible situations that you may want to download and decompress the data using external tools. 1) The download or decompress cost too much time in R environment; 2) The internal tool in R may download imcomplete or interrupted file for huge file which will cause the "untar" or "gunzip" command fails. \subsection{Build file table} Secondly, function buildFileTable will try to find all files (default) or expected files matching filePattern parameter in the subdirectories of root directory. The result data frame contains two columns, dataset (directory name) and filename. Here, rootDir can also be an array of directories. In following example, we just focused on AffyMetrix CEL file only. <>= datafile<-buildFileTable(rootDir=rootDir, filePattern="cel$") @ \subsection{Validate file redundancy} The function validateFile will calculate MD5 fingerprint for each file in table and then check to see if any two files have same MD5 fingerprint. The files with same fingerprint will be treated as duplication. The function will return a table contains all duplicated files and datasets. <>= result<-validateFile(datafile) if(result$hasdup){ duptable<-result$duptable write.csv(duptable, file=paste0(rootDir, "/duptable.csv")) } @ \subsection{Real example} \label{realexample} There are three colon cancer related datasets (GSE14333, GSE13067 and GSE17538) containing seriously data redundancy problem. User may run following codes to do the validation. It may cost a few miniutes to a few hours based on the network bindwidth and the computer performance. <>= library(DupChecker) rootDir<-paste0(dirname(tempdir()), "/DupChecker_RealExample") dir.create(rootDir, showWarnings = FALSE) geoDownload(datasets = c("GSE14333", "GSE13067", "GSE17538"), targetDir=rootDir) datafile<-buildFileTable(rootDir=rootDir, filePattern="cel$") result<-validateFile(datafile) if(result$hasdup){ duptable<-result$duptable write.csv(duptable, file=paste0(rootDir, "/duptable.csv")) } @ Table \ref{tab:duptable} illustrated the duplication between those three datasets. GSE13067 {[}64/74{]} means 64 of 74 CEL files in GSE13067 dataset were duplicated in other two datasets. \begin{table}[ht] \centering \caption{Part of duplication table of three datasets} \label{tab:duptable} \resizebox{\columnwidth}{!}{ \begin{tabular}{|c|c|c|c|} \hline MD5 & GSE13067{[}64/74{]} & GSE14333{[}231/290{]} & GSE17538{[}167/244{]} \\ \hline 001ddd757f185561c9ff9b4e95563372 & & GSM358397.CEL & GSM437169.CEL \\ 00b2e2290a924fc2d67b40c097687404 & & GSM358503.CEL & GSM437210.CEL \\ 012ed9083b8f1b2ae828af44dbab29f0 & GSM327335.CEL & GSM358620.CEL & \\ 023c4e4f9ebfc09b838a22f2a7bdaa59 & & GSM358441.CEL & GSM437117.CEL \\ \hline \end{tabular} } \end{table} \section{Discussion} We illustrated the application using gene expression data, but DupChecker package can also be applied to other types of high-throughput genomic data including next-generation sequencing data. \section{Session Info} <>= toLatex(sessionInfo()) @ \end{document}