% -*- mode: noweb; noweb-default-code-mode: R-mode; -*-
%\VignetteIndexEntry{SScore primer}
%\VignetteKeywords{Analysis, Affymetrix}
%\VignetteDepends{GCSscore}
%\VignettePackage{GCSscore}
\documentclass[12pt]{article}

\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}
\usepackage[utf8]{inputenc}

\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
%\headheight=-.3in

%\newcommand{\scscst}{\scriptscriptstyle}
%\newcommand{\scst}{\scriptstyle}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}

\author{Guy M. Harris, Shahroze Abbas,\\ and Michael F. Miles}
\begin{document}
% \SweaveOpts{concordance=TRUE}
\title{\Rpackage{GCSscore}: An R Package for Differential Expression Analysis of \\ WT-Type Affymetrix Microarrays \\ from Probe-Level Data}

\maketitle
\tableofcontents
\newpage
\section{Introduction}
\Rpackage{GCSscore} is an R package for detecting differential gene expression on whole transcriptome Affymetrix microarrays. It is based on the original S-Score algorithm, described by \cite{zhang2002}, \cite{kerns2003}, and validated by \cite{kennedy2006b}. These are novel comparative methods for microarray based-gene expression data analysis that utilizes the probe level data. It is based on an error model in which the detected signal is assumed to be proportional to the probe signal for highly expressed genes, but assumed to approach a background level (rather than 0) for genes with low levels of expression. This error model is used to calculate relative changes in probe intensities that converts probe signals into multiple measurements with equalized errors, which are summed over a probe set to form the significance score (S-score).  The original S-score method required the mismatch (MM) probes to estimate non-specific binding (NSB) for each perfect-match (PM) probes, and the MM probes were removed the arrays beginning with the Affymetrix Whole Transcriptome (WT) style arrays. This new algorithm uses a gc-content based NSB, thus eliminating the original algorithm's dependence on MM probes.  The GCS-score algorithm works of all ClariomS, ClariomD, and all other Whole Transcriptome Assays.  It also works for a select number of 3 prime IVT chip types. Assuming no expression differences between chips, the GCS-score output follows a standard normal distribution. Thus, a separate step estimating the probe set expression summary values is not needed and p-values can be easily calculated from the GCS-score output. Furthermore, in previous comparisons of dilution and spike-in microarray datasets, the original S-Score demonstrated greater sensitivity than many existing methods, without sacrificing specificity \citep{kennedy2006}. The \Rpackage{GCSscore} package \citep{harris2019} implements the GCS-score algorithm in the R programming environment, making it available to users of the Bioconductor \footnote{\url{http://www.bioconductor.org/}} project.

\section{What's new in this version}
This is the second public release of the \Rpackage{GCSscore} package.  In this release: there are improvements to algorithm, multiple bug fixes, and updates to the documentation.

\section{Reading in data and generating GCS-scores}
Affymetrix data are generated from microarrays by analyzing the scanned 
image of the chip (stored in a *.DAT file) to produce a *.CEL file. The 
*.CEL file contains, among other information, a decimal number for each 
probe on the chip that corresponds to its intensity. The GCS-score algorithm 
compares two microarrays by combining all of the probe intensities from a 
probesetID / transcriptionclusterID into a single summary statistic for each annotated gene or exon. 
The \Rpackage{GCSscore} package processes the data obtained from .CEL files, which must be 
loaded into R prior to calling the \Rfunction{GCSscore} function.  The \Rfunction{GCSscore} function utilizes the \Rfunction{readCel} function to directly access the individual .CEL files. Additional information regarding the \Rfunction{readCel} function and detailed descriptions of the structure of .CEL files can be found in the \Rpackage{affxparser} vignette. The \Rfunction{readCel} function allows the \Rpackage{GCSscore} package to access additional variables that are necessary for the noise and probe error estimations.

The examples in this vignette will demonstrate the functionality of the \Rpackage{GCSscore} package.
We begin with an example of the most basic GCS-score analysis, which utilize the .CEL data files that are supplied within the \Rpackage{GCSscore} package:

<<loadGCSscore, quiet=TRUE, message=FALSE>>=
library(GCSscore)
@ 

<<results=hide>>= 
# get the path to example CEL files in the package directory:
celpath1 <- system.file("extdata/","MN_2_3.CEL", package = "GCSscore")
celpath2 <- system.file("extdata/","MN_4_1.CEL", package = "GCSscore")

# run GCSscore() function directly on the two .CEL files above:
GCSs.single <- GCSscore(celFile1 = celpath1, celFile2 = celpath2)
@
\section{Important Parameters for the \Rfunction{GCSscore} function}

\begin{description}
\item[celFile1] -- character string giving the .CEL file name the directory in which the *.CEL files are stored. If a directory is not specified, the current working directory is used.
\item[celTable] -- A CSV file containing batch submission information. 
\item[celTab.names] -- If set to TRUE, then the GCS-score batch output is assigned the user-designated name, as specified in the first column of the batch input .CSV file. If set to FALSE, when the user submits a batch job, the column name of the run in the the batch output \Robject{data.table} will be: CEL-file-name1 vs CEL-file-name2.
\item[method] -- This determines the method used to group and tally the probeids when calculating GCS-scores.  The default method is 1.  For Whole Transcriptome arrays: method = 1 is for gene-level (transcriptclusterid-based) analysis. For exon-level (probesetid-based) analysis, set method = 2.  For the older generation arrays (3 IVT-style), if a GC-content based background correction is desired on the 3 IVT arrays, set method = 1, if a PM-MM based background correction is desired, set method = 2 (PM-MM gives identical results to the original S-score package).   
\end{description}

The \Rfunction{GCSscore} function returns an object of class \Robject{ExpressionSet}, a format defined in the \Rpackage{Biobase} package.  The GCS-score differential expression values are contained in the \Robject{assayData[["exprs"]]} section of the \Robject{ExpressionSet}.  The full set of annotation information is included in the \Robject{featureData@data} section of the \Robject{ExpressionSet}.  For viewing and exporting, it is often desirable to extract the relevant information from the \Robject{ExpressionSet} into an well structured object, such as a \Robject{data.table}.  The class \Robject{data.table} is described in the \Rpackage{data.table} package, which is available on CRAN.

<<>>= 
# view class of output:
class(GCSs.single)[1]

# convert GCSscore single-run from ExpressionSet to data.table:
GCSs.single.dt <- 
  data.table::as.data.table(cbind(GCSs.single@featureData@data,
                                  GCSs.single@assayData[["exprs"]]))

# show all column names included in the output:
colnames(GCSs.single.dt)

# show simplified output of select columns and rows:
GCSs.single.dt[10000:10005,
              c("transcriptclusterid","symbol",
                "ref_id","Sscore")]
@

\section{Using the Batch Functionality of GCSscore}
The \Rfunction{GCSscore} function is able to output mulitple GCS-score results into a single file.  This is done by leaving the \Robject{celFile1} and \Robject{celFile2} variables empty, and using the \Robject{celTable} argument instead.  The \Robject{celTable} argument accepts a three column \Robject{data.table} object, that is read into R from a .CSV file via the \Rfunction{fread} function from the \Rpackage{data.table} package.

<<>>=
# get the path to example CSV file in the package directory:
celtab_path <- system.file("extdata",
                           "GCSs_batch_ex.csv", 
                           package = "GCSscore")
# read in the .CSV file with fread():
celtab <- data.table::fread(celtab_path)
# view structure of 'celTable' input:
celtab
@

In these examples, the .CEL files will not be within the working directory.  Therefore, the path to the .CEL files must be added to allow the \Rfunction{GCSscore} function to locate the files.  This is not necessary if the .CEL files are in the working directory:

<<results=hide>>=
path <- system.file("extdata", package = "GCSscore")
celtab$CelFile1 <- celtab[,paste(path,CelFile1,sep="/")]
celtab$CelFile2 <- celtab[,paste(path,CelFile2,sep="/")]
@

The \Rfunction{GCSscore} function will process all of the runs listed in .CSV and each GCS-score run is assigned to a column in the output. If the \Robject{celTab.names} is set to TRUE, the column names of each run will correspond to the run name assigned in the first column of the .CSV batch input file.  In this example, all four .CEL files included with the package are run in pairwise fashion.

<<results=hide>>=
# run GCSscore using using all info from the batch file:
GCSs.batch <- GCSscore(celTable = celtab, celTab.names = TRUE)
@

The \Robject{ExpressionSet} returned from the \Rfunction{GCSscore} package can easily be converted back to a \Robject{data.table} structure.  This matches the structure of the .CSV file that is created if the fileout option is set to TRUE.  The conversion of the \Robject{ExpressionSet} object to \Robject{data.table} is as follows:

<<>>=
# view class of output:
class(GCSs.batch)[1]

# converting GCS-score output from'ExpressionSet' to 'data.table':
GCSs.batch.dt <-
  data.table::as.data.table(cbind(GCSs.batch@featureData@data,
                                  GCSs.batch@assayData[["exprs"]]))

# show all column names included in the output:
colnames(GCSs.batch.dt)

# show simplified output of select columns and rows:
GCSs.batch.dt[10000:10005,
              c("transcriptclusterid","symbol",
                "example01","example02","example03")]

@

\section{Natural Statistics of GCS-scores for Differential Gene Expression Analysis}
Under conditions of no differential expression, the GCS-Score output follows a standard normal (Gaussian) distribution with a mean of 0 and standard deviation of 1. This makes it straightforward to calculate p-values corresponding to rejection of the null hypothesis and acceptance of the alternative hypothesis of differential gene expression. Cutoff values for the GCS-scores can be set to achieve the desired level of significance. As an example, an absolute GCS-score value of 3 (signifying 3 standard deviations from the mean, a typical cutoff value) would correspond to a p-value of 0.003. While the GCS-score algorithm does account for the correlations among probes within a two-chip comparison, it does not adjust for multiple comparisons when comparing more than one pair of chips.  The additional steps for producing multiple test corrected lists of differentially expressed genes (DEGs) for a given study are covered in the next section.

\section{Using GCS-scores to produce DEGs in experimental datasets}

For a more biologically relevant situation, the next example will cover how to use the GCSscore package to produce DEGs with multiple test corrections, using the Significance Analysis of Microrrarys (SAM) method.  The data used in the next example is easily obtained from the GEO database.  Here, we will investigate a dataset of ClariomS mouse arrays taken from GSE103380.  Further information regarding the experimental design can be found at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103880.

This section of the vignette requires the following additional packages from the 'suggests' section of the package DESCRIPTION file: \Rpackage{siggenes}, \Rpackage{GEOquery}, and \Rpackage{R.utils}.

Begin by creating a temporary directory to store the downloaded files.  For clarity, the directory will have the name of the GEO identifer. 

<<results=hide, quiet=TRUE, message=FALSE>>=
GEO <- "GSE103380"
dir.geo <- paste(tempdir(),GEO,sep="/")
dir.create(dir.geo, showWarnings = FALSE)
@ 

For this example, select files will be downloaded into the temporary directory.  Here, these files are listed by the GSM ids for individual .CEL files:

\begin{itemize}
\item GSM2769665	(Naïve Microglia bio replicate 1)
\item GSM2769666	(Naïve Microglia bio replicate 2)
\item GSM2769667	(Naïve Microglia bio replicate 3)
\item GSM2769668	(Naïve Microglia bio replicate 4)
\item GSM2769669	(Day4 microglia bio replicate 1)
\item GSM2769670	(Day4 microglia bio replicate 6)
\item GSM2769671	(Day4 microglia bio replicate 7)
\item GSM2769672	(Day4 microglia bio replicate 8)
\end{itemize}

The compressed CEL files are downloaded directly from GEO using the \Rpackage{GEOquery} package.
<<results=hide, quiet=TRUE, message=FALSE>>=
list.cels <- c("GSM2769665","GSM2769666","GSM2769667","GSM2769668",
               "GSM2769669","GSM2769670","GSM2769671","GSM2769672")

# create function for pulling down the compressed .CEL files:
cels.get <- function(x)  
  GEOquery::getGEOSuppFiles(GEO = x,
                            makeDirectory = FALSE,
                            baseDir = dir.geo,
                            filter_regex = "*.CEL.gz")
@
Download the data files from GEO, into the dir.geo temp directory and unzip all of the .CEL files.
<<results=hide, quiet=TRUE, message=FALSE>>=
lapply(list.cels,cels.get)
files.geo <- paste(dir.geo,list.files(path=dir.geo, 
                                      pattern = ".gz"),sep="/")

# create function to gunzip the compressed data files:
fun.gunzip <- function(x) 
  R.utils::gunzip(filename = x,
                  overwrite=TRUE,
                  remove=FALSE)

# apply the gunzip function across the vector of compressed CEL files:
lapply(files.geo,fun.gunzip)
@

Get the path to example CSV file in the package directory.

<<results=hide, quiet=TRUE, message=FALSE>>=
celtab_path <- system.file("extdata",
                           "GSE103380_batch.csv", 
                           package = "GCSscore")

# read in the .CSV file with fread():
celtab <- data.table::fread(celtab_path)

# adds path to celFile names in batch input:
# NOTE: this is not necessary if the .CEL files 
#       are in the working directory:
celtab$CelFile1 <- celtab[,paste(dir.geo,CelFile1,sep="/")]
celtab$CelFile2 <- celtab[,paste(dir.geo,CelFile2,sep="/")]

# run GCSscore using using all info from the batch file:
GCSs.GSE103380 <- GCSscore(celTable = celtab, celTab.names = TRUE)

# convert GCS-score output from'ExpressionSet' to 'data.table':
GCSs.GSE103380.dt <- 
data.table::as.data.table(cbind(GCSs.GSE103380@featureData@data,
                                GCSs.GSE103380@assayData[["exprs"]]))
@
Create 4 averages of each experimental sample with each control.  The SAM analysis will be performed using these 4 averages.
<<results=hide, quiet=TRUE, message=FALSE>>=
GCSs.GSE103380.dt[, day4_1_vs_naive := 
                    rowMeans(GCSs.GSE103380.dt[
                      ,day4_1_vs_naive_1:day4_1_vs_naive_4])]
GCSs.GSE103380.dt[, day4_2_vs_naive := 
                    rowMeans(GCSs.GSE103380.dt[
                      ,day4_2_vs_naive_1:day4_1_vs_naive_4])]
GCSs.GSE103380.dt[, day4_3_vs_naive := 
                    rowMeans(GCSs.GSE103380.dt[
                      ,day4_3_vs_naive_1:day4_1_vs_naive_4])]
GCSs.GSE103380.dt[, day4_4_vs_naive := 
                    rowMeans(GCSs.GSE103380.dt[
                      ,day4_4_vs_naive_1:day4_1_vs_naive_4])]

@

Remove TCids without a clear symbol/name before running SAM.

<<>>=
GCSs.GSE103380.dt <- GCSs.GSE103380.dt[!is.na(symbol)]
# Set the SAM 'gene.names' to be either ('TCid' or 'symbol'):
GCSs.GSE103380.dt.SAM <- 
  siggenes::sam(GCSs.GSE103380.dt[
    ,day4_1_vs_naive:day4_4_vs_naive],
    cl = rep(1,4),rand=123,
    gene.names = GCSs.GSE103380.dt$symbol)
@
View the details of SAM analysis and write it to file.
<<>>=
GCSs.GSE103380.dt.SAM

# create name and path of SAM results:
sam.path <- paste(dir.geo,
                  "GSE103380_ex_SAM_16_6.csv",
                  sep = "/")
# Save TCids with delta >= 16.6:
siggenes::sam2excel(GCSs.GSE103380.dt.SAM,
                    delta = 16.6, 
                    file = sam.path)
@
Read the SAM output back into R and view the top 10 DEGs from the experiment.
<<>>=
sam.results <- data.table::fread(sam.path)

# View the top 10 DEGs output by SAM:
head(sam.results,10)
@

The list of DEG gene symbols produced in the SAM analysis can be used for functional enrichment, using tools such as ToppFun.  Likewise, the treatment responsive transcriptonclusterids can be input directly into software, such as Ingenity Pathway Analysis, to get information on pathway enrichment.  Further exploration of these results will be investigated in an accompanying publication.

\section{Version history}

\begin{description}
\item[1.2.0] second public release (BioC 3.11)
\item[1.0.0] first public release (BioC 3.10)
\item[0.99.1] initial development version
\end{description}

\section{Acknowledgements}
The development of the original S-Score algorithm and its original implementation in C++ is the work of Dr. Li Zhang. The Delphi implementation of the S-Score algorithm is the work of Dr. Robnet Kerns.  The original S-score R package was work of Dr. Robert Kennedy.  The calculations for the SF and SDT are performed as originally described in the Affymetrix Statistical Algorithms Description Document \citep{affy:tech:2002} and implemented in Affymetrix software (using SDT = 4 * RawQ * SF).  This work was partly supported by F30 training grant (F30AA025535) to Guy M. Harris and NIAAA research grant AA13678 to Michael F. Miles.

\bibliographystyle{plainnat}
\bibliography{sscore}

\end{document}