---
title: Vignette of the pengls package
author: Stijn Hawinkel
output: 
  rmarkdown::html_vignette:
    toc: true
    number_sections: true
    keep_md: true
vignette: >
  %\VignetteIndexEntry{Vignette of the pengls package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

This vignette demonstrates the use of the _pengls_ package for high-dimensional data with spatial or temporal autocorrelation. It consists of an iterative loop around the _nlme_ \parencite{Pinheiro2021} and _glmnet_ \parencite{Friedman2010} packages. Currently, only continuous outcomes and $R^2$ as performance measure are implemented.

\setcounter{tocdepth}{5}
\tableofcontents

# Installation instuctions

The _pengls_ package is available from BioConductor, and can be installed as follows:

```{r install, eval = FALSE}
library(BiocManager)
install("pengls")
```

Once installed, it can be loaded and version info printed.

```{r loadRCMpackage}
suppressPackageStartupMessages(library(pengls))
cat("pengls package version", as.character(packageVersion("pengls")), "\n")
```

# Illustration

## Spatial autocorrelation

We first create a toy dataset with spatial coordinates.

```{r spatialToy}
library(nlme)
n <- 25 #Sample size
p <- 50 #Number of features
g <- 15 #Size of the grid
#Generate grid
Grid <- expand.grid("x" = seq_len(g), "y" = seq_len(g))
# Sample points from grid without replacement
GridSample <- Grid[sample(nrow(Grid), n, replace = FALSE),]
#Generate outcome and regressors
b <- matrix(rnorm(p*n), n , p)
a <- rnorm(n, mean = b %*% rbinom(p, size = 1, p = 0.25), sd = 0.1) #25% signal
#Compile to a matrix
df <- data.frame("a" = a, "b" = b, GridSample)
```

The _pengls_ method requires prespecification of a functional form for the autocorrelation. This is done through the _corStruct_ objects defined by the _nlme_ package. We specify a correlation decaying as a Gaussian curve with distance, and with a nugget parameter. The nugget parameter is a proportion that indicates how much of the correlation structure explained by independent errors; the rest is attributed to spatial autocorrelation. The starting values are chosen as reasonable guesses; they will be overwritten in the fitting process.

```{r spatialCorrelation}
# Define the correlation structure (see ?nlme::gls), with initial nugget 0.5 and range 5
corStruct <- corGaus(form = ~ x + y, nugget = TRUE, value = c("range" = 5, "nugget" = 0.5))
```

Finally the model is fitted with a single outcome variable and large number of regressors, with the chosen covariance structure and for a prespecified penalty parameter $\lambda=0.2$.

```{r spatialFit}
#Fit the pengls model, for simplicity for a simple lambda
penglsFit <- pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE), glsSt = corStruct, lambda = 0.2, verbose = TRUE)
```

Standard extraction functions like print(), coef() and predict() are defined for the new "pengls" object.

```{r standardExtract}
penglsFit
penglsCoef <- coef(penglsFit)
penglsPred <- predict(penglsFit)
```

## Temporal autocorrelation

The method can also account for temporal autocorrelation by defining another correlation structure from the _nlme_ package, e.g. autocorrelation structure of order 1:

```{r timeSetup}
n <- 100 #Sample size
p <- 10 #Number of features
#Generate outcome and regressors
b <- matrix(rnorm(p*n), n , p)
a <- rnorm(n, mean = b %*% rbinom(p, size = 1, p = 0.25), sd = 0.1) #25% signal
#Compile to a matrix
dfTime <- data.frame("a" = a, "b" = b, "t" = seq_len(n))
corStructTime <- corAR1(form = ~ t, value = 0.5)
```

The fitting command is similar, this time the $\lambda$ parameter is found through cross-validation of the naive glmnet (for full cross-validation , see below). We choose $\alpha=0.5$ this time, fitting an elastic net model.

```{r timeFit}
penglsFitTime <- pengls(data = dfTime, outVar = "a", verbose = TRUE,
xNames = grep(names(dfTime), pattern = "b", value =TRUE),
glsSt = corStructTime, nfolds = 5, alpha = 0.5)
```

Show the output

```{r}
penglsFitTime
```

## Penalty parameter and cross-validation

The _pengls_ package also provides cross-validation for finding the optimal $\lambda$ value. If the tuning parameter $\lambda$ is not supplied, the optimal $\lambda$ according to cross-validation with the naive _glmnet_ function (the one that ignores dependence) is used. Hence we recommend to use the following function to use cross-validation. Multithreading is supported through the _BiocParallel_ package \parencite{Morgan2020}:

```{r registerMulticores}
library(BiocParallel)
register(MulticoreParam(3)) #Prepare multithereading
```

```{r, nfolds}
nfolds <- 3 #Number of cross-validation folds
```


The function is called similarly to _cv.glmnet_:

```{r cvpengls}
penglsFitCV <- cv.pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE), glsSt = corStruct, nfolds = nfolds)
```

Check the result:

```{r printCV}
penglsFitCV
```


By default, the 1 standard error is used to determine the optimal value of $\lambda$ \parencite{Friedman2010}:

```{r 1se}
penglsFitCV$lambda.1se #Lambda for 1 standard error rule
penglsFitCV$cvOpt #Corresponding R2
```

Extract coefficients and fold IDs.

```{r extractCv}
head(coef(penglsFitCV))
penglsFitCV$foldid #The folds used
```

By default, blocked cross-validation is used, but random cross-validation is also available (but not recommended for timecourse or spatial data). First we illustrate the different ways graphically, again using the timecourse example: 

```{r illustrFolds, fig.width = 8, fig.height = 7}
set.seed(5657)
randomFolds <- makeFolds(nfolds = nfolds, dfTime, "random", "t")
blockedFolds <- makeFolds(nfolds = nfolds, dfTime, "blocked", "t")
plot(dfTime$t, randomFolds, xlab ="Time", ylab ="Fold")
points(dfTime$t, blockedFolds, col = "red")
legend("topleft", legend = c("random", "blocked"), pch = 1, col = c("black", "red"))
```

To perform random cross-validation

```{r cvpenglsTimeCourse}
penglsFitCVtime <- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(dfTime), pattern = "b", value =TRUE), glsSt = corStructTime, nfolds = nfolds, cvType = "random")
```

To negate baseline differences at different timepoints, it may be useful to center or scale the outcomes in the cross validation. For instance for centering only:

```{r timeCourseScale}
penglsFitCVtimeCenter <- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(dfTime), pattern = "b", value =TRUE), glsSt = corStructTime, nfolds = nfolds, cvType = "blocked", transFun = function(x) x-mean(x))
penglsFitCVtimeCenter$cvOpt #Better performance
```

# Session info

```{r sessionInfo}
sessionInfo()
```

\clearpage

\printbibliography