Levi Waldron, CUNY School of Public Health
June 14, 2017
Based on:
table(spider$leg,spider$type)##     
##      pull push
##   L1   34   34
##   L2   15   15
##   L3   52   52
##   L4   40   40summary(spider)##  leg        type        friction     
##  L1: 68   pull:141   Min.   :0.1700  
##  L2: 30   push:141   1st Qu.:0.3900  
##  L3:104              Median :0.7600  
##  L4: 80              Mean   :0.8217  
##                      3rd Qu.:1.2400  
##                      Max.   :1.8400\[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \]
Systematic plus random components of model:
\(y_i = E[y_i|x_i] + \epsilon_i\)
Assumptions of linear models: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)
L1:
L2={1 if \(2^{nd}\) leg pair, 0 otherwise},L3={1 if \(3^{nd}\) leg pair, 0 otherwise},L4={1 if \(4^{th}\) leg pair, 0 otherwise}.aov(), lm(), glm(), and coxph() use a “model formula” interface.> response variable ~ explanatory variables
Model formula for simple linear regression:
> y ~ x
Friction coefficient for leg type of first leg pair:
spider.sub <- spider[spider$leg=="L1", ]
fit <- lm(friction ~ type, data=spider.sub)
summary(fit)## 
## Call:
## lm(formula = friction ~ type, data = spider.sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33147 -0.10735 -0.04941 -0.00147  0.76853 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.92147    0.03827  24.078  < 2e-16 ***
## typepush    -0.51412    0.05412  -9.499  5.7e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2232 on 66 degrees of freedom
## Multiple R-squared:  0.5776, Adjusted R-squared:  0.5711 
## F-statistic: 90.23 on 1 and 66 DF,  p-value: 5.698e-14Regression coefficients for friction ~ type for first set of spider legs:
fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.9215 | 0.0383 | 24.08 | 0.0000 | 
| typepush | -0.5141 | 0.0541 | -9.50 | 0.0000 | 
Diagram of the estimated coefficients in the linear model. The green arrow indicates the Intercept term, which goes from zero to the mean of the reference group (here the ‘pull’ samples). The orange arrow indicates the difference between the push group and the pull group, which is negative in this example. The circles show the individual samples, jittered horizontally to avoid overplotting.
Remember there are positions 1-4
fit <- lm(friction ~ leg, data=spider)fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.6644 | 0.0538 | 12.34 | 0.0000 | 
| legL2 | 0.1719 | 0.0973 | 1.77 | 0.0784 | 
| legL3 | 0.1605 | 0.0693 | 2.32 | 0.0212 | 
| legL4 | 0.2813 | 0.0732 | 3.84 | 0.0002 | 
Additional explanatory variables can be added as follows:
> y ~ x + z
Note that “+” does not have its usual meaning, which would be achieved by:
> y ~ I(x + z)
Remember there are positions 1-4
fit <- lm(friction ~ type + leg, data=spider)fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 1.0539 | 0.0282 | 37.43 | 0.0000 | 
| typepush | -0.7790 | 0.0248 | -31.38 | 0.0000 | 
| legL2 | 0.1719 | 0.0457 | 3.76 | 0.0002 | 
| legL3 | 0.1605 | 0.0325 | 4.94 | 0.0000 | 
| legL4 | 0.2813 | 0.0344 | 8.18 | 0.0000 | 
Interaction between coffee and time of day on performance
Image credit: http://personal.stevens.edu/~ysakamot/
Interaction is modeled as the product of two covariates: \[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1*x_2 \]
| symbol | example | meaning | 
|---|---|---|
| + | + x | include this variable | 
| - | - x | delete this variable | 
| : | x : z | include the interaction | 
| * | x * z | include these variables and their interactions | 
| ^ | (u + v + w)^3 | include these variables and all interactions up to three way | 
| 1 | -1 | intercept: delete the intercept | 
lm( y ~ u + v)u and v factors: ANOVAu and v numeric: multiple regression
one factor, one numeric: ANCOVA
Recall the multiple linear regression model:
\(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i\)
Matrix notation for the multiple linear regression model:
\[ \, \begin{pmatrix} y_1\\ y_2\\ \vdots\\ y_N \end{pmatrix} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} \]
or simply:
\[ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} \]
group <- factor( c(1, 1, 2, 2) )
model.matrix(~ group)##   (Intercept) group2
## 1           1      0
## 2           1      0
## 3           1      1
## 4           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"What if we forgot to code group as a factor?
group <- c(1, 1, 2, 2)
model.matrix(~ group)##   (Intercept) group
## 1           1     1
## 2           1     1
## 3           1     2
## 4           1     2
## attr(,"assign")
## [1] 0 1group <- factor(c(1,1,2,2,3,3))
model.matrix(~ group)##   (Intercept) group2 group3
## 1           1      0      0
## 2           1      0      0
## 3           1      1      0
## 4           1      1      0
## 5           1      0      1
## 6           1      0      1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"group <- factor(c(1,1,2,2,3,3))
group <- relevel(x=group, ref=3)
model.matrix(~ group)##   (Intercept) group1 group2
## 1           1      1      0
## 2           1      1      0
## 3           1      0      1
## 4           1      0      1
## 5           1      0      0
## 6           1      0      0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"diet <- factor(c(1,1,1,1,2,2,2,2))
sex <- factor(c("f","f","m","m","f","f","m","m"))
model.matrix(~ diet + sex)##   (Intercept) diet2 sexm
## 1           1     0    0
## 2           1     0    0
## 3           1     0    1
## 4           1     0    1
## 5           1     1    0
## 6           1     1    0
## 7           1     1    1
## 8           1     1    1
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$diet
## [1] "contr.treatment"
## 
## attr(,"contrasts")$sex
## [1] "contr.treatment"model.matrix(~ diet + sex + diet:sex)##   (Intercept) diet2 sexm diet2:sexm
## 1           1     0    0          0
## 2           1     0    0          0
## 3           1     0    1          0
## 4           1     0    1          0
## 5           1     1    0          0
## 6           1     1    0          0
## 7           1     1    1          1
## 8           1     1    1          1
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$diet
## [1] "contr.treatment"
## 
## attr(,"contrasts")$sex
## [1] "contr.treatment"typepush:legL2 in a model with interaction terms:fitX <- lm(friction ~ type * leg, data=spider)| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.9215 | 0.0327 | 28.21 | 0.0000 | 
| typepush | -0.5141 | 0.0462 | -11.13 | 0.0000 | 
| legL2 | 0.2239 | 0.0590 | 3.79 | 0.0002 | 
| legL3 | 0.3524 | 0.0420 | 8.39 | 0.0000 | 
| legL4 | 0.4793 | 0.0444 | 10.79 | 0.0000 | 
| typepush:legL2 | -0.1039 | 0.0835 | -1.24 | 0.2144 | 
| typepush:legL3 | -0.3838 | 0.0594 | -6.46 | 0.0000 | 
| typepush:legL4 | -0.3959 | 0.0628 | -6.30 | 0.0000 | 
**What if we want to ask this question for L3 vs L2?
What if we want to contrast…
typepush:legL3 - typepush:legL2
There are many ways to construct this design, one is with library(multcomp):
names(coef(fitX))## [1] "(Intercept)"    "typepush"       "legL2"          "legL3"         
## [5] "legL4"          "typepush:legL2" "typepush:legL3" "typepush:legL4"C <- matrix(c(0,0,0,0,0,-1,1,0), 1) 
L3vsL2interaction <- multcomp::glht(fitX, linfct=C) Is there a difference in pushing friction for L3 vs L2?
summary(L3vsL2interaction)## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = friction ~ type * leg, data = spider)
## 
## Linear Hypotheses:
##        Estimate Std. Error t value Pr(>|t|)    
## 1 == 0 -0.27988    0.07893  -3.546  0.00046 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)\[ g\left( E[y|x] \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]
Useful for log-transformed microarray data
The model: \(y_i = E[y|x] + \epsilon_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i\)
Random component of \(y_i\) is normally distributed: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)
Systematic component (linear predictor): \(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi}\)
Link function here is the identity link: \(g(E(y | x)) = E(y | x)\). We are modeling the mean directly, no transformation.
Useful for binary outcomes, e.g. Single Nucleotide Polymorphisms or somatic variants
The model: \[ Logit(P(x)) = log \left( \frac{P(x)}{1-P(x)} \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]
Random component: \(y_i\) follows a Binomial distribution (outcome is a binary variable)
Systematic component: linear predictor \[ \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]
Link function: logit (log of the odds that the event occurs)
\[ g(P(x)) = logit(P(x)) = log\left( \frac{P(x)}{1-P(x)} \right) \]
\[ P(x) = g^{-1}\left( \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \right) \]
The systematic part of the GLM is:
\[ log\left( E[y|x] \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + log(t_i) \]
\[ f(k, \lambda) = e^{-\lambda} \frac{\lambda^k}{k!} \]
\[
f(k, \lambda, \theta) = \frac{\Gamma(\frac{1 + \theta k}{\theta})}{k! \, \Gamma(\frac{1}{\theta})} 
    \left(\frac{\theta m}{1+\theta m}\right)^k 
    \left(1+\theta m\right)^\theta
    \quad\text{for }k = 0, 1, 2, \dotsc
\] * where \(f\) is still the probability of \(k\) events (e.g. # of reads counted), * \(\lambda\) is still the mean number of events, so \(E[y|x]\) * An additional dispersion parameter \(\theta\) is estimated: + \(\theta \rightarrow 0\): Poisson distribution + \(\theta \rightarrow \infty\): Gamma distribution
* The Poisson model can be considered as nested within the Negative Binomial model + A likelihood ratio test comparing the two models is possible