# Using Mixed-Effects Models for Confirmatory Hypothesis Testing (FAQ)

This FAQ is intended for people using linear mixed effects models (LMEMs) as a replacement for the statistical techniques that are more traditionally used for confirmatory hypothesis testing, such as ANOVA or t-tests. Some of the points we make here may not apply to other uses of LMEMs, such as for deriving the 'most parsimonious' model for one's data.

## Table of Contents

## Misconception 1: Mixed-effects models have dependency assumptions not shared by other kinds of analysis

LMEMs share the assumption of independent observations (technically, conditional independence, i.e., independence of the residuals) with other parametric approaches. LMEMs do allow for greater flexibility in accounting for sources of random variation than more traditional approaches, and it is this additional flexibility that makes it possible to simultaneously generalize over multiple populations (e.g., subjects and items). But this greater flexibility requires us to think more explicitly about random effects.

It is easy to miss the point that assumptions about independence have been there all along, because many people follow conventional "recipes" when they analyze their data that allow them to effectively ignore choices about random effects structure they would otherwise have to confront. For most researchers, the only moment in an analysis in which they think about anything like a "random effect" is when they consider whether to perform an independent or correlated-samples t-test, or decide among the recipes for "between-subjects ANOVA", "repeated-measures ANOVA", or "mixed design ANOVA" (not to be confused with mixed model ANOVA explained below). People mostly avoid error in these cases because they know how to follow the recipes.

As these recipes are currently practiced, the first step usually
involves aggregating the data so that the test can be performed on the
subject-by-cell means (or item-by-cell means) rather than on the raw,
unaggregated data. This "aggregate-before-analysis" step creates a
dataset that meets the requirement of conditional independence for the
slope (condition effect), by having only one data point per subject
per cell. This practice of aggregating first is so ingrained that may
researchers believe that ANOVA only "allows" one data point
per-subject-per-cell, which is not correct (see below). To meet the
assumption of conditional independence for the *intercept*, it is simply
necessary to choose the right recipe, i.e., to make sure that one does
a repeated-measures ANOVA or paired-sample t-test instead of a
between-subject ANOVA or independent samples t-test.

## Misconception 2: Random intercepts and random slopes have no analogue in ANOVA

Most datasets in experimental psychology involve multiple observations on the same sampling units (subjects or stimuli). It is possible to aggregate these observations together and to analyze the means rather than the trial-level data. When one wishes to analyze a dataset containing only a single observation per sampling unit, then between-subject techniques are warranted.

Datasets including multiple observations on the same units are 'within-subject' designs, analyzed using paired t-tests or within-subjects ANOVA. Such analyses control for variation in each subject overall response level, akin to including a random intercept in a LMEM.

The notion of random slopes is functionally equivalent to the notion of a treatment-by-subject interaction in ANOVA. Most people don't know about "treatment-by-subject interactions" because they usually perform analyses on the subject means rather than on the unaggregated data. The aggregation process confounds the treatment-by-subject variance with residual error, which legitimates the use of the standard residual error term in the denominator of the \(F\) ratio.

The tendency to analyze the subject means has led to the widespread misconception that ANOVA only 'allows' one observation per subject per cell. It is, in fact, possible to analyze the unaggregated, trial-level data using ANOVA and specifying the treatment-by-subject interaction as the error term in the denominator of the \(F\) ratio (see below and also here).

Here is how one might analyze subject-by-condition means using ANOVA
in R. We are using the R package `simgen`

to generate a sample dataset
with 24 subjects and 24 items, and with a single, two-level factor
manipulated within subjects and within items.

library(simgen, quietly=TRUE) x <- mkDf(mcr.params=createParamMx(1)[1,]) x$Cond <- factor(x$Cond) x.agg <- with(x, aggregate(list(Resp=Resp), list(Cond=Cond, SubjID=SubjID), mean, na.rm=TRUE)) x.agg.aov <- aov(Resp ~ Cond + Error(SubjID), x.agg) summary(x.agg.aov)

Error: SubjID Df Sum Sq Mean Sq F value Pr(>F) Residuals 23 26.24 1.141 Error: Within Df Sum Sq Mean Sq F value Pr(>F) Cond 1 0.003 0.0029 0.003 0.956 Residuals 23 21.468 0.9334

We specified the error term using `Error(SubjID)`

, which makes this a
within-subject design. Note that we have 23 residual degrees of
freedom, one minus the number of subjects. Equivalently, we could
have run a correlated-samples t-test on the aggregated data. The
analysis we performed here is actually equivalent to a random-intercepts only
LMEM. Note that performing a random-intercepts only LMEM on the
*aggregated* data would be legitimate because the treatment-by-subject
interaction is no longer a source of dependency, since we only have
one observation per subject per condition. It is also worth noting
that using an LMEM on aggregated data might be OK for a dataset
sampled from a single population (i.e., subjects) but the aggregation
would prevent us from using LMEMs for a simultaneous by-subject and
by-item analysis, since the item information would be lost in the
aggregation over subjects, and the subject information would be lost
in the aggregation over items.

And now here is a different way to analyze the same data, using the raw, unaggregated trial-level observations rather than the subject means.

```
x.aov <- aov(Resp ~ Cond + Error(SubjID/Cond), x)
summary(x.aov)
```

Error: SubjID Df Sum Sq Mean Sq F value Pr(>F) Cond 1 0.97 0.971 0.07 0.794 Residuals 22 307.07 13.958 Error: SubjID:Cond Df Sum Sq Mean Sq F value Pr(>F) Cond 1 0.09 0.09 0.008 0.927 Residuals 23 245.01 10.65 Error: Within Df Sum Sq Mean Sq F value Pr(>F) Residuals 500 1980 3.959

Note that we have specified the treatment-by-subject error term using
the syntax `Error(SubjID/Cond)`

. We should use the error term from
this "stratum" (`SubjID:Cond`

) when assessing the reliability of the
effect. Note also that we have an \(F\) statistic 23 denominator
degrees of freedom. This is functionally equivalent to a
"random-intercept-and-random-slope" (i.e., a maximal) LMEM.

In sum, both of the above analyses control for variance in overall
performance (random intercept variance) by requesting the total error
to be partitioned by subject. Both analyses control for variance in
susceptibility to the experimental treatment, but in different ways:
the first by confounding it with residual error, and the second by
assessing the treatment effect against the treatment-by-subject
interaction. The second version is analogous to a maximal LMEM, and
is also implemented as a default in *mixed-model ANOVA*
(see here for mixed-model ANOVA in SPSS).

What would happen if we used the first method, but performed the analysis on the unaggregated, trial-level data rather than the subject means? Most people would consider such an analysis to be "wrong"—and rightly so. However, this is effectively what one does when fitting a random-intercepts-only LMEM to trial-level data. Let's see what happens in the ANOVA case.

```
x.aov.bad <- aov(Resp ~ Cond + Error(SubjID), x)
summary(x.aov.bad)
```

Error: SubjID Df Sum Sq Mean Sq F value Pr(>F) Cond 1 0.97 0.971 0.07 0.794 Residuals 22 307.07 13.958 Error: Within Df Sum Sq Mean Sq F value Pr(>F) Cond 1 0.1 0.090 0.021 0.884 Residuals 523 2224.8 4.254

Note that we have two error terms against which we can test the effect of condition: (1) the subject error term, which doesn't seem right because it corresponds to variation due to differences is overall performance; and (2) the residual error term. If we used the latter, then note that we would have 523 denominator degrees of freedom, even though we only have 24 subjects in the experiment.

In case one needs any convincing of the analogy between a random-intercepts-only LMEM on unaggregated data and this obviously wrong ANOVA, the following simulation should remove all doubt. In the simulation, we generate 1000 datasets, each containing a single two-level treatment variable administered within subjects. There are 24 subjects and 12 replications per cell per subject. For all datasets, we assume there was no effect of the experimental manipulation (i.e., \(H_0\) is true). Five analyses were performed on each data set:

`lmer.RI`

: random-intercepts-only LMEM on the unaggregated data;`aov.RI`

: "bad" ANOVA on the unaggregated data (controlling for subject variance but not treatment-by-subject variance);`aov.agg`

: within-subjects ANOVA on the subject cell means;`aov.RS`

: within-subjects ANOVA on the unaggregated data, using the treatment-by-subject interaction as the error term (denominator) for the \(F\);`lmer.RS`

: a maximal LMEM (by-subject random intercepts and slopes)

library(lme4) nsubj <- 24 # number of subjects nrep <- 12 # number of replications per cell per subj numberofruns <- 1000 # number of Monte Carlo runs p.values <- matrix(nrow=numberofruns, ncol=5, dimnames=list(1:numberofruns, c("lmer.RI", "aov.RI", "lmer.RS", "aov.RS", "aov.agg")) ) for (i in 1:numberofruns) { # generate data frames for two condition experiment # 24 subjects, 12 replications per cell per subject x <- data.frame(SubjID=factor(rep(1:nsubj, each=nrep*2)), A=factor(rep(c("A1","A2"),each=nrep))) # response contains ONLY NOISE x$Resp <- rep(rnorm(nsubj,sd=3),each=nrep*2) + # random intercept rep(rep(rnorm(nsubj,sd=3),each=2)*c(-.5,.5), each=nrep) + # random slope rnorm(nsubj*nrep*2,sd=6) # residual error # analysis on unaggregated data x.aov.RI <- summary(aov(Resp ~ A + Error(SubjID), data=x)) # WRONG: violation of independence x.aov.RS <- summary(aov(Resp ~ A + Error(SubjID/A), data=x)) # RIGHT: correct error term (A:Subject) x.lmer.RI <- lmer(Resp ~ A + (1 | SubjID), data=x) # random intercept model x.lmer.RS <- lmer(Resp ~ A + (1 + A | SubjID), data=x) # random slope model # aggregate the data x.agg <- with(x, aggregate(list(Resp=Resp), list(SubjID=SubjID,A=A), mean)) # ANOVA on aggregated data x.agg.aov <- summary(aov(Resp ~ A + Error(SubjID), data=x.agg)) # store the data in the matrix p.values[i,] <- c(2*(1-pnorm(abs((fixef(x.lmer.RI)/sqrt(diag(vcov(x.lmer.RI))))[2]))), x.aov.RI[[2]][[1]]$`Pr(>F)`[1], 2*(1-pnorm(abs((fixef(x.lmer.RS)/sqrt(diag(vcov(x.lmer.RS))))[2]))), x.aov.RS[[2]][[1]]$`Pr(>F)`[1], x.agg.aov[[2]][[1]]$`Pr(>F)`[1]) }

Here are the resulting Type I error rates:

lmer.RI | 0.193 |

aov.RI | 0.192 |

lmer.RS | 0.068 |

aov.RS | 0.06 |

aov.agg | 0.06 |

The figure below displays the correlations among \(p\) values as
further evidence for the analogy between `lmer.RI`

and `aov.RI`

(left
panel), as well as among `lmer.RS`

, `aov.agg`

, and `aov.RS`

(right
panel).

## Misconception 3: Conventional ANOVA, unlike LMEMs, doesn't allow for testing random slopes

Though rarely used in practice, model selection procedures have long
been available for mixed-model ANOVA, just as for LMEMs. It is also
possible to test random slopes. In this section, we will explain how
to analyse repeated-measures data using the mixed-model ANOVA approach
in SPSS (Version 19) *General Linear Models*. Although most
psycholinguists are well acquainted with SPSS, many may not be
familiar with the mixed-model ANOVA option because they usually follow
the “standard recipe” of performing repeated-measures ANOVAs on
aggregated data (see earlier sections). Here we will show that it
possible to perform ANOVAs on non-aggregated data as well, i.e. with
more than one observation per subject(or item)-by-condition cell. We
will further demonstrate that the mixed-model ANOVA approach provides
significance tests not only for fixed effects, but also for random
effects—with the latter corresponding to random intercepts and
slopes in LMEMs. Finally, we will show that it is possible to
*simplify* the effects structure of a mixed-model ANOVA, thereby
providing all the ingredients necessary for *model selection*. However, note that—in stark contrast to what seems to be
“accepted practise” in LMEMs—model selection has never been adopted
as a standard for confirmatory hypothesis testing in ANOVA (and
rightly so, as the analyses in our paper suggest). We hope this
tutorial will help to illustrate the strong conceptual parallels
between mixed-model ANOVAs and LMEMs—apart from the most obvious
difference that LMEMs allow for *simultaneous* generalization of
effects over subjects and items, whereas mixed-model ANOVA requires
separate subject and item analyses due to its inability to handle
“crossed” random factors.

We start with a 24-subjects/24-items (within/within) data set taken
from our large pool of simulated two-condition experiments. For
convenience, we assume a balanced data set with no missing values
(note, however, that mixed-model ANOVA can handle unbalanced data sets
as well). The data are in “long” format such that each row represents
a single trial; the variables \(SubjID\) (1-24), *ItemID* (1-24),
*Condition* (A, B), and *Response* (continuous DV) are coded in
separate columns:

According to the “standard recipe” of performing ANOVA, one would
normally proceed along the following three steps: (1) condense the
data into subject-by-condition means (for F1) respectively
item-by-condition means (for F2) using the SPSS procedure AGGREGATE;
(2) re-arrange the aggregated data into “wide” format such that each
condition is in a separate column; and finally (3) perform a
repeated-measures ANOVA using *Analyze*: *General Linear Model*:
*Repeated Measures…*.

Here, we will not follow this recipe and perform a mixed-model ANOVA
instead. To do this, we leave the data in their original “long” format
(no aggregation!) and use the procedure *Analyze*: *General Linear Model*: *Univariate…*, which opens up a command window comprising a
complete list of variables on the left and five empty fields
(“Dependent Variable”, “Fixed Factor(s)”, etc.) on the right.

In this brief demonstration, we will perform a by-subjects (F1)
analysis—a by-items (F2) analysis would essentially work the same
way. First, we need to specify the dependent variable by highlighting
*Resp* in the variable list and moving it into the “Dependent
Variable” field (via clicking on the corresponding right-arrow
button). Next, our experimental treatment variable \(Cond\) is moved
into the “Fixed Factor(s)” field (in a factorial fixed-effect design,
this field would contain more than one variable). Finally, since we
want to perform an F1-analysis, our participant-variable \(SubjID\) ends
up in the “Random Factor(s)” field. [As in LMEM, random factors are
treated differently from fixed factors, but note that LMEMs and
mixed-model ANOVAs employ different algorithms for estimating random
effect variability in the population, which we will not discuss here.]
The other two fields (“Covariate(s)” and “WLS Weight”) remain empty
because there are no continuous predictors or weight variables to
consider in our analysis:

To run the analysis, simply click “OK” and examine the results in the Output Navigator (or Statistics Viewer or whatever it’s called these days…):

SPSS tries to confuse us by labelling the critical results table
“Tests of Between-Subjects Effects”—but we won’t be fooled by that,
knowing that the table actually contains the appropriate inferential
results for the *within-subjects* treatment effect of \(Cond\).

The first important point to note is that the table provides
statistical tests not only for the fixed effect of interest (here, the
main effect of \(Cond\)), but also for (i) the random main effect of
\(SubjID\) and (ii) the random \(Cond{\times}SubjID\) interaction. As
repeatedly stated in previous sections (as well as in the main paper
itself), (i) is conceptually equivalent to the by-subject random
*intercept* and (ii) is equivalent to the by-subject random *slope* in
LMEM. The fact that all these effects appear in the SPSS output-table
already highlights an important default assumption in mixed-model
ANOVA: the procedure automatically assumes a *full-factorial* design,
which implies a *maximal* (by-subjects) random effects structure
against which the fixed effect of interest is tested! And although
this default can be overridden quite easily (see further below), we
are not aware of any published work that has deviated from the default
when testing hypotheses in mixed-model ANOVA.

Another important point to note—also observe the footnotes in the
SPSS output table—is that the fixed effect of interest is
appropriately tested against an error term that corresponds to the
random \(Cond{\times}SubjID\) interaction (equivalent to the
by-subject random *slope* in LMEM). In other words, the procedure
tests **by default** whether the experimental treatment effect
generalizes *above and beyond* subject-specific random variation in
displaying that effect. Correspondingly, the “error” degrees of
freedom for the \(Cond\) fixed effect are identical to the “effect” (or
“hypothesis”, as SPSS calls it) degrees of freedom for the random
\(Cond{\times}SubjID\) interaction, namely 23 (number of subjects
minus 1). [*NB* With *unbalanced* numbers of observations per
condition, SPSS would perform a numerical correction on these \(df\)s,
giving different (and ultimately more conservative) results depending
on the severity of the imbalance.]

For our example data, the default mixed-model ANOVA (assuming a
*maximal* by-subject random effects structure) suggests that the
experimental treatment effect is not reliably generalizable across
subjects: \(F(1,23)=2.654\); \(p=.117\). [In case one might wonder: a
repeated-measures ANOVA on by-subject aggregated data would yield
\(F(1,23) = 3.946\), \(p=.059\).] Instead of the graphical user
interface, one could also use the following syntax code to perform the
standard full-factorial mixed-model ANOVA in SPSS:

UNIANOVA Resp BY Cond SubjID /RANDOM=SubjID /METHOD=SSTYPE(3) /INTERCEPT=INCLUDE /CRITERIA=ALPHA(0.05) /DESIGN=Cond SubjID Cond*SubjID.

What if we wanted to apply a **simpler** model to our data, say, one
that corresponds to a (by-subject) *intercept-only* model in LMEM?
Although certainly not the norm for hypothesis testing in ANOVA, some
researchers might consider such an intercept-only model if, for
example, a significance test of the \(Cond{\times}SubjID\) interaction
would yield a non-significant result (cf. *model selection*). Others
may just have picked up somewhere that “exclusion of the random
interaction is ok if it’s not of theoretical interest” or that
“intercept-only models are often used in LMEM” etc. (ignoring the
pivotal role that the random interaction/slope actually plays for
capturing repeated-measures dependencies in both ANOVA and LMEM!).

Indeed, mixed-model ANOVA allows us to override the full-factorial
default setting. To do this, open up the mixed-model ANOVA command
window at *Analyze*: *General Linear Model*: *Univariate…* and specify
the DV, the fixed and the random factor(s) as before. Now, note that
in the top-right corner of the *Univariate…* command window, there is
a button labelled “Model”. Clicking on that button opens up a new
window that looks like this:

The toggle switches at the top are still in their *full factorial*
default setting. Just switch over to *custom* and you will see a
“Factors & Covariates” list on the left (containing the fixed factor
\(Cond\) and the random factor \(SubjID\)) and a blank “Model” field on
the right. To specify a (by-subject) random intercept-only effect
structure, we need to implement a model that contains the main effects
of \(Cond\) and \(SubjID\), but no interaction between the two. Simply
highlight the two factors in the “Factors & Covariates” list, then
select *main effects* in the pull-down menu under “Build Term(s)” and
click on the arrow-button to move the terms into the “Model” field.
The command window now looks like this:

After clicking “Continue”, we are back in the *Univariate* main
window, where we click on “OK” to run the analysis. Instead of the
graphical user interface, we could also have used the following syntax
to perform the intercept-only mixed-model ANOVA:

UNIANOVA Resp BY Cond SubjID /RANDOM=SubjID /METHOD=SSTYPE(3) /INTERCEPT=INCLUDE /CRITERIA=ALPHA(0.05) /DESIGN=Cond SubjID.

The results now look like this:

As can be seen, the \(Cond\) fixed effect of interest changed from
formerly “not significant” to “highly significant” in this random
intercept-only analysis. Also, note what happened to the “error” term
in the F-ratio for the fixed effect, whose degrees of freedom changed
from formerly 23 (number of subjects minus 1) to 551 (number of trials
minus: 1 \(df\) for the grand average intercept, 1 \(df\) for the \(Cond\)
fixed effect, and 23 \(df\)s for the \(SubjID\) random effect) in the
intercept-only model. The latter is identical to the residual
*by-trial error* which indicates that, just as in LMEM, the random
intercept-only model essentially treats the by-condition measurements
as subject- *independent* observations.

The "take home message" from this demonstration is that mixed-model
ANOVA and LMEM are actually not all that different. Both distinguish
between random and fixed effects and provide significance tests for
each. Moreover, both allow for flexible adjustments of the model
structure which is useful for *model building* applications (if not
necessarily for *confirmatory hypothesis testing*). The only major
difference (apart from relying on different algorithms for random
effect estimation) is that LMEM, unlike ANOVA, allows for “crossing”
of random effects, i.e. simultaneous consideration of subject- and
item- related random effects.