Using Mixed-Effects Models for Confirmatory Hypothesis Testing (FAQ)

Table of Contents

This FAQ is intended for people using linear mixed effects models (LMEMs) as a replacement for the statistical techniques that are more traditionally used for confirmatory hypothesis testing, such as ANOVA or t-tests. Some of the points we make here may not apply to other uses of LMEMs, such as for deriving the 'most parsimonious' model for one's data.

Misconception 1: Mixed-effects models have dependency assumptions not shared by other kinds of analysis

LMEMs share the assumption of independent observations (technically, conditional independence, i.e., independence of the residuals) with other parametric approaches. LMEMs do allow for greater flexibility in accounting for sources of random variation than more traditional approaches, and it is this additional flexibility that makes it possible to simultaneously generalize over multiple populations (e.g., subjects and items). But this greater flexibility requires us to think more explicitly about random effects.

It is easy to miss the point that assumptions about independence have been there all along, because many people follow conventional "recipes" when they analyze their data that allow them to effectively ignore choices about random effects structure they would otherwise have to confront. For most researchers, the only moment in an analysis in which they think about anything like a "random effect" is when they consider whether to perform an independent or correlated-samples t-test, or decide among the recipes for "between-subjects ANOVA", "repeated-measures ANOVA", or "mixed design ANOVA" (not to be confused with mixed model ANOVA explained below). People mostly avoid error in these cases because they know how to follow the recipes.

As these recipes are currently practiced, the first step usually involves aggregating the data so that the test can be performed on the subject-by-cell means (or item-by-cell means) rather than on the raw, unaggregated data. This "aggregate-before-analysis" step creates a dataset that meets the requirement of conditional independence for the slope (condition effect), by having only one data point per subject per cell. This practice of aggregating first is so ingrained that may researchers believe that ANOVA only "allows" one data point per-subject-per-cell, which is not correct (see below). To meet the assumption of conditional independence for the intercept, it is simply necessary to choose the right recipe, i.e., to make sure that one does a repeated-measures ANOVA or paired-sample t-test instead of a between-subject ANOVA or independent samples t-test.

Misconception 2: Random intercepts and random slopes have no analogue in ANOVA

Most datasets in experimental psychology involve multiple observations on the same sampling units (subjects or stimuli). It is possible to aggregate these observations together and to analyze the means rather than the trial-level data. When one wishes to analyze a dataset containing only a single observation per sampling unit, then between-subject techniques are warranted.

Datasets including multiple observations on the same units are 'within-subject' designs, analyzed using paired t-tests or within-subjects ANOVA. Such analyses control for variation in each subject overall response level, akin to including a random intercept in a LMEM.

The notion of random slopes is functionally equivalent to the notion of a treatment-by-subject interaction in ANOVA. Most people don't know about "treatment-by-subject interactions" because they usually perform analyses on the subject means rather than on the unaggregated data. The aggregation process confounds the treatment-by-subject variance with residual error, which legitimates the use of the standard residual error term in the denominator of the \(F\) ratio.

The tendency to analyze the subject means has led to the widespread misconception that ANOVA only 'allows' one observation per subject per cell. It is, in fact, possible to analyze the unaggregated, trial-level data using ANOVA and specifying the treatment-by-subject interaction as the error term in the denominator of the \(F\) ratio (see below and also here).

Here is how one might analyze subject-by-condition means using ANOVA in R. We are using the R package simgen to generate a sample dataset with 24 subjects and 24 items, and with a single, two-level factor manipulated within subjects and within items.

library(simgen, quietly=TRUE)

x <- mkDf(mcr.params=createParamMx(1)[1,])
x$Cond <- factor(x$Cond)

x.agg <- with(x, aggregate(list(Resp=Resp),
			   list(Cond=Cond, SubjID=SubjID),
			   mean, na.rm=TRUE))

x.agg.aov <- aov(Resp ~ Cond + Error(SubjID), x.agg)
Loading required package: lattice

Attaching package: ‘lme4’

The following object(s) are masked from ‘package:stats’:

    AIC, BIC

Error: SubjID
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 23  20.31  0.8829               

Error: Within
          Df Sum Sq Mean Sq F value Pr(>F)
Cond       1  0.048  0.0479   0.091  0.766
Residuals 23 12.092  0.5257

We specified the error term using Error(SubjID), which makes this a within-subject design. Note that we have 23 residual degrees of freedom, one minus the number of subjects. Equivalently, we could have run a correlated-samples t-test on the aggregated data. The analysis we performed here is actually equivalent to a random-intercepts only LMEM. Note that performing a random-intercepts only LMEM on the aggregated data would be legitimate because the treatment-by-subject interaction is no longer a source of dependency, since we only have one observation per subject per condition. It is also worth noting that using an LMEM on aggregated data might be OK for a dataset sampled from a single population (i.e., subjects) but the aggregation would prevent us from using LMEMs for a simultaneous by-subject and by-item analysis, since the item information would be lost in the aggregation over subjects, and the subject information would be lost in the aggregation over items.

And now here is a different way to analyze the same data, using the raw, unaggregated trial-level observations rather than the subject means.

x.aov <- aov(Resp ~ Cond + Error(SubjID/Cond), x)  

Error: SubjID
          Df Sum Sq Mean Sq F value Pr(>F)
Cond       1   0.97   0.971    0.07  0.794
Residuals 22 307.07  13.958               

Error: SubjID:Cond
          Df Sum Sq Mean Sq F value Pr(>F)
Cond       1   0.09    0.09   0.008  0.927
Residuals 23 245.01   10.65               

Error: Within
           Df Sum Sq Mean Sq F value Pr(>F)
Residuals 500   1980   3.959

Note that we have specified the treatment-by-subject error term using the syntax Error(SubjID/Cond). We should use the error term from this "stratum" (SubjID:Cond) when assessing the reliability of the effect. Note also that we have an \(F\) statistic 23 denominator degrees of freedom. This is functionally equivalent to a "random-intercept-and-random-slope" (i.e., a maximal) LMEM.

In sum, both of the above analyses control for variance in overall performance (random intercept variance) by requesting the total error to be partitioned by subject. Both analyses control for variance in susceptibility to the experimental treatment, but in different ways: the first by confounding it with residual error, and the second by assessing the treatment effect against the treatment-by-subject interaction. The second version is analogous to a maximal LMEM, and is also implemented as a default in mixed-model ANOVA (see here for mixed-model ANOVA in SPSS).

What would happen if we used the first method, but performed the analysis on the unaggregated, trial-level data rather than the subject means? Most people would consider such an analysis to be "wrong"—and rightly so. However, this is effectively what one does when fitting a random-intercepts-only LMEM to trial-level data. Let's see what happens in the ANOVA case.

x.aov.bad <- aov(Resp ~ Cond + Error(SubjID), x)  

Error: SubjID
          Df Sum Sq Mean Sq F value Pr(>F)
Cond       1   0.97   0.971    0.07  0.794
Residuals 22 307.07  13.958               

Error: Within
           Df Sum Sq Mean Sq F value Pr(>F)
Cond        1    0.1   0.090   0.021  0.884
Residuals 523 2224.8   4.254

Note that we have two error terms against which we can test the effect of condition: (1) the subject error term, which doesn't seem right because it corresponds to variation due to differences is overall performance; and (2) the residual error term. If we used the latter, then note that we would have 523 denominator degrees of freedom, even though we only have 24 subjects in the experiment.

In case one needs any convincing of the analogy between a random-intercepts-only LMEM on unaggregated data and this obviously wrong ANOVA, the following simulation should remove all doubt. In the simulation, we generate 1000 datasets, each containing a single two-level treatment variable administered within subjects. There are 24 subjects and 12 replications per cell per subject. For all datasets, we assume there was no effect of the experimental manipulation (i.e., \(H_0\) is true). Five analyses were performed on each data set:

  1. lmer.RI: random-intercepts-only LMEM on the unaggregated data;
  2. aov.RI: "bad" ANOVA on the unaggregated data (controlling for subject variance but not treatment-by-subject variance);
  3. aov.agg: within-subjects ANOVA on the subject cell means;
  4. aov.RS: within-subjects ANOVA on the unaggregated data, using the treatment-by-subject interaction as the error term (denominator) for the \(F\);
  5. lmer.RS: a maximal LMEM (by-subject random intercepts and slopes)
nsubj <- 24          # number of subjects
nrep <- 12           # number of replications per cell per subj
numberofruns <- 1000 # number of Monte Carlo runs

p.values <- matrix(nrow=numberofruns, ncol=5,
		     c("lmer.RI", "aov.RI", "lmer.RS", "aov.RS", "aov.agg")) )

for (i in 1:numberofruns) {
  # generate data frames for two condition experiment
  # 24 subjects, 12 replications per cell per subject
  x <- data.frame(SubjID=factor(rep(1:nsubj, each=nrep*2)),

  # response contains ONLY NOISE
  x$Resp <-
    rep(rnorm(nsubj,sd=3),each=nrep*2) + # random intercept
    rep(rep(rnorm(nsubj,sd=3),each=2)*c(-.5,.5), each=nrep) + # random slope
    rnorm(nsubj*nrep*2,sd=6) # residual error

  # analysis on unaggregated data
  x.aov.RI <- summary(aov(Resp ~ A + Error(SubjID), data=x)) # WRONG: violation of independence
  x.aov.RS <- summary(aov(Resp ~ A + Error(SubjID/A), data=x)) # RIGHT: correct error term (A:Subject)

  x.lmer.RI <- lmer(Resp ~ A + (1 | SubjID), data=x) # random intercept model
  x.lmer.RS <- lmer(Resp ~ A + (1 + A | SubjID), data=x) # random slope model

  # aggregate the data
  x.agg <- with(x, aggregate(list(Resp=Resp), list(SubjID=SubjID,A=A), mean))

  # ANOVA on aggregated data
  x.agg.aov <- summary(aov(Resp ~ A + Error(SubjID), data=x.agg))

  # store the data in the matrix
  p.values[i,] <- c(2*(1-pnorm(abs((fixef(x.lmer.RI)/sqrt(diag(vcov(x.lmer.RI))))[2]))),

Here are the resulting Type I error rates:

The figure below displays the correlations among \(p\) values as further evidence for the analogy between lmer.RI and aov.RI (left panel), as well as among lmer.RS, aov.agg, and aov.RS (right panel).


Misconception 3: Conventional ANOVA, unlike LMEMs, doesn't allow for testing random slopes

Though rarely used in practice, model selection procedures have long been available for mixed-model ANOVA, just as for LMEMs. It is also possible to test random slopes. In this section, we will explain how to analyse repeated-measures data using the mixed-model ANOVA approach in SPSS (Version 19) General Linear Models. Although most psycholinguists are well acquainted with SPSS, many may not be familiar with the mixed-model ANOVA option because they usually follow the “standard recipe” of performing repeated-measures ANOVAs on aggregated data (see earlier sections). Here we will show that it possible to perform ANOVAs on non-aggregated data as well, i.e. with more than one observation per subject(or item)-by-condition cell. We will further demonstrate that the mixed-model ANOVA approach provides significance tests not only for fixed effects, but also for random effects—with the latter corresponding to random intercepts and slopes in LMEMs. Finally, we will show that it is possible to simplify the effects structure of a mixed-model ANOVA, thereby providing all the ingredients necessary for model selection. However, note that—in stark contrast to what seems to be “accepted practise” in LMEMs—model selection has never been adopted as a standard for confirmatory hypothesis testing in ANOVA (and rightly so, as the analyses in our paper suggest). We hope this tutorial will help to illustrate the strong conceptual parallels between mixed-model ANOVAs and LMEMs—apart from the most obvious difference that LMEMs allow for simultaneous generalization of effects over subjects and items, whereas mixed-model ANOVA requires separate subject and item analyses due to its inability to handle “crossed” random factors.

We start with a 24-subjects/24-items (within/within) data set taken from our large pool of simulated two-condition experiments. For convenience, we assume a balanced data set with no missing values (note, however, that mixed-model ANOVA can handle unbalanced data sets as well). The data are in “long” format such that each row represents a single trial; the variables \(SubjID\) (1-24), ItemID (1-24), Condition (A, B), and Response (continuous DV) are coded in separate columns:


According to the “standard recipe” of performing ANOVA, one would normally proceed along the following three steps: (1) condense the data into subject-by-condition means (for F1) respectively item-by-condition means (for F2) using the SPSS procedure AGGREGATE; (2) re-arrange the aggregated data into “wide” format such that each condition is in a separate column; and finally (3) perform a repeated-measures ANOVA using Analyze: General Linear Model: Repeated Measures….

Here, we will not follow this recipe and perform a mixed-model ANOVA instead. To do this, we leave the data in their original “long” format (no aggregation!) and use the procedure Analyze: General Linear Model: Univariate…, which opens up a command window comprising a complete list of variables on the left and five empty fields (“Dependent Variable”, “Fixed Factor(s)”, etc.) on the right.



In this brief demonstration, we will perform a by-subjects (F1) analysis—a by-items (F2) analysis would essentially work the same way. First, we need to specify the dependent variable by highlighting Resp in the variable list and moving it into the “Dependent Variable” field (via clicking on the corresponding right-arrow button). Next, our experimental treatment variable \(Cond\) is moved into the “Fixed Factor(s)” field (in a factorial fixed-effect design, this field would contain more than one variable). Finally, since we want to perform an F1-analysis, our participant-variable \(SubjID\) ends up in the “Random Factor(s)” field. [As in LMEM, random factors are treated differently from fixed factors, but note that LMEMs and mixed-model ANOVAs employ different algorithms for estimating random effect variability in the population, which we will not discuss here.] The other two fields (“Covariate(s)” and “WLS Weight”) remain empty because there are no continuous predictors or weight variables to consider in our analysis:


To run the analysis, simply click “OK” and examine the results in the Output Navigator (or Statistics Viewer or whatever it’s called these days…):


SPSS tries to confuse us by labelling the critical results table “Tests of Between-Subjects Effects”—but we won’t be fooled by that, knowing that the table actually contains the appropriate inferential results for the within-subjects treatment effect of \(Cond\).

The first important point to note is that the table provides statistical tests not only for the fixed effect of interest (here, the main effect of \(Cond\)), but also for (i) the random main effect of \(SubjID\) and (ii) the random \(Cond{\times}SubjID\) interaction. As repeatedly stated in previous sections (as well as in the main paper itself), (i) is conceptually equivalent to the by-subject random intercept and (ii) is equivalent to the by-subject random slope in LMEM. The fact that all these effects appear in the SPSS output-table already highlights an important default assumption in mixed-model ANOVA: the procedure automatically assumes a full-factorial design, which implies a maximal (by-subjects) random effects structure against which the fixed effect of interest is tested! And although this default can be overridden quite easily (see further below), we are not aware of any published work that has deviated from the default when testing hypotheses in mixed-model ANOVA.

Another important point to note—also observe the footnotes in the SPSS output table—is that the fixed effect of interest is appropriately tested against an error term that corresponds to the random \(Cond{\times}SubjID\) interaction (equivalent to the by-subject random slope in LMEM). In other words, the procedure tests by default whether the experimental treatment effect generalizes above and beyond subject-specific random variation in displaying that effect. Correspondingly, the “error” degrees of freedom for the \(Cond\) fixed effect are identical to the “effect” (or “hypothesis”, as SPSS calls it) degrees of freedom for the random \(Cond{\times}SubjID\) interaction, namely 23 (number of subjects minus 1). [*NB* With unbalanced numbers of observations per condition, SPSS would perform a numerical correction on these \(df\)s, giving different (and ultimately more conservative) results depending on the severity of the imbalance.]

For our example data, the default mixed-model ANOVA (assuming a maximal by-subject random effects structure) suggests that the experimental treatment effect is not reliably generalizable across subjects: \(F(1,23)=2.654\); \(p=.117\). [In case one might wonder: a repeated-measures ANOVA on by-subject aggregated data would yield \(F(1,23) = 3.946\), \(p=.059\).] Instead of the graphical user interface, one could also use the following syntax code to perform the standard full-factorial mixed-model ANOVA in SPSS:

  /DESIGN=Cond SubjID Cond*SubjID.

What if we wanted to apply a simpler model to our data, say, one that corresponds to a (by-subject) intercept-only model in LMEM? Although certainly not the norm for hypothesis testing in ANOVA, some researchers might consider such an intercept-only model if, for example, a significance test of the \(Cond{\times}SubjID\) interaction would yield a non-significant result (cf. model selection). Others may just have picked up somewhere that “exclusion of the random interaction is ok if it’s not of theoretical interest” or that “intercept-only models are often used in LMEM” etc. (ignoring the pivotal role that the random interaction/slope actually plays for capturing repeated-measures dependencies in both ANOVA and LMEM!).

Indeed, mixed-model ANOVA allows us to override the full-factorial default setting. To do this, open up the mixed-model ANOVA command window at Analyze: General Linear Model: Univariate… and specify the DV, the fixed and the random factor(s) as before. Now, note that in the top-right corner of the Univariate… command window, there is a button labelled “Model”. Clicking on that button opens up a new window that looks like this:


The toggle switches at the top are still in their full factorial default setting. Just switch over to custom and you will see a “Factors & Covariates” list on the left (containing the fixed factor \(Cond\) and the random factor \(SubjID\)) and a blank “Model” field on the right. To specify a (by-subject) random intercept-only effect structure, we need to implement a model that contains the main effects of \(Cond\) and \(SubjID\), but no interaction between the two. Simply highlight the two factors in the “Factors & Covariates” list, then select main effects in the pull-down menu under “Build Term(s)” and click on the arrow-button to move the terms into the “Model” field. The command window now looks like this:


After clicking “Continue”, we are back in the Univariate main window, where we click on “OK” to run the analysis. Instead of the graphical user interface, we could also have used the following syntax to perform the intercept-only mixed-model ANOVA:

  /DESIGN=Cond SubjID.

The results now look like this:


As can be seen, the \(Cond\) fixed effect of interest changed from formerly “not significant” to “highly significant” in this random intercept-only analysis. Also, note what happened to the “error” term in the F-ratio for the fixed effect, whose degrees of freedom changed from formerly 23 (number of subjects minus 1) to 551 (number of trials minus: 1 \(df\) for the grand average intercept, 1 \(df\) for the \(Cond\) fixed effect, and 23 \(df\)s for the \(SubjID\) random effect) in the intercept-only model. The latter is identical to the residual by-trial error which indicates that, just as in LMEM, the random intercept-only model essentially treats the by-condition measurements as subject- independent observations.

The "take home message" from this demonstration is that mixed-model ANOVA and LMEM are actually not all that different. Both distinguish between random and fixed effects and provide significance tests for each. Moreover, both allow for flexible adjustments of the model structure which is useful for model building applications (if not necessarily for confirmatory hypothesis testing). The only major difference (apart from relying on different algorithms for random effect estimation) is that LMEM, unlike ANOVA, allows for “crossing” of random effects, i.e. simultaneous consideration of subject- and item- related random effects.

Date: March 27, 2012

Author: Dale J. Barr, Roger Levy, Christoph Scheepers, and Harry J. Tily

Created: 2023-01-31 Tue 11:31