Coding categorical variables when analyzing factorial experiments with regression
Many experimentalists who are trying to make the leap from ANOVA to linear mixed-effects models (LMEMs) in R struggle with the coding of categorical predictors. It is unexpectedly complicated, and the defaults provided in R turn out to be wholly inappropriate for factorial experiments. Indeed, using those defaults with factorial experiments can lead researchers to draw erroneous conclusions from their data.
I have limited the discussion below to factorial designs in which all factors have exactly 2 levels. I will update the post later with info on more complex cases. The general principles and recommendations put forward here also hold for those cases.
Simple vs. main effects
Below, I will assume that you understand the difference between a simple effect and a main effect, and between a simple interaction and a main interaction in a three-way design. If you already do, just skip to the next section, “Choosing a coding scheme.”
Put simply, in an \(A{\times}B\) design, the simple effect of \(A\) is the effect of \(A\) controlling for \(B\), while the main effect of \(A\) is the effect of \(A\) ignoring \(B\). Another way of looking at this is to consider the cell means and marginal means in a factorial design:
B_{1} | B_{2} | |||
A_{1} | \(\bar{Y}_{11}\) | \(\bar{Y}_{12}\) | \(\bar{Y}_{1.}\) | |
A_{2} | \(\bar{Y}_{21}\) | \(\bar{Y}_{22}\) | \(\bar{Y}_{2.}\) | |
\(\bar{Y}_{.1}\) | \(\bar{Y}_{.2}\) |
In the above table, \(\bar{Y}_{11}\), \(\bar{Y}_{12}\), \(\bar{Y}_{21}\), and \(\bar{Y}_{22}\) are all cell means, and \(\bar{Y}_{1.}\), \(\bar{Y}_{2.}\), \(\bar{Y}_{.1}\), and \(\bar{Y}_{.2}\) are all marginal means. The dot-subscript tells you to “ignore” the dimension containing the dot; e.g., \(\bar{Y}_{.1}\) tells you to take the mean of the first column ignoring the row variable. To test the main effect of A is to test the null hypothesis that \(\bar{Y}_{1.}=\bar{Y}_{2.}\). To test the simple effect of \(A\) is to test e.g. that \(\bar{Y}_{11}=\bar{Y}_{21}\).
The distinction between simple and main interactions has the same
logic: the simple interaction of \(AB\) in an \(ABC\) design is the
interaction of \(AB\) at a particular level of \(C\); the main
interaction of \(AB\) is the interaction ignoring C. The latter is
what we are usually talking about when we talk about lower-order
interactions in a three-way design. It is also what we are given in
the output from standard ANOVA procedures, e.g., the aov()
function
in R, SPSS, SAS, etc.
Choosing a coding scheme
Why does it matter?
Generally, the choice of a coding scheme impacts the interpretation of:
- the intercept term; and
- the interpretation of the tests for all but the highest-order effects and interactions in a factorial design.
It also can influence the interpretation/estimation of random effects in a mixed-effects model (see here for further discussion). If you have a design with only a single two-level factor, and are using a maximal random-effects structure, the choice of coding scheme doesn’t really matter.
What coding schemes are available?
There are many possible coding schemes (see ?contr.helmert
for more
information). The most relevant ones are treatment, sum, and
deviation coding schemes.
Coding Scheme | \(A_1\) | \(A_2\) |
---|---|---|
Treatment (dummy) | 0 | 1 |
Sum | -1 | 1 |
Deviation | -.5 | .5 |
The default in R is to use treatment coding for any variable defined
as a factor
in the model (see ?factor
and ?contrasts
for
information). To see why this is inappropriate for factorial designs,
consider a 2x2x2 factorial design with factors \(A\), \(B\) and \(C\).
We will just consider a fully between-subjects design with only one
observation per subject as this allows us to use the simplest possible
error structure. We would fit such a model using lm()
:
lm(Y ~ A * B * C)
The table below provides the interpretation for various effects in the model under the three different coding schemes. Note that \(Y\) is the dependent variable, and the dots in the subscript mean to “ignore” the corresponding dimension. Thus, \(\bar{Y}_{.1.}\) is the mean of B_{1} (ignoring factors \(A\) and \(C\)) and \(\bar{Y}_{...}\) is the “grand mean” (ignoring all factors).
Treatment | Sum | Deviation | |
---|---|---|---|
intercept | \(\bar{Y}_{111}\) | \(\bar{Y}_{...}\) | \(\bar{Y}_{...}\) |
A | \(\bar{Y}_{211}-\bar{Y}_{111}\) | \(\frac{1}{2}(\bar{Y}_{2..}-\bar{Y}_{1..})\) | \(\bar{Y}_{2..}-\bar{Y}_{1..}\) |
B | \(\bar{Y}_{121}-\bar{Y}_{111}\) | \(\frac{1}{2}(\bar{Y}_{.2.}-\bar{Y}_{.1.})\) | \(\bar{Y}_{.2.}-\bar{Y}_{.1.}\) |
C | \(\bar{Y}_{112}-\bar{Y}_{111}\) | \(\frac{1}{2}(\bar{Y}_{..2}-\bar{Y}_{..1})\) | \(\bar{Y}_{..2}-\bar{Y}_{..1}\) |
A:B | \((\bar{Y}_{221}-\bar{Y}_{121})-(\bar{Y}_{211}-\bar{Y}_{111})\) | \(\frac{1}{4}\left[(\bar{Y}_{22.}-\bar{Y}_{12.})-(\bar{Y}_{21.}-\bar{Y}_{11.})\right]\) | \((\bar{Y}_{22.}-\bar{Y}_{12.})-(\bar{Y}_{21.}-\bar{Y}_{11.})\) |
B:C | \((\bar{Y}_{122}-\bar{Y}_{112})-(\bar{Y}_{121}-\bar{Y}_{111})\) | \(\frac{1}{4}\left[(\bar{Y}_{.22}-\bar{Y}_{.12})-(\bar{Y}_{.21}-\bar{Y}_{.11})\right]\) | \((\bar{Y}_{.22}-\bar{Y}_{.12})-(\bar{Y}_{.21}-\bar{Y}_{.11})\) |
A:C | \((\bar{Y}_{212}-\bar{Y}_{211})-(\bar{Y}_{112}-\bar{Y}_{111})\) | \(\frac{1}{4}\left[(\bar{Y}_{2.2}-\bar{Y}_{1.2})-(\bar{Y}_{2.1}-\bar{Y}_{1.1})\right]\) | \((\bar{Y}_{2.2}-\bar{Y}_{1.2})-(\bar{Y}_{2.1}-\bar{Y}_{1.1})\) |
For the three way A:B:C interaction, the interpretations are:
Treatment | \(\left[(\bar{Y}_{221}-\bar{Y}_{121})-(\bar{Y}_{211}-\bar{Y}_{111})\right] - \left[(\bar{Y}_{222}-\bar{Y}_{122})-(\bar{Y}_{212}-\bar{Y}_{112})\right]\) |
Sum | \(\frac{1}{8}\left(\left[(\bar{Y}_{221}-\bar{Y}_{121})-(\bar{Y}_{211}-\bar{Y}_{111})\right] - \left[(\bar{Y}_{222}-\bar{Y}_{122})-(\bar{Y}_{212}-\bar{Y}_{112})\right]\right)\) |
Deviation | \(\left[(\bar{Y}_{221}-\bar{Y}_{121})-(\bar{Y}_{211}-\bar{Y}_{111})\right] - \left[(\bar{Y}_{222}-\bar{Y}_{122})-(\bar{Y}_{212}-\bar{Y}_{112})\right]\) |
Note that the inferential tests of A:B:C will all have the same outcome, despite the parameter estimate for sum coding being one-eighth of that for the other schemes. For all lower-order effects, sum and deviation coding will give different parameter estimates but identical inferential outcomes. Both of these schemes provide identical tests of the canonical main effects and main interactions for a three-way ANOVA. In contrast, treatment (dummy) coding will provide inferential tests of simple effects and simple interactions. So, if what you are interested in getting are the “canonical” tests from ANOVA, use sum or deviation coding.
Conclusion
The interpretation of all but the highest order effect depends on the coding scheme.
With treatment coding, you are looking at simple effects and simple interactions, not main effects and main interactions.
Sum coding differs from deviation coding only in the parameter estimates.
The parameter estimates for a given effect under sum coding will be \(E/2^{X}\), where \(E\) is the actual effect and \(x\) is its order (1 for main effects, 2 for two-way interactions, etc.)
The default coding scheme for factors is R is “treatment” coding.
So, anytime you declare a variable as type factor
and use this
variable as a predictor in your regression model, R will automatically
create treatment-coded variables.
To change this to sum coding, see ?contrasts
and ?contr.sum
. If you find yourself always having to do this, you can also change the defaults (see <i>[Update May 2017: I no longer recommend this practice as it leads to non-reproducible scripts; thanks to Nick Brown for calling this to my attention]</i>
?options
) in your .Rprofile
startup file (see here).
take home message: when analyzing factorial designs in R using regression, to obtain the canonical ANOVA-style interpretations of main effects and interactions use sum or deviation coding and NOT the default treatment coding.