# Walkthrough of an "empirical logit" analysis in R

*Note:* This is an updated version of an earlier post from September
12, 2007, with a number of revisions to improve clarity.

*Note:* updated again on 27-July-2012 (weights were computed but never
used in the fitting)

We will be looking at a subset of the data from Kronmüller and Barr (2007), Experiment 2 (the "maintain precedent" condition only). This subset has a 2x2 design, with Speaker (Same, Different) crossed with Cognitive Load (None, Load). Essential background details for this walkthrough can be found in the MLR paper in JML.

You'll also need to download an archive containing the code
(`elogit-wt.R`

) and the data files (`bySubj.RData`

and
`byItem.RData`

). They are available in GNU tar/zip format (.tgz) or
in ZIP format.

## Setting up the R environment

For this walkthrough, I am using `R`

version 2.14.0 and `lme4`

package
version 0.999375.41. If you are using different versions, your
results might differ (hopefully only slightly, unless you're using a
really old version, in which case I recommend that you upgrade). For
this same reason, they will also slightly differ from the results I
presented in the MLR paper. Here's how to find out what versions you
are using.

```
R.Version()$version.string
packageVersion("lme4")
```

[1] "R version 2.14.2 (2012-02-29)" [1] ‘0.999375.41’

First, we need to load the `lme4`

package and then load in the data
from the experiment. We have two files (`bySubj.RData`

and
`byItem.RData`

) for the two by-subject and by-item analyses that we
will perform.

library(lme4) load("bySubj.RData") load("byItem.RData") ls()

[1] "byItem" "byItem.lmer" "bySubj" "bySubj.lmer" "ps" [6] "ts"

The data frames `bySubj`

and `byItem`

contain the data aggregated by
Subject and by Item respectively. The portion of the data we are
looking at contains sixteen trials per subject, four in each of the
four cells of the design. The data was prepared by aggregating over
50 ms time slices (3 frames of eye data) and over all of the trials in
a given condition within each subject (or item, for the analysis by
item). Thus, each 50 ms bin consists of (4 trials x 3 frames =) 12 frames of eye data.

For example, take a look at some of the data for subject 30.

subset(bySubj, SubjID==30)

SubjID AggID SpeakerT CogLoadT Bin t1 y N 1 30 241 Same None 0 0.00 0 12 2 30 241 Same None 1 0.05 0 12 3 30 241 Same None 2 0.10 0 12 4 30 241 Same None 3 0.15 0 12 5 30 243 Diff None 0 0.00 3 12 6 30 243 Diff None 1 0.05 3 12 7 30 243 Diff None 2 0.10 5 12 8 30 243 Diff None 3 0.15 4 12 9 30 245 Same Load 0 0.00 3 12 10 30 245 Same Load 1 0.05 3 12 11 30 245 Same Load 2 0.10 7 12 12 30 245 Same Load 3 0.15 6 12 13 30 247 Diff Load 0 0.00 3 12 14 30 247 Diff Load 1 0.05 3 12 15 30 247 Diff Load 2 0.10 3 12 16 30 247 Diff Load 3 0.15 3 12

Note that we have aggregated together all of the trials within a given
condition for each Subject, and broken up the time variable into a
series of 50 ms bins. The bins for a given aggregated condition
(henceforth "aggregate") were assigned an ID number (to represent that
they are all from the same condition for the same subject). The
precise number that is given to a particular "aggregate" does not
matter; what matters is that the number given to all the bins in a
given aggregate is unique. This is a way of telling the model fitting
function `lmer`

that these bins belong together.

The variable `y`

is the number of frames in the 50 ms window for which
the gaze fell within the bounds of the target object; The variable `N`

is the total number of frames within each bin.

Variable `t1`

is the time variable, coded in seconds, from the onset
of the analysis window (which was 300 ms after the start of the word).

We can compute the empirical logit for each bin using the formula \( elog = log\left(\frac{Y + .5}{N - Y + .5}\right) \). This is an approximation to log odds. Because we are using a linear approximation, we also will need to calculate weights \( \frac{1}{Y+.5} + \frac{1}{N-Y+.5} \) because the variance of the logit depends on the mean.

## Preparing the data

bySubj$elog <- log( (bySubj$y + .5) / (bySubj$N - bySubj$y + .5) ) bySubj$wts <- 1/(bySubj$y + .5) + 1/(bySubj$N - bySubj$y + .5)

We will now create condition codes for our categorical predictor variables. We will use ANOVA-style "deviation" coding.

bySubj$S <- ifelse(bySubj$SpeakerT=="Same",-.5,.5) bySubj$L <- ifelse(bySubj$CogLoadT=="None",-.5,.5) head(bySubj, 12)

SubjID AggID SpeakerT CogLoadT Bin t1 y N elog wts S L 1 30 241 Same None 0 0.00 0 12 -3.2188758 2.0800000 -0.5 -0.5 2 30 241 Same None 1 0.05 0 12 -3.2188758 2.0800000 -0.5 -0.5 3 30 241 Same None 2 0.10 0 12 -3.2188758 2.0800000 -0.5 -0.5 4 30 241 Same None 3 0.15 0 12 -3.2188758 2.0800000 -0.5 -0.5 5 30 243 Diff None 0 0.00 3 12 -0.9985288 0.3909774 0.5 -0.5 6 30 243 Diff None 1 0.05 3 12 -0.9985288 0.3909774 0.5 -0.5 7 30 243 Diff None 2 0.10 5 12 -0.3101549 0.3151515 0.5 -0.5 8 30 243 Diff None 3 0.15 4 12 -0.6359888 0.3398693 0.5 -0.5 9 30 245 Same Load 0 0.00 3 12 -0.9985288 0.3909774 -0.5 0.5 10 30 245 Same Load 1 0.05 3 12 -0.9985288 0.3909774 -0.5 0.5 11 30 245 Same Load 2 0.10 7 12 0.3101549 0.3151515 -0.5 0.5 12 30 245 Same Load 3 0.15 6 12 0.0000000 0.3076923 -0.5 0.5

## Fitting the model

The model that we are fitting is a simple line; the regression analysis will estimate the slope and intercept for the line, as well as how the various factors (and their interaction) modulate the slope and intercept terms (see the MLR paper for details). So the next step is to fit the full model, including all random effects for the Subject analysis.

```
bySubj.lmer <- lmer(elog ~ t1*S*L +
(1 + t1 | AggID:SubjID) + (1 + t1 | SubjID),
data=bySubj, weights=1/wts)
```

Note that we have broken what is actually a single command across three lines. This makes it easier to read (and explain).

First, `bySubj.lmer`

is the name of the variable that will be created
by the model fitting procedure `lmer`

. The first argument to the lmer
function is a formula, specifically `elog ~ t1*S*L + (1 + t1 | SubjID:AggID) + (1 + t1 | SubjID)`

. Note the tilde (~) where an equal
sign would normally be; this is the conventional way to write a
formula in R. The terms on the first line represent the DV (`elog`

)
and the predictor variables, which determine what *fixed effects* will
be estimated. We used the syntax `t1*S*L`

as a shorthand to get all
main effects and interactions between time (`t1`

), speaker (`S`

), and
load (`L`

), rather than laboriously writing out
`t1+S+L+S:L+t1:S+t1:L+t1:S:L`

.

The second line `(1 + t1 | SubjID:AggID) + (1 + t1 | SubjID)`

represents the random effects in the model. The pipe symbol `|`

divides what we want to allow to vary (slope and intercept) from the
units over which we will allow them to vary (aggregate nested within
subject, `SubjID:AggID`

, and subject, `SubjID`

).

The third line just tells `lmer`

the name of the data frame where our
variables are located, and weights each observation by 1/=wts=, i.e.,
by the reciprocal of the variance.

Now let's have a look at the results.

summary(bySubj.lmer)

1: Linear mixed model fit by REML 2: Formula: elog ~ t1 * S * L + (1 + t1 | AggID:SubjID) + (1 + t1 | SubjID) 3: Data: bySubj 4: AIC BIC logLik deviance REMLdev 5: 2468 2540 -1219 2437 2438 6: Random effects: 7: Groups Name Variance Std.Dev. Corr 8: AggID:SubjID (Intercept) 0.617466 0.78579 9: t1 17.876056 4.22801 -0.452 10: SubjID (Intercept) 0.011532 0.10739 11: t1 0.086934 0.29485 1.000 12: Residual 0.112820 0.33589 13: Number of obs: 896, groups: AggID:SubjID, 224; SubjID, 56 14: 15: Fixed effects: 16: Estimate Std. Error t value 17: (Intercept) -1.25874 0.05664 -22.224 18: t1 2.42555 0.32624 7.435 19: S 0.32142 0.10958 2.933 20: L -0.02784 0.10958 -0.254 21: t1:S -2.27101 0.64770 -3.506 22: t1:L -0.33024 0.64770 -0.510 23: S:L -0.11625 0.21917 -0.530 24: t1:S:L 0.87128 1.29544 0.673 25: 26: Correlation of Fixed Effects: 27: (Intr) t1 S L t1:S t1:L S:L 28: t1 -0.440 29: S -0.005 0.006 30: L -0.001 0.000 0.004 31: t1:S 0.006 -0.006 -0.490 -0.006 32: t1:L 0.000 -0.001 -0.006 -0.490 0.007 33: S:L 0.004 -0.006 -0.001 -0.005 0.000 0.006 34: t1:S:L -0.005 0.007 0.000 0.006 -0.001 -0.006 -0.490

There is a lot of complicated output here, so let's take it step by step.

Line 5 tells us about the quality of the model fit, including the
Akaike Information Criterion (`AIC`

), Bayesian Information Criterion
(`BIC`

), the log likelihood at convergence (`logLik`

), the deviance
(which is just -2 times the log likelihood), and something called the
`REMLdev`

which is the REML deviance from the restricted
maximum-likelihood technique that `lme4`

has used. If we wanted to
get values for the likelihood that we could use for likelihood tests,
we could fit the model using the option `REML=FALSE`

. However, we
will not be doing this in the current example.

Lines 6-12 give us information about the random effects in the model. These values characterize the variance/covariance of two bivariate distributions, one at the subject level and one that describes variation associated with individual aggregates (cells of the design). There is also a residual term.

Finally, lines 15-24 show the fixed effects in the model. Note that
we get `t`

values, but no `p`

values. If we want to get `p-values`

we
can treat the ts as if they were drawn from a normal distribution,
using the `pnorm`

function. The code below does this.

ts <- fixef(bySubj.lmer) / sqrt(diag(vcov(bySubj.lmer))) print(round(ts,2)) ps <- 2*(1-pnorm(abs(ts))) print(round(ps,3))

(Intercept) t1 S L t1:S t1:L -22.22 7.43 2.93 -0.25 -3.51 -0.51 S:L t1:S:L -0.53 0.67 (Intercept) t1 S L t1:S t1:L 0.000 0.000 0.003 0.799 0.000 0.610 S:L t1:S:L 0.596 0.501

We calculate the `t`

values by taking the parameter estimates and
dividing them by their standard errors. We have to extract all of
this information from the `bySubj.lmer`

object that was created as the
result of the call to the `lmer`

function. We extracted the parameter
estimates using the `fixef`

function, and then divided through by the
standard error, obtained using `sqrt(diag(vcov(bySubj.lmer)))`

. The
function `vcov(bySubj.lmer)`

gets the variance-covariance matrix for
the fixed effects. Their variances are in the diagonal, which we
extract using the `diag`

function; and then we take the square root
(`sqrt`

) to get the standard errors. Another way to get p-values
would be to use likelihood ratio tests, which we don't cover here.

The listing includes main effects and interactions. The effects `S`

,
`L`

, and `S:L`

are the effects of speaker, load, and their interaction
at the intercept (where `t1`

equals 0), and thus represent
anticipatory baseline effects. Note that in line 19 we see what looks
like a main effect of speaker. Because we coded the speaker variable
as Same=-.5 and Diff=.5, a positive parameter estimate of `.32142`

implies that the chance of looking at the target at the intercept was
higher when the speaker was different than when he/she was the same;
we obtained a p-value of .003 for this effect. (See the MLR paper for
a possible explanation for this baseline effect). What we also see in
line 21 is a speaker:time interaction, with the slope steeper for the
same-speaker condition (since "same" was coded as negative) by 2.93
logits, with a p-value less than .001.

So, we have an anticipatory baseline effect of Speaker, as well as an effect of Speaker on the slope that goes in the opposite direction (high slopes for same speaker), with no other effect approaching significance. This supports the idea that listeners were facilitated by hearing the precedent spoken in the same voice.

One can replicate the same analysis by Item using the code below. This yields basically the same pattern of results.

byItem$elog <- log( (byItem$y + .5) / (byItem$N - byItem$y + .5) ) byItem$wts <- 1/(byItem$y + .5) + 1/(byItem$N - byItem$y + .5) byItem$S <- ifelse(byItem$SpeakerT=="Same",-.5,.5) byItem$L <- ifelse(byItem$CogLoadT=="None",-.5,.5) byItem.lmer <- lmer(elog ~ t1*S*L + (1 + t1 | AggID:ItemID) + (1 + t1 | ItemID), data=byItem, weights=1/wts) print(summary(byItem.lmer),corr=F) ts <- fixef(byItem.lmer) / sqrt(diag(vcov(byItem.lmer))) print(round(ts,2)) ps <- 2*(1-pnorm(abs(ts))) print(round(ps,3))

Linear mixed model fit by REML Formula: elog ~ t1 * S * L + (1 + t1 | AggID:ItemID) + (1 + t1 | ItemID) Data: byItem AIC BIC logLik deviance REMLdev 1519 1582 -744.3 1490 1489 Random effects: Groups Name Variance Std.Dev. Corr AggID:ItemID (Intercept) 0.319040 0.564836 t1 7.040989 2.653486 -0.463 ItemID (Intercept) 0.020052 0.141604 t1 0.005126 0.071596 -1.000 Residual 0.073227 0.270605 Number of obs: 512, groups: AggID:ItemID, 128; ItemID, 32 Fixed effects: Estimate Std. Error t value (Intercept) -1.29490 0.05732 -22.589 t1 2.73781 0.26735 10.241 S 0.37377 0.10314 3.624 L -0.01247 0.10314 -0.121 t1:S -2.41912 0.53409 -4.529 t1:L -0.43510 0.53409 -0.815 S:L -0.14867 0.20628 -0.721 t1:S:L 0.54104 1.06819 0.507 (Intercept) t1 S L t1:S t1:L -22.59 10.24 3.62 -0.12 -4.53 -0.81 S:L t1:S:L -0.72 0.51 (Intercept) t1 S L t1:S t1:L 0.000 0.000 0.000 0.904 0.000 0.415 S:L t1:S:L 0.471 0.613