UP | HOME

Walkthrough of an "empirical logit" analysis in R

Note: This is an updated version of an earlier post from September 12, 2007, with a number of revisions to improve clarity.

Note: updated again on 27-July-2012 (weights were computed but never used in the fitting)

We will be looking at a subset of the data from Kronmüller and Barr (2007), Experiment 2 (the "maintain precedent" condition only). This subset has a 2x2 design, with Speaker (Same, Different) crossed with Cognitive Load (None, Load). Essential background details for this walkthrough can be found in the MLR paper in JML.

You'll also need to download an archive containing the code (elogit-wt.R) and the data files (bySubj.RData and byItem.RData). They are available in GNU tar/zip format (.tgz) or in ZIP format.

Setting up the R environment

For this walkthrough, I am using R version 2.14.0 and lme4 package version 0.999375.41. If you are using different versions, your results might differ (hopefully only slightly, unless you're using a really old version, in which case I recommend that you upgrade). For this same reason, they will also slightly differ from the results I presented in the MLR paper. Here's how to find out what versions you are using.

R.Version()$version.string
packageVersion("lme4")
[1] "R version 2.14.2 (2012-02-29)"
[1] ‘0.999375.41’

First, we need to load the lme4 package and then load in the data from the experiment. We have two files (bySubj.RData and byItem.RData) for the two by-subject and by-item analyses that we will perform.

library(lme4)

load("bySubj.RData")
load("byItem.RData")

ls()
[1] "byItem"      "byItem.lmer" "bySubj"      "bySubj.lmer" "ps"         
[6] "ts"

The data frames bySubj and byItem contain the data aggregated by Subject and by Item respectively. The portion of the data we are looking at contains sixteen trials per subject, four in each of the four cells of the design. The data was prepared by aggregating over 50 ms time slices (3 frames of eye data) and over all of the trials in a given condition within each subject (or item, for the analysis by item). Thus, each 50 ms bin consists of (4 trials x 3 frames =) 12 frames of eye data.

For example, take a look at some of the data for subject 30.

subset(bySubj, SubjID==30)
   SubjID AggID SpeakerT CogLoadT Bin   t1 y  N
1      30   241     Same     None   0 0.00 0 12
2      30   241     Same     None   1 0.05 0 12
3      30   241     Same     None   2 0.10 0 12
4      30   241     Same     None   3 0.15 0 12
5      30   243     Diff     None   0 0.00 3 12
6      30   243     Diff     None   1 0.05 3 12
7      30   243     Diff     None   2 0.10 5 12
8      30   243     Diff     None   3 0.15 4 12
9      30   245     Same     Load   0 0.00 3 12
10     30   245     Same     Load   1 0.05 3 12
11     30   245     Same     Load   2 0.10 7 12
12     30   245     Same     Load   3 0.15 6 12
13     30   247     Diff     Load   0 0.00 3 12
14     30   247     Diff     Load   1 0.05 3 12
15     30   247     Diff     Load   2 0.10 3 12
16     30   247     Diff     Load   3 0.15 3 12

Note that we have aggregated together all of the trials within a given condition for each Subject, and broken up the time variable into a series of 50 ms bins. The bins for a given aggregated condition (henceforth "aggregate") were assigned an ID number (to represent that they are all from the same condition for the same subject). The precise number that is given to a particular "aggregate" does not matter; what matters is that the number given to all the bins in a given aggregate is unique. This is a way of telling the model fitting function lmer that these bins belong together.

The variable y is the number of frames in the 50 ms window for which the gaze fell within the bounds of the target object; The variable N is the total number of frames within each bin.

Variable t1 is the time variable, coded in seconds, from the onset of the analysis window (which was 300 ms after the start of the word).

We can compute the empirical logit for each bin using the formula \( elog = log\left(\frac{Y + .5}{N - Y + .5}\right) \). This is an approximation to log odds. Because we are using a linear approximation, we also will need to calculate weights \( \frac{1}{Y+.5} + \frac{1}{N-Y+.5} \) because the variance of the logit depends on the mean.

Preparing the data

bySubj$elog <- log( (bySubj$y + .5) / (bySubj$N - bySubj$y + .5) )
bySubj$wts <- 1/(bySubj$y + .5) + 1/(bySubj$N - bySubj$y + .5)

We will now create condition codes for our categorical predictor variables. We will use ANOVA-style "deviation" coding.

bySubj$S <- ifelse(bySubj$SpeakerT=="Same",-.5,.5)
bySubj$L <- ifelse(bySubj$CogLoadT=="None",-.5,.5)
head(bySubj, 12)
   SubjID AggID SpeakerT CogLoadT Bin   t1 y  N       elog       wts    S    L
1      30   241     Same     None   0 0.00 0 12 -3.2188758 2.0800000 -0.5 -0.5
2      30   241     Same     None   1 0.05 0 12 -3.2188758 2.0800000 -0.5 -0.5
3      30   241     Same     None   2 0.10 0 12 -3.2188758 2.0800000 -0.5 -0.5
4      30   241     Same     None   3 0.15 0 12 -3.2188758 2.0800000 -0.5 -0.5
5      30   243     Diff     None   0 0.00 3 12 -0.9985288 0.3909774  0.5 -0.5
6      30   243     Diff     None   1 0.05 3 12 -0.9985288 0.3909774  0.5 -0.5
7      30   243     Diff     None   2 0.10 5 12 -0.3101549 0.3151515  0.5 -0.5
8      30   243     Diff     None   3 0.15 4 12 -0.6359888 0.3398693  0.5 -0.5
9      30   245     Same     Load   0 0.00 3 12 -0.9985288 0.3909774 -0.5  0.5
10     30   245     Same     Load   1 0.05 3 12 -0.9985288 0.3909774 -0.5  0.5
11     30   245     Same     Load   2 0.10 7 12  0.3101549 0.3151515 -0.5  0.5
12     30   245     Same     Load   3 0.15 6 12  0.0000000 0.3076923 -0.5  0.5

Fitting the model

The model that we are fitting is a simple line; the regression analysis will estimate the slope and intercept for the line, as well as how the various factors (and their interaction) modulate the slope and intercept terms (see the MLR paper for details). So the next step is to fit the full model, including all random effects for the Subject analysis.

bySubj.lmer <- lmer(elog ~ t1*S*L +  
                    (1 + t1 | AggID:SubjID) + (1 + t1 | SubjID), 
                    data=bySubj, weights=1/wts)

Note that we have broken what is actually a single command across three lines. This makes it easier to read (and explain).

First, bySubj.lmer is the name of the variable that will be created by the model fitting procedure lmer. The first argument to the lmer function is a formula, specifically elog ~ t1*S*L + (1 + t1 | SubjID:AggID) + (1 + t1 | SubjID). Note the tilde (~) where an equal sign would normally be; this is the conventional way to write a formula in R. The terms on the first line represent the DV (elog) and the predictor variables, which determine what fixed effects will be estimated. We used the syntax t1*S*L as a shorthand to get all main effects and interactions between time (t1), speaker (S), and load (L), rather than laboriously writing out t1+S+L+S:L+t1:S+t1:L+t1:S:L.

The second line (1 + t1 | SubjID:AggID) + (1 + t1 | SubjID) represents the random effects in the model. The pipe symbol | divides what we want to allow to vary (slope and intercept) from the units over which we will allow them to vary (aggregate nested within subject, SubjID:AggID, and subject, SubjID).

The third line just tells lmer the name of the data frame where our variables are located, and weights each observation by 1/=wts=, i.e., by the reciprocal of the variance.

Now let's have a look at the results.

summary(bySubj.lmer)
 1:  Linear mixed model fit by REML 
 2:  Formula: elog ~ t1 * S * L + (1 + t1 | AggID:SubjID) + (1 + t1 | SubjID) 
 3:     Data: bySubj 
 4:    AIC  BIC logLik deviance REMLdev
 5:   2468 2540  -1219     2437    2438
 6:  Random effects:
 7:   Groups       Name        Variance  Std.Dev. Corr   
 8:   AggID:SubjID (Intercept)  0.617466 0.78579         
 9:                t1          17.876056 4.22801  -0.452 
10:   SubjID       (Intercept)  0.011532 0.10739         
11:                t1           0.086934 0.29485  1.000  
12:   Residual                  0.112820 0.33589         
13:  Number of obs: 896, groups: AggID:SubjID, 224; SubjID, 56
14:  
15:  Fixed effects:
16:              Estimate Std. Error t value
17:  (Intercept) -1.25874    0.05664 -22.224
18:  t1           2.42555    0.32624   7.435
19:  S            0.32142    0.10958   2.933
20:  L           -0.02784    0.10958  -0.254
21:  t1:S        -2.27101    0.64770  -3.506
22:  t1:L        -0.33024    0.64770  -0.510
23:  S:L         -0.11625    0.21917  -0.530
24:  t1:S:L       0.87128    1.29544   0.673
25:  
26:  Correlation of Fixed Effects:
27:         (Intr) t1     S      L      t1:S   t1:L   S:L   
28:  t1     -0.440                                          
29:  S      -0.005  0.006                                   
30:  L      -0.001  0.000  0.004                            
31:  t1:S    0.006 -0.006 -0.490 -0.006                     
32:  t1:L    0.000 -0.001 -0.006 -0.490  0.007              
33:  S:L     0.004 -0.006 -0.001 -0.005  0.000  0.006       
34:  t1:S:L -0.005  0.007  0.000  0.006 -0.001 -0.006 -0.490

There is a lot of complicated output here, so let's take it step by step.

Line 5 tells us about the quality of the model fit, including the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), the log likelihood at convergence (logLik), the deviance (which is just -2 times the log likelihood), and something called the REMLdev which is the REML deviance from the restricted maximum-likelihood technique that lme4 has used. If we wanted to get values for the likelihood that we could use for likelihood tests, we could fit the model using the option REML=FALSE. However, we will not be doing this in the current example.

Lines 6-12 give us information about the random effects in the model. These values characterize the variance/covariance of two bivariate distributions, one at the subject level and one that describes variation associated with individual aggregates (cells of the design). There is also a residual term.

Finally, lines 15-24 show the fixed effects in the model. Note that we get t values, but no p values. If we want to get p-values we can treat the ts as if they were drawn from a normal distribution, using the pnorm function. The code below does this.

ts <- fixef(bySubj.lmer) / sqrt(diag(vcov(bySubj.lmer)))
print(round(ts,2))
ps <- 2*(1-pnorm(abs(ts)))
print(round(ps,3))
(Intercept)          t1           S           L        t1:S        t1:L 
     -22.22        7.43        2.93       -0.25       -3.51       -0.51 
        S:L      t1:S:L 
      -0.53        0.67
 (Intercept)          t1           S           L        t1:S        t1:L 
      0.000       0.000       0.003       0.799       0.000       0.610 
        S:L      t1:S:L 
      0.596       0.501

We calculate the t values by taking the parameter estimates and dividing them by their standard errors. We have to extract all of this information from the bySubj.lmer object that was created as the result of the call to the lmer function. We extracted the parameter estimates using the fixef function, and then divided through by the standard error, obtained using sqrt(diag(vcov(bySubj.lmer))). The function vcov(bySubj.lmer) gets the variance-covariance matrix for the fixed effects. Their variances are in the diagonal, which we extract using the diag function; and then we take the square root (sqrt) to get the standard errors. Another way to get p-values would be to use likelihood ratio tests, which we don't cover here.

The listing includes main effects and interactions. The effects S, L, and S:L are the effects of speaker, load, and their interaction at the intercept (where t1 equals 0), and thus represent anticipatory baseline effects. Note that in line 19 we see what looks like a main effect of speaker. Because we coded the speaker variable as Same=-.5 and Diff=.5, a positive parameter estimate of .32142 implies that the chance of looking at the target at the intercept was higher when the speaker was different than when he/she was the same; we obtained a p-value of .003 for this effect. (See the MLR paper for a possible explanation for this baseline effect). What we also see in line 21 is a speaker:time interaction, with the slope steeper for the same-speaker condition (since "same" was coded as negative) by 2.93 logits, with a p-value less than .001.

So, we have an anticipatory baseline effect of Speaker, as well as an effect of Speaker on the slope that goes in the opposite direction (high slopes for same speaker), with no other effect approaching significance. This supports the idea that listeners were facilitated by hearing the precedent spoken in the same voice.

One can replicate the same analysis by Item using the code below. This yields basically the same pattern of results.

byItem$elog <- log( (byItem$y + .5) / (byItem$N - byItem$y + .5) )
byItem$wts <- 1/(byItem$y + .5) + 1/(byItem$N - byItem$y + .5)
byItem$S <- ifelse(byItem$SpeakerT=="Same",-.5,.5)
byItem$L <- ifelse(byItem$CogLoadT=="None",-.5,.5)

byItem.lmer <- lmer(elog ~ t1*S*L +  
                    (1 + t1 | AggID:ItemID) + (1 + t1 | ItemID), 
                    data=byItem, weights=1/wts)

print(summary(byItem.lmer),corr=F)

ts <- fixef(byItem.lmer) / sqrt(diag(vcov(byItem.lmer)))
print(round(ts,2))
ps <- 2*(1-pnorm(abs(ts)))
print(round(ps,3))
 Linear mixed model fit by REML 
Formula: elog ~ t1 * S * L + (1 + t1 | AggID:ItemID) + (1 + t1 | ItemID) 
   Data: byItem 
  AIC  BIC logLik deviance REMLdev
 1519 1582 -744.3     1490    1489
Random effects:
 Groups       Name        Variance Std.Dev. Corr   
 AggID:ItemID (Intercept) 0.319040 0.564836        
              t1          7.040989 2.653486 -0.463 
 ItemID       (Intercept) 0.020052 0.141604        
              t1          0.005126 0.071596 -1.000 
 Residual                 0.073227 0.270605        
Number of obs: 512, groups: AggID:ItemID, 128; ItemID, 32

Fixed effects:
            Estimate Std. Error t value
(Intercept) -1.29490    0.05732 -22.589
t1           2.73781    0.26735  10.241
S            0.37377    0.10314   3.624
L           -0.01247    0.10314  -0.121
t1:S        -2.41912    0.53409  -4.529
t1:L        -0.43510    0.53409  -0.815
S:L         -0.14867    0.20628  -0.721
t1:S:L       0.54104    1.06819   0.507
 (Intercept)          t1           S           L        t1:S        t1:L 
     -22.59       10.24        3.62       -0.12       -4.53       -0.81 
        S:L      t1:S:L 
      -0.72        0.51
 (Intercept)          t1           S           L        t1:S        t1:L 
      0.000       0.000       0.000       0.904       0.000       0.415 
        S:L      t1:S:L 
      0.471       0.613

Date: 2012-07-27 12:41:09 BST

Author: Dale Barr

Org version 7.8.06 with Emacs version 23

Validate XHTML 1.0