# Random effects in real datasets

## Table of Contents

## General Information

For generating the simulated populations included in the manuscript, we assumed a uniform distribution over a set of parameters, reflecting our lack of information about the nature of clustering in real datasets. To gain some preliminary information about real datasets, we revisited some of our own data using linear mixed-effect modeling.

### Datasets

ID | Article | Expt. | Subjs/ Items | Design | Task | Manipulation | Dependent variables |
---|---|---|---|---|---|---|---|

1–4 | Matsuki et al. (2011) | 3 | 32/48 | 2W / 2W | Silent reading (Eyetracking) | high/low event prototypicality of patient noun | First fixation, Gaze duration, "Go past" times, Total time |

5 | Yao & Scheepers (2011) | 1 | 20/24 | 2Wx2W / 2Wx2W | Oral reading | context (fast/slow), quotation style (direct/indirect) | Syllables per second |

6 | Yao & Scheepers (2011) | 2 | 48/24 | 2Wx2W / 2Wx2W | Silent reading | context (fast/slow), quotation style (direct/indirect) | "Go past" times (ms) |

7-8 | Levy et al. (2011) | 1 | 41/24 | 2W / 2W | Self-paced reading | TODO | Reading times (same DV considered separately over two manipulations) |

9 | Rohde et al. (2011) | 2 | 55/20 | 2W / 2W | Self-paced reading | TODO | Reading time |

10 | Keysar et al. (2000) | 1 | 18/12 | 2W / 2W | Visual world eyetracking (spoken lang comprehension) | Competitor present/absent | Latency of target gaze |

11 | Kronmüller & Barr (2007) | 2 | 56/32 | 2Wx2Wx2W / 2Wx2Wx2W | Spoken language comprehension | Speaker, Precedent, Cognitive Load | Response Time |

12 | Barr & Seyfeddinipur (2011) | 1 | 92/12 | 2Wx2W / 2Wx2W | Spoken language comprehension | Speaker, Filled/unfilled pause | Distance of Mouse Cursor from Target |

13 | Gann & Barr (in press) | 1 | 64/16 | 2Bx2Wx2W / 2Wx2Wx2B | Referential Communication | Listener, New/Old Referent, Feedback | Speech Onset Latency |

### Parameter space used in the simulations

For convenience, the distributions of population parameters as given in the original manuscript are reproduced below.

Parameter | Description | Value |
---|---|---|

\( \beta_{0} \) | grand-average intercept | \( \sim U(-3, 3) \) |

\( \beta_{1} \) | grand-average slope | 0 (H_{0} true) or .8 (H_{1} true) |

\( {\tau_{00}}^2 \) | by-subject variance of \( S_{0s} \) | \( \sim U(0, 3) \) |

\( {\tau_{11}}^2 \) | by-subject variance of \( S_{1s} \) | \( \sim U(0, 3) \) |

\( \rho_S \) | correlation between \( (S_{0s},S_{1s}) \) pairs | \( \sim U(-.8, .8) \) |

\( {\omega_{00}}^2 \) | by-item variance of \( I_{0i} \) | \( \sim U(0, 3) \) |

\( {\omega_{11}}^2 \) | by-item variance of \( I_{1i} \) | \( \sim U(0, 3) \) |

\( \rho_I \) | correlation between \( (I_{0i},I_{1i}) \) pairs | \( \sim U(-.8, .8) \) |

\( \sigma^2 \) | residual error | \( \sim U(0, 3) \) |

\( p_{missing} \) | proportion of missing observations | \( \sim U(.00, .05) \) |

## Analyses

### Slope variance relative to intercept variance

The first analysis considered how much of the total variance related to a given sampling unit (subject or item) was attributable to the random slope versus the random intercept. We used the following formula:

\(\frac{{\tau_{11}}^2}{{\tau_{00}}^2+{\tau_{11}}^2}\)ID | Subject | Item |
---|---|---|

1 | .00003 | .73085 |

2 | .00564 | .37926 |

3 | .00362 | .29886 |

4 | .00035 | .04059 |

5 | .17032 | .25304 |

6 | .00463 | .03488 |

7 | .35727 | .51755 |

8 | .01663 | .39627 |

9 | .04805 | .44358 |

10 | .64403 | .04626 |

11 | .49218 | .49753 |

12 | .77840 | n/a |

13 | .40245 | .57898 |

MIN | .00003 | .03488 |

MEAN | .22489 | .35147 |

MED | .04805 | .38777 |

MAX | .77840 | .73085 |

There is a broad range across experiments, with slope variance accounting for anywhere from <1% of the total subject variance up to 78%. The by-item variance also shows broad dispersion, with slope variance carrying from 3% to 73% of total item variance. The by-subject measurement seems bimodally distributed, with observations clumping toward either end of the range.

### Random effects in relation to residual variance

One factor that became apparent in our analysis of real datasets was that our simulations assumed that by-subject or by-item random effect variance was roughly proportionate to residual variance. This assumption is unlikely to hold in actual datasets, where the random effect variance is typically much smaller than the residual variance. In other words, actual data sets tend to be much noisier than our simulated datasets.

Below are the results for each dataset, showing the residual variance and the by-subject/by-item random effect variance as a proportion of this residual variance. For each dataset containing multiple factors (e.g., in 2x2 designs), we present the average by-subject and average by-item slopes.

ID | Residual | \({\tau_{00}}^2\) | \({\tau_{11}}^2\) | \({\omega_{00}}^2\) | \({\omega_{11}}^2\) |
---|---|---|---|---|---|

1 | 3572 | 0.2163 | 0.0000 | 0.0143 | 0.0389 |

2 | 8438.8531 | 0.1404 | 0.0008 | 0.0282 | 0.0172 |

3 | 24387.6356 | 0.1046 | 0.0004 | 0.1339 | 0.0571 |

4 | 29933.581 | 0.3207 | 0.0001 | 0.1473 | 0.0062 |

5 | 0.493532 | 1.8765 | 0.3852 | 1.0238 | 0.3468 |

6 | 275362.526 | 0.4492 | 0.0021 | 1.0269 | 0.0371 |

7 | 230191 | 0.1058 | 0.0588 | 0.0910 | 0.0976 |

8 | 231824.16 | 0.1721 | 0.0029 | 0.0864 | 0.0567 |

9 | 51371.6 | 0.4117 | 0.0208 | 0.1076 | 0.0858 |

10 | 7536625 | 0.0363 | 0.0656 | 0.1198 | 0.0058 |

11 | 406043 | 0.2286 | 0.2216 | 0.3269 | 0.3237 |

12 | 0.128353 | 0.1042 | 0.3661 | 0.0000 | 0.0000 |

13 | 242830 | 0.4258 | 0.2867 | 0.0820 | 0.1127 |

MEAN | 0.3532 | 0.1085 | 0.2452 | 0.0912 | |

MED | 0.2163 | 0.0208 | 0.1076 | 0.0567 | |

MIN | 0.0363 | 0.0000 | 0.0000 | 0.0000 | |

MAX | 1.8765 | 0.3852 | 1.0269 | 0.3468 |

One thing that is apparent is that the by-subject and by-item random effects, as a proportion of residual variance, vary wildly across studies (from 0% to 187% of residual variance). Typically, they are only about 10-40% of the total variance. Generally, we also see more variance on the intercept than on the slope. It should also be noted that slope variance does not seem to be uniformly distributed over the range; rather, it seems clumped at the top and bottom of the range. It should be kept in mind that whereas intercept variances indexe differences in overall level, slope variances index differences in sensitivity to manipulations. It is possible that participants (or items) were simply insensitive to some of the manipulations in these studies, yielding no variance nor any overall effect.

### Subsampling from the observed ranges

The next analysis addresses how unrepresentative the main results from our simulations might be. Specifically, did the parameter space we used lead us to be too pessimistic about random-intercepts-only models and model-selection approaches and too optimistic about maximal models?

To address this, from the values reported in the previous section we derived the following plausible ranges from which to subsample our simulation data:

Parameter | Min | Max |
---|---|---|

\({\tau_{00}}^2/\sigma^2\) | 0.00 | 0.45 |

\({\tau_{11}}^2/\sigma^2\) | 0.00 | 0.40 |

\({\omega_{00}}^2/\sigma^2\) | 0.00 | 0.35 |

\({\omega_{11}}^2/\sigma^2\) | 0.00 | 0.35 |

This resulted in the selection of 3154 (about 3%) of the total runs for further analysis. On this subsample, we compared the power of maximal LMEMs to min-\(F'\), \(F_1 \times F_2\), RI-only LMEMs, and LMEMs using model selection for the random effects. From the various possible model selection techniques for within-items design, we chose the best performing model (the "backward best path" model, \(\alpha\) for inclusion set to .05) to see if it would improve power in this region of the space relative to the maximal model. The results are in the table below.

Subspace | Original | Subspace | Original | Subspace | Original | Subspace | Original | |
---|---|---|---|---|---|---|---|---|

wsbi.12 | wsbi.12 | wsbi.24 | wsbi.24 | wswi.12 | wswi.12 | wswi.24 | wswi.24 | |

min-\(F'\) | .0384 | .0445 | .0387 | .0446 | .0216 | .0271 | .0263 | .0307 |

\(F_1 \times F_2\) | .0653 | .0628 | .0770 | .0772 | .0549 | .0574 | .0656 | .0724 |

LMEM, Maximal | .0758 | .0703 | .0596 | .0575 | .0611 | .0589 | .0592 | .0559 |

LMEM, Selection | .0796 | .0702 | .0612 | .0575 | .1053 | .0683 | .0726 | .0579 |

LMEM, RI-only | .1055 | .1023 | .1027 | .1105 | .2483 | .4398 | .3167 | .4980 |

For between-items (wsbi) designs, the Type I error rates do not differ much from the original simulations for any of the analyses. For the within-items designs, ANOVA-based and maximal LMEMs perform similarly on the subsample as they do on the original sample. However, model selection approaches become slightly more anticonservative, while random-intercepts-only LMEM becomes substantially less anticonservative on the subsample. But even though RI-only LMEMs are performing better, their Type I error rates still remain intolerably high (.25 and .32).

Subspace | Original | Subspace | Original | Subspace | Original | Subspace | Original | |
---|---|---|---|---|---|---|---|---|

wsbi.12 | wsbi.12 | wsbi.24 | wsbi.24 | wswi.12 | wswi.12 | wswi.24 | wswi.24 | |

min-\(F'\) | .3003 | .2099 | .4984 | .3281 | .4471 | .3268 | .6826 | .5116 |

\(F_1 \times F_2\) | .3675 | .2518 | .5961 | .4034 | .5961 | .4400 | .8098 | .6432 |

LMEM, Maximal | .3965 | .2672 | .5643 | .3636 | .6215 | .4603 | .7921 | .6104 |

LMEM, Selection | .4017 | .2689 | .5685 | .3636 | .6715 | .4730 | .8025 | .6120 |

LMEM, RI-only | .4543 | .3185 | .6368 | .4492 | .8708 | .8534 | .9610 | .9351 |

\(F_1 \times F_2\) (CP) | .3291 | .2236 | .5187 | .3375 | .5748 | .4158 | .7695 | .5780 |

LMEM, Maximal (CP) | .3242 | .2225 | .5322 | .3418 | .5830 | .4325 | .7685 | .5914 |

LMEM, Selection (CP) | .3266 | .2229 | .5301 | .3424 | .5200 | .4144 | .7495 | .5880 |

LMEM, RI-only (CP) | .3231 | .2156 | .5040 | .3140 | .6180 | .3791 | .7961 | .5313 |

It is notable that all approachees (including maximal LMEMs) are more powerful on the subspace than on the original dataset. When power is corrected for anticonservativity (rows labeled "CP" in the table), one interesting outcome is that in the parameter subspace, maximal LMEMs are nearly always just as powerful and occasionally even more powerful than approaches using model selection. Finally, for within-items designs, RI-only LMEMs, once corrected for anticonservativity, showed only a very minor advantage relative to maximal LMEMs (6% and 4% increase in power for 12 and 24 item datasets, respectively). In contrast, for within-items designs, model selection approaches showed a disadvantage in corrected power relative to maximal LMEMs (11% and 2.5% drop for 12 and 24 item datasets, respectively).

## Summary

In closing, the analyses of actual datasets show that our simulations assumed that by-subject and by-item random variance was a bigger portion of the total variance than actually turned out to be the case. Yet it was clear that even for the subregion of the parameter space spanning the range of the observed data sets, maximal models reflect the best compromise between controlling Type I error and power. Unfortunately, we have no way of knowing whether our datasets are representative of the kinds of experimental datasets analyzed in experimental psychology. Nonetheless, these findings lend further confidence to our contention that maximal LMEMs provide the best approach for confirmatory hypothesis testing.

## Sources of real datasets

Barr, D. J., & Seyfeddinipur, M. (2011). The role of fillers in
listener attributions for speaker disfluency. *Language and Cognitive Processes*, *25*, 441-455.

Gann, T. M., & Barr, D. J. (in press). Speaking from experience:
Audience design as expert performance. *Language and Cognitive Processes*, Manuscript in press.

Keysar, B., Barr, D. J., Balin, J. A., & Brauner, J. S. (2000).
Taking perspective in conversation: The role of mutual knowledge in
comprehension. *Psychological Science*, *11*, 32-38.

Levy, R., Fedorenko, E., Breen, M., & Gibson, E. (2011). The
processing of extraposed structures in English. *Cognition*, *122*,
12-36.

Matsuki, K., Chow, T., Hare, M., Elman, J. L., Scheepers, C., & McRae, K. (2011). Event-based plausibility immediately influences on-line
language comprehension. *Journal of Experimental Psychology: Learning, Memory and Cognition*, *37*, 913-934.

Kronmüller, E., & Barr, D. J. (2007). Perspective-free
pragmatics: Broken precedents and the recovery-from-preemption
hypothesis. *Journal of Memory and Language*, *56*, 436-455.

Rohde, H., Levy, R., & Kehler, A. (2011). Anticipating explanations
in relative clause processing. *Cognition*, *118*, 339-358.

Yao, B. & Scheepers, C. (2011). Contextual modulation of reading rate
for direct versus indirect speech quotations. *Cognition*, *121*
447-453.