Random effects in real datasets

General Information

For generating the simulated populations included in the manuscript, we assumed a uniform distribution over a set of parameters, reflecting our lack of information about the nature of clustering in real datasets. To gain some preliminary information about real datasets, we revisited some of our own data using linear mixed-effect modeling.

Datasets

Description of datasets used in the analysis. The "Design" column describes the design of the study, indicating the number of levels of each factor (2) as well as whether it was within (W) or between (B) subjects / items.
1–4Matsuki et al. (2011)332/482W / 2WSilent reading (Eyetracking)high/low event prototypicality of patient nounFirst fixation, Gaze duration, "Go past" times, Total time
5Yao & Scheepers (2011)120/242Wx2W / 2Wx2WOral readingcontext (fast/slow), quotation style (direct/indirect)Syllables per second
6Yao & Scheepers (2011)248/242Wx2W / 2Wx2WSilent readingcontext (fast/slow), quotation style (direct/indirect)"Go past" times (ms)
7-8Levy et al. (2011)141/242W / 2WSelf-paced readingTODOReading times (same DV considered separately over two manipulations)
10Keysar et al. (2000)118/122W / 2WVisual world eyetracking (spoken lang comprehension)Competitor present/absentLatency of target gaze
11Kronmüller & Barr (2007)256/322Wx2Wx2W / 2Wx2Wx2WSpoken language comprehensionSpeaker, Precedent, Cognitive LoadResponse Time
12Barr & Seyfeddinipur (2011)192/122Wx2W / 2Wx2WSpoken language comprehensionSpeaker, Filled/unfilled pauseDistance of Mouse Cursor from Target
13Gann & Barr (in press)164/162Bx2Wx2W / 2Wx2Wx2BReferential CommunicationListener, New/Old Referent, FeedbackSpeech Onset Latency

Parameter space used in the simulations

For convenience, the distributions of population parameters as given in the original manuscript are reproduced below.

Ranges for the population parameters; $$∼ U(min, max)$$ means the parameter was sampled from a uniform distribution with range $$[min, max]$$.
ParameterDescriptionValue
$$\beta_{0}$$grand-average intercept$$\sim U(-3, 3)$$
$$\beta_{1}$$grand-average slope0 (H0 true) or .8 (H1 true)
$${\tau_{00}}^2$$by-subject variance of $$S_{0s}$$$$\sim U(0, 3)$$
$${\tau_{11}}^2$$by-subject variance of $$S_{1s}$$$$\sim U(0, 3)$$
$$\rho_S$$correlation between $$(S_{0s},S_{1s})$$ pairs$$\sim U(-.8, .8)$$
$${\omega_{00}}^2$$by-item variance of $$I_{0i}$$$$\sim U(0, 3)$$
$${\omega_{11}}^2$$by-item variance of $$I_{1i}$$$$\sim U(0, 3)$$
$$\rho_I$$correlation between $$(I_{0i},I_{1i})$$ pairs$$\sim U(-.8, .8)$$
$$\sigma^2$$residual error$$\sim U(0, 3)$$
$$p_{missing}$$proportion of missing observations$$\sim U(.00, .05)$$

Analyses

Slope variance relative to intercept variance

The first analysis considered how much of the total variance related to a given sampling unit (subject or item) was attributable to the random slope versus the random intercept. We used the following formula:

$$\frac{{\tau_{11}}^2}{{\tau_{00}}^2+{\tau_{11}}^2}$$

Slope variance as a proportion of total variance for a given sampling unit
IDSubjectItem
1.00003.73085
2.00564.37926
3.00362.29886
4.00035.04059
5.17032.25304
6.00463.03488
7.35727.51755
8.01663.39627
9.04805.44358
10.64403.04626
11.49218.49753
12.77840n/a
13.40245.57898
MIN.00003.03488
MEAN.22489.35147
MED.04805.38777
MAX.77840.73085

There is a broad range across experiments, with slope variance accounting for anywhere from <1% of the total subject variance up to 78%. The by-item variance also shows broad dispersion, with slope variance carrying from 3% to 73% of total item variance. The by-subject measurement seems bimodally distributed, with observations clumping toward either end of the range.

Random effects in relation to residual variance

One factor that became apparent in our analysis of real datasets was that our simulations assumed that by-subject or by-item random effect variance was roughly proportionate to residual variance. This assumption is unlikely to hold in actual datasets, where the random effect variance is typically much smaller than the residual variance. In other words, actual data sets tend to be much noisier than our simulated datasets.

Below are the results for each dataset, showing the residual variance and the by-subject/by-item random effect variance as a proportion of this residual variance. For each dataset containing multiple factors (e.g., in 2x2 designs), we present the average by-subject and average by-item slopes.

IDResidual$${\tau_{00}}^2$$$${\tau_{11}}^2$$$${\omega_{00}}^2$$$${\omega_{11}}^2$$
135720.21630.00000.01430.0389
28438.85310.14040.00080.02820.0172
324387.63560.10460.00040.13390.0571
429933.5810.32070.00010.14730.0062
50.4935321.87650.38521.02380.3468
6275362.5260.44920.00211.02690.0371
72301910.10580.05880.09100.0976
8231824.160.17210.00290.08640.0567
951371.60.41170.02080.10760.0858
1075366250.03630.06560.11980.0058
114060430.22860.22160.32690.3237
120.1283530.10420.36610.00000.0000
132428300.42580.28670.08200.1127
MEAN0.35320.10850.24520.0912
MED0.21630.02080.10760.0567
MIN0.03630.00000.00000.0000
MAX1.87650.38521.02690.3468

One thing that is apparent is that the by-subject and by-item random effects, as a proportion of residual variance, vary wildly across studies (from 0% to 187% of residual variance). Typically, they are only about 10-40% of the total variance. Generally, we also see more variance on the intercept than on the slope. It should also be noted that slope variance does not seem to be uniformly distributed over the range; rather, it seems clumped at the top and bottom of the range. It should be kept in mind that whereas intercept variances indexe differences in overall level, slope variances index differences in sensitivity to manipulations. It is possible that participants (or items) were simply insensitive to some of the manipulations in these studies, yielding no variance nor any overall effect.

Subsampling from the observed ranges

The next analysis addresses how unrepresentative the main results from our simulations might be. Specifically, did the parameter space we used lead us to be too pessimistic about random-intercepts-only models and model-selection approaches and too optimistic about maximal models?

To address this, from the values reported in the previous section we derived the following plausible ranges from which to subsample our simulation data:

ParameterMinMax
$${\tau_{00}}^2/\sigma^2$$0.000.45
$${\tau_{11}}^2/\sigma^2$$0.000.40
$${\omega_{00}}^2/\sigma^2$$0.000.35
$${\omega_{11}}^2/\sigma^2$$0.000.35

This resulted in the selection of 3154 (about 3%) of the total runs for further analysis. On this subsample, we compared the power of maximal LMEMs to min-$$F'$$, $$F_1 \times F_2$$, RI-only LMEMs, and LMEMs using model selection for the random effects. From the various possible model selection techniques for within-items design, we chose the best performing model (the "backward best path" model, $$\alpha$$ for inclusion set to .05) to see if it would improve power in this region of the space relative to the maximal model. The results are in the table below.

Type I error rate for original simulations and for the parameter subspace
SubspaceOriginalSubspaceOriginalSubspaceOriginalSubspaceOriginal
wsbi.12wsbi.12wsbi.24wsbi.24wswi.12wswi.12wswi.24wswi.24
min-$$F'$$.0384.0445.0387.0446.0216.0271.0263.0307
$$F_1 \times F_2$$.0653.0628.0770.0772.0549.0574.0656.0724
LMEM, Maximal.0758.0703.0596.0575.0611.0589.0592.0559
LMEM, Selection.0796.0702.0612.0575.1053.0683.0726.0579
LMEM, RI-only.1055.1023.1027.1105.2483.4398.3167.4980

For between-items (wsbi) designs, the Type I error rates do not differ much from the original simulations for any of the analyses. For the within-items designs, ANOVA-based and maximal LMEMs perform similarly on the subsample as they do on the original sample. However, model selection approaches become slightly more anticonservative, while random-intercepts-only LMEM becomes substantially less anticonservative on the subsample. But even though RI-only LMEMs are performing better, their Type I error rates still remain intolerably high (.25 and .32).

Power (and corrected power, CP) for original simulations and for the parameter subspace
SubspaceOriginalSubspaceOriginalSubspaceOriginalSubspaceOriginal
wsbi.12wsbi.12wsbi.24wsbi.24wswi.12wswi.12wswi.24wswi.24
min-$$F'$$.3003.2099.4984.3281.4471.3268.6826.5116
$$F_1 \times F_2$$.3675.2518.5961.4034.5961.4400.8098.6432
LMEM, Maximal.3965.2672.5643.3636.6215.4603.7921.6104
LMEM, Selection.4017.2689.5685.3636.6715.4730.8025.6120
LMEM, RI-only.4543.3185.6368.4492.8708.8534.9610.9351
$$F_1 \times F_2$$ (CP).3291.2236.5187.3375.5748.4158.7695.5780
LMEM, Maximal (CP).3242.2225.5322.3418.5830.4325.7685.5914
LMEM, Selection (CP).3266.2229.5301.3424.5200.4144.7495.5880
LMEM, RI-only (CP).3231.2156.5040.3140.6180.3791.7961.5313

It is notable that all approachees (including maximal LMEMs) are more powerful on the subspace than on the original dataset. When power is corrected for anticonservativity (rows labeled "CP" in the table), one interesting outcome is that in the parameter subspace, maximal LMEMs are nearly always just as powerful and occasionally even more powerful than approaches using model selection. Finally, for within-items designs, RI-only LMEMs, once corrected for anticonservativity, showed only a very minor advantage relative to maximal LMEMs (6% and 4% increase in power for 12 and 24 item datasets, respectively). In contrast, for within-items designs, model selection approaches showed a disadvantage in corrected power relative to maximal LMEMs (11% and 2.5% drop for 12 and 24 item datasets, respectively).

Summary

In closing, the analyses of actual datasets show that our simulations assumed that by-subject and by-item random variance was a bigger portion of the total variance than actually turned out to be the case. Yet it was clear that even for the subregion of the parameter space spanning the range of the observed data sets, maximal models reflect the best compromise between controlling Type I error and power. Unfortunately, we have no way of knowing whether our datasets are representative of the kinds of experimental datasets analyzed in experimental psychology. Nonetheless, these findings lend further confidence to our contention that maximal LMEMs provide the best approach for confirmatory hypothesis testing.

Sources of real datasets

Barr, D. J., & Seyfeddinipur, M. (2011). The role of fillers in listener attributions for speaker disfluency. Language and Cognitive Processes, 25, 441-455.

Gann, T. M., & Barr, D. J. (in press). Speaking from experience: Audience design as expert performance. Language and Cognitive Processes, Manuscript in press.

Keysar, B., Barr, D. J., Balin, J. A., & Brauner, J. S. (2000). Taking perspective in conversation: The role of mutual knowledge in comprehension. Psychological Science, 11, 32-38.

Levy, R., Fedorenko, E., Breen, M., & Gibson, E. (2011). The processing of extraposed structures in English. Cognition, 122, 12-36.

Matsuki, K., Chow, T., Hare, M., Elman, J. L., Scheepers, C., & McRae, K. (2011). Event-based plausibility immediately influences on-line language comprehension. Journal of Experimental Psychology: Learning, Memory and Cognition, 37, 913-934.

Kronmüller, E., & Barr, D. J. (2007). Perspective-free pragmatics: Broken precedents and the recovery-from-preemption hypothesis. Journal of Memory and Language, 56, 436-455.

Rohde, H., Levy, R., & Kehler, A. (2011). Anticipating explanations in relative clause processing. Cognition, 118, 339-358.

Yao, B. & Scheepers, C. (2011). Contextual modulation of reading rate for direct versus indirect speech quotations. Cognition, 121 447-453.

Date: March 27, 2012

Generated by Org version 7.8.06 with Emacs version 23

Validate XHTML 1.0