# Aggregation: How much is enough?

One problem with eyetracking data sets is that the response is
*oversampled*. The term *oversampling* is usually used in
the context of <a
href="http://en.wikipedia.org/wiki/Oversampling">audio sampling</a>,
rather than in statistics, but it captures an important idea. We
record frames of eye data at a very high rate (60-250 Hz), and the
time between two adjacent frames (17 ms or less) is possibly shorter
than the frequency with which decisions to move the eyes (or to remain
in place) are made and sent to the oculomotor system. Even though a
listener's beliefs about the identity of the target may be rapidly
changing based on the incoming speech, it takes time both to program
an eye movement (around 180 ms or so) as well as to move the eyes
through space from one region to another. These factors create
dependencies in the data set, making it difficult to apply standard
statistical techniques. One trick to overcome this problem is to
aggregate frames over time and over multiple trials, calculating a
single independent number that will stand in for many dependent
individual observations. The use of aggregation for these ends is
characteristic both of the <a
href="http://magnuson.psy.uconn.edu/GCA/">Mirman et al. 'Growth Curve'
approach</a> (which uses a proportional transformation) and the
quasi-MLR approach (which uses empirical logit).

The question is: how much aggregation is needed to achieve independence?

To address this question, I have conducted Monte Carlo randomization tests on a large eyetracking data set (20 participants, 72 items per participant). The data will be described in detail in an upcoming paper (more details later). What I would like to do here is to present the results in preliminary fashion and offer some heuristics for making decisions about how much aggregation you might need.

Here are the basic things that you need to know about what I did. A randomization test was used to explore the effects of bin size, as well as the number or trials over which data are aggregated, on the Type I error rate (detecting an effect when none exists). The eye data was sampled at a rate of 60 Hz (1 frame about every 17 ms). The data are shown in the figure below. It is evident that there is a difference in anticipation across conditions, but that both conditions have the same rate (i.e. slopes are equal).

The basic idea was to randomly shuffle the condition labels within
each subject's data to create a 'pseudo-experiment' and then run
quasi-MLR analyses on each of these shuffled data sets. The
simulation was repeated numerous times to determine both the Type I
error rate and power. In these simulated data sets, there should be
no anticipation effect (since the condition lables are determined by a
flip of the coin) nor any difference in rate between the conditions.
Not finding any difference when none exists is good, but that could
mean two different things: (1) observations are independent at these
parameter settings; (2) too much aggregation has undermined the power
of our analysis to detect *any* effect. One way of
distinguishing these possibilities is to see whether the analysis
detects a rising slope in both conditions. We want to choose the
parameter settings that minimize Type I error while maximizing our
chances of detecting effects that are actually there.

The analysis window spanned from 433 ms to 733 ms, a total of 18 frames. This window was chosen because it was the first moment at which language-driven eye movements were observed, and a 300 ms window seemed small enough to allow the fitting of a simple linear model to the data. (Although 433 ms seems late, in this experiment, listeners were given no preview of the pictures before hearing the speech. So it's actually on the fast side, given the additional processing that would have had to take place for recognizing the pictures.)

In the simulations, the bin size variable could take on the values of 3, 6, or 9 frames (50, 100, and 150 ms, respectively). The number of items in each 'pseudo' data set ranged from 2-6, with the items randomly sampled from the larger set of 72 used in the full experiment.

There were 1000 simulations (runs) at each parameter setting. The procedure was as follows: (1) choosing the number of items for the experiment (2-6 items); (2) randomly select items (2, 3, 4, 5, or 6 items) to include in the pseudoexperiment from the larger set of 72; (3) query out participants' data for those items; (4) randomly assign the data to conditions (by shuffling the labels across trials within each participant); (5) group the data into bins (3, 6, or 9 frames per bin); (6) analyze the data using quasi-MLR (empirical logit regression); (7) repeating steps 5-6 for the next bin size; (8) repeating steps (2)-(7) 999 more times before moving on to the next item size and starting all over again.

Here are the results.

The left panel shows the Type I error rate, with the gray horizontal bar representing the conventionally accepted error rate of .05. The right panel shows the proportion of simulations detecting a significantly increasing slope, representing the power of the analysis (note that a different scale is used in the two graphs). What is immediately notable in the left panel is that the Type I error rate was at or below the acceptable range for all simulations aggregated over 3 items per condition or greater. There does not seem to be much benefit to using bins larger than 50 ms because of the hit to power, which was especially large for the 150 ms bin size. The only case in which the larger bin size might be useful is when you have only one item per condition; in that case, it appears to adequately protect against Type I error without any compromise in power.

Bottom line: if you are going to use MLR with an empirical logit transform, use 50 ms bins with at least three items per condition. To have adequate power, it is recommended that you use 5 or 6 items per condition, but increasing the number of items beyond that may not yield additional benefits to power because of the information loss due to aggregation. If you have only one item per condition, a larger bin size (at least 150 ms) is recommended.

Now, I should probably throw out a caveat, which is that I have only tried a simple linear model, but it's not immediately clear why things would differ if one were to use a more complex model (i.e., quadratic or cubic) or larger analysis windows. In conclusion, the results are good news for an aggregative approach, showing that just a modicum of aggregation can help overcome the oversampling problem.

Date: 2010-12-03 11:34:50 GMT

HTML generated by org-mode 7.02 in emacs 23