Random effects structure for confirmatory hypothesis testing: Keep it maximal

Dale J Barr, Roger Levy, Christoph Scheepers, Harry J Tily, Dale J Barr, Roger Levy, Christoph Scheepers, Harry J Tily

Abstract

Linear mixed-effects models (LMEMs) have become increasingly prominent in psycholinguistics and related areas. However, many researchers do not seem to appreciate how random effects structures affect the generalizability of an analysis. Here, we argue that researchers using LMEMs for confirmatory hypothesis testing should minimally adhere to the standards that have been in place for many decades. Through theoretical arguments and Monte Carlo simulation, we show that LMEMs generalize best when they include the maximal random effects structure justified by the design. The generalization performance of LMEMs including data-driven random effects structures strongly depends upon modeling criteria and sample size, yielding reasonable results on moderately-sized samples when conservative criteria are used, but with little or no power advantage over maximal models. Finally, random-intercepts-only LMEMs used on within-subjects and/or within-items data from populations where subjects and/or items vary in their sensitivity to experimental manipulations always generalize worse than separate F1 and F2 tests, and in many cases, even worse than F1 alone. Maximal LMEMs should be the 'gold standard' for confirmatory hypothesis testing in psycholinguistics and beyond.

Keywords: Monte Carlo simulation; generalization; linear mixed-effects models; statistics.

Figures

Figure 1
Figure 1
Example RT data (open symbols) and model predictions (filled symbols) for a hypothetical lexical decision experiment with two within-subject/between-item conditions, A (triangles) and B (circles), including four subjects (S1–S4) and four items (I1–I4). Panel (a) illustrates a model with no random effects, considering only the baseline average RT (response to word type A) and treatment effect; panel (b) adds random subject intercepts to the model; panel (c) adds by-subject random slopes; and panel (d) illustrates the additional inclusion of by-item random intercepts. Panel (d) represents the maximal random-effects structure justified for this design; any remaining discrepancies between observed data and model estimates are due to trial-level measurement error (esi).
Figure 2
Figure 2
Performance of model selection approaches for within-items designs, as a function of selection algorithm and α level for testing slopes. The p-values for all LMEMs in the figure are from likelihood-ratio tests. Top row: 12 items; bottom row: 24 items. BB = Backwards, “best path”; BI = Backwards, Item-Slope First; BS = Backwards, Subject-Slope First; FB = Forwards, “best path”; FI = Forwards, Item-Slope First; FS = Forwards, Subject-Slope First.
Figure 3
Figure 3
Type I error (top two rows) and Power (bottom two rows) for between-items design with 24 items, as a function of by-subject random slope variance τ112 and by-item random intercept variance ω002. The p-values for all LMEMs in the figure are from likelihood-ratio tests. All model selection approaches in the figure had α = .05 for slope inclusion. The heatmaps from the 12-item datasets show similar patterns, and are presented in the appendix.
Figure 4
Figure 4
Type I error (top three rows) and power (bottom three rows) for within-items design with 24 items, as a function of by-subject random slope variance τ112 and by-item random slope variance ω112 The p-values for all LMEMs in the figure are from likelihood-ratio tests. All model selection approaches in the figure had α = .05 for slope inclusion. The heatmaps from the 12-item datasets show similar patterns, and are presented in the online appendix.
Figure 5
Figure 5
Type I error (top two rows) and power (bottom two rows) for design-driven approaches on within-items data, as a function of by-subject random slope variance τ112 and by-item random slope variance ω112 The p-values for all LMEMs in the figure are from likelihood-ratio tests. All approaches in the figure tested random slopes at α = .05.
Figure 6
Figure 6
Statistical power for maximal LMEMs, random-intercepts-only LMEMs, min-F′, and F1 × F2 as a function of effect size, when the generative model underlying the data is a random-intercepts-only LMEM.

Source: PubMed

3
S'abonner