Analyzing Clustered Data: Why and How to Account for Multiple Observations Nested within a Study Participant?
Erika L Moen, Catherine J Fricano-Kugler, Bryan W Luikart, A James O'Malley, Erika L Moen, Catherine J Fricano-Kugler, Bryan W Luikart, A James O'Malley
Abstract
A conventional study design among medical and biological experimentalists involves collecting multiple measurements from a study subject. For example, experiments utilizing mouse models in neuroscience often involve collecting multiple neuron measurements per mouse to increase the number of observations without requiring a large number of mice. This leads to a form of statistical dependence referred to as clustering. Inappropriate analyses of clustered data have resulted in several recent critiques of neuroscience research that suggest the bar for statistical analyses within the field is set too low. We compare naïve analytical approaches to marginal, fixed-effect, and mixed-effect models and provide guidelines for when each of these models is most appropriate based on study design. We demonstrate the influence of clustering on a between-mouse treatment effect, a within-mouse treatment effect, and an interaction effect between the two. Our analyses demonstrate that these statistical approaches can give substantially different results, primarily when the analyses include a between-mouse treatment effect. In a novel analysis from a neuroscience perspective, we also refine the mixed-effect approach through the inclusion of an aggregate mouse-level counterpart to a within-mouse (neuron level) treatment as an additional predictor by adapting an advanced modeling technique that has been used in social science research and show that this yields more informative results. Based on these findings, we emphasize the importance of appropriate analyses of clustered data, and we aim for this work to serve as a resource for when one is deciding which approach will work best for a given study.
Conflict of interest statement
Competing Interests: The authors have declared that no competing interests exist.
Figures
References
- Aarts E, Verhage M, Veenvliet JV, Dolan CV, van der Sluis S. A solution to dependency: using multilevel analysis to accommodate nested data. Nat Neurosci. 2014; 17(4):491–496. 10.1038/nn.3648
- Lazic SE. The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? BMC Neurosci. 2010; 11:5 10.1186/1471-2202-11-5
- Galbraith S, Daniel JA, Vissel B. A study of clustered data and approaches to its analysis. J Neurosci. 2010; 30(32):10601–10608. 10.1523/JNEUROSCI.0362-10.2010
- Nieuwenhuis S, Forstmann BU, Wagenmakers EJ. Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci. 2011; 14(9):1105–1107. 10.1038/nn.2886
- Zyzanski SJ, Flocke SA, and Dickinson LM. On the nature and analysis of clustered data. Ann Fam Med. 2004; 2(3): 199–200.
- Fricano CJ, Despenza T, Frazel PW, Li M, O'Malley AJ, Westbrook GL, et al. Fatty acids increase neuronal hypertrophy of Pten knockdown neurons. Front Mol Neurosci. 2014; 7:30 10.3389/fnmol.2014.00030
- StataCorp. Stata Statistical Software: Release 13. College Station, TX: StataCorp LP; 2013.
- Hardin J. Generalized estimating equations (GEE) John Wiley & Sons, Ltd; 2005.
- Gelman A. Analysis of variance—why it is more important now than ever. In: The Annals of Statistics; 2005. p. 1–53.
- King, G and Roberts ME. How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do about It. Political Analysis. 2014; p. 1–21.
- Neter J, Kutner M, Wasserman W, Nachtsheim C. Applied Linear Statistical Models. 4th ed. McGraw-Hill/Irwin; 1996.
- Raudenbush S, Bryk A. Hierarchical Linear Models: Applications and Data Analysis Methods. 2nd ed. Sage Publications, Ltd; 2002.
- Wooldridge J. Econometric Analysis of Cross Section and Panel Data. 2nd ed. The MIT Press; 2010.
- Fleiss JL, Cohen J. The equivalence of the weighted kappa and the intraclass correlation coefficient as a measure of reiability. Educational & Psychological Measurements. 1973; 2:113–117.
- Gelman A, Park DK. Models, assumptions, and model checking in ecological regressions. J. R. Statist. Soc. A. 2001; 164(1):101–118.
- Austin PC. Estimating multilevel logistic regression models when the number of clusters is low: a comparison of different statistical software procedures. Int J Biostat. 2010; 6(1):Article 16.
- Hubbard AE, Ahern J, Fleischer NL, Van der Laan M, Lippman SA, Jewell N, et al. To GEE or not to GEE: comparing population average and mixed models for estimating the associations between neighborhood risk factors and health. Epidemiology. 2010; 21(4):467–474. 10.1097/EDE.0b013e3181caeb90
- Chu R, Thabane L, Ma J, Holbrook A, Pullenayegum E, Devereaux PJ. Comparing methods to estimate treatment effects on a continuous outcome in multicentre randomized controlled trials: a simulation study. BMC Med Res Methodol. 2011; 11:21 10.1186/1471-2288-11-21
- Landis SC, Amara SG, Asadullah K, Austin CP, Blumenstein R, Bradley EW, et al. A call for transparent reporting to optimize the predictive value of preclinical research. Nature. 2012; 490(7419):187–191. 10.1038/nature11556
- Parker RM, Browne WJ. The place of experimental design and statistics in the 3Rs. ILAR J. 2014; 55(3):477–485. 10.1093/ilar/ilu044
Source: PubMed