Biostatistics Advance Access published online on April 11, 2007
Biostatistics, doi:10.1093/biostatistics/kxm008
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Published by Oxford University Press 2007.
Fitting semiparametric random effects models to large data sets
Division of Biostatistics, College of Public Health, The Ohio State University, B-115 Starling-Loving Hall, 320 West 10th Avenue, Columbus, OH 43210, USA mpennell{at}cph.osu.edu
Biostatistics Branch, MD A3-03, National Institute of Environmental Health Sciences, PO Box 12233, Research Triangle Park, NC 27709, USA
* To whom correspondence should be addressed.
For large data sets, it can be difficult or impossible to fit models with random effects using standard algorithms due to memory limitations or high computational burdens. In addition, it would be advantageous to use the abundant information to relax assumptions, such as normality of random effects. Motivated by data from an epidemiologic study of childhood growth, we propose a 2-stage method for fitting semiparametric random effects models to longitudinal data with many subjects. In the first stage, we use a multivariate clustering method to identify G<<N groups of subjects whose data have no scientifically important differences, as defined by subject matter experts. Then, in stage 2, group-specific random effects are assumed to come from an unknown distribution, which is assigned a Dirichlet process prior, further clustering the groups from stage 1. We use our approach to model the effects of maternal smoking during pregnancy on growth in 17 518 girls.
Keywords: Cluster analysis; Dirichlet process; Latent variables; Longitudinal data; Mixed effects model; Prior elicitation
Received October 2, 2006; revised February 19, 2007; accepted for publication March 6, 2007.