Skip Navigation


Biostatistics Advance Access originally published online on April 11, 2007
Biostatistics 2007 8(4):821-834; doi:10.1093/biostatistics/kxm008
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
8/4/821    most recent
kxm008v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Pennell, M. L.
Right arrow Articles by Dunson, D. B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pennell, M. L.
Right arrow Articles by Dunson, D. B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Published by Oxford University Press 2007.

Fitting semiparametric random effects models to large data sets

Michael L. Pennell*

Division of Biostatistics, College of Public Health, The Ohio State University, B-115 Starling-Loving Hall, 320 West 10th Avenue, Columbus, OH 43210, USA mpennell{at}cph.osu.edu

David B. Dunson

Biostatistics Branch, MD A3-03, National Institute of Environmental Health Sciences, PO Box 12233, Research Triangle Park, NC 27709, USA

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATION STUDIES
 4. ANALYSIS OF THE...
 5. DISCUSSION
 REFERENCES
 
For large data sets, it can be difficult or impossible to fit models with random effects using standard algorithms due to memory limitations or high computational burdens. In addition, it would be advantageous to use the abundant information to relax assumptions, such as normality of random effects. Motivated by data from an epidemiologic study of childhood growth, we propose a 2-stage method for fitting semiparametric random effects models to longitudinal data with many subjects. In the first stage, we use a multivariate clustering method to identify G<<N groups of subjects whose data have no scientifically important differences, as defined by subject matter experts. Then, in stage 2, group-specific random effects are assumed to come from an unknown distribution, which is assigned a Dirichlet process prior, further clustering the groups from stage 1. We use our approach to model the effects of maternal smoking during pregnancy on growth in 17 518 girls.

Keywords: Cluster analysis; Dirichlet process; Latent variables; Longitudinal data; Mixed effects model; Prior elicitation


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATION STUDIES
 4. ANALYSIS OF THE...
 5. DISCUSSION
 REFERENCES
 
When data are compiled from a large population or multiple centers, the number of observations can become massive. In these situations, memory limitations or high computational burdens can make fitting random effects models difficult using standard frequentist (e.g. Wolfinger and others, 1994Go) and Bayesian (e.g. Zeger and Karim, 1991Go) methods. These difficulties are illustrated by data collected in the Collaborative Perinatal Project (CPP), a prospective epidemiologic study of pregnant women and their children in the United States from 1959–1974. Chen and others (2006)Go examined the relationship between maternal smoking habits during pregnancy and childhood obesity within N = 34866 children in the CPP using generalized estimating equations (GEE; Liang and Zeger, 1986Go). Although GEE allowed the authors to perform inferences on the average effects in the population, it would have also been interesting to assess how smoking varied in its effect across the children. Unfortunately, the size of the data impeded a random effects analysis. For instance, SAS PROC MIXED ran out of memory when we attempted to fit a model with random smoking effects.

Recently, Guha and Ryan (2006)Go and Z. Huang and A. Gelman (in preparation) proposed efficient methods for fitting parametric random effects models to large data sets. However, when a data set is large, as in the CPP, it would be advantageous to use the abundant information to relax distribution assumptions, such as normality of random effects. Several authors have proposed methods which assume a smooth, unknown random effects distribution (see, e.g. Magder and Zeger, 1996Go; Tao and others, 1999Go; Zhang and Davidian, 2001Go; Chen and others, 2002Go and related papers by Verbeke and Lesaffre, 1996Go; Ghidey and others, 2004Go), though computation may be impractical for large N. A more tractable approach is to assume a discrete distribution, thereby reducing the number of random effects to be estimated. Aitkin (1999)Go proposed an EM algorithm for fitting these models, though the approach converges to a local mode.

Following a Bayesian approach, a Dirichlet process prior (DPP) may be used to allow an unknown, discrete random effects distribution and to automatically cluster individuals into latent classes (West and others, 1994Go, Bush and MacEachern, 1996Go, Mukhopadhyay and Gelfand, 1997Go, Kleinman and Ibrahim, 1998Go). Unfortunately, it is not feasible to use the DPP in large data sets using current Markov chain Monte Carlo (MCMC) algorithms (e.g. West and others, 1994Go; MacEachern, 1994Go; Ishwaran and James, 2001Go). Although the variational Bayes method of Blei and Jordan (2006)Go can substantially reduce computation time relative to MCMC, the approach can be sensitive to starting values and relies on replacing the true posterior density with a lower bound having unknown accuracy.

An alternate approach is to reduce the size of the data set prior to performing MCMC. For instance, one could fit models to a random subsample (e.g. Owen, 2003Go). DuMouchel and others (1999)Go proposed a 2-stage method for "data squashing." First, the complete data is partitioned into compact subregions. Then, within each region, one generates a set of "pseudo-data" and weights so that the weighted moments on the squashed data match the unweighted moments on the original data. The method of DuMouchel and others is less sensitive to outliers than random sampling, but their moment matching procedure is computationally intensive. Recently, Madigan and others (2002)Go proposed a squashing method which first groups subjects based on their contribution to the likelihood and then fits models to the mean of each group. Although this approach may be promising for some models, it is unwieldy under the DPP due the complicated structure of the likelihood.

Motivated by the CPP data, we propose a new data squashing procedure for fitting semiparametric random effects models to large, longitudinal data sets. In the first stage, we construct G<<N clusters of "scientifically indistinguishable" subjects, meaning that differences between subjects in each group are so small that they would not be considered significant by an expert of the subject matter. In the second stage, we use a DPP to model the G cluster means, further clustering the groups from the first stage. By applying the DPP to the cluster means instead of the complete data, we reduce both the computation time and the number of latent classes. In addition, our use of expert opinion improves the scientific justification of clustering. For discussion of the importance of expert elicitation, refer to Kadane and Wolfson (1998)Go, Meyer and Booker (2001)Go, and Garthwaite and others (2005)Go.

In Section 2, we propose the method. Section 3 contains simulation examples, Section 4 applies the approach to the CPP data, and Section 5 discusses the results.


    2. METHODS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATION STUDIES
 4. ANALYSIS OF THE...
 5. DISCUSSION
 REFERENCES
 

2.1 General motivation

For i = 1,...,N, let yi = (yi1,...,yini)' denote a set of ni longitudinal measurements on subject i. Letting Xi = (xi1,...,xip) denote a set of predictors, we focus on the linear random effects model


Formula (2.1)

where Ini is an nixni identity matrix and bi = (bi1,...,bip)'~H, an unknown distribution with mean ß and covariance V.

As N becomes very large and both ni and p remain modest, many subjects have essentially identical values with yi{approx}yj and Xi{approx}Xj for many different pairs i,j. Outcomes, such as weight, that are treated as continuous are often truncated or rounded when recorded, limiting the number of unique values in the data. In addition, values which are so close that a subject matter expert would consider them scientifically indistinguishable can be grouped together without loss of important information. Under these circumstances, the data are adequately summarized by values for G<<N clusters. For an observation i in cluster g, let


Formula (2.2)

where Formula, Formula, and Formula are the cluster-specific means of the response, predictors, and random effects, {phi}gi and {varphi}gi are random variables, and {Delta}gi is a random matrix. When the G clusters adequately represent the heterogeneity in the data, the observed values of {phi}gi, {varphi}gi, and {Delta}gi are all approximately zero. Thus, ß = E(bi) can be estimated by


Formula (2.3)

where mg is the number of subjects in cluster g.

Instead of fitting models to all N subjects, we propose an alternative approach in which we fit our model to the pseudo-sample, (yFormula,XFormula),...,(yFormula,XFormula), where (yFormula,XFormula) represents the typical subject in cluster g (i.e. Formula and Formula). In Section 2.2, we recommend a strategy for initial clustering of the N subjects into G groups. In Section 2.3, we define our semiparametric model and discuss the implications of the DPP on clustering. In Section 2.4, we discuss our approach to inference.

2.2 Stage 1 Clustering

Subject-specific data are first divided into q strata based on categorical predictors. For example, if there are 2 categorical predictors, one dichotomous and one with 3 levels, q would equal 6. Within each stratum, we wish to develop clusters of scientifically indistinguishable subjects based on the values of the continuous variables, i.e. the longitudinal responses and continuous predictors. For subject i in stratum j, we denote the values of these variables as wji = (wji1,...,wjipji)'. For ease in exposition, we will temporarily assume that pji = pj for i = 1,...,Mj, where pj is the number of continuous variables for each subject in stratum j and Mj is the stratum frequency. Prior to clustering, we transform wji to zji = (zji1,...,zjipj)', where


Formula

and Formula and swjk denote the mean and standard deviation, respectively, of the kth continuous variable in stratum j.

Let the z-scores in stratum j be divided into Gj clusters whose locations in Rpj are represented by a set of data points or "seeds," cj1,...,cjGj, where cjl = (cjl1,...,cjlpj)' and cjlk is the average value of the kth standardized variable in cluster l. We assume that both the number of clusters and locations are unknown a priori, but through expert elicitation, we define a threshold r such that


Formula (2.4)

for each subject j,i in cluster j,l.

To elicit r, we recommend performing a set of exploratory cluster analyses and presenting the results to one or more subject matter experts. These analyses may be performed using a set of historical data, or alternatively, one stratum of the current data. In the latter method, the data used to elicit r will also be used in the second stage of the analysis, thus creating a sort of an empirical Bayes approach. Since, the data are standardized prior to clustering, the r based on one stratum should be valid for a different stratum even if the mean and variance differ across groups.

For the CPP data, we treated the data on boys whose mothers never smoked as historical data and used it to choose an appropriate r for the girls. In our exploratory analyses, we used a range of r values to cluster the longitudinal weight of boys with complete data (i.e. with follow-ups at ages 0, 1, 3, 4, 7, and 8). Following each analysis, we plotted the growth curves from subjects in the cluster with largest radius (see Figure 1). We presented these plots to a panel of experts on body weight research and asked them to identify clusters (each indexed by a radius, r) whose growth curves have potentially significant differences. Most panelists agreed that when r ≤ 2.14, the growth curves in each cluster were not significantly different, while 1 panelist chose r = 2.47 as the threshold. Hence, we went with the majority and used r = 2.14. In other applications where there is substantial disagreement across the experts, the average elicited value could be used instead. Our method for choosing r is similar to the use of "opinion pools" in prior elicitation (see, e.g. Cooke and Goossens, 2000Go).


Figure 1
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Plots used to elicit maximum radius, r, for CPP data. The subjects used in the analyses were male children of non-smoking mothers who were measured at each follow-up (N = 1,115). Each plot consists of 10 growth curves from the cluster with the largest radius, r. These curves correspond to the subjects furthest from and closest to the cluster seed as well as 8 randomly chosen subjects.

 
When (2.1) contains more than 2 continuous predictors, it may be difficult to summarize each cluster graphically as in Figure 1. Hence, to make expert elicitation more tractable, we recommend using no more than 2 continuous predictors, and to discretize predictors when possible.

Given r, we use the following 3 steps to cluster the data in stratum j:

Step 1: Initialize cluster seeds.

The goal of this step is to initialize the cluster centroids using data points that are separated by a distance of r or more. Subjects are not assigned to clusters until steps 2 and 3. To begin, initialize Gj at 1 and let cFormula = zj1, the data for subject (j,1). For i = 2,...,Mj, if dFormula = minld(zji,cFormula) > r, then increment Gj by 1 and create a new cluster seed, cFormula = zji.

Step 2: Update the seeds.

Let index variable, t, equal 1 and do the following:

2.1 For i = 1,...,Mj, if dFormula = minld(zji,cFormula) ≤ r, assign zji to the cluster with the closest seed.

2.2 For l = 1,...,Gj, compute


Formula

where mjl is the number of subjects currently in cluster j,l. Let 0 ≤ {nu} < 1 denote a prespecified convergence criterion such that changes in the cluster seeds less than or equal to {nu}·dFormula are permissible, where dFormula denotes the minimum distance between the initial seeds. If maxld(cFormula,cFormula) > {nu}·dFormula, then increment t by 1 and repeat steps 2.1 and 2.2, otherwise proceed to step 3. Note that in step 2.1, a subject may be assigned to a cluster even if it wasn't assigned to one in the previous iteration.

Step 3: Construct final clusters.

3.1 Repeat step 2.1 using cFormula,...,cFormula.

3.2 For all i:dFormula > r, assign zji to its own cluster and increment Gj by 1.

Steps 1–3 may be implemented using SAS PROC FASTCLUS, and sample code is available upon request. Step 1 is related to the leader algorithm (Hartigan, 1975Go), while step 2 is a form of k-means clustering (MacQueen, 1967Go). Though step 1 is sensitive to the order of the data, we have found that when there are several hundred subjects in each stratum, the results are fairly consistent across orderings; Gj differs by no more than a few clusters, and changes in the cluster means are small. After completing steps 1–3 for j = 1,...,q, we compute the means of the untransformed variables in each cluster, Formula. As mentioned in Section 2.1, these data (plus the categorical predictors) will constitute our G = {sum}FormulaGj pseudo-subjects.

Though related to the data squashing method of DuMouchel and others (1999)Go, our approach has several advantages. The method of DuMouchel and others requires that the number of clusters be known a priori. By choosing r based on expert elicitation, we induce a prior on G, thus reducing the sensitivity of our method to subjectively chosen values of this hyperparameter. In addition, our method is geared toward producing clusters with low between-subject variability. Thus, unlike the method of DuMouchel and others, we can justify using 1 pseudo-subject per cluster. In some cases, the method of DuMouchel and others may be faster since it does not require any consultation with scientists. However, it should be relatively easy for a biostatistician to obtain expert opinion since they are usually collaborating with the scientists who provided them with the data. Hence, we expect that the advantages of expert elicitation should outweigh this inconvenience.

In many longitudinal studies, including the CPP, pji!=pji' for several pairs (j,i),(j,i') due to missing follow-ups. A simple solution is to stratify by missingness, but this may be infeasible when there are several different patterns. For instance, there are 60 different missingness patterns in the CPP data. Thus, to resolve this problem, we recommend stratifying by the most common patterns and assigning the remaining subjects to the stratum for which they have the least number of missing variables. In each of these strata, the initial cluster seeds are chosen using subjects with complete data. Then, in steps 2 and 3, subjects with missing observations are assigned to clusters based on adjusted distances,


Formula (2.5)

where the sum is taken over the pji nonmissing variables for subject i in cluster j. As before, these subjects may still be assigned to their own cluster if dFormula > r in step 3, and thus, we do not ignore any important outliers. Since previous methods for data squashing were developed with univariate data in mind (see DuMouchel and others, 1999Go; Madigan and others, 2002Go), ours is the first to accommodate missing measurements.

2.3 Semiparametric model

In the remaining sections of this chapter, we will drop the stratum index from the stage 1 clusters and refer to the pseudo-data as (yFormula,XFormula),...,(yFormula,XFormula), which were defined in Section 2.1. For pseudo-subject g = 1,...,G, we assume


Formula (2.6)

where nFormula is the number of measurements on pseudo-subject g, H0 is a known distribution, and {alpha} is a precision parameter. In all our examples, H0 = N(µ, D).

If we marginalize over the DPP for H, the sequence of random effects, bFormula,...,bFormula, follows a Polya urn scheme (Blackwell and MacQueen, 1973Go), i.e.


Formula (2.7)

for j < k and k = 2,...,G. Thus, under the DPP, the random effects are clustered into K ≤ G groups whose random effects are {theta}1,...,{theta}K, where {theta}l~H0 for l = 1,...,K.

Let S1,iisin{1,...,G} and S2,iisin{1,...,K} index the stage 1 and 2 clusters of subject i, respectively. Given the frequencies of our stage 1 clusters, m1,...,mG, the probability that 2 randomly selected subjects are in the same stage 1 cluster is


Formula (2.8)

which follows from the multivariate hypergeometric distribution. Also, under the DPP, the probability that 2 pseudo-subjects are grouped together is 1/({alpha} + 1) (Antoniak, 1974Go). Therefore, a priori,


Formula (2.9)

Thus, our method increases the prior probability that 2 subjects are clustered together, relative to a DPP applied to N subjects. As a result, our prior favors a smaller, but more scientifically justified, number of clusters.

Posterior computation proceeds using the Polya urn Gibbs sampler for Dirichlet process mixture models (MacEachern, 1994Go; West and others, 1994Go). Details regarding implementation may be found in Appendix A available at Biostatistics online.

To reduce the sensitivity of our model to subjectively chosen hyperparameters, we recommend the following hyperpriors for µ, D, {tau}, and {alpha}:


Formula

where W(·;d0,dFormulaDFormula) is the Wishart density with degrees of freedom d0 and mean DFormula and Ga(·;a,b) is the Gamma density with mean a/b. As seen in Appendix A available at Biostatistics online, we update {alpha} in our MCMC using a simple data augmentation approach (West, 1992Go), while the remaining hyperparameters are sampled directly from their full conditional posteriors.

2.4 Methods for inference

As seen in (2.3), a reasonable estimate for the population-average effects, ß, is the weighted mean of the average random effects in each stage 1 cluster, Formula. Although we fit our model to mean response and predictor values obtained from m1,...,mG subjects in each cluster, the random effects, bFormula,...,bFormula, are based on 1 pseudo-subject per cluster. Hence


Formula

However, given ß, a transformation can be made:


Formula

which preserves the mean for bFormula but changes the covariance to V/mg so that


Formula

Based on the above results, we make a similar posterior transformation of bFormula,...,bFormula which ensures that the variability of the population effects reflects the sample size. Following convergence, let bFormula denote the value of bFormula observed at iteration t, t = 1,...,T. Prior to calculating the population mean, we replace bFormula with


Formula (2.10)

where Formula. Note that for large T, CovFormula approaches 0, and thus, we do not (significantly) inflate the variances of Formula by replacing ß with Formula. Note that as the stage 1 cluster sizes grow, we shrink back toward the mean of the samples. By doing shrinkage within the first-stage clusters instead of across the clusters, we do not obscure or mask non-normal features in the random effects distribution.

Now that we have corrected our estimates of the cluster-specific means, population effects can be estimated at each iteration of the MCMC as


Formula (2.11)

Thus, linear combinations of Formula can be used to test hypotheses about the average effects of the predictors, similar to what is done with fixed effects in mixed models.

Our method can also be used to perform inferences on between-subject heterogeneity. For example, general features of the random effects distribution can be examined using the following approximation for H:


Formula (2.12)

Inferences may also be based on the clustering of random effects. As seen in Appendix A (available at Biostatistics online), cluster membership changes at each iteration of the MCMC. Hence, we post-process our results; using single linkage (Sneath, 1957Go), we obtain a new set of stage 2 clusters k = 1,...,K*, where for each subject g in cluster k, there exists a different pseudo-subject in k, g*, such that Pr(Sg = Sg*) ≥ 0.5 where, as in Appendix A available at Biostatistics online, Sg indicates the cluster membership of pseudo-subject g assigned by the Dirichlet process. These clusters can be constructed using the linkage and cluster functions in MATLAB. As seen in Section 4.2, the cluster-specific longitudinal trajectories and the number of subjects per cluster are useful in identifying outliers.


    3. SIMULATION STUDIES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATION STUDIES
 4. ANALYSIS OF THE...
 5. DISCUSSION
 REFERENCES
 
We applied the approach to 3 simulated data examples. In each case, the true model for yi given bi was yi~N(Xibi,I6), where Xi = (xi0,xi1,xi2) with xi0 = 16, xi1 = ui·16, uiisin{0,1}, and xi2 = (0,1,3,4,7,8)'/8 for i = 1,...,N. The predictor xi2 can be thought of as the ages at follow-up for subject i and ui as an exposure indicator, where {sum}Formulaui = N/2.

3.1 Case 1: discrete random effects

In the first case, we simulated a single data set of size N = 2000 using the discrete distribution


Formula

which has mean ß = (3.25,1.45,23.06)'.

We applied our approach for r = 2.14 (elicited value), r = 1.66, and r = 0 (complete data). Diffuse priors were chosen for µ and {tau} with µ0 = (15,0,0)', {Sigma}0 = 100·I3, {tau}0 = 1, and {psi} = 0.1. The prior for D was centered on the identity matrix with d0 = 3. We also let {alpha}~Ga(a,1), where we let a = 0.25 for r = 2.14 and r = 1.66 but chose a = 0.1 for the complete data to induce a similar prior for K across G. The MCMC was run for 25 000 iterations in each analysis with the first 5000 iterations discarded as a burn-in and with every 10th sample collected to thin the chain.

Table 1 provides estimates of K and the population effects from our MCMC. Both the number of clusters and the values of the regression parameters are similar across r. In addition, it took only 23 min to fit the model under the elicited r, while 19.5 h were needed to fit the model to the complete data. We were also relatively successful in recovering the original clusters (i.e. the 5 atoms of the discrete distribution) using our post-processing method. The results of this secondary analysis are available in Appendix B available at Biostatistics online.


View this table:
[in this window]
[in a new window]

 
Table 1. Means and 95% credible intervals for K and Formula from simulation Cases 1–3. The true values of the population effects were Formula = (3.25, 1.45, 23.06)' in Case 1 and approximately ß = (3.3, 1.5, 23.2)' in Cases 2 and 3

 
3.2 Cases 2–3: continuous random effects

In Case 2, bi~N(ß,diag({omega})), where ß = (3.3,1.5,23.2)' and {omega} = (0.4,0.4,3)', while in Case 3,


Formula

where ß1 = (2.9,1.1,22.2)', ß2 = (4,2.25,25)', {omega}1 = (0.075,0.1,1)', and {omega}2 = (0.175,0.2,2)'. Since computation was more intensive than in Case 1, we reduced our sample sizes to 1000 in each study. The stage 1 clustering and MCMC proceeded as in Case 1, but with different priors for {alpha}; to impose a similar prior for K across the analyses, in Case 2 we chose a = 1 for r > 0, while in Case 3, we chose a = 3 for r = 2.14 and a = 2 for r = 1.66. In both cases, we chose a = 0.5 when r = 0.

Under normal random effects, the parameter estimates under r = 2.14 were virtually identical to those under r = 0 (see Table 1). However, when the random effects came from a mixture of normals, it appears as if r = 2.14 leads to some underestimation of the variability in the population, resulting in population effects which are slightly biased. This is not unexpected given that random effects come from a continuum, which implies that the true r = 0. Hence, these results demonstrate the robustness of our method to r. Also, even for a sample size of 1000, choosing r = 2.14 instead of r = 0 reduced the computation time from approximately 1.5 days to less than 19 min.


    4. ANALYSIS OF THE CPP DATA
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATION STUDIES
 4. ANALYSIS OF THE...
 5. DISCUSSION
 REFERENCES
 

4.1 Methods

We now return to the CPP data discussed in Section 1. In our analysis, we considered modeling the longitudinal weight of girls by age and exposure category: children of never smoker (N1 = 6684), ex-smoker (N2 = 1849), or current smoker (N3 = 8985). In stage 1, we stratified by exposure and the 4 most common missingness patterns: no missing data (i.e. measurements at birth and years 1, 3, 4, 7, and 8), missing follow-up at year 8, missing follow-ups at years 3 and 8, and lost to follow-up following year 1. Within each stratum, we clustered under r = 2.14(pj/6)1/2, where pj is the number of follow-ups under the missingness pattern in stratum j. Note that (pj/6)1/2 is the reciprocal of the correction used in (2.5). These stage 1 analyses generated G = 526 clusters across 12 strata.

In stage 2, we modeled the weight of pseudo-subject g using an intercept, xFormula, indicators of smoking exposure (xFormula for ex-smokers and xFormula for current smokers), age at follow-up (xFormula), and ex-smoker by age (xFormula) and current smoker by age (xFormula) interactions. Age was centered around the mean value among the pseudo-subjects (3.16) and was assumed to have a linear effect due to the relatively few ages at which measurements were collected.

We used the same priors for {tau}, µ, and D as in the simulations and assigned a Ga(0.5,1) prior to {alpha}. We ran our MCMC for 45 000 iterations following a burn-in of 10 000, otherwise implementing as in Section 3.

4.2 Results

As in Chen and others (2006)Go, our population effects suggest that a mother's smoking habit during pregnancy had a significant impact on child growth. As seen in Table 2, the 95% credible intervals for the smoking–age interactions (ß4 and ß5) obtained using our method (denoted G-DPP) are above 0, suggesting that the effects of smoking increased with age. To describe the smoking effect, we provide estimates of the ex-smoker and current smoker effects at birth ({eta}E0 and {eta}C0) and age 8 ({eta}E8 and {eta}C8). At birth, the children of ex-smokers and current smokers were leaner than the children of never smokers, with the decrease being highly significant, Pr({eta}C0 < 0) and Pr({eta}E0 < 0) > 0.99, but similar across the 2 groups, Pr({eta}C0 < {eta}E0) = 0.668. However, at the age of 8, children in both exposure groups were significantly heavier, Pr({eta}C8 > 0) and Pr({eta}E8 > 0) > 0.999, with the increase in weight being greater in the children of ex-smokers, Pr({eta}E8 > {eta}C8) = 0.997. It is possible that some of the ex-smoker effects are due to confounding as Chen and others found that adjustment for covariates such as center and pre-pregnancy weight resulted in an insignificant ex-smoker effect.


View this table:
[in this window]
[in a new window]

 
Table 2. Population effects of smoking in CPP analysis. The DPP estimates listed are means and 95% credible intervals; 95% confidence intervals are listed for the results from GEE and the marginal model

 
Table 2 also presents estimates obtained using GEE. Although the GEE results suggest a significant effect of smoking on child weight, there is no significant ex-smoker by age interaction (p = 0.141). It is not surprising that GEE provides a flatter slope for the ex-smoker effect since, under the assumption of missing completely at random, it does not allow a child's observed weight to be related to her missingness pattern, which appears to be the case in the CPP. An alternative approach, which does not suffer from the missing completely at random assumption, is to fit a marginal model with an unstructured covariance matrix for the repeated weight measurements using the REPEATED statement in PROC MIXED. As seen in Table 2, the marginal model suggests a significant ex-smoker–age interaction but an insignificant current smoker–age interaction (p = 0.161). It is likely that the discrepancies from our approach are due to differences in distribution assumptions; though our method imposes some structure on the covariance matrix by assuming a linear trend in age, we do not require Gaussian weights. In addition, our method estimates random effects, while the marginal model only provides fixed effects.

Another common method for large data sets is to fit a model to a random subsample of the data. Thus, we compared our population effect estimates to those obtained from fitting a semiparametric random effects model to 2 random samples of size 1752 (denoted RS1-DPP and RS2-DPP in Table 2). In each case, the ex-smoker effects were insignificant. However, the results for the current smoker effects were inconsistent; in one sample the effect increased with age, while in the other, the effect was insignificant. These results demonstrate 2 key weaknesses of random sampling: a loss of power to detect an effect of a rare exposure and, since the method is sensitive to outliers in the data, dependence on the sample chosen. Our method does not suffer from either weakness since we preserve all scientifically important differences in stage 1 and, by weighing our population effects by cluster size, we ensure that our estimates are reflective of the complete data. The 2-stage methodology is also more computationally efficient; in this example, it took approximately 5.5 h to complete the MCMC under our approach and more than 35 h for RS1- and RS2-DPP.

We also examined heterogeneity in the random effects using our post-processing method. We found that 15 740 girls belonged to a subpopulation with "normal" traits, which we will refer to as cluster 1, and 40 girls (20 non-smokers, 2 ex-smokers, and 18 current smokers) belonged to a small outlier cluster, which we will refer to as cluster 2. The remaining 1738 girls belonged to 77 stage 1 clusters which were not combined with other stage 1 clusters by our post-processing method. Although some of the girls in these clusters appear to be outliers with unusual growth patterns, most (1722) were lost to follow-up following birth or year 1, and the DPP could not accurately classify them due to their limited data. Had we not stratified by missingness in stage 1, it is likely that many of these girls would be in cluster 1.

Figure 2 provides the posterior mean of the ex-smoker and current smoker effects within each of our stage 2 clusters. As expected, the posterior means for cluster 1 are similar to the population estimates, while other clusters have larger effect values. In particular, the average ex-smoker effect in cluster 2 is 7.8 kg at age 7 and the average current smoker effect is 2.7 kg at age 8. In addition to exhibiting unusual growth, the children in cluster 2 also had mothers who were, on average, 17.2 kg heavier prior to pregnancy than cluster 1 mothers.


Figure 2
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Mean smoking effects in the CPP data. The solid, unlabeled lines correspond to pseudo-subjects not assigned to cluster 1 or 2. Effect estimates were computed up to the last follow-up of the exposed subjects. Estimates for unexposed pseudo-subjects are omitted.

 

    5. DISCUSSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATION STUDIES
 4. ANALYSIS OF THE...
 5. DISCUSSION
 REFERENCES
 
We have proposed a 2-stage method for fitting semiparametric random effects models to large data sets. Our method uses expert elicitation to generate a small pseudo-sample that summarizes the biologically important differences in the complete data. By applying the DPP to these data, we substantially decrease the computational burden and generate scientifically interesting clusters in the posterior. In simulation studies, our method detected true trends in the data under discrete and normal random effects and was slightly biased under continuous, multimodal random effects.

Although our method was motivated by a specific example, it can be applied broadly to large longitudinal data sets. In some prospective epidemiology studies, there may be interest in fitting a model with many potential confounders for the exposure. In these settings, it may be necessary to modify our first-stage clustering to improve efficiency. For example, the clustering could be stratified-based quintiles of propensity scores (Rosenbaum and Rubin, 1983Go). The model fit to the pseudo-data would then include propensity score quintile as a covariate. In future work, it would also be interesting to extend our approach to accommodate generalized linear mixed models for non-normal outcomes.


    ACKNOWLEDGMENTS
 
We thank Aimin Chen and Matthew Longnecker, National Institute of Environmental Health Sciences, for providing the data used in our example and our panel of subject matter experts, Walter Rogan, MD, Allen Wilcox, MD, National Institute of Environmental Health Sciences; Robert McMurray, Department of Exercise Physiology, University of North Carolina at Chapel Hill and Diane Holditch-Davis, School of Nursing, Duke University. This research was supported in part by the Intramural Research Program of the National Institutes of Health, National Institute of Environmental Health Sciences. Conflict of Interest: None declared.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATION STUDIES
 4. ANALYSIS OF THE...
 5. DISCUSSION
 REFERENCES
 

    Aitkin M. A general maximum likelihood analysis of variance components in generalized linear models. Biometrics (1999) 55:117–128.[CrossRef][Web of Science][Medline]

    Antoniak CE. Mixtures of Dirichlet processes with applications to nonparametric problems. Annals of Statistics (1974) 2:1152–1174.[CrossRef][Web of Science]

    Blackwell D, MacQueen JB. Ferguson distributions via Pólya urn schemes. Annals of Statistics (1973) 1:353–355.[CrossRef][Web of Science]

    Blei DM, Jordan MI. Variational inference for Dirichlet process mixtures. Bayesian Analysis (2006) 1:1–23.[CrossRef]

    Bush CA, MacEachern SN. A semiparametric Bayesian model for randomized block designs. Biometrika (1996) 83:175–185.

    Chen A, Pennell ML, Klebanoff MA, Rogan WJ, Longnecker MP. Maternal smoking during pregnancy in relation to child overweight: follow-up to age 8 years. International Journal of Epidemiology (2006) 35:121–130.[Abstract/Free Full Text]

    Chen J, Zhang D, Davidian M. A Monte Carlo EM algorithm for generalized linear mixed models with flexible random effects distribution. Biostatistics (2002) 3:347–360.[Abstract]

    Cooke RM, Goossens LHJ. Procedures guide for structured expert judgement in accident consequence modelling. Radiation Protection Dosimetry (2000) 90:303–309.[Abstract]

    DuMouchel W, Volinsky C, Johnson T, Cortes C, Pregibon D. Squashing flat files flatter. In: Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining (1999) New York, NY: ACM Press. 6–15.

    Garthwaite PH, Kadane JB, O'Hagan A. Statistical methods for eliciting probability distributions. Applied Statistics (2005) 100:680–700.

    Ghidey W, Lesaffre E, Eilers P. Smooth random effects distribution in a linear mixed model. Biometrics (2004) 60:945–953.[CrossRef][Web of Science][Medline]

    Guha S, Ryan L. Gauss-Seidel estimation of generalized linear mixed models with application to Poisson modeling of spatially varying disease rates. Working Paper 33 (2006) Boston, MA: Department of Biostatistics, Harvard School of Public Health. http://www.bepress.com/harvardbiostat/paper33.

    Hartigan JA. Clustering Algorithms (1975) New York: John Wiley & Sons, Inc. 74–78.

    Ishwaran H, James LF. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association (2001) 96:161–173.[CrossRef][Web of Science]

    Kadane JB, Wolfson LJ. Experiences in elicitation. The Statistician (1998) 47:3–19.

    Kleinman KP, Ibrahim JG. A semiparametric Bayesian approach to the random effects model. Biometrics (1998) 54:921–938.[CrossRef][Web of Science][Medline]

    Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika (1986) 73:13–22.[Abstract/Free Full Text]

    MacEachern SN. Estimating normal means with a conjugate style Dirichlet process prior. Communications in Statistics—Simulation and Computation (1994) 23:727–741.[CrossRef]

    MacQueen JB. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability—LucienLeCam M, Neyman Jerzy, eds. (1967) Berkeley, CA: University of California Press. 281–297.

    Madigan D, Raghavan N, DuMouchel W, Nason M, Posse C, Ridgeway G. Likelihood-based data squashing: a modeling approach to instance construction. Data Mining and Knowledge Discovery (2002) 6:173–190.[CrossRef][Web of Science]

    Magder LS, Zeger SL. A smooth nonparametric estimate of a mixing distribution using mixtures of Gaussians. Journal of the American Statistical Association (1996) 91:1141–1151.[CrossRef][Web of Science]

    Meyer MA, Booker JM. Eliciting and Analyzing Expert Judgment: A Practical Guide (2001) Philadelphia, PA: ASA/Society of Industrial and Applied Mathematics.

    Mukhopadhyay S, Gelfand AE. Dirichlet process mixed generalized linear models. Journal of the American Statistical Association (1997) 92:633–639.[CrossRef][Web of Science]

    Owen A. Data squashing by empirical likelihood. Data Mining and Knowledge Discovery (2003) 7:101–113.[CrossRef][Web of Science]

    Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika (1983) 70:41–55.[Abstract/Free Full Text]

    Sneath PHA. The application of computers to taxonomy. Journal of General Microbiology (1957) 17:201–226.[Abstract/Free Full Text]

    Tao H, Palta M, Yandell BS, Newton MA. An estimation method for the semiparametric mixed effects model. Biometrics (1999) 55:102–110.[CrossRef][Web of Science][Medline]

    Verbeke G, Lesaffre E. A linear mixed-effects model with heterogeneity in the random-effects population. Journal of the American Statistical Association (1996) 91:217–221.[CrossRef][Web of Science]

    West M. Hyperparameter estimation in Dirichlet process mixture models. Discussion Paper 92-A03 (1992) Durham, NC: Institute of Statistics and Decision Sciences, Duke University. http://ftp.stat.duke.edu/WorkingPapers/92–A03.ps.

    West M, Müller P, Escobar MD. Hierarchical priors and mixture models with application in regression and density estimation. In: Aspects of Uncertainty: A Tribute to D.V. Lindley—Smith A, Freeman P, eds. (1994) New York: Wiley. 363–386.

    Wolfinger R, Tobias R, Sall J. Computing Gaussian likelihoods and their derivatives for generalized linear mixed models. SIAM Journal on Scientific Computing (1994) 15:1294–1131.[CrossRef][Web of Science]

    Zeger SL, Karim MR. Generalized linear models with random effects; a Gibbs sampling approach. Journal of the American Statistical Association (1991) 86:79–86.[CrossRef][Web of Science]

    Zhang D, Davidian M. Linear mixed models with flexible distributions of random effects for longitudinal data. Biometrics (2001) 57:795–802.[CrossRef][Web of Science][Medline]

    Received October 2, 2006; revised February 19, 2007; accepted for publication March 6, 2007.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    8/4/821    most recent
    kxm008v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Pennell, M. L.
    Right arrow Articles by Dunson, D. B.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Pennell, M. L.
    Right arrow Articles by Dunson, D. B.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?