Skip Navigation


Biostatistics Advance Access originally published online on February 17, 2006
Biostatistics 2006 7(4):551-568; doi:10.1093/biostatistics/kxj025
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
7/4/551    most recent
kxj025v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Dunson, D. B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dunson, D. B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Published by Oxford University Press 2006.

Bayesian dynamic modeling of latent trait distributions

David B. Dunson

Biostatistics Branch, National Institute of Environmental Health Sciences, MD A3-03, PO Box 12233, Research Triangle Park, NC 27709, USA dunson1{at}niehs.nih.gov


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. SEMIPARAMETRIC HIERARCHICAL...
 3. POSTERIOR COMPUTATION AND...
 4. APPLICATION TO DNA...
 5. DISCUSSION
 APPENDIX
 REFERENCES
 
Studies of latent traits often collect data for multiple items measuring different aspects of the trait. For such data, it is common to consider models in which the different items are manifestations of a normal latent variable, which depends on covariates through a linear regression model. This article proposes a flexible Bayesian alternative in which the unknown latent variable density can change dynamically in location and shape across levels of a predictor. Scale mixtures of underlying normals are used in order to model flexibly the measurement errors and allow mixed categorical and continuous scales. A dynamic mixture of Dirichlet processes is used to characterize the latent response distributions. Posterior computation proceeds via a Markov chain Monte Carlo algorithm, with predictive densities used as a basis for inferences and evaluation of model fit. The methods are illustrated using data from a study of DNA damage in response to oxidative stress.

Keywords: Dynamic Dirichlet process; Factor analysis; Hierarchical model; Latent variables; Measurement error; Random effect; Surrogate data


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. SEMIPARAMETRIC HIERARCHICAL...
 3. POSTERIOR COMPUTATION AND...
 4. APPLICATION TO DNA...
 5. DISCUSSION
 APPENDIX
 REFERENCES
 
In many applications, the primary response variable of interest cannot be measured directly and one must instead rely on multiple surrogates. For example, in studying DNA damage and repair, it is not feasible to directly measure the frequency of DNA strand breaks for each cell in a sample. However, using single-cell gel electrophoresis (also known as the comet assay), one can obtain multiple measures that relate directly to the frequency of strand breaks. In such settings, it is natural to use a latent response model in which the different measured outcomes are assumed to be manifestations of a latent variable, which in turn may depend on covariates, such as the dose of an exposure.

Often, in applying such models, one assumes that both the latent and manifest variables are normally distributed (see Roy and Lin, 2000Go, 2002Go; Xu and Zeger, 2001Go, for recent references). However, a number of approaches have been proposed which allow the measured outcomes to have different parametric distributions, typically restricted to be underlying normal (Muthén, 1984Go; Shi and Lee, 2000Go) or in the exponential family (Muthén, 1984Go; Sammel et al., 1997Go; Moustaki and Knott, 2000Go; Dunson, 2000Go, 2003Go). In addition, one can potentially use a latent class model in which the underlying response is categorical (see Miglioretti, 2003Go, for a recent reference). Since full likelihood approaches are often difficult to implement and may be sensitive to distributional assumptions, some authors have advocated the use of robust score tests (Sammel and Ryan, 2002Go) or estimating equation-based approaches (Reboussin et al., 1999Go; Roy et al., 2003Go). Such methods are most useful when interest focuses on assessing changes in the overall mean response profile with covariates. However, in certain applications, one may anticipate possible changes in not only the mean but also the distributional shape across levels of a predictor. For example, in the molecular epidemiology studies which motivated this article, the distribution of DNA damage across cells in a sample may have different shapes depending on the level of oxidative stress induced by a chemical exposure, the amount of time after exposure damage is measured, and the presence of polymorphisms in genes involved in the base excision repair pathway.

To allow the surrogate outcome distributions to be unknown, Dunson et al. (2003)Go proposed an approximate Bayesian approach for quantile regression. Following Lavine (1995)Go, they replaced the likelihood function with a substitution likelihood based on quantiles. Covariate effects were incorporated on the level of the surrogate outcomes and residual dependency was accommodated through a shared latent normal variable. In order to reduce dimensionality in assessing covariate effects on the latent response of interest, it may be preferable to allow covariates to affect the location of the latent variable, while avoiding parametric assumptions about the latent variable distribution.

This article proposes a Bayesian semiparametric approach to this problem. The surrogate outcomes are related to a latent response variable through a factor analytic model, with a scale mixture (West, 1987Go) of underlying normals used to characterize flexibly the measurement error distributions. Our primary focus is on developing an approach for assessing dynamic changes in the latent response distribution across levels of a predictor, Formula. For example, X may represent the level of a treatment, age, or time since exposure. To allow for uncertainty in the latent response distribution conditional on X, we propose a dynamic mixture of Dirichlet processes (DMDP). In particular, the latent response distribution in group h is represented as a mixture of the distribution in group Formula and an unknown innovation distribution, which is assigned a Dirichlet process (DP) prior (Ferguson, 1973Go, 1974Go). This structure accommodates autocorrelation in the distributions, and results in a flexible dynamic mixture structure for the surrogate outcomes.

The proposed approach is an alternative to the dependent Dirichlet process (DDP) of MacEachern (1999Go, 2000)Go, which is a class of priors for a collection of unknown distributions (see also De Iorio et al., 2002Go, 2004; Gelfand et al., 2004Go). The DDP characterizes dependence through a stochastic process for a fixed number of atoms in the unknown distributions. The proposed DMDP instead allows evolving changes in the number of atoms through a weighted mixture of independent DP measures. This approach extends the weighted mixture formulation of Müller et al. (2004)Go, which was used to borrow strength across studies, to a time series setting. An innovative alternative approach to defining dependent nonparametric measures was recently proposed by Griffin and Steel (2006)Go. Their order-based DDP allows the weights in the Sethuraman (1994)Go stick-breaking representation of the DP to be dependent on covariates. They considered a nonparametric time series model as a special case, though our proposed DMDP has advantages in terms of ease of implementation.

For a recent review of Bayesian nonparametric inference, refer to Müller and Quintana (2004)Go. Several authors have used DP priors for intermediate variables in hierarchical models, without allowing the unknown distributions to vary with covariates. Bush and MacEachern (1996)Go used a DP mixture for a random block factor, and Kleinman and Ibrahim (1998)Go applied a related approach to random effects distributions in mixed effects models. Also considering semiparametric linear mixed models, Ishwaran and Takahara (2002)Go developed an iid weighted Chinese restaurant algorithm for inference. Mukhopadhyay and Gelfand (1997)Go proposed a general class of DP mixtures of hierarchical generalized linear models. Recent authors have considered improved approaches for computation and inference (Neal, 2000Go; Gelfand and Kottas, 2002Go; Ishwaran and James, 2002Go).

Section 2 proposes the semiparametric latent response model and prior structure. Section 3 outlines a hybrid Gibbs sampler and Metropolis algorithm for posterior computation, and discusses inferences. Section 4 applies the approach to data from a study of DNA damage in relationship with oxidative stress, and Section 5 discusses the results.


    2. SEMIPARAMETRIC HIERARCHICAL MODEL
 TOP
 SUMMARY
 1. INTRODUCTION
 2. SEMIPARAMETRIC HIERARCHICAL...
 3. POSTERIOR COMPUTATION AND...
 4. APPLICATION TO DNA...
 5. DISCUSSION
 APPENDIX
 REFERENCES
 

2.1 Data structure and measurement model

Let Formula denote a Formula vector of surrogate measurements for the latent response of the ith (Formula) subject in group h (Formula). For example, in the DNA damage study, Formula denotes surrogates of DNA damage for the ith cell in dose group h. The elements of Formula are ordered so that the first Formula (Formula) elements are continuous and the remaining Formula elements are categorical. To facilitate joint modeling, we link the categorical surrogates to the underlying continuous variables as in Muthén (1984)Go. Formally, let Formula, for Formula, where Formula is a continuous variable underlying Formula. For the continuous surrogates, we have Formula for Formula. For the categorical surrogates, with Formula, we have Formula for Formula, where Formula are thresholds satisfying Formula. Hence, Formula is the identity link for continuous surrogates, and is otherwise a threshold link mapping from Formula, where Formula is the number of categories of the jth surrogate.

Letting Formula, we relate the underlying continuous variables to the latent response through the following measurement model:

Formula 2(2.1)

where Formula 2 is a vector of intercept parameters, Formula 2 are factor loadings, Formula 2 is a latent response variable for subject i in group h, and Formula 2 is a vector of independently distributed measurement errors measuring idiosyncratic features of the different surrogates. A primary goal in considering this model is to assess how the latent response distribution changes between groups.

To address this goal, one could potentially use a mean regression model in which Formula 2 and Formula 2, where Formula 2 is a vector of group indicator variables and the variance of the latent variable density is fixed at 1 for identifiability. In fitting the model and performing inferences, one could avoid parametric assumptions on the residual measurement error and latent variable distributions by using two-stage least squares procedures (with some risk of bias). Alternatively, one could follow a full likelihood-based or Bayesian approach after specifying a distribution for Formula 2 and Formula 2. For example, an obvious choice that satisfies the moment constraints would be Formula 2 and Formula 2, where Formula 2 is the measurement error precision matrix and Formula 2 for the categorical surrogates, Formula 2, to ensure identifiability. Related approaches have been considered by Sammel et al. (1997)Go and Dunson (2003)Go, among others.

As a more flexible approach for modeling of the residual distributions, we use a scale mixture of normal distributions (Fernandez and Steel, 2001Go) by letting Formula 2, where Formula 2 with Formula 2. This specification results in a t density with Formula 2 degrees of freedom and, for Formula 2, mean 0 and variance Formula 2 for the measurement error Formula 2 in the continuous variable underlying the jth surrogate. For continuous surrogates, this accounts for heavier tails than expected under the normal distribution, while for categorical surrogates, the use of a t-distribution results in a more flexible link function than the probit form: Formula 2, where Formula 2 denotes the standard normal distribution function, implied by assuming Formula 2. Here, we set Formula 2, for Formula 2, for identifiability. See Johnson and Albert (1999)Go for an overview of the Bayesian modeling of categorical data using latent variable formulations.

2.2 Nonparametric latent response model

Our focus is primarily on developing a nonparametric specification for the latent response distribution. Let Formula 2, where Formula 2 is an unknown distribution drawn from a DP centered on nonatomic base distribution Formula 2 and with precision parameter Formula 2 (as in Antoniak, 1974Go). Following Sethuraman's (1994)Go stick-breaking representation, we specify

Formula 2(2.2)

where Formula 2 denotes the degenerate distribution with all its mass at Formula 2, Formula 2 is an infinite sequence of random weights, and Formula 2 is a corresponding sequence of random atoms generated from Formula 2. It can be shown that Formula 2 is almost surely discrete.

Letting Formula 2, the DP structure implies that the elements of Formula 2 are allocated to Formula 2 unique values (or clusters), which we denote as Formula 2. Letting Formula 2 denote the subvector of Formula 2 excluding the ith element, the conditional distribution of Formula 2 given Formula 2 is

Formula 2(2.3)

where the unique values of Formula 2 are denoted as Formula 2, and Formula 2 denotes the number of elements of Formula 2 having value Formula 2. This distribution is the mixture of the base distribution Formula 2 and a uniform distribution with support on Formula 2, with the mixture weights depending on Formula 2 and the sample size Formula 2.

The form of (2.3) simplifies posterior computation and prediction of the latent response for an additional subject in group 1, denoted Formula 2. In particular, the conditional predictive density of Formula 2 given Formula 2 is simply

Formula 2(2.4)

Potentially, this predictive distribution could be used as a reasonable best guess for Formula 2, the distribution of Formula 2, the latent response for a subject in the second group. Such an approach would indirectly account for dependency between Formula 2 and Formula 2 by modeling Formula 2 conditionally on Formula 2. We prefer to specify explicitly the dependence between Formula 2 and Formula 2. In particular, it is reasonable to assume that Formula 2 shares features with Formula 2 but that innovations may have occurred. This can be modeled using the mixture structure Formula 2, where Formula 2 and Formula 2 is a DP-distributed `innovation' distribution. Note that this formulation randomly modifies the discrete distribution Formula 2 by (i) reducing the probabilities allocated to the atoms in Formula 2 by a multiplicative factor Formula 2 and (ii) incorporating new atoms drawn from the nonatomic base distribution Formula 2.

Letting Formula 2 denote Borel sets partitioning Formula 2, we have

Formula 2(2.5)

where Formula 2 denotes the finite K-dimensional Dirichlet distribution, and

Formula 2

is a random innovation on Formula 2. For any Formula 2, we have

Formula 2

The hyperparameters Formula 2 and Formula 2 control the magnitude of the expected change from Formula 2 to Formula 2, with Formula 2 in the limit as Formula 2 and Formula 2 as Formula 2. The variance of the change is controlled by Formula 2 and Formula 2, with Formula 2 in the limit as Formula 2 or Formula 2. We do not consider the case in which Formula 2 because that corresponds to the degenerate case in which Formula 2 places all its mass at a single point.

Extending this approach to later groups (Formula 2), we let Formula 2, with

Formula 2(2.6)

where Formula 2, for Formula 2, with Formula 2 and Formula 2, are probability weights on the different components in the mixture. Note that this model can be expressed equivalently as

Formula 2(2.7)

where Formula 2 for Formula 2 and Formula 2 for Formula 2. This formulation expresses Formula 2 as equal to a randomly selected element out of a set of independent DP-distributed latent factors Formula 2.

The Appendix derives the correlation coefficient between Formula 2 and Formula 2 and the marginal mean and variance of Formula 2. Focusing on the special case in which Formula 2, for Formula 2, so that the same base distribution is chosen for each component in the mixture, we have

Formula 2(2.8)

with the Formula 2 terms dropping out in the special case in which Formula 2, for Formula 2. Because the Formula 2 values depend on Formula 2, it is clear from this expression that the correlation in the unknown distributions is driven by these mixture weights. This expression is particularly useful due to its simplicity and its lack of dependency on Formula 2.

Because factors drawn from a common DP cluster together as characterized in the Pólya urn scheme of Expression (2.3), it is clear that latent variables for subjects in different groups can belong to the same cluster. For example, if Formula 2, then subjects i and Formula 2 can potentially belong to the same cluster, so that Formula 2. The prior probability of clustering together two subjects Formula 2 and Formula 2 in the same or different groups is simply

Formula 2(2.9)

which is the probability that they are sampled from the same mixture component and are then grouped together, summed across the different possibilities for the mixture component. It is clear from the above expressions that Formula 2 and Formula 2 are key hyperparameters controlling the clustering process and dynamic changes in the latent variable distribution across groups. Potentially, a reasonable simplifying assumption in some applications may be Formula 2 and Formula 2, for Formula 2. This special case may be particularly useful when group sizes are small. However, for greater flexibility, we choose hyperprior distributions for Formula 2 and Formula 2 as follows:

Formula 2(2.10)

where Formula 2 denotes the gamma density with mean Formula 2 and variance Formula 2. In most applications, one may expect a priori that correlation between Formula 2 and Formula 2 is moderate to high. Such belief corresponds to the expectation that the Formula 2 values are less than 0.5 and may be close to 0, which can be reasonably expressed using Formula 2 and Formula 2. It is also reasonable, in most cases, to anticipate that a small to moderate number of atoms are added in moving between two groups, which can be expressed by choosing a prior that assigns high probability to small values of Formula 2 (e.g. Formula 2 and Formula 2).

2.3 Identifiability and prior specification

An important issue in latent variable models is the incorporation of constraints to ensure identifiability of the model from the observed data. Although this is different from formal Bayesian identifiability, it is nonetheless an appealing property for a Bayesian model. As a starting point for a discussion of identifiability, consider the expectation and covariance of Formula 2 integrating out the latent variables Formula 2 and Formula 2

Formula 2(2.11)

The correlation coefficient between the underlying variables, Formula 2 and Formula 2, denoted Formula 2 can be used as a measure of the correlation between Formula 2 and Formula 2. Clearly, there is a potential nonidentifiability problem, since the model is invariant to transformations that (i) multiply Formula 2 by any positive constant Formula 2 while dividing Formula 2 by Formula 2 for Formula 2 or (ii) add any real number Formula 2 to Formula 2 while subtracting Formula 2 from Formula 2 for Formula 2. To eliminate this problem, we recommend fixing the values of one of the elements of both Formula 2 and Formula 2; say by letting Formula 2 and Formula 2.

It is important to consider carefully the sources of information about the group-specific latent variable distributions, Formula 2. For purposes of discussion, first consider the simple case in which Formula 2 and Formula 2, so there is one continuous outcome and a single group. Since the factor loadings, Formula 2, characterize dependency among the outcomes, we recommend fixing Formula 2 for identifiability when Formula 2. With this constraint, there is information in the data about the shape of Formula 2, since the distribution of Formula 2 is characterized as the mixture of a t-distribution across Formula 2. Lack of fit of the t-distribution, such as a positively skewed shape or multimodality, can be accommodated through a nonnormal mixing distribution, Formula 2. Extending to the Formula 2 group case, the mixing distribution will change dynamically from Formula 2 to Formula 2 across the range of the group index (e.g. across dose groups or time points). Hence, the model can accommodate systematic differences in lack of fit. For example, in the presence of heterogeneity in a dose or treatment effect, there may be increasing skewness in higher treatment groups.

Finally, considering the general case in which Formula 2 and Formula 2, the density of the jth surrogate in group h can be expressed as the mixture of a t-distribution with mean Formula 2 across the Formula 2-distribution for Formula 2:

Formula 2

Hence, the marginal distribution of Formula 2 will be increasingly driven by the characteristics of Formula 2 as the factor loading Formula 2 and the correlation with the other surrogates increase. If the surrogates in group h tend to be positively skewed or to have other characteristics inconsistent with the t-distribution, these features will be reflected in Formula 2. Common features of the surrogate distributions which tend to change across the groups will be reflected in differences among Formula 2. In this manner, the data clearly inform about the latent variable distributions Formula 2.

A Bayesian specification of the model is completed with priors for the threshold parameters Formula 2 and additional unknowns in the measurement model (2.1). Letting Formula 2 and Formula 2, Expression (2.1) is equivalent to Formula 2. Following Albert and Chib (1993)Go, we choose a uniform improper prior for the threshold parameters, Formula 2, for all Formula 2 such that Formula 2, with the Formula 2 values known for Formula 2. For the remaining parameters, our prior can be expressed as follows:

Formula 2(2.12)

We choose the first and Formula 2st diagonal elements of Formula 2 to be Formula 20 in effect fixing Formula 2 and Formula 2, for identifiability purposes. This is done to simplify book-keeping in developing the computational algorithm, and yields essentially identical results to treating Formula 2 and Formula 2 as strictly fixed constants. Focusing on the case in which the surrogates all have the same direction, we constrain Formula 2, though this sign restriction is not necessary for identifiability.


    3. POSTERIOR COMPUTATION AND INFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. SEMIPARAMETRIC HIERARCHICAL...
 3. POSTERIOR COMPUTATION AND...
 4. APPLICATION TO DNA...
 5. DISCUSSION
 APPENDIX
 REFERENCES
 
This section outlines a Markov chain Monte Carlo algorithm for posterior computation and predictive inference. In the absence of outside information about Formula 2 and systematic changes that occur across groups, a natural choice for Formula 2 is the standard normal distribution. This choice results in a semiparametric model, which is centered on a parametric model having normally distributed factors. Given the focus in the literature on normal latent variable models, this form is particularly appealing and will be our focus in developing a computational algorithm. The form of the algorithm is motivated by efficiency considerations, and we utilize approaches for efficient sampling in DP models while also using a block-updating approach for the unknowns in the measurement model. The steps involved in these two components are described separately in the following two subsections. We focus on the case in which all the surrogates are continuous since the extension to the general case is straightforward using the Albert and Chib (1993)Go approach.

3.1 Updating the unknowns in the latent response model

In this section, we describe our algorithm for updating the latent variables Formula 2, the mixture weights Formula 2, and precisions Formula 2, for Formula 2, integrating out the infinite-dimensional Formula 2. Our approach is related to the Pólya urn Gibbs sampler described by MacEachern (1994)Go and West et al. (1994)Go. Complications arise due to the DMDP structure of Expression (2.6), but most of these are alleviated by using the characterization of Expression (2.7).

Let Formula 2 denote the unique values of the latent variable in the lth mixture component Formula 2, and let Formula 2 denote that Formula 2 and Formula 2, so that subject Formula 2 belongs to the rth cluster in the lth mixture component, with Formula 2. Let Formula 2 and Formula 2 denote the total number of subjects having Formula 2 and Formula 2, respectively. Also, let Formula 2, Formula 2, Formula 2, Formula 2, and Formula 2 denote the values obtained excluding subject Formula 2. Then, the conditional prior distribution of Formula 2 given the latent variable values and mixture component indicators for all other subjects is

Formula 3(3.1)

We first derive the conditional posterior distribution of Formula 3, updating this prior with the data. Introducing shorthand notation, let Formula 3 and Formula 3 denote the respective multipliers on Formula 3 and Formula 3 in Expression (3.1). Multiplying the conditional prior (3.1) by the conditional likelihood, Formula 3, with Formula 3, and normalizing results in the following full conditional posterior distribution for Formula 3:

Formula 3(3.2)

where Formula 3 and Formula 3 are the conditional posterior variance and mean derived under the base parametric prior Formula 3, the updated component weights are defined as follows:

Formula 3

Computation can potentially proceed by Gibbs steps, which successively sample from the full conditional distribution (3.2) for each Formula 3. However, since this approach results in slow mixing, we suggest an alternative approach following MacEachern (1994)Go. In particular, letting Formula 3 if Formula 3 is allocated to a new cluster in mixture component l and Formula 3 if Formula 3, we alternate between the following steps:

  1. Update Formula 3 by sampling each Formula 3 from its full conditional distribution, which is multinomial with Formula 3 for Formula 3, Formula 3. Whenever Formula 3, for any l, we replace Formula 3 with a draw from Formula 3 to assign subject Formula 3 to their own cluster in component l.
  2. Generate new values for Formula 3, for Formula 3, conditional on the current configuration of subjects to clusters by sampling Formula 3, for Formula 3, from its full conditional posterior distribution, which is Formula 3, with

    Formula 3


  3. Update Formula 3, for Formula 3, from its full conditional posterior distribution which is

    Formula 3

    where Formula 3 if Formula 3 for any r.

  4. Update Formula 3, for Formula 3, using the procedure proposed by West (1992)Go, noting that for updating Formula 3 the relevant number of clusters is Formula 3 and the relevant sample size is Formula 3, which varies from iteration to iteration.

3.2 Updating the unknowns in the measurement model

Sampling of the coefficients, Formula 3, and unknowns, Formula 3, proceeds as follows:

Step 2,a. Update Formula 3 in a single block by sampling from the joint conditional distribution, which is Formula 3 subject to the constraint that Formula 3 for Formula 3, where

Formula 3

where Formula 3 is the jth row vector of Formula 3.

Step 2,b. Update the measurement error parameters by sampling from the full conditional distribution of Formula 3 (for Formula 3):

Formula 3

sampling from the full conditional distribution of Formula 3 (for all Formula 3):

Formula 3

and finally updating the Formula 3 values in a Metropolis step.

3.3 Inferences on the latent response distribution

Posterior summaries of Formula 3 calculated from the MCMC output can be used as a basis for inferences on correlation between the surrogates. However, the primary focus is typically on assessing changes in the distribution of the latent response as a function of the predictors. For this purpose, it will be useful to have estimates of the predicted density function of Formula 3 in each group, as well as measures of the magnitude, location, and weight of evidence of changes between groups. To address these goals, we recommend collecting draws from the conditional predictive distributions of Formula 3 for a future subject in dose group h, for Formula 3. Generalizing Expression (2.4), this distribution is simply

Formula 3(3.3)

One can collect samples from this distribution after apparent convergence, along with the mean and selected percentiles. Samples can be obtained easily due to the simple mixture structure, the mean is available in closed form, and percentiles can be estimated by calculating the cdf at a dense grid of values.

After convergence, the samples of Formula 3 represent draws from the predictive density of the latent response in group h, and inferences can be based on comparing these densities between groups. One can also estimate marginal posterior densities of differences in quantiles between groups, which is useful in summarizing group differences, as we illustrate in Section 4. In addition, by also collecting draws from the conditional distributions of the surrogate outcomes for additional subjects in each group, we can estimate predictive densities of the surrogates. By comparing these predictive densities to the empirical distributions, one can assess goodness of fit of the procedure. In particular, it is of interest to look for surrogates which do not follow the trajectory predicted by the model, since this may suggest that the simple one-factor structure may be insufficient. Although we have focused on the one-factor case throughout this article, the generalization to multiple factors is straightforward.


    4. APPLICATION TO DNA DAMAGE STUDY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. SEMIPARAMETRIC HIERARCHICAL...
 3. POSTERIOR COMPUTATION AND...
 4. APPLICATION TO DNA...
 5. DISCUSSION
 APPENDIX
 REFERENCES
 
We illustrate the methodology using data from a genotoxicity experiment analyzed previously by Dunson et al. (2003)Go. The study assessed the effect of oxidative stress, induced by hydrogen peroxide exposure, on the frequency of DNA strand breaks using single-cell gel electrophoresis. Human lymphoblastoid cells (Formula 3) drawn from an immortalized cell line were randomized to one of the five dose groups (0, 5, 20, 50, or 100 micromoles Formula 3), resulting in 100 cells per dose group. For each cell (Formula 3), we have Formula 3 surrogate measures of DNA damage, including (1) % tail DNA, (2) tail extent divided by head extent, (3) extent tail moment, (4) Olive tail moment, and (5) tail extent. For a detailed description of these variables and diagnostics demonstrating nonnormality, refer to Dunson et al. (2003)Go.

Letting Formula 3 index the dose group and normalizing each of the surrogates, Formula 3, prior to analysis, we apply the approach proposed in Sections 2 and 3 to assess the effect of hydrogen peroxide exposure on the frequency of DNA strand breaks. To complete a Bayesian specification of the model, we let Formula 3, Formula 3 for Formula 3, and Formula 3 for Formula 3. A prior mean of 0 for the Formula 3 values seems reasonable since the surrogates are normalized. The prior for the factor loadings expresses our belief in a moderate degree of dependency in the surrogates. In addition, we let Formula 3 and Formula 3, since we anticipate heavier tails than normal, and we let Formula 3 and Formula 3 to express our belief in the magnitude of the residual variance component. Finally, for the hyperparameters in the priors for Formula 3 and Formula 3 in Expression (2.10), we follow the recommendation of Section 2.2.

We ran the MCMC algorithm for Formula 3 iterations, discarding the first 5000 iterations as a burn-in, and collecting every 10th sample to thin the chain. Based on examination of trace plots, convergence was rapid and autocorrelation was low to moderate, reflecting good computational efficiency of our MCMC algorithm under the constraints imposed on the model to ensure identifiability. Posterior summaries are provided in Table 1.


View this table:
[in this window]
[in a new window]

 
Table 1 Posterior summaries of the parameters in the DNA damage application

 
Applying the approach described in Section 3.3, we estimated the predictive distribution of the latent response variable Formula 3 for cells in each of the dose groups. The results are shown in Figure 1. Interestingly, the densities change considerably in shape as dose increases, with the density in the control group having an approximately log-normal shape with relatively low variance. As dose increases, the mean and variance increase substantially, the distribution flattens out, and the right tail becomes increasingly fat. Figure 2 shows boxplots for samples from the posterior distributions for changes in the 10th, 25th, 50th, 75th, and 90th percentiles of the predictive distribution for Formula 3 attributable to increasing Formula 3 exposure from 0 (Formula 3) to the maximum dose of 100 (Formula 3). The differences clearly increase as one moves further into the right tail, reflecting failure of a mean or median regression model. This pattern likely reflects heterogeneity among the cells in their response to oxidative stress induced by hydrogen peroxide exposure, with some cells having little or no induced damage. The ability to make inferences on changes in quantiles of the response distribution is an appealing feature of our approach.


Figure 1
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Histogram and kernel smoothed density estimates for the latent DNA damage variable (Formula 3) among cells in each of the dose groups. The solid vertical lines represent the mean and the dashed lines represent 2.5th and 97.5th percentiles.

 

Figure 2
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Boxplots for samples from the posterior distribution of differences in the 10th, 25th, 50th, 75th, and 90th percentiles between the latent response density for Formula 3 and Formula 3. Formula 3 in quantiles between the conditional distribution of Formula 3 and Formula 3.

 
It is important to assess how well the model fits the data since we make some parametric assumptions and use informative priors. In particular, we assume that a single latent response variable can be used to characterize departures from the t-distribution and effects of hydrogen peroxide exposure on each of the surrogate variables. To assess model fit, we estimated predictive distributions for each of the surrogate variables at each dose level and compared these estimates to histograms of the raw data. The results for the first four surrogates are included in Figure 3, with the fifth excluded to make the plot readable (though the results are similar).


Figure 3
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Empirical histograms and estimated predictive densities for different surrogate variables and dose levels.

 
Overall, the model-based predictive distributions do a good job at capturing the distributional shifts that occur as dose increases. The best fit is for the Olive tail moment, which was recommended by Dunson et al. (2003)Go as the best single surrogate of DNA damage. In contrast, the fit is not as good for tail extent/head extent, which is known to be a poor surrogate due to sensitivity to individual pixels in the tail of the image. At 100 micromoles, there is some evidence of a second mode in the right tail which is picked up in Figure 1 but not in Figure 3, possibly due to over-smoothing by the t-kernel.

An important issue is sensitivity of the results to the choice of hyperparameters. To assess robustness, we repeated the analysis using alternative priors with (i) Formula 3 and Formula 3 to correspond to lower autocorrelation across dose groups, (ii) Formula 3 to correspond to less uncertainty in the normal base distribution and a higher rate of adding atoms between dose groups, (iii) the prior variance doubled for all the parameters in the measurement model, and (iv) the prior precision doubled for all the parameters in the measurement model. In each of these cases, the figures were essentially identical to those obtained in the primary analysis. Estimates of quantiles of the predicted latent variable densities in each dose group under priors (i) and (ii) are shown in Table 2; results for priors (iii) and (iv) differed by only 1.6%, on average, from the primary analysis estimates and are not shown.


View this table:
[in this window]
[in a new window]

 
Table 2 Estimated percentiles of the predictive distribution of the latent response for cells in each dose group in the DNA damage application

 

    5. DISCUSSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. SEMIPARAMETRIC HIERARCHICAL...
 3. POSTERIOR COMPUTATION AND...
 4. APPLICATION TO DNA...
 5. DISCUSSION
 APPENDIX
 REFERENCES
 
This article has proposed a Bayesian semiparametric latent response model in which the latent variable density can shift dynamically across groups. Although inferences on covariate effects on the mean response profile may be somewhat robust to the parametric form for the latent variable density, linear mean regression structures may be insufficiently flexible in many applications. For example, this is often the case in epidemiologic and toxicologic studies in which there may be increasing variance and skewness at higher exposure levels. The genotoxicity application presents one striking example of this scenario, though we anticipate many other applications in which the latent linear mean regression model fails. Our proposed Bayesian approach is quite flexible, and should provide a useful alternative to existing methods, such as the semiparametric median regression modeling approach of Kottas and Gelfand (2001)Go.

The proposed DMDP should prove useful in other applications in which a distribution or random function can change across levels of a predictor. Note that one can either apply the DMDP directly to the distribution of interest, in order to model the distribution as discrete with an unknown number and location of atoms, or use the DMDP for a mixture distribution, in order to characterize a continuous density.

The DMDP should also prove useful when interest focuses on clustering of observations within and across groups. For example, in a time course or dose response gene expression study, one may want to cluster genes having similar levels of differential expression, both within a given time or dose group and across times or doses.

An interesting area for future research is the generalization to broader classes of factor analytic and structural equation models, allowing for uncertainty in the number of factors as in Lopes and West (2004)Go for the normal linear factor model. It is possible that fewer factors may be needed to characterize the covariance structure if one allows the factors to have nonparametric distributions. However, issues of interpretation and identifiability need to be carefully considered. Potentially, a rich class of multivariate distributions could be generated by including a few factors having unknown distributions. This idea is conceptually related to models based on mixtures of factor analyzers (Utsugi and Kumagai, 2001Go; Fokoue and Titterington, 2003Go; McLachlan et al., 2003Go), though previous methods have assumed fully parametric mixture structures.


    APPENDIX
 TOP
 SUMMARY
 1. INTRODUCTION
 2. SEMIPARAMETRIC HIERARCHICAL...
 3. POSTERIOR COMPUTATION AND...
 4. APPLICATION TO DNA...
 5. DISCUSSION
 APPENDIX
 REFERENCES
 

A.1 Characterization of correlation in the unknown distributions

From (2.6) and (2.7), it is straightforward to derive Formula 3:

Formula 1(A.1)

The expectation and variance terms are as follows:

Formula 2(A.2)

for Formula 2. The numerator in Expression (A.1) can be calculated as follows:

Formula 3(A.3)

Hence, focusing on the special case in which Formula 3, for Formula 3, Expression (2.8) follows from straightforward algebra.


    ACKNOWLEDGMENTS
 
The author thanks Mary Watson and Jack Taylor for generously providing the data used in the example. Thanks also to Alan Gelfand, David Umbach, and Gregg Dinse for their helpful comments on an early draft of the manuscript.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. SEMIPARAMETRIC HIERARCHICAL...
 3. POSTERIOR COMPUTATION AND...
 4. APPLICATION TO DNA...
 5. DISCUSSION
 APPENDIX
 REFERENCES
 

    Albert JH and Chib S. (1993) Bayesian-analysis of binary and polychotomous response data. Journal of the American Statistical Association 88:669–679.[CrossRef]

    Antoniak CE. (1974) Mixtures of Dirichlet processes with applications to nonparametric problems. Annals of Statistics 2:1152–1174.[Web of Science]

    Bush CA and MacEachern SN. (1996) A semiparametric Bayesian model for randomised block designs. Biometrika 83:275–285.[Abstract/Free Full Text]

    De Iorio M, Müller P, Rosner GL, MacEachern SN. (2002) ANOVA DDP models: a review. In Denison DD, Hansen MH, Holmes CC, Mallick B, Yu B (Eds.). Nonlinear Estimation and Classification(Springer, New York) pp. 467.

    De Iorio M, Müller P, Rosner GL, MacEachern SN. (2004) An ANOVA model for dependent random measures. Journal of the American Statistical Association 99:205–215.[CrossRef]

    Dunson DB. (2000) Bayesian latent variable models for clustered mixed outcomes. Journal of the Royal Statistical Society, Series B 62:355–366.[CrossRef]

    Dunson DB. (2003) Dynamic latent trait models for multidimensional longitudinal data. Journal of the American Statistical Association 98:555–563.[CrossRef]

    Dunson DB, Watson M, Taylor JA. (2003) Bayesian latent variable models for median regression on multiple outcomes. Biometrics 59:296–304.[CrossRef][Web of Science][Medline]

    Ferguson TS. (1973) A Bayesian analysis of some nonparametric problems. Annals of Statistics 1:209–230.[CrossRef][Web of Science]

    Ferguson TS. (1974) Prior distributions on spaces of probability measures. Annals of Statistics 2:615–629.[CrossRef][Web of Science]

    Fernandez C and Steel MFJ. (2001) Bayesian regression analysis with scale mixtures of normals. Econometric Theory 16:80–101.[CrossRef]

    Fokoue E and Titterington DM. (2003) Mixtures of factor analysers: Bayesian estimation and inference by stochastic simulation. Machine Learning 50:73–94.[CrossRef]

    Gelfand AE and Kottas A. (2002) A computational approach for full nonparametric Bayesian inference under Dirichlet process mixture models. Journal of Computational and Graphical Statistics 11:289–305.

    Gelfand AE, Kottas A, MacEachern SN. (2004) Bayesian nonparametric spatial modeling with Dirichlet process mixing. Technical Report AMS 2004-5. (Department of Applied Math and Statistics, University of California, Santa Cruz).

    Griffin, J.E. and Steel, M.F.J., (2006). Order-based dependent Dirichlet process. Journal of the American Statistical Association (in press).

    Ishwaran H and James LF. (2002) Approximate Dirichlet process computing in finite normal mixtures: smoothing and prior information. Journal of Computational and Graphical Statistics 11:508–532.[CrossRef]

    Ishwaran H and Takahara G. (2002) Independent and identically distributed Monte Carlo algorithms for semiparametric linear mixed models. Journal of the American Statistical Association 97:1154–1166.[CrossRef]

    Johnson VE and Albert JH. (1999) Ordinal Data Modeling(Springer., New York).

    Kleinman KP and Ibrahim JG. (1998) A semiparametric Bayesian approach to the random effects model. Biometrics 54:921–938.[CrossRef][Web of Science][Medline]

    Kottas A and Gelfand AE. (2001) Bayesian semiparametric median regression modeling. Journal of the American Statistical Association 96:1458–1468.[CrossRef]

    Lavine M. (1995) On an approximate likelihood for quantiles. Biometrika 82:220–222.[Abstract/Free Full Text]

    Lopes HF and West M. (2004) Bayesian model assessment in factor analysis. Statistica Sinica 14:41–67.[Web of Science]

    MacEachern SN. (1994) Estimating normal means with a conjugate style Dirichlet process prior. Communications in Statistics: Simulation and Computation 23:727–741.[CrossRef][Web of Science]

    MacEachern SN. (1999) Dependent nonparametric process, ASA Proceeding of the Section on Bayesian Statistical Science. (American Statistical Association, Alexandria, VA).

    MacEachern SN. (2000) Dependent Dirichlet processes (unpublished)Department of Statistics, The Ohio State University.

    McLachlan GJ, Peel D, Bean RW. (2003) Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics and Data Analysis 41:379–388.[CrossRef]

    Miglioretti DL. (2003) Latent transition models for mixed outcomes. Biometrics 59:710–720.[CrossRef][Web of Science][Medline]

    Moustaki I and Knott M. (2000) Generalized latent trait models. Psychometrika 65:391–411.[CrossRef][Web of Science]

    Mukhopadhyay S and Gelfand AE. (1997) Dirichlet process mixed generalized linear models. Journal of the American Statistical Association 92:633–639.[CrossRef]

    Müller P and Quintana FA. (2004) Nonparametric Bayesian data analysis. Statistical Science 19:95–110.

    Müller P, Quintana FA, Rosner G. (2004) A method for combining inference across related nonparametric Bayesian models. Journal of the Royal Statistical Society, Series B 66:735–749.

    Muthén D. (1984) A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika 49:115–132.[CrossRef][Web of Science]

    Neal RM. (2000) Markov chain sampling methods for Dirichlet process mixture. Journal of Computational and Graphical Statistics 9:249–265.

    Reboussin BA, Liang KY, Reboussin DM. (1999) Estimating equations for a latent transition model with multiple discrete indicators. Biometrics 55:839–845.[CrossRef][Web of Science][Medline]

    Roy J and Lin XH. (2000) Latent variable models for longitudinal data with multiple continuous outcomes. Biometrics 56:1047–1054.[CrossRef][Web of Science][Medline]

    Roy J and Lin XH. (2002) Analysis of multivariate longitudinal outcomes with nonignorable dropout and missing covariates: changes in methadone treatment practices. Journal of the American Statistical Association 97:40–52.[CrossRef]

    Roy J, Lin XH, Ryan LM. (2003) Scaled marginal models for multiple continuous outcomes. Biostatistics 4:371–383.[Abstract]

    Sammel MD and Ryan LM. (2002) Effects of covariance misspecification in a latent variable model for multiple outcomes. Statistica Sinica 12:1207–1222.[Web of Science]

    Sammel MD, Ryan LM, Legler JM. (1997) Latent variable models for mixed discrete and continuous outcomes. Journal of the Royal Statistical Society B 59:667–678.[CrossRef]

    Sethuraman J. (1994) A constructive definition of the Dirichlet process prior. Statistica Sinica 2:639–650.

    Shi JQ and Lee SY. (2000) Latent variable models with mixed continuous and polytomous data. Journal of the Royal Statistical Society, Series B 62:77–87.[CrossRef]

    Utsugi A and Kumagai T. (2001) Bayesian analysis of mixtures of factor analyzers. Neural Computation 13:993–1002.[CrossRef][Web of Science][Medline]

    West M. (1987) On scale mixtures of normal-distributions. Biometrika 74:646–648.[Abstract/Free Full Text]

    West M. (1992) Hyperparameter estimation in Dirichlet process mixture model. ISDS Discussion Paper No. 92-03. (Duke University, Durham, NC).

    West M, Müller P, Escobar MD. (1994) Hierarchical priors and mixture models, with application in regression and density estimation. In Smith AFM and Freeman PR (Eds.). Aspects of Uncertainty: A Tribute to D.V. Lindley(Wiley., London).

    Xu J and Zeger SL. (2001) The evaluation of multiple surrogate endpoints. Biometrics 57:81–87.[CrossRef][Web of Science][Medline]

    Received December 4, 2004; revised June 7, 2005; revised December 4, 2005; revised January 25, 2006; accepted for publication February 15, 2006.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


    This article has been cited by other articles:


    Home page
    BiometrikaHome page
    A. Rodriguez, D. B. Dunson, and A. E. Gelfand
    Bayesian nonparametric functional data analysis through density estimation
    Biometrika, March 1, 2009; 96(1): 149 - 162.
    [Abstract] [PDF]


    Home page
    BiostatisticsHome page
    A. Rodriguez, D. B. Dunson, and J. Taylor
    Bayesian hierarchically weighted finite mixture models for samples of distributions
    Biostat., January 1, 2009; 10(1): 155 - 171.
    [Abstract] [Full Text] [PDF]


    Home page
    BiometrikaHome page
    D. B. Dunson and S. D. Peddada
    Bayesian nonparametric inference on stochastic ordering
    Biometrika, December 1, 2008; 95(4): 859 - 874.
    [Abstract] [PDF]


    Home page
    BiometrikaHome page
    D. B. Dunson and J.-H. Park
    Kernel stick-breaking processes
    Biometrika, June 1, 2008; 95(2): 307 - 323.
    [Abstract] [PDF]


    Home page
    Stat Methods Med ResHome page
    D. B. Dunson
    Bayesian methods for latent trait modelling of longitudinal data
    Statistical Methods in Medical Research, October 1, 2007; 16(5): 399 - 415.
    [Abstract] [PDF]


    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    7/4/551    most recent
    kxj025v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Dunson, D. B.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Dunson, D. B.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?