Biostatistics Advance Access originally published online on February 17, 2006
Biostatistics 2006 7(4):551-568; doi:10.1093/biostatistics/kxj025
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Published by Oxford University Press 2006.
Bayesian dynamic modeling of latent trait distributions
Biostatistics Branch, National Institute of Environmental Health Sciences, MD A3-03, PO Box 12233, Research Triangle Park, NC 27709, USA dunson1{at}niehs.nih.gov
| SUMMARY |
|---|
|
|
|---|
Studies of latent traits often collect data for multiple items measuring different aspects of the trait. For such data, it is common to consider models in which the different items are manifestations of a normal latent variable, which depends on covariates through a linear regression model. This article proposes a flexible Bayesian alternative in which the unknown latent variable density can change dynamically in location and shape across levels of a predictor. Scale mixtures of underlying normals are used in order to model flexibly the measurement errors and allow mixed categorical and continuous scales. A dynamic mixture of Dirichlet processes is used to characterize the latent response distributions. Posterior computation proceeds via a Markov chain Monte Carlo algorithm, with predictive densities used as a basis for inferences and evaluation of model fit. The methods are illustrated using data from a study of DNA damage in response to oxidative stress.
Keywords: Dynamic Dirichlet process; Factor analysis; Hierarchical model; Latent variables; Measurement error; Random effect; Surrogate data
| 1. INTRODUCTION |
|---|
|
|
|---|
In many applications, the primary response variable of interest cannot be measured directly and one must instead rely on multiple surrogates. For example, in studying DNA damage and repair, it is not feasible to directly measure the frequency of DNA strand breaks for each cell in a sample. However, using single-cell gel electrophoresis (also known as the comet assay), one can obtain multiple measures that relate directly to the frequency of strand breaks. In such settings, it is natural to use a latent response model in which the different measured outcomes are assumed to be manifestations of a latent variable, which in turn may depend on covariates, such as the dose of an exposure.
Often, in applying such models, one assumes that both the latent and manifest variables are normally distributed (see Roy and Lin, 2000
, 2002
; Xu and Zeger, 2001
, for recent references). However, a number of approaches have been proposed which allow the measured outcomes to have different parametric distributions, typically restricted to be underlying normal (Muthén, 1984
; Shi and Lee, 2000
) or in the exponential family (Muthén, 1984
; Sammel et al., 1997
; Moustaki and Knott, 2000
; Dunson, 2000
, 2003
). In addition, one can potentially use a latent class model in which the underlying response is categorical (see Miglioretti, 2003
, for a recent reference). Since full likelihood approaches are often difficult to implement and may be sensitive to distributional assumptions, some authors have advocated the use of robust score tests (Sammel and Ryan, 2002
) or estimating equation-based approaches (Reboussin et al., 1999
; Roy et al., 2003
). Such methods are most useful when interest focuses on assessing changes in the overall mean response profile with covariates. However, in certain applications, one may anticipate possible changes in not only the mean but also the distributional shape across levels of a predictor. For example, in the molecular epidemiology studies which motivated this article, the distribution of DNA damage across cells in a sample may have different shapes depending on the level of oxidative stress induced by a chemical exposure, the amount of time after exposure damage is measured, and the presence of polymorphisms in genes involved in the base excision repair pathway.
To allow the surrogate outcome distributions to be unknown, Dunson et al. (2003)
proposed an approximate Bayesian approach for quantile regression. Following Lavine (1995)
, they replaced the likelihood function with a substitution likelihood based on quantiles. Covariate effects were incorporated on the level of the surrogate outcomes and residual dependency was accommodated through a shared latent normal variable. In order to reduce dimensionality in assessing covariate effects on the latent response of interest, it may be preferable to allow covariates to affect the location of the latent variable, while avoiding parametric assumptions about the latent variable distribution.
This article proposes a Bayesian semiparametric approach to this problem. The surrogate outcomes are related to a latent response variable through a factor analytic model, with a scale mixture (West, 1987
) of underlying normals used to characterize flexibly the measurement error distributions. Our primary focus is on developing an approach for assessing dynamic changes in the latent response distribution across levels of a predictor,
. For example, X may represent the level of a treatment, age, or time since exposure. To allow for uncertainty in the latent response distribution conditional on X, we propose a dynamic mixture of Dirichlet processes (DMDP). In particular, the latent response distribution in group h is represented as a mixture of the distribution in group
and an unknown innovation distribution, which is assigned a Dirichlet process (DP) prior (Ferguson, 1973
, 1974
). This structure accommodates autocorrelation in the distributions, and results in a flexible dynamic mixture structure for the surrogate outcomes.
The proposed approach is an alternative to the dependent Dirichlet process (DDP) of MacEachern (1999
, 2000)
, which is a class of priors for a collection of unknown distributions (see also De Iorio et al., 2002
, 2004; Gelfand et al., 2004
). The DDP characterizes dependence through a stochastic process for a fixed number of atoms in the unknown distributions. The proposed DMDP instead allows evolving changes in the number of atoms through a weighted mixture of independent DP measures. This approach extends the weighted mixture formulation of Müller et al. (2004)
, which was used to borrow strength across studies, to a time series setting. An innovative alternative approach to defining dependent nonparametric measures was recently proposed by Griffin and Steel (2006)
. Their order-based DDP allows the weights in the Sethuraman (1994)
stick-breaking representation of the DP to be dependent on covariates. They considered a nonparametric time series model as a special case, though our proposed DMDP has advantages in terms of ease of implementation.
For a recent review of Bayesian nonparametric inference, refer to Müller and Quintana (2004)
. Several authors have used DP priors for intermediate variables in hierarchical models, without allowing the unknown distributions to vary with covariates. Bush and MacEachern (1996)
used a DP mixture for a random block factor, and Kleinman and Ibrahim (1998)
applied a related approach to random effects distributions in mixed effects models. Also considering semiparametric linear mixed models, Ishwaran and Takahara (2002)
developed an iid weighted Chinese restaurant algorithm for inference. Mukhopadhyay and Gelfand (1997)
proposed a general class of DP mixtures of hierarchical generalized linear models. Recent authors have considered improved approaches for computation and inference (Neal, 2000
; Gelfand and Kottas, 2002
; Ishwaran and James, 2002
).
Section 2 proposes the semiparametric latent response model and prior structure. Section 3 outlines a hybrid Gibbs sampler and Metropolis algorithm for posterior computation, and discusses inferences. Section 4 applies the approach to data from a study of DNA damage in relationship with oxidative stress, and Section 5 discusses the results.
| 2. SEMIPARAMETRIC HIERARCHICAL MODEL |
|---|
|
|
|---|
Let
denote a
vector of surrogate measurements for the latent response of the ith (
) subject in group h (
). For example, in the DNA damage study,
denotes surrogates of DNA damage for the ith cell in dose group h. The elements of
are ordered so that the first
(
) elements are continuous and the remaining
elements are categorical. To facilitate joint modeling, we link the categorical surrogates to the underlying continuous variables as in Muthén (1984)
. Formally, let
, for
, where
is a continuous variable underlying
. For the continuous surrogates, we have
for
. For the categorical surrogates, with
, we have
for
, where
are thresholds satisfying
. Hence,
is the identity link for continuous surrogates, and is otherwise a threshold link mapping from
, where
is the number of categories of the jth surrogate.
Letting
, we relate the underlying continuous variables to the latent response through the following measurement model:
![]() | (2.1) |
where
is a vector of intercept parameters,
are factor loadings,
is a latent response variable for subject i in group h, and
is a vector of independently distributed measurement errors measuring idiosyncratic features of the different surrogates. A primary goal in considering this model is to assess how the latent response distribution changes between groups.
To address this goal, one could potentially use a mean regression model in which
and
, where
is a vector of group indicator variables and the variance of the latent variable density is fixed at 1 for identifiability. In fitting the model and performing inferences, one could avoid parametric assumptions on the residual measurement error and latent variable distributions by using two-stage least squares procedures (with some risk of bias). Alternatively, one could follow a full likelihood-based or Bayesian approach after specifying a distribution for
and
. For example, an obvious choice that satisfies the moment constraints would be
and
, where
is the measurement error precision matrix and
for the categorical surrogates,
, to ensure identifiability. Related approaches have been considered by Sammel et al. (1997)
and Dunson (2003)
, among others.
As a more flexible approach for modeling of the residual distributions, we use a scale mixture of normal distributions (Fernandez and Steel, 2001
) by letting
, where
with
. This specification results in a t density with
degrees of freedom and, for
, mean 0 and variance
for the measurement error
in the continuous variable underlying the jth surrogate. For continuous surrogates, this accounts for heavier tails than expected under the normal distribution, while for categorical surrogates, the use of a t-distribution results in a more flexible link function than the probit form:
, where
denotes the standard normal distribution function, implied by assuming
. Here, we set
, for
, for identifiability. See Johnson and Albert (1999)
for an overview of the Bayesian modeling of categorical data using latent variable formulations.
Our focus is primarily on developing a nonparametric specification for the latent response distribution. Let
, where
is an unknown distribution drawn from a DP centered on nonatomic base distribution
and with precision parameter
(as in Antoniak, 1974
). Following Sethuraman's (1994)
stick-breaking representation, we specify
![]() | (2.2) |
where
denotes the degenerate distribution with all its mass at
,
is an infinite sequence of random weights, and
is a corresponding sequence of random atoms generated from
. It can be shown that
is almost surely discrete.
Letting
, the DP structure implies that the elements of
are allocated to
unique values (or clusters), which we denote as
. Letting
denote the subvector of
excluding the ith element, the conditional distribution of
given
is
![]() | (2.3) |
where the unique values of
are denoted as
, and
denotes the number of elements of
having value
. This distribution is the mixture of the base distribution
and a uniform distribution with support on
, with the mixture weights depending on
and the sample size
.
The form of (2.3) simplifies posterior computation and prediction of the latent response for an additional subject in group 1, denoted
. In particular, the conditional predictive density of
given
is simply
![]() | (2.4) |
Potentially, this predictive distribution could be used as a reasonable best guess for
, the distribution of
, the latent response for a subject in the second group. Such an approach would indirectly account for dependency between
and
by modeling
conditionally on
. We prefer to specify explicitly the dependence between
and
. In particular, it is reasonable to assume that
shares features with
but that innovations may have occurred. This can be modeled using the mixture structure
, where
and
is a DP-distributed `innovation' distribution. Note that this formulation randomly modifies the discrete distribution
by (i) reducing the probabilities allocated to the atoms in
by a multiplicative factor
and (ii) incorporating new atoms drawn from the nonatomic base distribution
.
Letting
denote Borel sets partitioning
, we have
![]() | (2.5) |
where
denotes the finite K-dimensional Dirichlet distribution, and
![]() |
is a random innovation on
. For any
, we have
![]() |
The hyperparameters
and
control the magnitude of the expected change from
to
, with
in the limit as
and
as
. The variance of the change is controlled by
and
, with
in the limit as
or
. We do not consider the case in which
because that corresponds to the degenerate case in which
places all its mass at a single point.
Extending this approach to later groups (
), we let
, with
![]() | (2.6) |
where
, for
, with
and
, are probability weights on the different components in the mixture. Note that this model can be expressed equivalently as
![]() | (2.7) |
where
for
and
for
. This formulation expresses
as equal to a randomly selected element out of a set of independent DP-distributed latent factors
.
The Appendix derives the correlation coefficient between
and
and the marginal mean and variance of
. Focusing on the special case in which
, for
, so that the same base distribution is chosen for each component in the mixture, we have
![]() | (2.8) |
with the
terms dropping out in the special case in which
, for
. Because the
values depend on
, it is clear from this expression that the correlation in the unknown distributions is driven by these mixture weights. This expression is particularly useful due to its simplicity and its lack of dependency on
.
Because factors drawn from a common DP cluster together as characterized in the Pólya urn scheme of Expression (2.3), it is clear that latent variables for subjects in different groups can belong to the same cluster. For example, if
, then subjects i and
can potentially belong to the same cluster, so that
. The prior probability of clustering together two subjects
and
in the same or different groups is simply
![]() | (2.9) |
which is the probability that they are sampled from the same mixture component and are then grouped together, summed across the different possibilities for the mixture component. It is clear from the above expressions that
and
are key hyperparameters controlling the clustering process and dynamic changes in the latent variable distribution across groups. Potentially, a reasonable simplifying assumption in some applications may be
and
, for
. This special case may be particularly useful when group sizes are small. However, for greater flexibility, we choose hyperprior distributions for
and
as follows:
![]() | (2.10) |
where
denotes the gamma density with mean
and variance
. In most applications, one may expect a priori that correlation between
and
is moderate to high. Such belief corresponds to the expectation that the
values are less than 0.5 and may be close to 0, which can be reasonably expressed using
and
. It is also reasonable, in most cases, to anticipate that a small to moderate number of atoms are added in moving between two groups, which can be expressed by choosing a prior that assigns high probability to small values of
(e.g.
and
).
An important issue in latent variable models is the incorporation of constraints to ensure identifiability of the model from the observed data. Although this is different from formal Bayesian identifiability, it is nonetheless an appealing property for a Bayesian model. As a starting point for a discussion of identifiability, consider the expectation and covariance of
integrating out the latent variables
and
![]() | (2.11) |
The correlation coefficient between the underlying variables,
and
, denoted
can be used as a measure of the correlation between
and
. Clearly, there is a potential nonidentifiability problem, since the model is invariant to transformations that (i) multiply
by any positive constant
while dividing
by
for
or (ii) add any real number
to
while subtracting
from
for
. To eliminate this problem, we recommend fixing the values of one of the elements of both
and
; say by letting
and
.
It is important to consider carefully the sources of information about the group-specific latent variable distributions,
. For purposes of discussion, first consider the simple case in which
and
, so there is one continuous outcome and a single group. Since the factor loadings,
, characterize dependency among the outcomes, we recommend fixing
for identifiability when
. With this constraint, there is information in the data about the shape of
, since the distribution of
is characterized as the mixture of a t-distribution across
. Lack of fit of the t-distribution, such as a positively skewed shape or multimodality, can be accommodated through a nonnormal mixing distribution,
. Extending to the
group case, the mixing distribution will change dynamically from
to
across the range of the group index (e.g. across dose groups or time points). Hence, the model can accommodate systematic differences in lack of fit. For example, in the presence of heterogeneity in a dose or treatment effect, there may be increasing skewness in higher treatment groups.
Finally, considering the general case in which
and
, the density of the jth surrogate in group h can be expressed as the mixture of a t-distribution with mean
across the
-distribution for
:
![]() |
Hence, the marginal distribution of
will be increasingly driven by the characteristics of
as the factor loading
and the correlation with the other surrogates increase. If the surrogates in group h tend to be positively skewed or to have other characteristics inconsistent with the t-distribution, these features will be reflected in
. Common features of the surrogate distributions which tend to change across the groups will be reflected in differences among
. In this manner, the data clearly inform about the latent variable distributions
.
A Bayesian specification of the model is completed with priors for the threshold parameters
and additional unknowns in the measurement model (2.1). Letting
and
, Expression (2.1) is equivalent to
. Following Albert and Chib (1993)
, we choose a uniform improper prior for the threshold parameters,
, for all
such that
, with the
values known for
. For the remaining parameters, our prior can be expressed as follows:
![]() | (2.12) |
We choose the first and
st diagonal elements of
to be
0 in effect fixing
and
, for identifiability purposes. This is done to simplify book-keeping in developing the computational algorithm, and yields essentially identical results to treating
and
as strictly fixed constants. Focusing on the case in which the surrogates all have the same direction, we constrain
, though this sign restriction is not necessary for identifiability.
| 3. POSTERIOR COMPUTATION AND INFERENCES |
|---|
|
|
|---|
This section outlines a Markov chain Monte Carlo algorithm for posterior computation and predictive inference. In the absence of outside information about
and systematic changes that occur across groups, a natural choice for
is the standard normal distribution. This choice results in a semiparametric model, which is centered on a parametric model having normally distributed factors. Given the focus in the literature on normal latent variable models, this form is particularly appealing and will be our focus in developing a computational algorithm. The form of the algorithm is motivated by efficiency considerations, and we utilize approaches for efficient sampling in DP models while also using a block-updating approach for the unknowns in the measurement model. The steps involved in these two components are described separately in the following two subsections. We focus on the case in which all the surrogates are continuous since the extension to the general case is straightforward using the Albert and Chib (1993)
In this section, we describe our algorithm for updating the latent variables
, the mixture weights
, and precisions
, for
, integrating out the infinite-dimensional
. Our approach is related to the Pólya urn Gibbs sampler described by MacEachern (1994)
and West et al. (1994)
. Complications arise due to the DMDP structure of Expression (2.6), but most of these are alleviated by using the characterization of Expression (2.7).
Let
denote the unique values of the latent variable in the lth mixture component
, and let
denote that
and
, so that subject
belongs to the rth cluster in the lth mixture component, with
. Let
and
denote the total number of subjects having
and
, respectively. Also, let
,
,
,
, and
denote the values obtained excluding subject
. Then, the conditional prior distribution of
given the latent variable values and mixture component indicators for all other subjects is
![]() | (3.1) |
We first derive the conditional posterior distribution of
, updating this prior with the data. Introducing shorthand notation, let
and
denote the respective multipliers on
and
in Expression (3.1). Multiplying the conditional prior (3.1) by the conditional likelihood,
, with
, and normalizing results in the following full conditional posterior distribution for
:
![]() | (3.2) |
where
and
are the conditional posterior variance and mean derived under the base parametric prior
, the updated component weights are defined as follows:
![]() |
Computation can potentially proceed by Gibbs steps, which successively sample from the full conditional distribution (3.2) for each
. However, since this approach results in slow mixing, we suggest an alternative approach following MacEachern (1994)
. In particular, letting
if
is allocated to a new cluster in mixture component l and
if
, we alternate between the following steps:
- Update
by sampling each
from its full conditional distribution, which is multinomial with
for
,
. Whenever
, for any l, we replace
with a draw from
to assign subject
to their own cluster in component l.
- Generate new values for
, for
, conditional on the current configuration of subjects to clusters by sampling
, for
, from its full conditional posterior distribution, which is
, with 
- Update
, for
, from its full conditional posterior distribution which is 
where
if
for any r.
- Update
, for
, using the procedure proposed by West (1992)
, noting that for updating
the relevant number of clusters is
and the relevant sample size is
, which varies from iteration to iteration.
Sampling of the coefficients,
, and unknowns,
, proceeds as follows:
Step 2,a. Update
in a single block by sampling from the joint conditional distribution, which is
subject to the constraint that
for
, where
![]() |
where
is the jth row vector of
.
Step 2,b. Update the measurement error parameters by sampling from the full conditional distribution of
(for
):
![]() |
sampling from the full conditional distribution of
(for all
):
![]() |
and finally updating the
values in a Metropolis step.
Posterior summaries of
calculated from the MCMC output can be used as a basis for inferences on correlation between the surrogates. However, the primary focus is typically on assessing changes in the distribution of the latent response as a function of the predictors. For this purpose, it will be useful to have estimates of the predicted density function of
in each group, as well as measures of the magnitude, location, and weight of evidence of changes between groups. To address these goals, we recommend collecting draws from the conditional predictive distributions of
for a future subject in dose group h, for
. Generalizing Expression (2.4), this distribution is simply
![]() | (3.3) |
One can collect samples from this distribution after apparent convergence, along with the mean and selected percentiles. Samples can be obtained easily due to the simple mixture structure, the mean is available in closed form, and percentiles can be estimated by calculating the cdf at a dense grid of values.
After convergence, the samples of
represent draws from the predictive density of the latent response in group h, and inferences can be based on comparing these densities between groups. One can also estimate marginal posterior densities of differences in quantiles between groups, which is useful in summarizing group differences, as we illustrate in Section 4. In addition, by also collecting draws from the conditional distributions of the surrogate outcomes for additional subjects in each group, we can estimate predictive densities of the surrogates. By comparing these predictive densities to the empirical distributions, one can assess goodness of fit of the procedure. In particular, it is of interest to look for surrogates which do not follow the trajectory predicted by the model, since this may suggest that the simple one-factor structure may be insufficient. Although we have focused on the one-factor case throughout this article, the generalization to multiple factors is straightforward.
| 4. APPLICATION TO DNA DAMAGE STUDY |
|---|
|
|
|---|
We illustrate the methodology using data from a genotoxicity experiment analyzed previously by Dunson et al. (2003)
) drawn from an immortalized cell line were randomized to one of the five dose groups (0, 5, 20, 50, or 100 micromoles
), resulting in 100 cells per dose group. For each cell (
), we have
surrogate measures of DNA damage, including (1) % tail DNA, (2) tail extent divided by head extent, (3) extent tail moment, (4) Olive tail moment, and (5) tail extent. For a detailed description of these variables and diagnostics demonstrating nonnormality, refer to Dunson et al. (2003)
Letting
index the dose group and normalizing each of the surrogates,
, prior to analysis, we apply the approach proposed in Sections 2 and 3 to assess the effect of hydrogen peroxide exposure on the frequency of DNA strand breaks. To complete a Bayesian specification of the model, we let
,
for
, and
for
. A prior mean of 0 for the
values seems reasonable since the surrogates are normalized. The prior for the factor loadings expresses our belief in a moderate degree of dependency in the surrogates. In addition, we let
and
, since we anticipate heavier tails than normal, and we let
and
to express our belief in the magnitude of the residual variance component. Finally, for the hyperparameters in the priors for
and
in Expression (2.10), we follow the recommendation of Section 2.2.
We ran the MCMC algorithm for
iterations, discarding the first 5000 iterations as a burn-in, and collecting every 10th sample to thin the chain. Based on examination of trace plots, convergence was rapid and autocorrelation was low to moderate, reflecting good computational efficiency of our MCMC algorithm under the constraints imposed on the model to ensure identifiability. Posterior summaries are provided in Table 1.
|
Applying the approach described in Section 3.3, we estimated the predictive distribution of the latent response variable
for cells in each of the dose groups. The results are shown in Figure 1. Interestingly, the densities change considerably in shape as dose increases, with the density in the control group having an approximately log-normal shape with relatively low variance. As dose increases, the mean and variance increase substantially, the distribution flattens out, and the right tail becomes increasingly fat. Figure 2 shows boxplots for samples from the posterior distributions for changes in the 10th, 25th, 50th, 75th, and 90th percentiles of the predictive distribution for
attributable to increasing
exposure from 0 (
) to the maximum dose of 100 (
). The differences clearly increase as one moves further into the right tail, reflecting failure of a mean or median regression model. This pattern likely reflects heterogeneity among the cells in their response to oxidative stress induced by hydrogen peroxide exposure, with some cells having little or no induced damage. The ability to make inferences on changes in quantiles of the response distribution is an appealing feature of our approach.
|
|
It is important to assess how well the model fits the data since we make some parametric assumptions and use informative priors. In particular, we assume that a single latent response variable can be used to characterize departures from the t-distribution and effects of hydrogen peroxide exposure on each of the surrogate variables. To assess model fit, we estimated predictive distributions for each of the surrogate variables at each dose level and compared these estimates to histograms of the raw data. The results for the first four surrogates are included in Figure 3, with the fifth excluded to make the plot readable (though the results are similar).
|
Overall, the model-based predictive distributions do a good job at capturing the distributional shifts that occur as dose increases. The best fit is for the Olive tail moment, which was recommended by Dunson et al. (2003)
An important issue is sensitivity of the results to the choice of hyperparameters. To assess robustness, we repeated the analysis using alternative priors with (i)
and
to correspond to lower autocorrelation across dose groups, (ii)
to correspond to less uncertainty in the normal base distribution and a higher rate of adding atoms between dose groups, (iii) the prior variance doubled for all the parameters in the measurement model, and (iv) the prior precision doubled for all the parameters in the measurement model. In each of these cases, the figures were essentially identical to those obtained in the primary analysis. Estimates of quantiles of the predicted latent variable densities in each dose group under priors (i) and (ii) are shown in Table 2; results for priors (iii) and (iv) differed by only 1.6%, on average, from the primary analysis estimates and are not shown.
|
| 5. DISCUSSION |
|---|
|
|
|---|
This article has proposed a Bayesian semiparametric latent response model in which the latent variable density can shift dynamically across groups. Although inferences on covariate effects on the mean response profile may be somewhat robust to the parametric form for the latent variable density, linear mean regression structures may be insufficiently flexible in many applications. For example, this is often the case in epidemiologic and toxicologic studies in which there may be increasing variance and skewness at higher exposure levels. The genotoxicity application presents one striking example of this scenario, though we anticipate many other applications in which the latent linear mean regression model fails. Our proposed Bayesian approach is quite flexible, and should provide a useful alternative to existing methods, such as the semiparametric median regression modeling approach of Kottas and Gelfand (2001)
The proposed DMDP should prove useful in other applications in which a distribution or random function can change across levels of a predictor. Note that one can either apply the DMDP directly to the distribution of interest, in order to model the distribution as discrete with an unknown number and location of atoms, or use the DMDP for a mixture distribution, in order to characterize a continuous density.
The DMDP should also prove useful when interest focuses on clustering of observations within and across groups. For example, in a time course or dose response gene expression study, one may want to cluster genes having similar levels of differential expression, both within a given time or dose group and across times or doses.
An interesting area for future research is the generalization to broader classes of factor analytic and structural equation models, allowing for uncertainty in the number of factors as in Lopes and West (2004)
for the normal linear factor model. It is possible that fewer factors may be needed to characterize the covariance structure if one allows the factors to have nonparametric distributions. However, issues of interpretation and identifiability need to be carefully considered. Potentially, a rich class of multivariate distributions could be generated by including a few factors having unknown distributions. This idea is conceptually related to models based on mixtures of factor analyzers (Utsugi and Kumagai, 2001
; Fokoue and Titterington, 2003
; McLachlan et al., 2003
), though previous methods have assumed fully parametric mixture structures.
| APPENDIX |
|---|
|
|
|---|
From (2.6) and (2.7), it is straightforward to derive
:
![]() | (A.1) |
The expectation and variance terms are as follows:
![]() | (A.2) |
for
. The numerator in Expression (A.1) can be calculated as follows:
![]() | (A.3) |
Hence, focusing on the special case in which
, for
, Expression (2.8) follows from straightforward algebra.
| ACKNOWLEDGMENTS |
|---|
The author thanks Mary Watson and Jack Taylor for generously providing the data used in the example. Thanks also to Alan Gelfand, David Umbach, and Gregg Dinse for their helpful comments on an early draft of the manuscript.
| REFERENCES |
|---|
|
|
|---|
-
Albert JH and Chib S. (1993) Bayesian-analysis of binary and polychotomous response data. Journal of the American Statistical Association 88:669679.[CrossRef]
Antoniak CE. (1974) Mixtures of Dirichlet processes with applications to nonparametric problems. Annals of Statistics 2:11521174.[Web of Science]
Bush CA and MacEachern SN. (1996) A semiparametric Bayesian model for randomised block designs. Biometrika 83:275285.
De Iorio M, Müller P, Rosner GL, MacEachern SN. (2002) ANOVA DDP models: a review. In Denison DD, Hansen MH, Holmes CC, Mallick B, Yu B (Eds.). Nonlinear Estimation and Classification(Springer, New York) pp. 467.
De Iorio M, Müller P, Rosner GL, MacEachern SN. (2004) An ANOVA model for dependent random measures. Journal of the American Statistical Association 99:205215.[CrossRef]
Dunson DB. (2000) Bayesian latent variable models for clustered mixed outcomes. Journal of the Royal Statistical Society, Series B 62:355366.[CrossRef]
Dunson DB. (2003) Dynamic latent trait models for multidimensional longitudinal data. Journal of the American Statistical Association 98:555563.[CrossRef]
Dunson DB, Watson M, Taylor JA. (2003) Bayesian latent variable models for median regression on multiple outcomes. Biometrics 59:296304.[CrossRef][Web of Science][Medline]
Ferguson TS. (1973) A Bayesian analysis of some nonparametric problems. Annals of Statistics 1:209230.[CrossRef][Web of Science]
Ferguson TS. (1974) Prior distributions on spaces of probability measures. Annals of Statistics 2:615629.[CrossRef][Web of Science]
Fernandez C and Steel MFJ. (2001) Bayesian regression analysis with scale mixtures of normals. Econometric Theory 16:80101.[CrossRef]
Fokoue E and Titterington DM. (2003) Mixtures of factor analysers: Bayesian estimation and inference by stochastic simulation. Machine Learning 50:7394.[CrossRef]
Gelfand AE and Kottas A. (2002) A computational approach for full nonparametric Bayesian inference under Dirichlet process mixture models. Journal of Computational and Graphical Statistics 11:289305.
Gelfand AE, Kottas A, MacEachern SN. (2004) Bayesian nonparametric spatial modeling with Dirichlet process mixing. Technical Report AMS 2004-5. (Department of Applied Math and Statistics, University of California, Santa Cruz).
Griffin, J.E. and Steel, M.F.J., (2006). Order-based dependent Dirichlet process. Journal of the American Statistical Association (in press).
Ishwaran H and James LF. (2002) Approximate Dirichlet process computing in finite normal mixtures: smoothing and prior information. Journal of Computational and Graphical Statistics 11:508532.[CrossRef]
Ishwaran H and Takahara G. (2002) Independent and identically distributed Monte Carlo algorithms for semiparametric linear mixed models. Journal of the American Statistical Association 97:11541166.[CrossRef]
Johnson VE and Albert JH. (1999) Ordinal Data Modeling(Springer., New York).
Kleinman KP and Ibrahim JG. (1998) A semiparametric Bayesian approach to the random effects model. Biometrics 54:921938.[CrossRef][Web of Science][Medline]
Kottas A and Gelfand AE. (2001) Bayesian semiparametric median regression modeling. Journal of the American Statistical Association 96:14581468.[CrossRef]
Lavine M. (1995) On an approximate likelihood for quantiles. Biometrika 82:220222.
Lopes HF and West M. (2004) Bayesian model assessment in factor analysis. Statistica Sinica 14:4167.[Web of Science]
MacEachern SN. (1994) Estimating normal means with a conjugate style Dirichlet process prior. Communications in Statistics: Simulation and Computation 23:727741.[CrossRef][Web of Science]
MacEachern SN. (1999) Dependent nonparametric process, ASA Proceeding of the Section on Bayesian Statistical Science. (American Statistical Association, Alexandria, VA).
MacEachern SN. (2000) Dependent Dirichlet processes (unpublished)Department of Statistics, The Ohio State University.
McLachlan GJ, Peel D, Bean RW. (2003) Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics and Data Analysis 41:379388.[CrossRef]
Miglioretti DL. (2003) Latent transition models for mixed outcomes. Biometrics 59:710720.[CrossRef][Web of Science][Medline]
Moustaki I and Knott M. (2000) Generalized latent trait models. Psychometrika 65:391411.[CrossRef][Web of Science]
Mukhopadhyay S and Gelfand AE. (1997) Dirichlet process mixed generalized linear models. Journal of the American Statistical Association 92:633639.[CrossRef]
Müller P and Quintana FA. (2004) Nonparametric Bayesian data analysis. Statistical Science 19:95110.
Müller P, Quintana FA, Rosner G. (2004) A method for combining inference across related nonparametric Bayesian models. Journal of the Royal Statistical Society, Series B 66:735749.
Muthén D. (1984) A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika 49:115132.[CrossRef][Web of Science]
Neal RM. (2000) Markov chain sampling methods for Dirichlet process mixture. Journal of Computational and Graphical Statistics 9:249265.
Reboussin BA, Liang KY, Reboussin DM. (1999) Estimating equations for a latent transition model with multiple discrete indicators. Biometrics 55:839845.[CrossRef][Web of Science][Medline]
Roy J and Lin XH. (2000) Latent variable models for longitudinal data with multiple continuous outcomes. Biometrics 56:10471054.[CrossRef][Web of Science][Medline]
Roy J and Lin XH. (2002) Analysis of multivariate longitudinal outcomes with nonignorable dropout and missing covariates: changes in methadone treatment practices. Journal of the American Statistical Association 97:4052.[CrossRef]
Roy J, Lin XH, Ryan LM. (2003) Scaled marginal models for multiple continuous outcomes. Biostatistics 4:371383.[Abstract]
Sammel MD and Ryan LM. (2002) Effects of covariance misspecification in a latent variable model for multiple outcomes. Statistica Sinica 12:12071222.[Web of Science]
Sammel MD, Ryan LM, Legler JM. (1997) Latent variable models for mixed discrete and continuous outcomes. Journal of the Royal Statistical Society B 59:667678.[CrossRef]
Sethuraman J. (1994) A constructive definition of the Dirichlet process prior. Statistica Sinica 2:639650.
Shi JQ and Lee SY. (2000) Latent variable models with mixed continuous and polytomous data. Journal of the Royal Statistical Society, Series B 62:7787.[CrossRef]
Utsugi A and Kumagai T. (2001) Bayesian analysis of mixtures of factor analyzers. Neural Computation 13:9931002.[CrossRef][Web of Science][Medline]
West M. (1987) On scale mixtures of normal-distributions. Biometrika 74:646648.
West M. (1992) Hyperparameter estimation in Dirichlet process mixture model. ISDS Discussion Paper No. 92-03. (Duke University, Durham, NC).
West M, Müller P, Escobar MD. (1994) Hierarchical priors and mixture models, with application in regression and density estimation. In Smith AFM and Freeman PR (Eds.). Aspects of Uncertainty: A Tribute to D.V. Lindley(Wiley., London).
Xu J and Zeger SL. (2001) The evaluation of multiple surrogate endpoints. Biometrics 57:8187.[CrossRef][Web of Science][Medline]
Received December 4, 2004; revised June 7, 2005; revised December 4, 2005; revised January 25, 2006; accepted for publication February 15, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
A. Rodriguez, D. B. Dunson, and A. E. Gelfand Bayesian nonparametric functional data analysis through density estimation Biometrika, March 1, 2009; 96(1): 149 - 162. [Abstract] [PDF] |
||||
![]() |
A. Rodriguez, D. B. Dunson, and J. Taylor Bayesian hierarchically weighted finite mixture models for samples of distributions Biostat., January 1, 2009; 10(1): 155 - 171. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. B. Dunson and S. D. Peddada Bayesian nonparametric inference on stochastic ordering Biometrika, December 1, 2008; 95(4): 859 - 874. [Abstract] [PDF] |
||||
![]() |
D. B. Dunson and J.-H. Park Kernel stick-breaking processes Biometrika, June 1, 2008; 95(2): 307 - 323. [Abstract] [PDF] |
||||
![]() |
D. B. Dunson Bayesian methods for latent trait modelling of longitudinal data Statistical Methods in Medical Research, October 1, 2007; 16(5): 399 - 415. [Abstract] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||























) among cells in each of the dose groups. The solid vertical lines represent the mean and the dashed lines represent 2.5th and 97.5th percentiles.
and
.
in quantiles between the conditional distribution of
and
.





