Biostatistics Advance Access originally published online on October 10, 2006
Biostatistics 2007 8(3):609-624; doi:10.1093/biostatistics/kxl032
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
An informative Bayesian structural equation model to assess source-specific health effects of air pollution
Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA meg.nikolov{at}gmail.com
Department of Environmental Health, Harvard School of Public Health, Boston, MA 02115, USA
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
A primary objective of current air pollution research is the assessment of health effects related to specific sources of air particles or particulate matter (PM). Quantifying source-specific risk is a challenge because most PM health studies do not directly observe the contributions of the pollution sources themselves. Instead, given knowledge of the chemical characteristics of known sources, investigators infer pollution source contributions via a source apportionment or multivariate receptor analysis applied to a large number of observed elemental concentrations. Although source apportionment methods are well established for exposure assessment, little work has been done to evaluate the appropriateness of characterizing unobservable sources thus in health effects analyses. In this article, we propose a structural equation framework to assess source-specific health effects using speciated elemental data. This approach corresponds to fitting a receptor model and the health outcome model jointly, such that inferences on the health effects account for the fact that uncertainty is associated with the source contributions. Since the structural equation model (SEM) typically involves a large number of parameters, for small-sample settings, we propose a fully Bayesian estimation approach that leverages historical exposure data from previous related exposure studies. We compare via simulation the performance of our approach in estimating source-specific health effects to that of 2 existing approaches, a tracer approach and a 2-stage approach. Simulation results suggest that the proposed informative Bayesian SEM is effective in eliminating the bias incurred by the 2 existing approaches, even when the number of exposures is limited. We employ the proposed methods in the analysis of a concentrator study investigating the association between ST-segment, a cardiovascular outcome, and major sources of Boston PM and discuss the implications of our findings with respect to the design of future PM concentrator studies.
Keywords: Factor analysis; Latent variables; Measurement error; Multivariate receptor model; Source apportionment
| 1. INTRODUCTION |
|---|
|
|
|---|
Epidemiological studies have consistently demonstrated increased morbidity and mortality outcomes associated with elevated levels of airborne particulate matter (PM; Dockery and others, 1993
The exposure assessment literature contains an ample amount of research that focuses on estimation of source-specific contributions from a complex mixture of air pollution (for a review, see Hopke, 2003
). Methods such as source apportionment and multivariate receptor modeling use factor analytic techniques to estimate the contributions of a small number of pollution sources from the measured mixture components, elements and other compounds. Although receptor modeling is well developed for exposure assessment, little work has been done to evaluate the appropriateness of characterizing unobservable sources in this way to estimate source-specific health effects. The problem can be thought of as an exposure measurement error problem (Carroll and others, 1995
), whereby the PM exposure generated from a particular source is estimated rather than known or measured directly.
At present, existing source-specific health effects analyses rely on approaches that do not take into account the uncertainty associated with estimated source contributions. A "2-stage" strategy uses estimated source contributions from a factor analysis to assess the impacts of specific pollution sources on health effects (Laden and others, 2000
; Clarke and others, 2000
). A variation on this 2-stage strategy is the tracer approach, whereby the estimated source contributions are replaced by the elemental concentrations of a distinct set of tracers in the health effects analysis (Wellenius and others, 2003
).
The statistical properties of the PM health effects estimates obtained from these existing approaches are not well understood. Previous statistical research has shown that, in simpler models, measurement error associated with estimated latent variables can lead to bias in the subsequent regression coefficient estimates (Tsiatis and others, 1995
). It is unclear whether this bias will occur in PM research as the mixtures typically observed in PM exposures are quite different from those in other latent variable settings. As noted by Dominici and others (2003)
, there remains much work to be done in order to understand these estimates from a statistical standpoint and to assess the reliability of these estimates of association between pollution sources and health outcomes.
In this paper, we propose a structural equation framework for assessing source-specific health effects using speciated data in the form of elemental concentrations. This approach corresponds to jointly fitting a multivariate receptor model to the exposure data and a model for the health outcome given source contributions. Because the source contributions and health effects are modeled jointly, resulting inferences on the health effects account for the fact that uncertainty is associated with the exposures of interest.
This work is motivated by animal toxicology studies evaluating the mechanisms of morbidity and mortality associated with inhalation of concentrated air particles (CAPs) conducted at the Harvard School of Public Health (HSPH) (Godleski and others, 2000
). Harvard researchers have implemented multiple animal toxicology studies to investigate the adverse effects of PM on cardiopulmonary and respiratory activities in canines and rats. Samples of ambient Boston aerosol are collected and concentrated approximately 30 times by the Harvard ambient particle concentrator (Sioutas and others, 1995
; Godleski and others, 2000
) without altering the physical and chemical composition of the mixture. Animals are then exposed to the concentrated complex mixture for a given period of time, and cardiac and respiratory outcomes are monitored on each exposed animal. Because exposure is generated from ambient pollution, exposures are essentially random across days, and hence, a complete exposure assessment is made for each concentrated exposure. Data from these studies consist of the measured elemental concentrations of the concentrated air pollution mixture and the recorded health outcomes on the exposed animals. Because of the complexity of these studies, in any one study, investigators typically expose animals on approximately 20 unique exposure days.
The structural equation model (SEM), as well as the factor analysis model used in the 2-stage approach, typically involves a large number of parameters. Given the high dimensionality of the model, the typical study has an insufficient number of exposures to obtain reliable parameter estimates using maximum likelihood (ML). An approach for handling this problem is to consider a reduced number of elemental species for a health effects analysis (Clarke and others, 2000
). To overcome the small-sample problem, we propose a fully Bayesian estimation approach that leverages historical exposure data from previous concentrator studies in defining informative priors on the parameters relating the measured exposures to the source contributions. This serves to pool exposure information from studies which are consistent in their collection and analysis of CAPs data.
The remainder of this paper is arranged as follows: Section 2 describes in detail the design and data from a study evaluating the effects of CAPs on myocardial ischemia in dogs (Wellenius and others, 2003
). Section 3 presents the SEM, and Section 4 discusses the informative Bayesian approach to estimation. Section 5 presents a simulation study to examine the statistical properties of health effects estimates obtained with the tracer, 2-stage, and structural equation methodologies. Section 6 demonstrates an application of the informative Bayesian SEM to analyze the study of Wellenius and others (2003)
. Finally, in Section 7 we discuss our findings along with implications for the design of future PM concentrator studies.
| 2. DATA |
|---|
|
|
|---|
Wellenius and others (2003)
|
Samples of the CAPs exposures were collected and analyzed. Each CAPs exposure was measured for sulfate (SULF) via ion chromatography, black carbon (BC) using an aethalometer, elemental carbon (EC) and organic carbon (OC) determined with a thermal and optical reflectance method, and elemental concentrations (in µg/m3) collected via X-ray fluorescence, specifically: aluminum (Al), arsenic (As), barium (Ba), bromine (Br), calcium (Ca), cadmium (Cd), chlorine (Cl), chromium (Cr), copper (Cu), iron (Fe), potassium (K), manganese (Mn), nickel (Ni), sodium (Na), lead (Pb), sulfur (S), selenium (Se), silicon (Si), titanium (Ti), vanadium (V), and zinc (Zn). The Sham exposures were assumed to have zero concentration of all elements, as this had been confirmed in earlier test runs of the concentrator (Sioutas and others, 1995
Extensive research into the composition of Boston aerosol (Oh and others, 1997
; Clarke and others, 2000
; Batalha and others, 2002
) has revealed 4 major sources of PM pollution in this area; resuspended road dust consisting mainly of the crustal elements (Si and Al), coal-fired power plants (S and SULF), oil combustion primarily for home heating (Ni and V), and motor vehicle exhaust (BC, OC, and EC). Wellenius and others (2003)
considered linear mixed models for log transformed peak ST-segment, using individual elemental concentrations as tracer representatives of known sources of air pollution in Boston. These authors chose silicon, sulfur, nickel, and BC to represent road dust, power plants, oil combustion, and motor vehicles, respectively, and found a strong positive association between log peak ST-segment and resuspended road dust. A question that naturally arises is whether measurement error associated with tracer representatives of the source contributions obscured the relationship between log peak ST-segment and the other pollution sources. In this article, we analyze the data using methods that account for the uncertainty in estimated source contributions.
| 3. MODEL AND NOTATION |
|---|
|
|
|---|
We propose a full-likelihood approach that estimates the health effects by fitting the receptor and health outcome models jointly. The joint model, which falls within the SEM framework (Bollen, 1989
), is
|
| (3.1) |
|
| (3.2) |
where for a given time t, Xt is the vector of p elemental concentrations,
t is the vector of the k unobserved source contributions, and Yt is the health outcome. We assume that Yt represents a single continuous variable and that k is known. The model for Xt is the factor analysis model for the exposure analysis (Park, 2001
), where
is the (pxk) matrix of factor loadings, also known as the factor pattern, and
for diagonal
. The vector (
1j,
2j,...,
pj) may be viewed as the profile of pollution source j. The parameters ß quantify the k source-specific health effects, and
. Standard factor analysis assumes
.
We extend the standard SEM defined by (3.1)(3.2) in 2 ways. First, we truncate a normal distribution for
to ensure the physical nonnegativity of source contributions
|
|
Alternatively, we could specify a lognormal distribution. In our application, we assess the sensitivity of our conclusions to distributional assumptions on
by fitting the SEM both ways. The specification of nonnegative source contributions extends the standard factor analysis, which does not restrict the domain of
in the model and allows negative source contributions. Positive matrix factorization (Paatero and Tapper, 1994
) is a alternative method that uses constrained weighted least squares to ensure nonnegative source contributions.
Second, to accommodate the repeated measures design of these studies, we build random effects into the model for the health outcome
|
|
where Yst is the health outcome, Zst is the vector of covariates for unit s at time t, bs is the vector of random effects for unit s,
,
, and b
Y.
The SEM defined by (3.1)(3.2) is not identifiable without further assumptions. Because the source profiles are unknown and the source contributions are unobserved, the SEM does not have a unique solution. However, the model may be made identifiable by constraining parameters in
. We consider the following 2 sets of identifiability conditions, which result in a confirmatory factor analysis (Park and others, 2002
):
C1: There are at least k 1 zero elements in each column of
.
C2: The rank of
(j) is k 1, where
(j) is the matrix composed of the rows containing the assigned 0's in the jth column with those assigned 0's deleted.
C3:
ij = 1 for some i (i = 1,2,...,p) for each j = 1,2,...,k.
D1: There are at least k rows in
with each of the k rows containing only one nonzero element.
D2: Same as C2.
D3: Same as C3.
The C1C3 conditions assume that there are at least k 1 elements per source that are not associated with that source and that this group of elements is not the same for all sources. The D1D3 conditions assume that there is at least one "tracer" element for each source in the sense that the tracer does not load on other sources. Further, as noted by Park and others (2001)
, these conditions identify the loadings up to normalization. Thus, for each source, we specify one loading to be equal to 1, effectively placing the source contribution on the scale of the element having the constrained loading of 1 for that source. Combining all these constraints, the D1D3 conditions effectively use the concept of tracers in the factor analysis model but for each source supplement this tracer information with that contained in nontracer elements.
The C1C3 conditions and the D1D3 conditions are each sufficient but not necessary to establish identifiability. While there exist alternative conditions, other commonly used proposals are also sufficient but not necessary. For instance, Park and others (2002)
proposed sufficient conditions which, instead of placing constraints on the factor loadings, assume that some sources are absent on some days. These authors argued that in some settings, this alternative set of constraints may be plausible if one knows that a particular source, such as a power plant in the region, has been shut down for some period of time. In the same vein, Bandeen-Roche (1994)
considered situations in which a subset of the source contributions is known. In our setting, however, we do not have information on the presence or absence of a particular source on a particular day. Thus, given the existing literature on the pollution mixture in the Boston area (Oh and others, 1997
), it seems safer to assume that certain elements are not markers for certain sources. Bollen (1989)
also gave some rules that help to determine whether a model is identifiable, but, as noted by this author, these rules are also either necessary or sufficient, but not both.
| 4. ESTIMATION |
|---|
|
|
|---|
Standard SEMs may be fit via ML using existing latent variable software, such as Mplus (Muthen and Muthen, 1998
To overcome the small number of unique exposure days, we propose an informative Bayesian approach to model fitting. This approach is especially appealing considering that HSPH researchers have conducted multiple concentrator studies, all of which are consistent in their collection and analysis of exposure data. It is reasonable to pool the exposure data from prior studies to estimate the profiles of PM sources in Boston. An informative Bayesian approach leverages historical exposure data to obtain more reliable estimates of the source profiles, thus improving our ability to estimate the health effects investigated in an individual study.
The Bayesian approach incorporates information from previous studies through specification of the priors. In this case, a preliminary factor analysis of the historical exposure data provides prior information on the unknown factor pattern
. Let
be the kx(p k) vector of all unconstrained factor loadings,
= (
,
,...,
)T, where
j is the vector of unconstrained loadings for source j, j = 1,...,k. Let
represent the posterior mean and covariance of the factor loadings obtained from a Bayesian factor analysis of the historical data. For the informative Bayesian SEM, the prior distribution on the free parameters in
may be defined as
|
| (4.1) |
while the constrained loadings are treated as fixed constants in the likelihood.
| 5. SIMULATION STUDY |
|---|
|
|
|---|
We conducted a simulation study to examine the statistical properties of the health effects estimates obtained via the tracer, 2-stage, and structural equation approaches. In the interest of direct comparison between the various approaches, we assume a normal mean zero distribution on the source contributions to ensure that our assessments are not confounded by distributional assumptions made by different implementations of SEMs.
In order to make our findings most relevant to the HSPH concentrator studies, we based our simulations on the known sources of Boston PM pollution described previously. We obtained realistic parameter settings for
,
, and
from a confirmatory factor analysis of the complete aggregated exposure data (n = 178). We conducted our analysis on a subset of p = 13 elements deemed to be major components of the 4 known sources of Boston PM; silicon (Si), sulfur (S), nickel (Ni), OC, aluminum (Al), titanium (Ti), calcium (Ca), SULF, selenium (Se), vanadium (V), bromine (Br), BC, and EC. Since convergence problems are common when elemental concentrations are on widely different scales, each element was scaled by its sample standard deviation, which is equivalent to conducting a factor analysis on the sample correlation matrix, as opposed to the sample covariance matrix.
We constrained one "tracer" element for each of the k = 4 sources according to the D1D3 identifiability conditions. We chose silicon, sulfur, nickel, and OC to identify road dust, power plants, oil combustion, and motor vehicles, respectively. A preliminary exploratory factor analysis justified the "tracer" identifiability conditions, since the estimated factor loadings of silicon, sulfur, nickel, and OC were low ( < 0.2) on all but a single source. Due to space constraints, values of the parameter settings for
,
, and
may be found in an accompanying online technical report (Nikolov and others, 2006
). Settings for the health model parameters were motivated by the investigation of Wellenius and others (2003)
of PM effects on heart rate;
= 86, ß = 2, and
Y = 8, yielding an effect size of
= 0.25.
To simulate exposure, we generated source contributions from 
N4(0,
) and elemental concentrations from X||
N13(
,
). We generated health outcomes assuming a health effect from a single source. Specifically, for a given source j, the health outcome was simulated from a simple linear regression model assuming only an effect of that source. For example, y1 is a vector of simulated health outcomes where the health effect is associated with the first factor, road dust. For each set of generated exposures, we generate 4 sets of health outcomes, y1, y2, y3, and y4, where the health effect corresponds to the different pollution sources, road dust, power plants, oil combustion, and motor vehicles, respectively. We analyzed these simulated health outcomes separately.
Exposures, source contributions and elemental concentrations, were generated for n
{20,100} days. Although the HSPH concentrated particle experiments typically do not run with 100 exposure days, we included this hypothetical scenario to confirm that any deficiencies of the ML SEM are due to a small number of exposure days. The health outcomes were generated for 2 animals per exposure day, for a total of 2n
{40,200} outcomes.
We obtained health effects estimates using 5 different strategies:
- Known source contributions: Although source contributions are not directly measured in the studies motivating this research, here we simulate them so that they are effectively known. We estimate the health effects using these source contributions as predictors in the health effects model. This setting represents an unobtainable "gold standard" and is provided for reference only.
- Tracer approach: We estimated the health effects using the elemental concentrations of the distinct set of the tracers, silicon, sulfur, nickel, and OC, as covariates in the health effects model.
- 2-stage approach: We first conducted a confirmatory factor analysis on all simulated elements, constraining the factor loadings for the tracer elements, silicon, sulfur, nickel, and OC, according to the D1D3 identifiability conditions. We then fit the health effects model using the estimated source contributions as predictors in the health model.
- ML SEM: We estimated the receptor and health effects models (3.1) and (3.2) jointly using ML in Mplus (Muthen and Muthen, 1998
). This approach imposed the D1D3 identifiability conditions but did not use any historical exposure information.
- Bayesian SEM: We estimated the receptor and health effects models jointly using an informative Bayesian approach. To obtain informative priors on the source profiles, we conducted a confirmatory factor analysis on a single simulated historical data set of n = 200 exposures, based on the D1D3 identifiability constraints and tracers, silicon, sulfur, nickel, and OC. We defined informative priors on the source profiles using (4.1) and set vague priors on the remaining parameters; IG(0.01,0.01) on {
jj}, {
ii}, and 
and N(0,1000) on
and {ßj}. The Bayesian SEM was fit using the Markov chain Monte Carlo (MCMC) method in WinBUGS (Spiegelhalter and others, 2000
); for each fit, we ran 25 000 iterations, discarding 20 000 as burn-in and thinning by 5, for a total of 1000 posterior samples for estimation and inference. We randomly checked convergence on multiple simulated data sets and saw evidence of good mixing and convergence in every case. We ran 500 simulations to assess the statistical properties of the health effects estimates of a typical concentrator study with n = 20 days of exposure and an additional 500 simulations to evaluate the estimates of a hypothetical study with n = 100 exposure days.
Table 2 displays the health effects estimates and corresponding simulation standard errors, obtained with the 5 different methodologies. Although the health effects were estimated with a model that included terms to represent all 4 sources, the table presents only the estimate for the source on which the health effect was simulated. For example, the first column contains the health effects estimates corresponding to road dust, since this column reflects the analysis of the y1 outcome, where the health effect was simulated on the road dust source. The estimated coefficients for the other 3 sources, power plants, oil combustion, and motor vehicles, were always all approximately zero and, hence, are not included in the tables. In all cases, our estimates of the null coefficients were unbiased, and therefore, we display only the estimates for which the truth is ß = 2.
|
In Table 2, we see that the health effects estimates obtained with the tracer approach demonstrate the typical attenuation of effect associated with measurement error in this simple setting. This finding is consistent in the simulation study based on n = 100 exposure days as well (results not shown due to space constraints). In fact, we can calculate the attenuation factor
associated with each tracer estimate, since we know the amount of measurement error associated with each of the tracer elements, quantified by
Si,
S,
Ni, and
OC. Because our simulations are based on a factor pattern with a unique tracer for each source, uncorrelated factors, and normality, for a given source j,
j = (
jj)/(
jj +
jj). Here, the attenuation factors are 0.97, 0.97, 0.87, and 0.78 for the road dust, power plants, oil combustion, and motor vehicles effects, respectively. In our simulation study, we are able to correct for the bias induced by measurement error and obtain reliable health effects estimates using the tracer approach. However, in practical settings, the variance parameters {
jj} and {
jj} are typically unknown, and the small number of exposure days prevents us from obtaining reliable estimates of these correction factors.
Alternatively, the 2-stage approach amounts to estimating the correction terms and adjusting the health effects estimates accordingly by using
in the health outcome model. In this way, the 2-stage approach may be viewed as a form of regression calibration (Carroll and others, 1995
). The simulation study demonstrates attenuation in the 2-stage health effects estimates based on n = 20 exposure days; however, we attribute this bias to the small number of exposures. In the study based on n = 20 exposures, the receptor model failed to converge in 19 out of 500 (3.8%) simulations. However, in the study based on n = 100 days of exposure, all receptor models converged. Furthermore, the 2-stage estimates based on n = 100 exposure days are all very similar to the estimates obtained with the known source contributions, and all estimates are within twice the simulation standard error of the truth.
The SEM approach appears to offer a clear advantage to the tracer and 2-stage approaches, particularly in the case of a small number of exposures. However, in this small-sample context, the SEM estimates are distinguished by the method of estimation. The health effects estimates obtained with the informative Bayesian SEM are most similar to those obtained with known source contributions and are within twice the simulation standard error of the truth in almost all cases for n = 20. In contrast, the estimates obtained via ML appear to be biased downward, and the ML SEM estimate for oil combustion is well beyond twice the simulation standard error from the truth. As in the case of the 2-stage approach, we attribute these deficiencies in the ML SEM to the small number of exposures. In the study based on n = 20 exposure days, the ML method failed to converge in approximately 4% of the simulations. (The numbers of failures are 22, 22, 20, and 21 for the analysis of y1, y2, y3, and y4, respectively.) However, when we increase the number of exposures to n = 100 (results not shown), all 500 simulations converged, and the ML SEM performs almost exactly the same as the gold standard that uses known
.
Table 3 provides the estimated power, defined as the proportion of 95% confidence (or credible, for the Bayesian approaches) intervals that do not contain the null value of zero when ßHE = 2, as well as the estimated size, the proportion of 95% confidence (credible) intervals that do not contain a true value of ß*HE = 0, for the study based on n = 20 exposure days. In this small study context, the 2-stage approach and the ML SEM exceed the expected size of 0.05 in all cases, indicating that these approaches are too liberal when the number of exposures is limited. For methods of approximately the same size (excluding the 2-stage and ML SEM), the Bayesian SEM is comparable to the "gold standard" based on known source contributions and has virtually the same sensitivity for detecting a true effect as the tracer approach. Likewise, in the study based on n = 100 exposure days (results not shown), the 2-stage approach and ML SEM are of the appropriate size (
0.05), and all methods are very powerful ( > 95%) at detecting a true effect in this setting.
|
Finally, we conducted an additional simulation study designed to investigate the impact of choosing incorrect identifiability constraints. One could argue that setting a single factor loading to zero when it is really greater than zero should have little impact on the resulting health effect estimates, whereas incorrectly setting kx(k 1) loadings equal to zero may collectively have a larger effect. To check the impact of this misspecification, we conducted a simulation study assuming all loadings were nonzero, replacing the zero loadings in the previous simulation study to randomly generated values from a uniform(0, 0.2) distribution. For each simulated data set, we estimated the health effects using known source contributions, the tracer approach, the informative Bayesian SEM fit with the D1D3 conditions, and the informative Bayesian SEM fit with the less restrictive C1C3 identifiability conditions. Table 4 presents the source-specific health effects estimates and standard errors. Table 5 provides the corresponding power and size. These results are consistent with the results from the previous simulations for n = 20 exposure days, in that the Bayesian SEM approaches yield estimates similar to those obtained if the true source contributions are known and tests of the appropriate size. Thus, the second study suggests that the proposed identifiability constraints do not have a large impact on inference as long as the loadings set to zero are not much larger than 0.2.
|
|
| 6. DATA ANALYSIS |
|---|
|
|
|---|
In this section, we implement our informative Bayesian SEM to analyze the source-specific PM health effects on myocardial ischemia in dogs (Wellenius and others, 2003
). Analyses of the Wellenius data did not detect any pairing or period (day) effects but did suggest there may be a carryover effect from the CAPs exposure. As a result, the authors excluded Sham exposures following the CAPs exposure in their analyses. The final analysis was based on a total of 43 measured health outcomes, corresponding to 18 CAPs and 25 Sham exposures.
Wellenius and others (2003)
found a large amount of variability within each dog by cycle combination and therefore included a random effect for sequence (see Table 1). The authors estimated the source-specific health effects of PM with the following model:
|
|
where Ytd is log(peak ST-segment), x
is the vector of elemental concentrations for silicon, sulfur, nickel, and BC for sequence t and day d, bt is the random sequence effect,
,
, and b
Y.
Accordingly, we fit the following informative SEM model:
![]() |
where
for diagonal
,
,
, and b
Y. ICAPStd is an indicator of CAPs exposure on day d in sequence t. This indicator provides for the Sham exposures and operates on the assumption that all source contributions are null in filtered air, that is, if the dog in sequence t is exposed to Sham on day d, ICAPStd = 0 and Ytd =
+ bt + 
. Based on this specification, the receptor model applies to the CAPs exposures only, while the health effects regression is fit on all outcomes. Finally, to respect the nonnegativity of source contributions, we truncate the normal distribution on the latent variables,
As noted in Section 3.2, we also fit the model assuming lognormal source contributions to assess the sensitivity of our conclusions to distributional assumption on
.
To obtain prior information on the source profiles, we fit a Bayesian confirmatory factor analysis on the scaled historical data (n = 160). We assumed the k = 4 major sources of Boston PM, and we assumed that these sources are independent. To be consistent with the Wellenius and others (2003)
analysis, here we chose silicon to identify road dust, sulfur for power plants, nickel for oil combustion, and BC for motor vehicles.
It is thought that BC is a better marker of motor vehicles than OC. However, exploratory factor analyses consistently estimated moderate loadings for BC on several factors. Therefore, in our analysis, we apply the more flexible C1C3 identifiability conditions described in Section 3.2. According to these conditions, we need to constrain k 1 = 3 loadings to zero on each profile, while ensuring a distinct set of constraints for each source. In order to set meaningful constraints, we consulted our exploratory results and identified distinct sets of three near zero ( < 0.2) loadings per factor. We constrain to zero the following factor loadings: sulfur, nickel, and EC on road dust; silicon, vanadium, and OC on power plants; aluminum, SULF, and OC on oil combustion; and aluminum, SULF, and nickel on motor vehicles.
We fit the confirmatory factor analysis to the historical data using MCMC in WinBUGS. We set vague priors on all parameters, specifying IG(0.01,0.01) on {
jj} and {
ii}, N(0,1000) on {µj}, and, following Park and others (2001)
, we truncate normal distributions for the unconstrained {
ij},
ij
N(0,10000)I(
ij
0), since negative components of source profiles are not interpretable. We ran 25 000 iterations, discarding 20 000 as burn-in and thinning by 5, for a total of 1000 posterior samples for estimation. Evaluation of autocorrelation and trace plots supported convergence.
Table 6 provides the posterior source profiles along with the estimated standard errors. Figure 1 displays the posterior correlations between the unconstrained factor loadings in 2 ways. The frequency distribution demonstrates that most of the posterior correlations are low (
), with the absolute values of a few correlations as large as 0.6. The temperature grid arranges these correlations by chemical species. This plot reveals that the few moderate correlations correspond primarily to pairs of loadings on the same element. Overall, the standard errors and correlations of the unconstrained factor loadings suggest that the additional exposure information provided by the nontracer elements makes a substantial contribution to the differentiation of the pollution sources.
|
|
We fit the informative Bayesian SEM to the data of Wellenius and others (2003)
using MCMC in WinBUGS. We defined informative priors on the source profiles based on the posterior means in Table 6 and posterior covariances from the Bayesian factor analysis of the historical data. Here again, we truncated the normal distribution for the factor loadings,
. We set vague priors on all other parameters, IG(0.01,0.01) on {
jj}, {
ii}, 
, and 
and N(0,1000) on {µj},
, and {ßj}. We ran 25 000 iterations, discarding 20 000 as burn-in and thinning by 5, for a total of 1000 posterior samples. We examined diagnostic trace and autocorrelaion plots and found satisfactory convergence.
Table 7 displays the posterior means, standard errors, and 95% credible (confidence) intervals for the source-specific health effects estimated with the informative Bayesian SEM and the tracer approach. In the SEM analysis, for each pollution source, the health effect estimate is on the scale of the element whose factor loading is constrained to 1. For example, we interpret the health effect estimate of road dust as the change in log peak ST-segment associated with an increase in the "contribution of road dust" on the scale of one standard deviation increase in the concentration of silicon. However, in the tracer analysis, we interpret the health effects estimates as the change in log peak ST-segment associated with one standard deviation increase in the concentration of corresponding tracer. For instance, here we interpret the health effect estimate of road dust as the change in log peak ST-segment associated with one standard deviation increase in the "concentration of silicon." The tracer estimates provided in Table 7 do not precisely correspond to those reported in Wellenius and others (2003)
, which were based on ECG recordings from 2 precordial leads. Since readings from the 2 leads were highly correlated (r > 0.8), we restrict our analysis to log peak ST-segment recorded on a single lead (V5).
|
In accordance with Wellenius and others (2003)
Thus, the informative Bayesian structural equation results suggest that the conclusions in the original analysis were not driven by unequal amounts of measurement error associated with the tracer representations of the 4 pollution sources. Given that the SEM analysis yields an estimate of the road dust coefficient similar to the silicon coefficient in the tracer analysis, one might wonder whether the more complex analysis was worth the effort. Then again, the SEM analysis does more than reinforce the significance of the road dust effect; it also reaffirms the lack of evidence of an association between this cardiac outcome and pollution from the other sources, in particular motor vehicles. Not surprisingly, given the results of our simulation study, the estimated health effect estimate for motor vehicles,
, is more than 4 times the corresponding attenuated tracer estimate for BC,
; however, this estimated association remains insignificant. Confirmation of a nonsignificant effect of motor vehicles is equally important in this setting.
To ensure that our conclusions are not sensitive to distributional assumptions, we reran the informative Bayesian SEM under the specification of lognormal source contributions. We estimated
,
,
, and
and detected a significant effect of road dust only. Thus, both sets of distributional assumptions yield findings that agree with those in Wellenius and others (2003)
.
| 7. DISCUSSION |
|---|
|
|
|---|
In this paper, we considered methods to assess source-specific health effects of complex mixtures of PM pollution. One objective was to evaluate the statistical properties of estimates obtained with methods currently used in practice, the tracer approach and the 2-stage approach, for multivariate pollution patterns typical of Boston aerosol. In a simulation study, we showed that the health effects estimates obtained using the tracer approach are attenuated, both in small- and large-sample cases, which was expected having framed the problem from a measurement error perspective. Our results suggest that the common marker for traffic particles, BC, has a relatively large degree of error associated with it, which may reflect a regional component of BC in addition to local traffic in the Boston area; as such the tracer approach may underestimate the effect associated with motor vehicles. The 2-stage approach is similarly susceptible to bias, although only in the case of small samples. For large samples, the 2-stage estimates appear unbiased as we would expect of a regression calibration.
As an alternative to the tracer and 2-stage approaches, we proposed a SEM to account for the uncertainty associated with latent source contributions, along with a Bayesian approach to model fitting. This approach leverages exposure information from previous related concentrator studies. Simulations suggest that the proposed informative Bayesian SEM is effective in eliminating bias in estimated source-specific health effects estimates, even when the number of exposures is limited. We demonstrated the flexibility of the Bayesian approach to accommodate complex study designs and nonnormality in our analysis of the study of Wellenius and others (2003)
. As an added advantage, the informative Bayesian SEM may be implemented in freely available software.
Our findings in this paper have implications for the design of future PM concentrator studies. The results demonstrate the benefits of using exposure data from existing, relevant exposure studies where possible. However, not all studies have the benefit of such prior knowledge. When historical data are not available, investigators should maximize the number of unique exposure days, subject to cost constraints. A large number of exposure days will allow one to use 2-stage models that employ source apportionment techniques to address errors in the tracer characterization of the pollution sources. Unfortunately, because even one run of a concentrator exposure can be costly, such "many-exposure" designs may be prohibitive. In cases where both historic and current exposure information is limited, the tracer approach is preferable to the 2-stage analysis, since the latter may be unstable and experience convergence problems; however, one should take care to choose "good" tracers that minimize the error associated with these surrogates. In these settings, the tracer approach is still useful in screening for source-specific health effects but is likely to yield attenuated estimates of effects.
The purpose of this article was to assess the performance of methods for source characterization in concentrator studies. We therefore focused on a specific form of a factor analysis model in the structural equations framework and have not addressed all the interesting modeling issues that arise in the development of a "good" receptor model. For instance, although the estimation of the number of sources can often be challenging (Park and others, 2002
), we assumed that we have good prior knowledge on the number of major pollution sources in the Boston area. This assumption is probably reasonable in our setting as the pollution mixture in this area has been studied for almost 2 decades (Oh and others, 1997
). We note that existing exposure studies suggest that there exists an additional particle source in the Boston aerosol comprised of sodium and chloride, often referred to as sea salt. However, from a regulatory perspective this exposure is not of primary importance and hence of less interest in PM health studies. Second, the majority of the source apportionment literature for exposure assessment of PM (Park and others, 2001
) assumes that the sources of exposure are independent. To maintain consistency with existing methodology, we specified independent priors on the latent source contributions in the implementation of our Bayesian SEM. However, while source-specific exposures are assumed to be independent a priori, the Bayesian approach uses data to update the priors, thus allowing for correlation in the posterior distributions of the source contributions. Given that we might expect source contributions to be correlated, at least in part due to meterologic conditions, this is an appealing feature of our approach. Furthermore, the methods proposed in this paper extend naturally to account for systematic factors, like meteorology, that are likely to affect pollution levels from different sources similarly by allowing the means of the unobserved sources to depend on covariates.
| ACKNOWLEDGMENTS |
|---|
This research was supported by the National Institute of Environmental Health Sciences (NIEHS) grant ES07142 (Margaret C. Nikolov), American Chemistry Council grant 2843 and NIEHS grant ES012044 (Brent A. Coull), and National Institutes of Health grant ES012972 (John J. Godleski). The authors thank a referee for insightful comments that improved the manuscript. Conflict of Interest: None declared.
| REFERENCES |
|---|
|
|
|---|
-
Bandeen-Roche K. Resolution of additive mixtures into source components and contributions: a compositional approach. Journal of the American Statistical Association (1994) 89:14501458.[CrossRef][Web of Science]
Batalha JRF, Saldiva PHN, Clarke RW, Coull BA, Stearns RC, Lawrence J, Murthy GGK, Koutrakis P, Godleski JJ. Concentrated ambient air particles induce vasoconstriction of small pulmonary arteries in rats. Environmental Health Perspectives (2002) 110:11911197.[Web of Science][Medline]
Bollen KA. Structural Equations with Latent Variables (1989) New York: Wiley.
Carroll RJ, Ruppert D, Stefanski LA. Measurement Error in Nonlinear Models (1995) London: Chapman & Hall.
Clarke RW, Coull BA, Reinisch U, Catalano P, Killingsworth CR, Koutrakis P, Kavouras I, Murthy GGK, Lawrence J, Lovett E, and others. Inhaled concentrated ambient particles are associated with hematologic and bronchoalveolar lavage changes in canines. Environmental Health Perspectives (2000) 12:11791187.
Dockery DW, Pope CA, Xu X, Spengler JD, Ware JH, Fay ME, Ferris BG, Speizer FE. An association between air pollution and mortality in six U.S. cities. New England Journal of Medicine (1993) 329:17531759.
Dominici F, Daniels M, Zeger SL, Samet JM. Air pollution and mortality: estimating regional and national dose-response relationships. Journal of the American Statistical Association (2002) 97:100111.[CrossRef][Web of Science]
Dominici F, Sheppard L, Clyde M. Health effects of air pollution: a statistical review. International Statistical Review (2003) 71:243276.[Web of Science]
Godleski JJ, Clarke RW, Coull BA, Saldiva PHN, Jiang N-F, Lawrence J, Koutrakis P. Composition of inhaled urban air particles determines acute pulmonary responses. Annals of Occupational Hygiene (2002) 46:419424.
Godleski JJ, Verrier RL, Koutrakis P, Catalano P, Coull BA, Reinisch U, Lovett EG, Lawrence J, Murthy GGK, Wolfson JM, Clarke RW, Nearing BD. Mechanisms of morbidity and mortality from exposure to ambient air particulate. Health Effects Institute Research Report (2000) 91:1103.
Hopke PK. Recent developments in receptor modeling. Journal of Chemometrics (2003) 17:255265.[CrossRef][Web of Science]
Laden F, Neas LM, Dockery DW, Schwartz J. Association of fine particulate matter from different sources with daily mortality in six U.S. cities. Environmental Health Perspectives (2000) 108:941947.[Web of Science][Medline]
Lawrence J, Wolfson JM, Ferguson S, Koutrakis P, Godleski JJ. Performance stability of the Harvard ambient particle concentrator. Aerosol Science and Technology (2004) 38:219227.[CrossRef][Web of Science]
Muthen LK, Muthen BO. Mplus User's Guide (1998) 3rd edition. Los Angeles, CA: Muthen & Muthen.
Nikolov MC, Coull BA, Catalano PJ, Godleski JJ. An informative Bayesian structural equation model to assess source-specific s of air pollution. Harvard University Biostatistics Working Paper Series 46 (2006) http://www.bepress.com/harvardbiostat/paper46. Accessed 25 October 2006.
Oh JA, Suh HH, Lawrence JE, Allen GA, Koutrakis P. Characterization of particulate mass concentrations in South Boston, MA. In: Proceedings of AWMA/EPA Symposium on "Measurement of Toxic and Related Air Pollutants", April 29-May 1, 1997, Research Triangle Park, NC (1997) Pittsburgh, PA: AWMA publication number VIP-74. 397407.
Paatero P, Tapper U. Positive matrix factorizationa nonnegative factor model with optimal utilization of error-estimates of data values. Environmetrics (1994) 5:111126.[Medline]
Park ES, Guttorp P, Henry RC. Multivariate receptor modeling for temporally correlated data by using MCMC. Journal of the American Statistical Association (2001) 96:11711183.[CrossRef][Web of Science]
Park ES, Spiegelman CH, Henry RC. Bilinear estimation of pollution source profiles and amounts by using multivariate receptor models. Environmetrics (2002) 13:775798.[CrossRef][Web of Science]
R Development Core Team. R: A Language and Environment for Statistical Computing (2003) Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org. Accessed 21 February 2006.
Sioutas C, Koutrakis P, Burton RM. A technique to expose animals to concentrated fine ambient aerosols. Environmental Health Perspectives (1995) 103:172177.[Web of Science][Medline]
Spiegelhalter D, Thomas A, Best N. WinBUGS Version 1.3. User's Manual (2000) Cambridge: MRC Biostatistics Unit, Institute of Public Health. http://www.math.helsinki.fi/openbugs/data/Docu/WinBugs%20Manual.html.
Tsiatis AA, De Gruttola V, Wulfsohn MS. Modeling the relationship of survival to longitudinal data measured with error; applications to survival and CD4 counts in patients with AIDS. Journal of the American Statistical Association (1995) 90:2737.[CrossRef][Web of Science]
Wellenius GA, Coull BA, Godleski JJ, Koutrakis P, Okabe K, Savage ST, Lawrence JE, Murthy GGK, Verrier RL. Inhalation of concentrated ambient air particles exacerbates myocardial ischemia in conscious dogs. Environmental Health Perspectives (2003) 111:402408.[Web of Science][Medline]
Received August 25, 2005; revised May 1, 2006; revised September 13, 2006; accepted for publication October 3, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

