Biostatistics Advance Access originally published online on March 23, 2006
Biostatistics 2007 8(1):32-52; doi:10.1093/biostatistics/kxj030
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bayesian logistic regression using a perfect phylogeny
Department of Statistics, University of Oxford, Oxford, UK and Department of Epidemiology and Public Health, Imperial College, London, UK taane.clark{at}imperial.ac.uk
Department of Epidemiology and Public Health, Imperial College, London
Department of Statistics, University of Oxford, Oxford
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
Haplotype data capture the genetic variation among individuals in a population and among populations. An understanding of this variation and the ancestral history of haplotypes is important in genetic association studies of complex disease. We introduce a method for detecting associations between disease and haplotypes in a candidate gene region or candidate block with little or no recombination. A perfect phylogeny demonstrates the evolutionary relationship between single-nucleotide polymorphisms (SNPs) in the haplotype blocks. Our approach extends the logic regression technique of Ruczinski and others (2003) to a Bayesian framework, and constrains the model space to that of a perfect phylogeny. Environmental factors, as well as their interactions with SNPs, may be incorporated into the regression framework. We demonstrate our method on simulated data from a coalescent model, as well as data from a candidate gene study of sarcoidosis.
Keywords: Gene tree; Gibbs sampling; Haplotype association; Logic regression; SNP data
| 1. INTRODUCTION |
|---|
|
|
|---|
The identification of genes involved in complex disease has the potential to reduce its prevalence and morbidity through the development of new therapies, as well as early screening. A dense genome map consisting of nearly four million single-nucleotide polymorphisms (SNPs), 0.1% of the genome, will become available to facilitate this task (Cardon and Abecasis, 2003
Complex diseases have both genetic and environmental components, and we consider a Bayesian (logistic) regression framework to model both types of effect. Advantages of a regression framework include the interpretability of coefficients, diagnostics, and its multiple extensions. A Bayesian framework offers advantages in terms of model selection and incorporating prior information from earlier empirical studies. We consider a situation where a number of potential environmental factors and multiple haplotype blocks, and interactions between the two, may be associated with a "binary outcome": diseased or not diseased. These types of data may result from casecontrol studies, a study design being increasingly applied in genetic association studies using population-based data. The (conditional) logistic model is the analysis tool of choice for analyzing binary response data, due to the attractive interpretation of the regression coefficients in terms of the change in log odds of one class (diseased) over another (non-diseased) for unit change in the associated covariate. The natural likelihood to use for a casecontrol study is a "retrospective" likelihood, i.e. a likelihood based on the probability of exposure given disease status. Prentice and Pyke (1979)
showed that, when a logistic regression form is assumed for the probability of disease given exposure (and co-factors are treated completely non-parametrically), the maximum likelihood estimators and asymptotic covariance matrix of the log odds ratios obtained from the retrospective likelihood are the same as those obtained from the "prospective" likelihood, i.e. that based on probability of disease given exposure.
Evidence is mounting that multiple (and not single) mutations within a gene, occurring on the same chromosome, can have a large effect on phenotype (Seltman and others, 2003
). In our regression framework, combinations of SNPs within haplotype blocks are represented by "logic trees" (Ruczinski and others, 2003)
structures particularly useful at representing interactions. Our ultimate goal is to find SNPs, environmental factors, and interactions between them that are associated with disease. In Section 2, we describe how logic trees model interactions between SNPs. In Section 3, we describe the perfect phylogeny (or equivalently the gene tree) constraint on the model space of the logic trees. The description of how trees and environmental effects are incorporated into a Bayesian logistic regression framework is discussed in Section 4. The method is demonstrated on simulated data from a coalescent model in Section 5, as well as data from a candidate gene study of smoking persistence in Section 6. The focus is on haplotype data consisting of SNPs. Association analyses using haplotypes tend to assume (a) an additive effect of the pair of alleles, (b) HardyWeinberg equilibrium for each SNP holding to ensure that haplotypes for individuals are independent (Sasieni, 1997
), and (c) that the phase of haplotypes is known (Stephens and Donnelly, 2003
). However, we discuss how phasing of haplotypes may be incorporated as well as how genotypic data may be used to determine the mode of disease inheritance. Other extensions to our methodology, such as modeling a continuous phenotype, are also discussed.
| 2. LOGIC TREES |
|---|
|
|
|---|
A locus is a specified site or short region on a chromosome. Complex traits may be caused by the interaction of many loci, each with varying effect. Patterns of interactions between several loci, for example, disease phenotype caused by locus A and locus B, or A but not B, or A and (B or C), make identification of the loci involved more difficult. Ruczinski and others (2003)
). An example of a logic tree T is |
|
where
is an OR and
is an AND operator. T denotes the condition when either S2 and/or S3 mutants are present with the wild-type variant of S1. T is a column vector with dimension 2n, whose elements, Ti, take value one if the Boolean expression is true for the ith haplotype, zero otherwise. The interpretation of the coefficient of T within a logistic regression model is a log odds ratio. Logic trees can be presented graphically (see Figure 1), where SNPs are leaves on branches with operators (e.g. AND or
, OR or
), and may be modified using the following steps:
- Birth step
- Growth at an operator, e.g. T = S

S5
(S2
S3)
- Growth at a SNP, e.g. T = S

[S2
(S3
S
)]
- Growth at an operator, e.g. T = S
- Death step
- Deleting a SNP, e.g. T = S

S2
- Deleting a SNP, e.g. T = S
- Move step
- Changing SNPs, e.g. T = S

(S
S3)
- Changing operators, e.g. T = S

(S2
S3).
- Changing SNPs, e.g. T = S
|
Using the logic tree representation and the modifying operations on trees above, the implementation of Ruczinski and others (2003)
| 3. PERFECT PHYLOGENIES |
|---|
|
|
|---|
If mutation and recombination rates are not high, it is reasonable that the haplotypes observed within a block have evolved according to a perfect phylogeny, in which at most one mutation event has occurred within any site (called the infinitely many sites model (Griffiths, 2001)), and no recombination occurred at the given region. Consider an incidence matrix for our haplotype data, that is, a matrix with rows that represent unique haplotypes in the data and the columns individual SNPs with two base types (see Figure 2). Data are compatible with a rooted perfect phylogeny if, and only if, for any two SNPs (columns) in the incidence matrix, not all three combinations (01, 10, 11) between them exist. Recombination leads to the possible existence of all three combinations. If the perfect phylogeny condition holds between all pairwise combinations of SNPs, it is possible to construct a gene tree (Griffiths, 2001). This unique tree describes the ancestry of a sample of 2n haplotypes, consisting of at most m + 1 distinct haplotypes or lineages/branches, where m is the number of segregating (i.e. mutant) SNPs. Gusfield's (1991)
|
Because sets of nearby SNPs on the same chromosome may be inherited in haplotype blocks within each of which there is little evidence of recombination (Gabriel and others, 2002
operator in a logic tree. Therefore, the infinitely many sites model imposes constraints on the combination of mutant types of SNPs. For the data in Figure 2, the logic tree S1
S3 is not possible because a haplotype containing mutations at both SNPs 1 and 3 cannot exist in the gene tree. A logic tree with the expression S1
S
is compatible with the gene tree because a haplotype with a mutant SNP 1 and non-mutant SNP 3 can exist. In general, a logic expression |
|
is consistent with the gene tree if and only if there is a haplotype in the sample with mutant sites at i1, i2,..., ik and wild-type sites at positions j1, j2,..., jl. Using a gene tree constraint reduces the possible moves in the model space, therefore speeding up the model selection process. The reduction of the number of possible SNP combinations to consider is a major point of the model, as well as a genuine attempt to include evolutionary information from the sample. In Section 4, the logic trees and the gene tree constraint for haplotype blocks are included with environmental factors in a Bayesian logistic model.
| 4. BAYESIAN LOGISTIC FRAMEWORK |
|---|
|
|
|---|
We consider a situation where a number of potential environmental factors and multiple haplotype blocks, and interactions between the two, may be associated with a "binary outcome": diseased or not diseased. Prior knowledge is often available about the effects of particular combinations of SNPs, environmental factors, or interactions between them. Hence, we consider a Bayesian logistic regression model,
![]() | (4.1) |
where yi
{0,1}, i = 1,..., 2n, is a binary response variable for a collection of 2n haplotypes with an associated intercept and p environmental measurements Xi = (xi0, xi1, ...,xip). There are q logic trees Tj, j = 1,...,q, constructed from nSNP possible SNPs;
= (T1,T2,...,Tq) and
i = (Ti1,...,Tiq); g(u) = log(u/(1 u)) is the logistic link function,
i is the linear predictor, ßpx1 and
qx1 represent regression coefficients for environmental factors and logic trees, respectively, with prior distributions
(ß) and
(
), respectively. Unlike normal linear regression, inference on
and ß is complicated by the fact that no conjugate priors exist. Holmes and Held (2003)
consider a latent variable representation of the logistic model, and (4.1) may be re-written as
![]() | (4.2) |
where KS is the KolmogorovSmirnov distribution (Devroye, 1986
). Using this framework, conjugate priors are available to the conditional likelihood and full posterior inference can be performed using a block Gibbs sampler (Holmes and Held, 2003
). In particular, in the case of priors
(ß) = N(mß,vß) and
(
) = N(m
,v
),
![]() | (4.3) |
Alternatively, we could update
and ß jointly. The full conditional distribution for zi is truncated normal,
![]() | (4.4) |
W is the precision matrix. As the model is fixed, the block Gibbs sampler processes
updating W, and then updates Z,
, and ß at each iteration. This algorithm is used to estimate the posterior distributions of
and ß in the selected model. The conditional distribution of the variance estimator
i does not have a standard form, but updating is achieved efficiently using rejection sampling with the acceptance probability of a proposed
* being
|
| (4.5) |
where KS(·) denotes KolmogorovSmirnov distribution density (Holmes and Held, 2003
). In Section 4.2, we discuss the prior distribution defined on the space of logic trees.
In this section, we focus on the prior for the parameters that define the tree structure. In specifying the prior on a logic tree, we follow the Bayesian Classification and Regression Tree (CART) literature (Chipman and others, 1998
; Denison and others, 1998
). We define a probability distribution over the space of possible logic trees which satisfies the Perfect Phylogeny constraint (PPC). Any logic tree can be defined by the position of the Boolean operators present, the values they assume (
or
) together with the SNPs included, as well as their respective values (0 or 1). Let K be the number of SNPs in the tree. In the remainder of the section, we will denote the SNPs with Si, i = 1,..., K, and the operator variables with Oi. It is trivial to show that if K SNPs are included in the model, then there will be exactly K 1 operators in the tree. Denison and others (2002)
argue that simple prior distributions are most appropriate when fitting complex non-linear models. We embrace this principle and assume that each tree structure with the same number of SNPs which also satisfies the PPC is equally likely. There is a natural hierarchical structure in this setup which we formalize as
|
|
where I(·) is the usual indicator function that is equal to 1 if the PPC is satisfied and 0 otherwise; p(S1,..., SK,O1,...,OK 1|K) is the probability of a particular tree configuration conditional on there being K SNPs in the tree, and it is given by the uniform distribution defined over the space of all the possible tree configurations with K SNPs. A geometric distribution with parameter
is used to specify the prior distribution on the number of SNPs:
|
|
Therefore, any logic tree must contain at least one SNP. We will assume that
is a constant, but it is possible to specify a prior for it. Note also that Denison and others (1998)
prefer a truncated Poisson distribution, but we feel that a geometric distribution is more effective in favoring trees with a small number of SNPs. We also have a finite number of possible trees as there cannot be more than nSNP 1 operator variables, where nSNP is the number of SNPs in the data set.
The model consists of environmental factors (X) and trees (
) consisting of SNPs. Rather than proposing moves to both X and
and updating ß and
jointly, we consider each separately. There are three possible moves in the environmental factor space, chosen with equal probability:
- Birth stepadding an environmental factor
- Death stepdeleting an environmental factor
- Move stepadding and deleting different environmental factors, but retaining the same number of environmental factors.
(X) that places mass on the 2p possible sub-models made up of different columns of X. In particular, consider the covariate indicator vector
= {
1,...,
p},
i
{0,1}, i = 1,..., p, such that
i = 1 if the ith environmental covariate is present in the model and
i = 0 if it is not. The intercept is always present in the model, i.e
0 = 1. A prior on the model space can be specified via a prior on the indicator,
(
), and this is the same as the proposal distribution of
. We select an environmental factor at random, and propose 
= 1, if the current
i = 0, or 
= 0, otherwise. A key advantage of using representation (4.2) is that when updating the environmental set defined by
it is possible to condition on z,
and jointly update the ß values as well, from their full conditional distribution given the new model structure. Holmes and Held (2003)
, and assuming a uniform prior on the environmental space and the proposal in the same space being the same as the prior, that the new move
* is accepted with probability
![]() | (4.6) |
where
and Vß(
) are defined in (4.3), and the subscripts indicate that they are conditioned on the environmental factor set defined by
. The acceptance probability resembles a Bayes' factor (BF) of a standard Bayesian linear model. This implicit marginalization of ß in the proposal step leads to efficient dimension sampling, as the ß values are being updated from their full conditional distributions given the change to the covariate set. After considering a move in environmental factor space, we propose a move in tree space. The space of trees is potentially large and the number of trees, q, is not known in advance, but the Bayesian framework contains a natural penalty against over-complex models. We assume that q has a value between 0 and tmax, where tmax is the maximum number of trees that can be included in the model. We assume a uniform prior for q. We consider two moves in tree space, chosen with equal probability:
- Adding a new logic tree consisting of one SNP
- Modify an existing logic tree: choosing with equal probability between the birth, death, or move steps for the logic trees. The birth and move steps must be consistent with the constraint of the gene tree.
denote the configuration of trees, and assume that the proposed move results in a new configuration
*. By configuration,
, we simply mean the set of all logic trees included in the model and their topologies. As above, it is possible to condition on z, X, ß and jointly update the
values as well, from their full conditional distribution given the new model structure. In this context, the acceptance probability for the proposed move
* is
![]() | (4.7) |
where
(
) denotes the joint prior distribution on the number of trees and on the tree structure. To incorporate interactions between SNPs and environmental effects, we extend X to include such interactions, and process them in the same way as the environmental factors. For example, if there are five environmental measurements and 10 SNPs, the inclusion of all first-order interactions requires 50 extra terms in X.
Using the components discussed previously, our algorithm is
- 0. Initialize all parameters in the model, including
and
.
- 1. Generate
i and
i, for i = 1,..., 2n. Use (4.5) to accept or reject 
, and calculate W.
- 2. Generate z from (4.4).
- 3. Propose a move among environmental factors and interactions,
*, and propose a move for ß|
*. Calculate the acceptance probability using (4.6) and, if accepted, set
=
* and ß = ß*.
- 4. Propose a move among trees,
*, and propose a move for
|
*. Calculate the acceptance probability using (4.7) and, if accepted, set
=
* and
=
*.
- 5. Repeat Steps 1 to 4 until convergence.
- 1. Generate
Genotypes comprise the combined information for the two homologous chromosomes present in a diploid individual, and are the genetic information most commonly supplied for an individual following DNA typing. In a sample of n individuals, a segregating SNP site will have at least two of the three possible genotypes 0/0, 0/1, and 1/1, where 1 refers to a mutant allele and 0 the wild type. Genotypic data may be applied in our algorithm to determine the mode of inheritance for the disease. When compared to a haplotype analysis, the number of observations is halved (n instead of 2n) and the number of potential predictors doubled. In particular, two predictors, say Si1 and Si2, are created for each SNP i (Si):
![]() |
Whether Si1 or Si2 is present in the (final) logic trees depends on whether a recessive or dominance model, respectively, best fits the data. Si1 and Si2 cannot be in the same logic tree, but they may be in separate trees effectively fitting an additive model on the logit scale or multiplicative model on the odds scale.
Another approach when using genotype data is to reconstruct the haplotype configuration or "phase." The determination of phase uses the nhap possible sets of haplotypes (H) forming nhap perfect phylogenies (represented by a gene tree G) constructed from the genotype data. For a perfect phylogeny, this is possible using existing algorithms (Bafna and others, 2003
; Eskin and others, 2003
). We can model the uncertainty of phase determination by adding a new step in the hierarchy in Model (4.1). In particular, H becomes a random variable, that can assume one of the nhap possible configurations. Note that in the case of a perfect phylogeny, the number of possible haplotypes, nhap, is finite. A priori we assume each of these configurations is equally likely. Therefore, we can determine the posterior distribution of H given the data, P(H|y), by adding a new MH step within our block Gibbs sampling framework. G is the gene tree associated with a set of haplotypes H. A new haplotype configuration, H*, results in a new gene tree G* because it implies a different evolutionary model for the loci. Therefore, when we propose H* given the rest of the parameters in the model, we must ensure that the current haplotype configuration is consistent with the logic trees currently in the model. Also, by changing the haplotype, we also modify the set of individuals that satisfy the relation defined by the logic tree and therefore
i, i = 1,..., 2n. To simulate from the posterior distribution of H, we need to add a new step in the algorithm described in Section 4.4.
Let H be the current haplotype configuration. Propose a new haplotype configuration H* from the set of haplotypes consistent with the logic trees already in the model. Accept H* with probability
![]() | (4.8) |
where
H* and
H denote the dependence of the covariates
i on the phasing.
More often than not, the outcome or phenotype y is continuous. We will assume that y (possibly after transformation) may be distributed with a Gaussian distribution. In this case, we can use the standard Bayesian linear model (Lindley and Smith, 1972
), where the Model (4.1) can be specified by
![]() | (4.9) |
where IG is an inverse gamma probability distribution. Posterior inference and model selection can be performed by slight modification of the algorithm in Section 4.4.
The model space consisting of logic trees and environmental factors is potentially large. Chipman and others (1998)
and Denison and others (1998)
argue that simulation from the true posterior distribution cannot be achieved for CART models without a prohibitive number of iterations, because of the huge number of possible trees and the difficulty traversing the posterior surface. This implies that it is difficult to determine exactly which trees have largest posterior probability. The same type of concern holds in the case of logic trees because of the similarity of the search space. Moreover, with the difficulty of sampling from the posterior distribution, it is not clear how to think of the "posterior mean" and of "the posterior mode" found using the MCMC output. Although the entire posterior distribution cannot be efficiently computed in non-trivial problems, the reversible jump algorithm can still be used to effectively explore the posterior distribution. Therefore, we prefer to classify the different models visited by using an optimality criterion such as "Bayesian Information Criterion (BIC)" (Schwarz, 1978
), which is widely applied in the Bayesian model selection literature, or "marginal likelihood." Denison and others (2002)
suggest to use the marginal likelihood to compare tree models, while Chipman and others (1998)
propose to use the marginal likelihood along with the misclassification rate to determine good models. Note that the BIC and the marginal likelihood criterion are asymptotically equivalent. The marginal likelihood contains a natural dimension penalty, it is consistent with the Bayesian paradigm, and has been applied to CART models (Denison and others, 1998
) and logic trees (Clark and others, 2005
). Nevertheless, it would be expected that averaging over the models visited during the MCMC run (as described in Section 4.10) would lead to better predictions than single models. We refer to Chipman and others (1998)
and Denison and others (1998)
for two different strategies which try to eliminate some of the problems associated with the sampling on the space of trees.
In the following sections, we evaluate the performance of the BIC and marginal likelihood on simulated and epidemiological data. The BIC for a model consisting of p components (q logic trees, p q environmental factors including the intercept) fitted to 2n haplotypes is
|
|
where
is the maximum log likelihood for the model,
and
are the maximum likelihood estimates (mles) of
and ß, respectively, and p is the total number of parameters in the model. The mles are estimated using iteratively re-weighted least squares (McCullagh and Nelder, 1989
). In the setting of a logistic regression, the marginal likelihood for a model
consisting of logic trees and environmental factors, p(y|
), is not available in closed form. However, the marginal likelihood can be approximated via the Laplace approximation (DiCiccio and others, 1997
)
|
|
where h(
,ß|
) = p(y|
,ß,
)p(
,ß|
) and
![]() |
Both
and
, the maximum posteriori estimate of
and ß, respectively, are found via Newton's method, and
.
The logistic model produces (log) odds ratios which are meaningful quantities in relating disease to risk factors in epidemiological studies, and are easily interpreted by clinicians and genetic epidemiologists. The mean, standard deviation and, indeed, the distribution, of the (log) odds ratios for the logic trees, environmental, and interaction effects can be calculated for the selected model using the algorithm described earlier.
One application of a model is to predict the probability of being diseased for any given environmental factor (x) and SNP configuration (t). This prediction is achieved through Bayesian model averaging (Hoeting and others, 1999
; Denison and others, 2002
). The posterior predictive distribution for the binary outcome, y, for x and t can be written as
![]() |
where
is the data and
j is the model at iteration j of M (post burn-in). The equation is simply a mixture of individual predictive distributions for y given each model, weighted by the posterior probability of each model. The predicted outcome or expectation of the posterior predictive distribution for y,
, is
![]() |
Other summary statistics of the posterior predictive distribution may be calculated in a similar way.
| 5. SIMULATION STUDY |
|---|
|
|
|---|
Here we compare the performance of our method with logic regression and stepwise logistic regression using data from seven simulation scenarios. Using a coalescent model (Hudson, 2002
- Model 0:
= 0
- Model 1:
= 2 + log(3)J([S6
S37])
- Model 2:
= 2 + log(2) J([S6
S37])
- Model 3:
= 2 + log(3) J([S
S37])
- Model 4:
= 2 + log(2) J([S
S37])
- Model 5:
= 2 + log(3) J([(S6
S41)
S37])
- Model 6:
= 2 + log(2) J([(S6
S41)
S37]),
refers to the logit(P(Y = 1)), and the probability of being diseased (P(Y = 1)) is 1/(1 + exp(
)). Model 0 consists of no causal SNP, i.e. there is no SNP associated with the disease outcome. An additive disease model was defined at the genotype level using J(·), which takes values 0, 1, and 2, according to whether zero, one, or two haplotypes constituting a single genotype are consistent with the logic tree. The intercept relates to the underlying risk of disease in the population and the slope estimate mimics a form of (log) genotypic relative risk for the logic tree for each individual. Note that the genetic signal is relatively weak in all the simulation scenarios to mimic a potentially realistic situation.
For our approach, we assumed
(
) = N(0,100),
(ß) = N(0,100), the maximum number of trees (tmax) to be 5, and the number of SNPs on a tree to be Geometric (0.3). We ran the algorithm for 1 000 000 iterations, with the first 10 000 being discarded as burn-in. Computational time using a single Pentium IV processor for a single replicate was less than 15 min. We compared our method to (i) logistic regression with stepwise selection using Akaike's Information Criterion (AIC) and BIC, where up to two-way interactions could be included, and (ii) logic regression. The stepwise logistic approaches and logic regression were implemented in the R statistical software package (R Development Core Team, 2005
). We have used the R package LogicReg to implement logic regression; to avoid time-consuming cross-validation experiments, we prespecified the number of trees and leaves to be those of the true model. In particular, for Models 1, 2, 3, and 4, we set the number of trees to be 1 and the number of leaves to be 2. Similarly, for Models 5 and 6, we set the number of trees and leaves to be 1 and 3, respectively.
Table 1 shows the proportion of replicates where the methods have either (a) identified the correct set of SNPs used to simulate the phenotype, with no extra SNPs (i.e. the final model from each statistical procedure contains all the correct SNPs and no others), or (b) identified the correct set of SNPs with perhaps extra SNPs. Table 2 shows (c) the proportion of replicates where the methods have not identified any of the correct set of SNPs used to simulate the phenotype in the model (i.e. the final model selected from each statistical procedure does not contain any of the correct SNPs) and (d) under simulation scenario 0, the proportion of false positives (i.e. the number of replicates for which the statistical procedure selected at least one SNP). The stepwise logistic approach using the AIC tends to select models with more SNPs than other approaches (see Table 1); in fact, the method performs best when considering (c) and worst when considering (a) and (d). The stepwise logistic approach with the BIC performs well in simulation scenarios 1 and 2, where the logic tree structure is simpler. The performance of the stepwise approaches is worst in simulations 3, 4, 5, and 6, where some or all the SNPs involved in simulating the phenotype are in the same block, and the search is affected by high LD. Comparatively, logic regression seems to perform better in simulations 3, 4, 5, and 6, and this may be due to the increased flexibility of the logic tree structure, and its ability to be model interactions. Logic regression was constrained to have the correct number of trees and leaves, and therefore models larger than the correct one are never selected. Hence, the mean number of extra SNPs other than the correct ones is zero, and the proportions in (b) are underestimated (see second column of Table 1). For the same reason, the false-positive rates for logic regression in Table 2 have not been calculated. Our method uses logic trees jointly with the perfect phylogeny constraint and an optimality criterion, and generally tends to perform better than the other methods; in the worst cases its performance is comparable to the best approach. An advantage of the proposed approach is that it keeps the size of the model relatively small. In our simulations, both the BIC and marginal likelihood criteria showed favorable properties with marginal likelihood performing slightly better. However, although BIC performs well, the properties of such a criterion are not fully understood especially in complex settings.
|
|
Table 3 shows the median and interquartile range of selection proportions from our method for the causal SNPs in simulations 16.
|
From the MCMC run, we can estimate the posterior probability of inclusion for each SNP. We can then use the BF (Kass and Raftery, 1995
i = 1)" against H0: "variable/SNP i should not be included in the model (i.e.
i = 0)". Let p0 = P(
i = 1) be the prior probability of H1 (and therefore 1 p0 is the prior probability of H0), and let p1 = P(
i = 1|Data) be the posterior probability of H1. The BF is then defined as the ratio of the posterior odds and the prior odds: |
|
Kass and Raftery (1995)
suggest that a BF greater than 10 provides substantial evidence against H0, but this threshold should always be interpreted with caution. Due to convergence issues discussed in Section 4.8, the estimate of p1 is also problematic. Let us consider, for example, one of the simulation replicates from Model 1. The posterior probability of inclusion for S6 is equal to one, leading to a BF of infinity. In the case of S37, p1 = 0.948, and the BF is greater than 100. For most of the remaining SNPs, p1 is less than 0.01, and BF is less than 0.06. The prior probability p0 was evaluated by simulations.
| 6. SARCOIDOSIS STUDY |
|---|
|
|
|---|
Sarcoidosis is a rare disease due to inflammation, particularly in the lung and lymph nodes, and is characterized by the presence of granulomas, small areas of inflamed cells. Sarcoidosis is thought to result from the interaction between an unknown environmental antigenic trigger and the host's genetic susceptibility. Clinical onset and progression vary widely in sarcoidosis, ranging from benign disease to progressive pulmonary fibrosis leading to respiratory failure. A subset of patients have Löfgren's syndrome, which is characterized by the symptoms, such as acute presentation of fever. It has been found that some relevant genes on the human leukocyte antigen (HLA) region, such as DQB1*0201, may be associated with Löfgren's syndrome and not other types of sarcoidosis (Spagnolo and others, 2003
Here we present data from 141 Dutch patients with sarcoidosis, 47 (33%) with Löfgren's syndrome. The candidate gene regions of interest are TNF (4 SNPs: T-1032C, G-308A, G-238A, G488A), CCR5 (6 SNPs: A-5652G, C-3890A, T-3452G, T-2129C, T-1829C, G32A), and CCR2 (5 SNPs: G-6928T, A-6752G, G190A, A3610G, C3671G). These represent regions of approximately 1.5, 5.5, and 10.5 kb, respectively, and will be referred to collectively as SNPs S1S15. The gene trees for the Sarcoidosis study haplotypes using all 282 haplotypes are shown in Figure 3, and represent five blocks constructed using the imperfect phylogeny method (Halperin and Eskin, 2003
). The plots indicate that mutations at S2, S12, and S14 may be associated with Löfgren's syndrome. Table 4 shows other covariates of interest: gender, age at diagnosis, history of smoking, and various HLA regions. There appears to be a large difference in the distribution of gender and DRB1*201 between the phenotype groups. Although the HLA regions are genetic in nature, they involve potentially complex combinations of genotypes, and in practice they tend not to be converted to SNPs. Here we consider them as binary covariates indicating whether the particular combination of genotypes of interest is present or absent for an individual. The HLA covariates could be viewed as existing risk factors that the new candidate SNPs may potentially displace from an association model.
|
|
An analysis of the data by Spagnolo and others (2003)
(
),
(ß) = N(0,100), tmax = 5, and the number of SNPs on a tree to be distributed as a Geometric (0.3). The block Gibbs sampler was run for 1 000 000 iterations with a 10 000 burn-in.
The selection frequencies in Table 5 indicate that gender and DQB1*201 were present in most post burn-in models, while S12 was the most selected SNP. The three selected most frequently were
|
|
|
The results were robust to changes in the random number seed, tmax, and the parameter in the geometric distribution for tree size. The best model under both the BIC and marginal likelihood contained gender, a history of smoking, DQB1*201, and S12
S14. The results of fitting this model in a Bayesian logistic regression are presented in Table 6, and are consistent with the findings of Spagnolo and others (2003)
|
Using model averaging, the predicted probability of Löfgren's syndrome for a male with a history of smoking, the DQB1*201 allele, and with mutations at SNPs 12 or 14 was 0.702 (variance 0.209). Application of logic regression to the haplotype data suggested that the trees (log OR) S
( 1.22), S12
S14 (1.52), and S15 ( 0.863) were associated with the response. However, when including the three non-SNP covariates in Table 6 as fixed covariates, the tree component (log OR) of the model was S12
S
(1.27). A Bayesian logistic regression with gender, a history of smoking, DQB*201, and S12
S
resulted in log ORs (standard error) of 1.78 (0.37), 0.79 (0.36), 3.09 (0.37), and 1.71 (0.73), respectively. | 7. DISCUSSION |
|---|
|
|
|---|
We have presented a Bayesian logistic regression method that performs an evolution-based association analysis using haplotype data, where these data are consistent with a perfect phylogeny. Haplotype data capture the genetic variation among individuals in a population and among populations, and an understanding of this variation is essential in genetic association studies of complex disease. A "gene tree" is equivalent to the SNP configuration of mutations when there is no recombination and recurrent or parallel mutations do not occur. A natural choice for modeling genealogies is the coalescent model, which induces a stochastic distribution for a gene tree. Although, we only use the basic gene tree topology, full likelihood inference to estimate mutation rates, the age of mutations, or the time to the most recent common ancestor is also possible. Cladistic approaches (Templeton and others, 1992
We applied our method to both simulated and epidemiological data. The simulated data had varying effect sizes, and our approach was in general able to detect the correct logic tree with or without other SNPs more often than the alternative approaches, while keeping the total number of leaves small. In the Sarcoidosis study, we identified genetic and non-genetic factors associated with Löfgren's syndrome. Although, the data set is small, our results replicate the findings in previous work. In general, large sample sizes are required to detect modest genetic effects and outcomes must be robust (Cardon and Bell, 2001
). Our method performs well because of the link between the data matrix, logic trees, and gene trees, when the perfect phylogeny assumption holds. The gene tree enhances speed in the model search space. Our methodology is an extension of logic regression, an approach that has been shown to be effective at identifying disease-predisposing SNPs when compared to other approaches (Witte and Fijal, 2001
). Logic regression is also related directly to other graphical methods (Ruczinski and others, 2003
), such as CART. In practice, the presence of recombination will mean the perfect phylogeny assumption is unlikely to hold, but like other LD mapping methods mentioned earlier (except Larribe and others, 2002
) assuming (little or) no recombination is a starting point, and may be potentially useful when considering haplotype blocks of high LD. It is possible to apply our current algorithm where there is a small amount of recombination, using either the prune algorithm to obtain a reduced data set that has a unique gene tree, or the imperfect phylogeny methodology (Halperin and Eskin, 2003
) to create blocks. It is also possible to use knowledge of a genome map of haplotype blocks (e.g. the HAPMAP project described in Cardon and Abecasis, 2003
) in our methodology.
Another potential limitation is that genetic data often consist of diploid genotypes and not haplotypes. Haplotypes need to be reconstructed and any uncertainty associated with this process must be incorporated into the analysis. Haplotype reconstruction programs, such as PHASE (Stephens and Donnelly, 2003
), output the posterior probabilities or population frequencies for haplotypes, and these may be used as a probability distribution of the possible haplotype configurations in a more general hierarchical modeling framework. We have presented an extension to the block Gibbs sampler which will estimate the posterior distribution of the haplotype configuration. Moreover, haplotypes do not usually inform us about a mechanism of disease. The use of haplotypes will detect a signal for an association (our objective), and our methodology may be applied to genotypes to detect the mechanism by using the approach described earlier.
The Bayesian nature of the method implies that model selection may be less problematic than in a frequentist setting, where models may be overfitted and multiple testing can lead to false positives. There are also advantages in terms of incorporating prior beliefs, and interpretability of posterior probabilities and distributions. Extensions of the Bayesian linear model (Lindley and Smith, 1972
) are well developed, and it is possible to model a continuous or a count phenotype as well as incorporate cluster effects for family data. Our framework incorporates evolutionary relationships among haplotypes, and interpretability of results in association studies are enhanced substantially. The methodology described is useful for haplotype data with little recombination, and should prove to be useful for genetic epidemiologists who are searching for the genetic and environmental basis of complex disease in this setting.
Software to implement the method is available from the corresponding author.
| ACKNOWLEDGMENTS |
|---|
We wish to thank Ken Welsh for providing the data from the Sarcoidosis study, Jonathan Marchini for introducing us to the idea of logic regression and for some assistance with C++ programming, Chris Holmes for providing a copy of his technical report, and Yvonne Griffiths for comments. Taane G. Clark was funded by a National Health Service (UK) Training Fellowship, and is now funded by the Medical Research Council (UK). Conflict of Interest: None declared.
| REFERENCES |
|---|
|
|
|---|
-
Bafna V, Gusfield D, Lancia G, Yooseph S. (2003) Haplotyping as a perfect phylogeny: a direct approach. Journal of Computational Biology 3:32340.[CrossRef]
Cardon LR and Abecasis GR. (2003) Using haplotype blocks to map human complex trait loci. Trends in Genetics 19:13540.[CrossRef][Web of Science][Medline]
Cardon LR and Bell JI. (2001) Association study designs for complex diseases. Nature Reviews Genetics 2:918.[CrossRef][Web of Science][Medline]
Chipman HA, George EI, McCulloch RE. (1998) Bayesian CART model search. Journal of the American Statistical Association 93:93560.[CrossRef]
Clark TG, De Iorio M, Griffiths RC, Farrall M. (2005) Finding associations in dense genetic maps: a genetic algorithm approach. Human Heredity 50:97108.[Medline]
Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, LES TJ. (2001) High-resolution haplotype structure in the human genome. Nature Genetics 29:22932.[CrossRef][Web of Science][Medline]
Denison DGT, Holmes CC, Mallick BK, Smith AFM. (2002) Bayesian Methods for Nonlinear Classification and Regression(John Wiley and Sons Ltd., England).
Denison DGT, Mallick BK, Smith AFM. (1998) A Bayesian CART algorithm. Biometrika 85:36377.
Devroye L. (1986) Non-Uniform Random Variate Generation(Springer, New York) pp. 3618.
DiCiccio TJ, Kass RE, Raftery A, Wasserman L. (1997) Computing Bayes factors by combining simulation and asymptotic approximations. Journal of the American Statistical Association 92:90315.[CrossRef]
Eskin E, Halperin E, Karp RM. (2003) Efficient reconstruction of haplotype structure via perfect phylogeny. Journal of Bioinformatics and Computational Biology 1:120.[CrossRef][Medline]
Estabrook GF, Johnson CS, McMorris FR. (1975) An idealized concept of the true cladistic character. Mathematical Biosciences 23:26372.[Medline]
Felsenstein J. (1983) Method for inferring phylogenies: a statistical view. In Felsenstein J (Ed.). Numerical Taxonomy(Springer, Berlin) pp. 31534.
Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M. and others. (2002) The structure of haplotype blocks in the human genome. Science 296:22259.
Griffiths RC. (2001) Ancestral inference from gene trees. In Donnelly P and Foley R (Eds.). Genes, Fossils, and Behaviour: An Integrated Approach to Human Evolution(IOS Press, Netherlands) pp. 13772.
Gusfield D. (1991) Efficient algorithms for inferring evolutionary trees. Networks 21:1928.
Halperin E and Eskin E. (2003) Haplotype reconstruction from genotype data using imperfect phylogeny. Bioinformatics 1:18.
Hoeting JA, Madigan D, Raftery AE, Volinsky CT. (1999) Bayesian model averaging: a tutorial (with discussion). Statistical Science 14:382417.[CrossRef]
Holmes CC and Held L. (2003) On the simulation of Bayesian binary and polychotomous regression models using auxiliary variables. Technical Report(Department of Mathematics, Imperial College, London).
Hudson RR. (2002) Generating samples under a Wright-Fisher neutral model. Bioinformatics 18:3378.
Kass RE and Raftery A. (1995) Bayes factors. Journal of the American Statistical Association 90:77395.[CrossRef]
Kingman JFC. (1982) The coalescent. Stochastic Processes and Their Applications 13:23548.[CrossRef]
Larribe F, Lessard S, Schork NJ. (2002) Gene mapping via the ancestral recombination graph. Theoretical Population Biology 62:21529.[CrossRef][Web of Science][Medline]
Lindley DV and Smith AFM. (1972) Bayes estimates for the linear model (with discussion). Journal of the Royal Statistical Society, Series B 34:141.
McCullagh P and Nelder JA. (1989) Generalized Linear Models(Chapman and Hall, London).
Morris AP, Whittaker J, Balding DJ. (2002) Fine-scale mapping of disease loci via shattered coalescent modelling of genealogies. American Journal of Human Genetics 70:686707.[CrossRef][Web of Science][Medline]
Prentice RL and Pyke R. (1979) Logistic disease incidence and case-control studies. Biometrika 66:40311.
R DEVELOPMENT CORE TEAM. (2005) R: A Language and Environment for Statistical Computing.
Reich DE, Schaffner SF, Daly MJ, McVean G, Mullikin JC, Higgins JM, Richter DJ, Lander ES, Altshuler D. (2002) Human genome sequence variation and the influence of gene history, mutation and recombination. Nature Genetics 32:13542.[CrossRef][Web of Science][Medline]
Rosenberg NA and Nordborg M. (2002) Genealogical trees, coalescent theory and analysis of genetic polymorphisms. Nature Genetics 3:38090.
Ruczinski I, Kooperberg C, LeBlanc M. (2003) Logic regression. Journal of Computational and Graphical Statistics 12:475511.[CrossRef]
Sasieni PD. (1997) From genotypes to genes: doubling the sample size. Biometrics 53:125361.[CrossRef][Web of Science][Medline]
Schwarz G. (1978) Estimating the dimension of a model. Annals of Statistics 6:4614.[Web of Science]
Seltman H, Roeder K, Devlin B. (2003) Evolutionary-based association analysis using haplotype data. Genetic Epidemiology 25:4858.[CrossRef][Web of Science][Medline]
Spagnolo P, Renzoni EA, Wells AU, Sato H, Grutters JC, Sestini P, Abdallah A, Gramiccioni E, Ruven HJT, du Bois RM. and others. (2003) CC chemokine receptor 2 and sarcoidosis: association with Löfgren's syndrome. American Journal of Respiratory and Critical Care Medicine 168:11626.
Stephens M and Donnelly P. (2003) A comparison of Bayesian methods for haplotype reconstruction from population genotype data. American Journal of Human Genetics 73:11629.[CrossRef][Web of Science][Medline]
Templeton AR, Crandall KA, Sing CF. (1992) A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping and DNA sequence data. III. Cladogram estimation. Genetics 132:61933.[Abstract]
Witte JS and Fijal BA. (2001) Introduction: analysis of sequence data and population structure. Genetic Epidemiology S1:62631.
Received November 25, 2004; revised September 16, 2005; revised February 13, 2006; accepted for publication March 22, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
Z. Ding, T. Mailund, and Y. S. Song Efficient whole-genome association mapping using local phylogenies for unphased genotype data Bioinformatics, October 1, 2008; 24(19): 2215 - 2221. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||















