Biostatistics Advance Access originally published online on April 11, 2007
Biostatistics 2008 9(1):30-50; doi:10.1093/biostatistics/kxm010
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Penalized logistic regression for detecting gene interactions
Google Inc., 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA meeyoung{at}google.com
Department of Statistics and Department of Health Research & Policy, Stanford University, Stanford, CA 94305, USA
* To whom correspondence should be addressed.
| SUMMARY |
|---|
We propose using a variant of logistic regression (LR) with
-regularization to fit gene–gene and gene–environment interaction models. Studies have shown that many common diseases are influenced by interaction of certain genes. LR models with quadratic penalization not only correctly characterizes the influential genes along with their interaction structures but also yields additional benefits in handling high-dimensional, discrete factors with a binary response. We illustrate the advantages of using an
-regularization scheme and compare its performance with that of "multifactor dimensionality reduction" and "FlexTree," 2 recent tools for identifying gene–gene interactions. Through simulated and real data sets, we demonstrate that our method outperforms other methods in the identification of the interaction structures as well as prediction accuracy. In addition, we validate the significance of the factors selected through bootstrap analyses.
Keywords: Discrete factors; Gene interactions; High dimensional; Logistic regression; L2-regularization
| 1. INTRODUCTION |
|---|
Because many common diseases are known to be affected by certain genotype combinations, there is a growing demand for methods to identify the influential genes along with their interaction structures. We propose a forward stepwise method based on penalized logistic regression (LR). Our method primarily targets data consisting of single-nucleotide polymorphisms (SNPs) measurements and a binary response variable separating the affected subjects from the unaffected ones.
LR is a standard tool for modeling effects and interactions with binary response data. However, for the SNP data here, LR models have significant drawbacks:
- The 3-level genotype factors and their interactions can create many parameters, and with relatively small data sets, problems with overfitting arise.
- With many candidate loci, factors can be correlated leading to further degradation of the model.
- Often cells that define an interaction can be empty or nearly empty, which would require special parametrization.
- These problems are exacerbated as the interaction order is increased.
In this paper, we show that some simple modifications of standard LR overcome the problems. We modify the LR criterion by combining it with a penalization of the
-norm of the coefficients; this adjustment yields significant benefits. Because of the quadratic penalization, collinearity among the variables does not degrade fitting much, and the number of factors in the model is essentially not limited by sample size. When the levels of discrete factors are sparse or high-order interaction terms are considered, the contingency tables for the factors may easily include cells with zeros or near zeros. Again, with the help of quadratic penalization, these situations do not diminish the stability of the fits.
We compare our method with multifactor dimensionality reduction (MDR; Ritchie and others, 2001
), a widely used tool for detecting gene interactions. The authors of MDR propose it as an alternative to LR, primarily for the reasons mentioned above. Their method screens pure interactions of various orders, using cross-validation to reduce the bias of overfitting. Once an interaction is found, the inventors propose using LR to tease it apart.
In the following sections, we describe and support our approach in more detail with examples and justifications. We review MDR and several other related methods in Section 2. We explore the use of penalized LR in Section 3. Our methods are illustrated with simulated and real data sets in Sections 4 and 5. We conclude with a summary and possible extensions of our studies in Section 6.
| 2. RELATED WORK |
|---|
MDR proposed by Ritchie and others (2001)
, Ritchie and others (2003)
, Hahn and others (2003)
and Coffey and others (2004)
is a popular technique for detecting and characterizing gene–gene/gene–environment interactions that affect complex but common genetic diseases.
MDR finds both the optimal interaction order K and the corresponding K factors that are significant in determining the disease status. The algorithm is as follows:
- 1) For each
run 10-fold cross-validation to find the optimal set of K factors (described below).
- 2) Compare the prediction errors (on the left out set) and the consistencies (how many times out of 10 folds the optimal set of factors was selected) for different K.
- 3) Select the K with the smallest estimate of prediction error and/or the largest consistency. This K is the final size of the model, and the optimal set for the chosen order forms the best multifactor model.
- 2) Compare the prediction errors (on the left out set) and the consistencies (how many times out of 10 folds the optimal set of factors was selected) for different K.
In step 1 above, MDR uses cross-validation to find the optimal set of factors for each K. The following steps are repeated for each cross-validation fold:
- 1) Construct a contingency table among every possible set of K factors.
- 2) Label the cells of the table "high risk" if the cases/control ratio is greater than 1 in the training part
and "low risk" otherwise.
- 3) Compute the training error for the
data, by classifying high risk as a case and low risk a control.
- 4) For the set of K factors that yields the lowest training error, compute the prediction error using the remaining
- 2) Label the cells of the table "high risk" if the cases/control ratio is greater than 1 in the training part
A strong selling point of MDR is that it can simultaneously detect and characterize multiple genetic loci associated with diseases. It searches through any levels of interaction regardless of the significance of the main effects. It is therefore able to detect high-order interactions even when the underlying main effects are statistically insignificant. However, this "strength" is also its weakness; MDR can "only" identify interactions and hence will suffer severely from lack of power if the real effects are additive. For example, if there are 3 loci active, and their effect is additive, MDR can only see them all as a 3-factor interaction. Typically, the power for detecting interactions decreases with K since the number of parameters grows exponentially with K, so this is a poor approach if the real effects are additive and lower dimensional. Of course, one can post-process a 3-factor interaction term and find that it is additive, but the real art here is in discovering the relevant factors involved.
MDR suffers from several technical disadvantages. First, cells in high-dimensional tables will often be empty; these cells cannot be labeled based on the cases/control ratio. Second, the binary assignment (high risk/low risk) is highly unstable when the proportions of cases and controls are similar.
The authors of MDR claim (Ritchie and others, 2003
- 1) MDR searches for the optimal interaction order K.
- 2) MDR searches for an optimal set of K factors, among
possibilities.
- 3) Given K factors, MDR "searches" for the optimal binary assignment of the cells of a table into high risk and low risk.
- 2) MDR searches for an optimal set of K factors, among
All these amount to an "effective dimension" or "degrees of freedom" that is typically much larger than 1. We demonstrate, through a simulation, a more realistic assessment of the dimensionality of MDR. Here, we present the results and refer the readers to Appendix 6 for the details of the simulation procedure.
We simulated 500 samples with 10 factors, each having 3 levels; the responses were randomly chosen to be
and thus, none of the factors was relevant to the response. Changing the order of interaction (
) and the total number of available factors (from K to 10), we computed the deviance changes for the fitted models. We estimated the degrees of freedom of the models by repeating the simulation 200 times and averaging the deviance measures.
Figure 1 captures the results. The horizontal lines mark the degrees of freedom from MDR (the lower dotted line) and LR (the upper dotted line) using a fixed set of factors. These are summarized as well in Table 1. For example, an MDR model with a third-order interaction of 3-level factors has an effective dimension of
—above half way between the claimed 1 and the 26 of LR.
|
|
Conditional LR is an essential tool for the analysis of categorical factors with binary responses. Unlike MDR, LR is able to fit additive and other lower order effects as well as full-blown interactions. Therefore, LR can yield a more precise interpretation that distinguishes the presence of additive effects from the presence of interaction effects. In fact, the users of MDR fit LR models using the factors selected by MDR precisely for this reason; to simplify the high-order interactions into its component effects. LR is sometimes criticized due to the difficulties of estimating a large number of parameters with a relatively small number of samples (Ritchie and others, 2001
); however, we provide a solution to overcome this drawback. Biologists (Coffey and others, 2004
, e.g.) have shown that LR performs as well as other methods in cases where it is able to be fit.
Huang and others (2004)
proposed a tree-structured learning method, FlexTree, to identify the genes related to the cause of complex diseases along with their interactions. It is a rather complex procedure that aims to build a tree with splits in the form of a linear combination of multiple factors. Beginning from the root node that contains all the observations, each node is recursively split into 2 daughter nodes through the following steps:
- 1) Use backward shaving to select the optimal set of predictors for splitting the specific node. For the backward shaving, form a decreasing series of candidate subsets based on the bootstrapped scores. Then, determine the best subset among the series that yields the largest cross-validated impurity measure.
- 2) Perform a permutation test to see if the linear relationship between the selected subset of predictors and the outcome is strong enough. If so, go to the next step. If not, stop splitting the node.
- 3) Use the selected subset of predictors to compute the regression coefficients
and the splitting threshold
such that a binary split is determined based on
The optimal scoring method is used for estimating
and C is chosen to maximize the resulting Gini index for the node.
- 2) Perform a permutation test to see if the linear relationship between the selected subset of predictors and the outcome is strong enough. If so, go to the next step. If not, stop splitting the node.
Huang and others (2004)
compared FlexTree to other methods such as classification and regression trees, quick, unbiased and efficient statistical tree, logic regression, bagging, multiple additive regression trees, and random forest; they showed that FlexTree performed better than or as well as these competing methods. Using a very similar data set, we compared the performance of our method with that of FlexTree (in Section 5).
| 3. PENALIZED LR |
|---|
The generic LR model has the form
|
| (3.1) |
where X is a vector of predictors (typically dummy variables derived from factors in the present setting). LR coefficients are typically estimated by maximum likelihood (McCullagh and Nelder, 1989
); in fact, the deviance (A.1) that we used in Section 2.1 is twice the negative log-likelihood. Here, we maximize the log-likelihood subject to a size constraint on
-norm of the coefficients (excluding the intercept) as proposed in Lee and Silvapulle (1988)
and Le Cessie and Van Houwelingen (1992)
. This amounts to minimizing the following equation:
|
| (3.2) |
where l indicates the binomial log-likelihood and
is a positive constant. The coefficients are regularized in the same manner as in ridge regression (Hoerl and Kennard, 1970
). The importance of the quadratic penalty, particularly in our application, will be elaborated in subsequent sections.
To fit penalized LR models, we repeat the Newton–Raphson steps, which result in the "iteratively reweighted ridge regressions" (IRRR) algorithm:
![]() | (3.3) |
|
| (3.4) |
|
| (3.5) |
is the
matrix of the predictors (n and p are the numbers of the samples and the predictors, respectively),
is the vector of
responses,
is the vector of probability estimates that the responses are equal to 1,
is the diagonal matrix with the diagonal elements
for
,
is the diagonal matrix with the diagonal elements
and
is the current "working" response in the IRRR algorithm.
As a result of the quadratic penalization, the norm of the coefficient estimates is smaller than in the case of regular LR; however, none of the coefficients is zero. As in ridge regression, the amount of shrinkage that gets applied to each coefficient depends on the variance of the corresponding factor. This analogy to ridge regression is easily seen from (3.3)–(3.5).
Using the values from the final Newton–Raphson step of the IRRR algorithm, we estimate the effective degrees of freedom of the model (Hastie and Tibshirani, 1990
) and the variance of the coefficient estimates (Gray, 1992
). The effective degrees of freedom are approximated by
|
| (3.6) |
where
is obtained from the final step of the algorithm. This representation is based on similar ideas to those described in Appendix 6. The variance of the coefficients is also estimated from the final iteration:
|
| (3.7) |
|
| (3.8) |
![]() | (3.9) |
where
denotes the information in
This is referred to as a "sandwich estimate" (Gray, 1992
).
We now elaborate on the use of penalized LR specifically as it relates to our problem.
Using quadratic regularization with LR has a number of attractive properties:
- 1) When we fit interactions between categorical factors, the number of parameters can grow large. The penalization nevertheless enables us to fit the coefficients in a stable fashion.
- 2) We can code factors in a symmetric fashion using dummy variables, without the usual concern for multicollinearity. (In Section 3.5, we introduce a missing value imputation method taking advantage of this coding scheme.)
- 3) Zero cells are common in multifactor contingency tables. These situations are handled gracefully.
- 2) We can code factors in a symmetric fashion using dummy variables, without the usual concern for multicollinearity. (In Section 3.5, we introduce a missing value imputation method taking advantage of this coding scheme.)
Since quadratic regularization overcomes collinearity among the variables, a penalized LR model can be fit with a large number of factors or high-order interaction terms. The sample size does not limit the number of parameters. In Section 3.2, we illustrate our variable selection strategy; a growing number of variables in the model is not detrimental to the variable search.
The quadratic penalty makes it possible to code each level of a factor by a dummy variable, yielding coefficients with direct interpretations. Each coefficient reveals the significance of a particular level of a factor. This coding method cannot be applied to regular LR because the dummy variables representing a factor are perfectly collinear (they sum to 1). To overcome this, one of the levels is omitted or else the levels of the factors are represented as contrasts.
It turns out that the penalized criterion (3.2) creates the implicit constraint that the coefficients of the dummy variables representing any discrete factor/interaction of factors must sum to zero. Consider the model
|
| (3.10) |
where D is a vector of dummy variables that represent the levels of a 3-level categorical factor. As can be seen from (3.10), adding a constant vector to
and subtracting the same constant from
would not change the probability estimate. However, because our criterion minimizes
the coefficients are identifiable in such a way that the elements of
sum to zero. Given a data set of n observations
, we differentiate the objective function (3.2) with respect to the coefficients and obtain
|
| (3.11) |
|
| (3.12) |
These equations, in turn, imply
Zhu and Hastie (2004)
explored this property of the
-penalization in (multinomial) penalized LR using continuous factors. We can learn more from (3.12):
- Higher order interactions have particular marginal constraints. For example, consider an interaction between factor 1 and 2:

(3.13) 
(3.14) We can see this by summing over appropriate subsets of the equations as in (3.12).
- Each of these imply that
.
- A generalization is that for any penalized interaction term, if the coefficients of a marginal are "not" penalized, the sums as in (3.13) and (3.14) are zero.
When a column of
is so unbalanced that it contains no observations at a particular level (or combination of levels), the corresponding dummy variable is zero for all n observations. This phenomenon is common in SNP data because one allele of a locus can easily be prevalent over the other allele on the same locus. The lack of observations for certain levels of factors occurs even more frequently in high-order interaction models. We cannot fit a regular LR model with an input matrix that contains a column of zeros. However, when LR is accompanied by any small amount of quadratic penalization, the coefficient of the zero column will automatically be zero.
We demonstrate this for a simple 2-way interaction term in a model. As in (3.12),
|
| (3.15) |
where
is the number of observations with
and
,
is the number among these with
, and
is the coefficient for the jth level of variable 1, etc. The equivalence implies that if
then
and, thus,
for any
An analogous equality holds at any interaction order.
Penalizing the norm of the coefficients results in a smoothing effect in most cases. However, with an
-penalization, none of the coefficients is set to zero unless the distribution of the factors is extremely sparse as illustrated in Section 3.1. For prediction accuracy and interpretability, we often prefer using only a subset of the features in the model, and thus we need to include a method for variable selection. For that purpose, we use a classic approach: forward selection, followed by backward deletion.
In each forward step, a factor/interaction of factors is added to the model. A preset total number of forward steps are repeated. In the subsequent backward steps, a factor/interaction of factors is deleted, beginning with the final (and largest) model from the forward steps. Backward deletion continues until only one factor remains in the active set (the set of variables in the model). The choice of factor to be added or deleted in each step is based on the cost-complexity statistic
where cp is "complexity parameter." Popular choices are
and
for Akaike information criterion (AIC) and Bayesian information cirterion (BIC), respectively.
When adding or deleting variables, we obey the hierarchy principle: when an interaction of multiple factors is in the model, the lower order factors comprising the interaction should also be present. This is symmetric in the 2 lower order factors; we in fact implement a less stringent asymmetric version. To allow interaction terms to enter the model more easily, we modify the convention, such that any factor/interaction of factors in the active set can form a new interaction with any other single factor, even when the single factor is not yet in the active set. This asymmetric hierarchy was proposed by Friedman (1991)
in the context of Multivariate Adaptive Regression Splines models. To add more flexibility, an option we allow is to provide all possible second-order interactions as well as main effect terms as candidate factors at the beginning of the forward steps.
In the backward deletion process, we again respect the asymmetric hierarchy; not all components (of lower order) of an interaction term can be dropped before the interaction term. As the size of the active set is reduced monotonically, the backward deletion process produces a series of models, from the most complex model to the NULL model. Using the list of corresponding scores, we select the model size that generated the minimum score C.
Our quadratic penalization scheme has an unwelcome side effect, particularly when the regularization parameter
is large. Suppose factor
has a strong effect and factor
has no effect, either as a main effect or interaction with
. With only
in the model and a sizeable quadratic penalty parameter
, its coefficients
are not allowed to enter at full strength. If we include the interaction term
, its coefficients obey (3.13)
, and hence they can augment the reduced strength main effects. Since, for example,
, it is easy to see that quadratic penalization "favors" breaking up a large main effect coefficient into a sum of smaller interaction pieces. This suggests that interaction terms might be included in the model to compensate for the regularized main effects, rather than for the purpose intended1.
We often use a small value of
, largely as a device for controlling numerical stability; in these cases, this effect will be negligible. When larger values are used, we propose an alternative and more restrictive forward "stagewise" selection approach.
That is, when a variable is added to the model, the coefficients for all the previously included factors are fixed and only the coefficients for the new variable are penalized. This prevents the earlier coefficients from readjusting themselves to take advantage of the phenomenon described above.
We compare the forward stagewise and stepwise procedures in Section 5; the sets of factors selected by the 2 procedures are very similar and often identical because the factors that enter the model in the forward stagewise procedure are often in the same sequence as in the forward selection steps. Our software offers both choices, but based on our experience so far the original stepwise procedure is our preferred choice.

Here, we explore the smoothing effect of
-penalization mentioned briefly in Section 3.2. When building factorial models with interactions, there is always a risk of overfitting the data even with a selected subset of the features. In addition to the advantages of using quadratic regularization emphasized in Section 3.1, it can also be used to smooth a model and thus control its effective size or degrees of freedom (3.6). As heavier regularization is imposed by increasing
the deviance of the fit increases (the fit degrades), but the variance (3.7)–(3.9) of the coefficients and the effective degrees of freedom (3.6) of the model decrease. As a result, when the model size is determined by AIC/BIC, a larger value of
tends to choose a model with more variables and allows complex interaction terms to join the model more easily.
We used simulation to illustrate the patterns of models selected by varying
and for guidance in choosing an appropriate value. We ran 3 sets of simulation analyses, for each one generating data with a different magnitude of interaction effect. For all 3 cases, we generated data sets consisting of a binary response and 6 categorical factors with 3 levels. Only the first 2 of the 6 predictors affected the response with the conditional probabilities of belonging to class 1 as in the tables below. (AA, Aa, aa) and (BB, Bb, bb) are the possible genotypes for the first 2 factors;
meaning that the proportions of the alleles A and B in the population are assumed to be the same as the alleles a and
respectively. Figure 2 displays the log odds for class
for all possible combinations of levels of the first 2 factors; the log odds are additive for the first model, while the next 2 show interaction effects.
|
For each model, we generated 30 data sets of size
with balanced class labels. Then, we applied our procedure with
for each
selecting a model based on BIC. Table 2 summarizes how many times
(the additive model with the first 2 factors) and
(the interaction between the first 2 factors) were selected. The models that were not counted in the table include 
or the ones with the terms other than
and
Given that
is the true model for the first set while
is appropriate for the second and the third, ideally we should be fitting with small values of
for the additive model but increasing
as stronger interaction effects are added.
|
We can cross-validate to choose the value of
for each fold, we obtain a series of optimal models (based on AIC/BIC) corresponding to the candidate values of
and compute the log-likelihoods using the omitted fold. Then, we choose the value of
that yields the largest average (cross-validated) log-likelihood. We demonstrate this selection strategy in Section 4.
The coding method we implemented suggests an easy, but reasonable, method of imputing any missing values. If there are any samples lacking an observation for a factor
we then compute the sample proportions of the levels of
among the remaining samples. These proportions, which are the expected values of the dummy variables, are assigned to the samples with missing cells.
In this scheme, the fact that the dummy variables representing any factor sum to 1 is retained. In addition, our approach offers a smoother imputation than does filling the missing observations with the level that occurred most frequently in the remaining data. Through simulations in Section 4, we show that this imputation method yields a reasonable result.
| 4. SIMULATION STUDY |
|---|
To compare the performance of penalized LR to that of MDR under various settings, we generated 3 epistatic models and a heterogeneity model, some of which are based on the suggestions in Neuman and Rice (1992)
Figure 3 contains the plots of the log odds for all the conditional probabilities in the tables. (Zeros are replaced by
to compute the log odds.) As can be seen, we designed the third epistatic model so that the log odds are additive (the odds are multiplicative) in the first 2 factors; the interaction effect is more obvious in the first 2 epistatic models than in the heterogeneity model. Our distinction of the heterogeneity and epistatic models is based on Vieland and Huang (2003)
and Neuman and Rice (1992)
. We discuss how it is different from the additive/interaction scheme in LR in Appendix 6.
|
In addition to this initial simulation, we added noise to the data as described in Ritchie and others (2003)
- 1) Missing cells (MS): For
of the samples, one of the significant factors is missing.
- 2) Genetic heterogeneity (GH): For
of the cases, the third and the fourth factors, instead of the first 2, are significant.
- 2) Genetic heterogeneity (GH): For
Under each scenario, we simulated 30 sets of training and test data sets. For each training set, we selected the regularization parameter
through cross-validation and, using the chosen
built a model based on the BIC criterion. For each cross-validation, we provided candidate values of
in an adaptive way. We first applied a small value,
to the whole training data set and achieved models of different sizes from the backward deletion. Based on the series of models, we defined a set of reasonable values for the effective degrees of freedom. Then, we computed the values of
that would reduce the effective degrees of freedom of the largest model to the smaller values in the set.
We measured the prediction errors by averaging the 30 test errors. Table 3 summarizes the prediction accuracy comparison of the penalized LR and the MDR; the standard errors of the error estimates are parenthesized. The table shows that for both methods, the error rates increase when the data contain errors. The prediction accuracies are similar between the 2 methods although MDR yields slightly larger error rates in most situations.
|
Table 4 contains the numbers counting the cases (out of 30) for which the correct factors were identified. For PLR, the number of cases for which the interaction terms were also selected is parenthesized; the numbers vary reflecting the magnitude of interaction effect imposed in these 4 models as shown in Figure 3.
|
For the heterogeneity model, main effects exist for both the 2 significant factors. In addition, as one is stronger than the other, MDR was not successful in identifying them simultaneously even for the data with no error, as shown in Table 4. In the case of the heterogeneity model or the second epistatic model, MDR suffered from a decrease in power, especially with GH perturbations. When GH perturbations were added to the second epistatic model, MDR correctly specified the 4 factors only 16 out of 30 times, while our method did so in all 30 simulations. These results show that the penalized LR method is more powerful than MDR, especially when multiple sets of significant factors exist; in these situations, MDR often identifies only a subset of the significant factors.
| 5. REAL DATA EXAMPLE |
|---|
We compared our method to FlexTree and MDR using the data from the SAPPHIRe (Stanford Asian Pacific Program for Hypertension and Insulin Resistance) project. The goal of the SAPPHIRe project was to detect the genes that predispose individuals to hypertension. A similar data set was used in Huang and others (2004)
to show that the FlexTree method outperforms many competing methods. The data set contains the menopausal status and the genotypes on 21 distinct loci of 216 hypotensive and 364 hypertensive Chinese women. The subjects family information is also available; samples belonging to the same family are included in the same cross-validation fold for all the analyses.
We applied 5-fold cross-validation to estimate the misclassification rates using penalized LR with forward stepwise variable selection, FlexTree, and MDR. For penalized LR, a complexity parameter was chosen for each fold through an internal cross-validation. For MDR, we used internal cross-validations to select the most significant sets of features.
Huang and others (2004)
initially used an unequal loss for the 2 classes: misclassifying a hypotension sample was twice as costly as misclassifying a hypertension sample. We fit penalized LR, FlexTree, and MDR with an equal as well as an unequal loss. For MDR with an unequal loss, we used "1/(1+2)" as threshold for the class proportion when labeling the cells of the tables, instead of "1/2" as in the usual cases of equal loss.
The results are compared in Table 5. Penalized LR achieved lower misclassification cost than other methods with either loss function. When an equal loss was used, FlexTree and MDR generated highly unbalanced predictions, assigning most samples to the larger class. Although penalized LR also achieved a low specificity, it was not so serious as in the other 2 methods.
|
Figure 4 shows the receiver operating characteristic (ROC) curves for penalized LR with an unequal (left panel) and an equal (right panel) loss function. For both plots, vertical and horizontal axes indicate the sensitivity and the specificity, respectively. Because penalized LR yields the predicted probabilities of a case, we could compute different sets of sensitivity and specificity by changing the classification threshold between 0 and
The solid circles on the curves represent the values we achieved with the usual threshold
The plus signs corresponding to FlexTree and the squares corresponding to MDR are all located toward the lower left corner, away from the ROC curves. In other words, penalized LR would achieve a higher sensitivity (specificity) than other methods if the specificity (sensitivity) were fixed the same as theirs.
|
Applying our forward stepwise procedure to the whole data set yields a certain set of significant features as listed in Table 6. However, if the data were perturbed, a different set of features would be selected. Through a bootstrap analysis (Efron and Tibshirani, 1993
|
We illustrate the bootstrap analysis using a fixed value of
For each of
bootstrap data sets, we ran a forward stepwise procedure with
which is a value that was frequently selected in previous cross-validation. At the end of the B bootstrap runs, we counted the frequency for every factor/interaction of factors that has been included in the model at least once. The second and the fourth columns of Table 6 contain the counts for the corresponding features; some of them were rarely selected. Table 7 lists the factors/interactions of factors that were selected with relatively high frequencies.
|
Not all the commonly selected factors listed in Table 7 were included in the model when we used the whole data set. It is possible that some factors/interactions of factors were rarely selected simultaneously because of a strong correlation among them. To detect such instances, we propose using the co-occurrence matrix (after normalizing for the individual frequencies) among all the factors/interactions of factors listed in Table 7 as a dissimilarity matrix and applying hierarchical clustering. Then, any group of factors that tends not to appear simultaneously would form tight clusters.
Using the 11 selected features in Table 7, we first constructed the
co-occurrence matrix so that the
th element was the number of the bootstrap runs in which the ith and the jth features were selected simultaneously. Then, we normalized the matrix by dividing the
th entry by the number of bootstrap runs in which either the ith or the jth feature was selected. That is, denoting the
th entry as
we divided it by
for every i and
As we performed hierarchical clustering with the normalized co-occurrence distance measure, PTPN1i4INV and MLRI2V were in a strong cluster: they were in the model simultaneously for only 2 bootstrap runs. Analogously,
and AGT2R1A1166C appeared 106 and 35 times, respectively, but only twice simultaneously. For both clusters, one of the elements was selected in our model (Table 6) while the other was not. Hence, the pairs were presumably used as alternatives in different models.
To compare the forward stagewise variable selection scheme to the forward stepwise method that we have used so far, we again summarize the factors/interactions of factors that were selected in 300 bootstrap runs of forward stagewise procedure in Table 8. Although the forward stagewise procedure tends to select single factors more frequently and the 2-way interaction terms less frequently compared to the previous scheme, the 2 methods share common factors/interactions of factors. For instance, menopause and MLRI2V were the 2 most frequent main effect terms, and
and
were among the most common interaction terms.
|
| 5.2 Bladder cancer data set |
|---|
We show a further comparison of different methods with another data set, which was used by Hung and others (2004)
We compared the prediction error rate of penalized LR with those of FlexTree and MDR through 5-fold cross-validation. As summarized in Table 9, penalized LR achieved higher sensitivity and specificity than FlexTree and better balanced class predictions than MDR.
|
As done in Section 5.1, we generated the ROC curve (Figure 5) for penalized LR by varying the classification threshold between 0 and
Both sensitivity and specificity of FlexTree are lower than those of penalized LR; therefore, penalized LR would achieve higher sensitivity (specificity) than FlexTree if its specificity (sensitivity) is adjusted to be the same as FlexTree. Although the sensitivity of MDR is higher than that of penalized LR, the square is still off the ROC curve, toward the lower left corner. In addition, sensitivity and specificity are more even for penalized LR.
|
When we fit a penalized LR model with forward stepwise selection using this bladder cancer data set, the 4 terms in Table 10 were selected. To validate their significance, we performed a similar bootstrap analysis as in Section 5.1. The second and the fourth columns of Table 10 record the number of bootstrap runs (out of
) in which the factors were chosen.
|
The factors/interactions of factors that were frequently selected through the bootstrap runs are listed in Table 11. The factors in Table 10 form the subset with the highest ranks among the ones listed in Table 11, providing evidence of reliability. The latter half of Table 11 shows that even the interaction terms with the largest counts were not as frequent as other common main effect terms. When we applied MDR, the second-order interaction term
was often selected; however, according to the bootstrap results, LR method can explain their effect in a simpler, additive model. In addition, MDR was not able to identify other potentially important features.
|
We also used the co-occurrence matrix of the factors in Table 11 as a dissimilarity measure and applied hierarchical clustering. One of the tightest clusters was the pair of GSTM1 and NAT2: they were in the model 133 and 88 times, respectively, but coincided only 33 times, implying that NAT2 was often used to replace GSTM1.
These results from the bootstrap analysis are consistent with the findings in Hung and others (2004)
in several ways. The factors with high frequencies (the first column of Table 11) are among the ones that were shown to be significantly increasing the risk of bladder cancer, through conventional analyses reported in Hung and others (2004)
. Hung and others also incorporated some known facts about the functional similarities of the genes and improved the estimates of the odds ratio. From this hierarchical modeling, MPO, GSTM1, and MnSOD achieved high odds ratios with improved accuracy. In addition, their analysis of gene–environment interaction showed that although smoking status itself was a significant factor, none of its interaction with other genes was strikingly strong. Similarly, as can be seen from Table 11, our bootstrap runs did not detect any critical interaction effect.
| 6. DISCUSSION |
|---|
We have proposed using LR with a penalization on the size of the
-norm of the coefficients. The penalty was imposed not only for the usual smoothing effect but also for convenient and sometimes necessary features that the quadratic penalization accompanied. In regular LR models, a small sample size prohibits high-order interaction terms, and variables with constant zero entries are often not allowed. Because these situations are common in modeling gene–gene interactions, LR is limited in its applications. However, the quadratic penalization scheme yields a stable fit, even with a large number of parameters, and automatically assigns zero to the coefficients of zero columns.
We modified the hierarchy rule of the forward stepwise procedure to allow the interaction terms to enter the model more easily. One strategy was to accept an interaction term as a candidate if either component was already in the model. If a strong interaction effect with negligible main effects is suspected, more flexible rules, such as accepting an interaction term even with no main effect terms, should be applied. However, the forward stepwise procedure selects variables in a greedy manner. A less greedy selection through
-regularization will allow the terms to enter the model more smoothly.
LR yields a reasonable prediction accuracy and identification of significant factors along with their interaction structures. We have shown that adding a quadratic penalization is a simple but powerful remedy that makes it possible to use LR in building gene–gene interaction models.
| APPENDIX A: DIMENSIONALITY OF MDR6 |
|---|
Here, we present the details of how the simulations were done to evaluate the effective degrees of freedom of the MDR fits, as briefly discussed in Section 2.1. First, we review a standard scenario for comparing nested LR models. Suppose we have n measurements on two 3-level factors
and
and a binary (case/control) response Y generated "completely at random"—that is, as a coin flip, totally independent of
or
. We then fit 2 models for the probabilities
of a "case" in cell i of
and cell j of
:- 1)
: a constant model
, which says the probability of a case is fixed and independent of the factors (the correct model).
- 2)
: a second-order interaction LR model, which allows for a separate probability
of a case in each cell of the
table formed by the factors.
- 2)
is the observed binary response for observation
and the model probability is
, then the "deviance" measures the discrepancy between the data and the model:
|
| (A.1) |
We now fit the 2 models separately, by minimizing the deviance above for each, yielding fitted models
and
. In this case, the "change in deviance"
|
| (A.2) |
measures the improvement in fit from using the richer model over the constant model. Since the smaller model is correct in this case, the bigger model is "fitting the noise." Likelihood theory tells us that as the sample size n gets large, the change in deviance has a
-distribution with degrees of freedom equal to
, the difference in the number of parameters in the 2 models. If we fit an additive LR model for
instead, the change in deviance would have an asymptotic
-distribution. Two important facts emerge from this preamble:
- The more parameters we fit, the larger the change in deviance from a null model and the more we overfit the data.
- The degrees-of-freedom measures the average amount of overfitting; indeed, the degrees of freedom d is the "mean" of the
-distribution.
In the scenario above, MDR would examine each of the 9 cells in the 2-way table and, based on the training data responses, create its "1-dimensional" binary factor
with levels high risk and low risk. With this factor in hand, we could go ahead and fit a 2-parameter model with probabilities of a case
and
in each of these groups. We could then fit this model to the data, yielding a fitted probability vector
, and compute the change in deviance
. Ordinarily for a single 2-level factor fit to a null model, we would expect a
-distribution. However, the 2-level factor was not predetermined but "fit to the data." Hence, we expect change of deviances bigger than predicted by a
. The idea of the simulation is to fit these models many times to null data and estimate the effective degrees of freedom as the average change in the deviance (Hastie and Tibshirani, 1990
).
We used this simulation model to examine 2 aspects of the effective dimension of MDR:
- For a fixed set of K factors, the effective degrees-of-freedom cost for creating the binary factor
.
- The additional cost for searching among all possible sets of size K from a pool of p available factors.
| APPENDIX B: TWO-LOCUS MODELING6 |
|---|
Many researchers have suggested methods to categorize 2-locus models for genetic diseases and to mathematically formulate the corresponding probabilities of influencing the disease status. The 2-locus models are often divided into 2 classes: "heterogeneity models" for which a certain genotype causes the disease independently of the genotype on the other locus and "epistatis models" for which the 2 genotypes are dependent. To represent the 2 distinct classes, the concept of "penetrance" is often used. Penetrance is a genetic term meaning the proportion of individuals with a disease-causing gene that actually shows the symptoms of the disease.
Let A and B denote potential disease-causing genotypes on 2 different loci, and use the following notation to formulate different genetic models:
|
|
That is,
and
denote the penetrance, or the conditional probability of resulting in the disease, for the individuals carrying the genotypes
and both A and
respectively.
Vieland and Huang (2003)
defined "heterogeneity" between 2 loci to be the relationship with the following "fundamental heterogeneity equation":
|
| (B.1) |
which is directly derived from the following more obvious representation of the independent effect of a pair of genotypes:
|
|
They referred to any other 2-locus relationship for which (B.1) does not hold as "epistatic." Neuman and Rice (1992)
distinguished the heterogeneity models likewise. Risch (1990)
also characterized "multiplicative" and "additive" 2-locus models; the disease penetrance for carriers of both genotypes A and B was multiplicative or additive in the penetrance scores for single genotypes A and
Risch considered the additive model to be a reasonable approximation of a heterogeneity model.
As we demonstrated in Sections 3.4 and 4, LR identifies the relationship among the active genes as either additive or having an interaction; however, this distinction is not equivalent to that of heterogeneity and the epistatic relationship described above. For example, "epistatic model III" in Section 4 has a probability distribution such that the conditional probabilities of disease are additive in log odds; we expect an additive model when applying LR. Although in genetics, heterogeneity models are often characterized by no interaction among loci in affecting the disease, the factors are not necessarily conceived to be additive in LR. However, in the example illustrated in Section 4, the interaction effect in the heterogeneity model was not as critical as in other epistatic models, and LR found an additive model in more than
of the repeats.
| ACKNOWLEDGMENTS |
|---|
The authors thank Richard Olshen for valuable comments and for a version of data from the SAPPHIRe project, John Witte for the bladder cancer dataset, and Amir Najmi and Balasubramanian Narasimhan for the codes for FlexTree. The authors are also grateful to Robert Tibshirani and other Hastie–Tibshirani Lab members for helpful discussions. Trevor Hastie was partially supported by grant DMS-0505676 from the National Science Foundation and grant 2R01 CA 72028-07 from the National Institutes of Health. Conflict of Interest: None declared.
| FOOTNOTES |
|---|
1 We thank one of the referees for alerting us to this phenomenon.
| REFERENCES |
|---|
-
Coffey C, Hebert P, Ritchie M, Krumholz H, Gaziano J, Ridker P, Brown N, Vaughan D, Moore J. An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene interactions on risk of myocardial infarction: the importance of model validation. BMC Bioinformatics (2004) 5:49.[CrossRef][Medline]
Efron B, Tibshirani R. An Introduction to the Bootstrap (1993) Boca Raton, FL: Chapman & Hall.
Friedman J. Multivariate adaptive regression splines. The Annals of Statistics (1991) 19:1–67.
Gray R. Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. Journal of the American Statistical Association (1992) 87:942–951.[CrossRef][Web of Science]
Hahn L, Ritchie M, Moore J. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interaction. Bioinformatics (2003) 19:376–382.
Hastie T, Tibshirani R. Generalized Additive Models (1990) London: Chapman & Hall.
Hoerl AE, Kennard R. Ridge regression: biased estimation for nonorthogonal problems. Technometrics (1970) 12:55–67.[CrossRef][Web of Science]
Huang J, Lin A, Narasimhan B, Quertermous T, Hsiung C, Ho L, Grove J, Oliver M, Ranade K, Risch N, et al. Tree-structured supervised learning and the genetics of hypertension. Proceedings of the National Academy of Sciences of the United States of America (2004) 101:10529–10534.
Hung R, Brennan P, Malaveille C, Porru S, Donato F, Boffetta P, Witte J. Using hierarchical modeling in genetic association studies with multiple markers: application to a case-control study of bladder cancer. Cancer Epidemiology, Biomarkers & Prevention (2004) 13:1013–1021.
Le Cessie S, Van Houwelingen J. Ridge estimators in logistic regression. Applied Statistics (1992) 41:191–201.[CrossRef][Web of Science]
Lee A, Silvapulle M. Ridge estimation in logistic regression. Communications in Statistics, Simulation and Computation (1988) 17:1231–1257.[CrossRef][Web of Science]
McCullagh P, Nelder J. Generalized Linear Models (1989) Boca Raton, FL: Chapman & Hall.
Neuman R, Rice J. Two-locus models of disease. Genetic Epidemiology (1992) 9:347–365.[CrossRef][Web of Science][Medline]
Risch N. Linkage strategies for genetically complex traits. i. multilocus models. American Journal of Human Genetics (1990) 46:222–228.[Web of Science][Medline]
Ritchie M, Hahn L, Moore J. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genetic Epidemiology (2003) 24:150–157.[CrossRef][Web of Science][Medline]
Ritchie M, Hahn L, Roodi N, Bailey L, Dupont W, Parl F, Moore J. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. American Journal of Human Genetics (2001) 69:138–147.[CrossRef][Web of Science][Medline]
Vieland V, Huang J. Two-locus heterogeneity cannot be distinguished from two-locus epistasis on the basis of affected-sib-pair data. American Journal of Human Genetics (2003) 73:223–232.[CrossRef][Web of Science][Medline]
Zhu J, Hastie T. Classification of gene microarrays by penalized logistic regression. Biostatistics (2004) 46:505–510.
Received May 26, 2006; revised February 6, 2007; accepted for publication March 2, 2007.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
T. T. Wu, Y. F. Chen, T. Hastie, E. Sobel, and K. Lange Genome-wide association analysis by lasso penalized logistic regression Bioinformatics, March 15, 2009; 25(6): 714 - 721. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Yang, Z. He, X. Wan, Q. Yang, H. Xue, and W. Yu SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies Bioinformatics, February 15, 2009; 25(4): 504 - 511. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. B. Guzman, A. Yambartsev, A. Goncalves-Primo, I. D.C.G. Silva, C. R.N. Carvalho, J. C.L. Ribalta, L. R. Goulart, N. Shulzhenko, M. Gerbase-DeLima, and A. Morgun New approach reveals CD28 and IFNG gene interaction in the susceptibility to cervical cancer Hum. Mol. Genet., June 15, 2008; 17(12): 1838 - 1844. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








