Skip Navigation


Biostatistics Advance Access originally published online on April 11, 2007
Biostatistics 2008 9(1):30-50; doi:10.1093/biostatistics/kxm010
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
9/1/30    most recent
kxm010v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Park, M. Y.
Right arrow Articles by Hastie, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Park, M. Y.
Right arrow Articles by Hastie, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

Penalized logistic regression for detecting gene interactions

Mee Young Park*

Google Inc., 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA meeyoung{at}google.com

Trevor Hastie

Department of Statistics and Department of Health Research & Policy, Stanford University, Stanford, CA 94305, USA

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. RELATED WORK
 3. PENALIZED LR
 4. SIMULATION STUDY
 5. REAL DATA EXAMPLE
 5.2 Bladder cancer data...
 6. DISCUSSION
 APPENDIX A: DIMENSIONALITY OF...
 APPENDIX B: TWO-LOCUS MODELING6
 REFERENCES
 
We propose using a variant of logistic regression (LR) with Formula-regularization to fit gene–gene and gene–environment interaction models. Studies have shown that many common diseases are influenced by interaction of certain genes. LR models with quadratic penalization not only correctly characterizes the influential genes along with their interaction structures but also yields additional benefits in handling high-dimensional, discrete factors with a binary response. We illustrate the advantages of using an Formula-regularization scheme and compare its performance with that of "multifactor dimensionality reduction" and "FlexTree," 2 recent tools for identifying gene–gene interactions. Through simulated and real data sets, we demonstrate that our method outperforms other methods in the identification of the interaction structures as well as prediction accuracy. In addition, we validate the significance of the factors selected through bootstrap analyses.

Keywords: Discrete factors; Gene interactions; High dimensional; Logistic regression; L2-regularization


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. RELATED WORK
 3. PENALIZED LR
 4. SIMULATION STUDY
 5. REAL DATA EXAMPLE
 5.2 Bladder cancer data...
 6. DISCUSSION
 APPENDIX A: DIMENSIONALITY OF...
 APPENDIX B: TWO-LOCUS MODELING6
 REFERENCES
 
Because many common diseases are known to be affected by certain genotype combinations, there is a growing demand for methods to identify the influential genes along with their interaction structures. We propose a forward stepwise method based on penalized logistic regression (LR). Our method primarily targets data consisting of single-nucleotide polymorphisms (SNPs) measurements and a binary response variable separating the affected subjects from the unaffected ones.

LR is a standard tool for modeling effects and interactions with binary response data. However, for the SNP data here, LR models have significant drawbacks:

  • The 3-level genotype factors and their interactions can create many parameters, and with relatively small data sets, problems with overfitting arise.
  • With many candidate loci, factors can be correlated leading to further degradation of the model.
  • Often cells that define an interaction can be empty or nearly empty, which would require special parametrization.
  • These problems are exacerbated as the interaction order is increased.
For these and other reasons, researchers have looked for alternative methods for identifying interactions.

In this paper, we show that some simple modifications of standard LR overcome the problems. We modify the LR criterion by combining it with a penalization of the Formula-norm of the coefficients; this adjustment yields significant benefits. Because of the quadratic penalization, collinearity among the variables does not degrade fitting much, and the number of factors in the model is essentially not limited by sample size. When the levels of discrete factors are sparse or high-order interaction terms are considered, the contingency tables for the factors may easily include cells with zeros or near zeros. Again, with the help of quadratic penalization, these situations do not diminish the stability of the fits.

We compare our method with multifactor dimensionality reduction (MDR; Ritchie and others, 2001Go), a widely used tool for detecting gene interactions. The authors of MDR propose it as an alternative to LR, primarily for the reasons mentioned above. Their method screens pure interactions of various orders, using cross-validation to reduce the bias of overfitting. Once an interaction is found, the inventors propose using LR to tease it apart.

In the following sections, we describe and support our approach in more detail with examples and justifications. We review MDR and several other related methods in Section 2. We explore the use of penalized LR in Section 3. Our methods are illustrated with simulated and real data sets in Sections 4 and 5. We conclude with a summary and possible extensions of our studies in Section 6.


    2. RELATED WORK
 TOP
 SUMMARY
 1. INTRODUCTION
 2. RELATED WORK
 3. PENALIZED LR
 4. SIMULATION STUDY
 5. REAL DATA EXAMPLE
 5.2 Bladder cancer data...
 6. DISCUSSION
 APPENDIX A: DIMENSIONALITY OF...
 APPENDIX B: TWO-LOCUS MODELING6
 REFERENCES
 

2.1 Multifactor dimensionality reduction

MDR proposed by Ritchie and others (2001)Go, Ritchie and others (2003)Go, Hahn and others (2003)Go and Coffey and others (2004)Go is a popular technique for detecting and characterizing gene–gene/gene–environment interactions that affect complex but common genetic diseases.

The MDR algorithm.

MDR finds both the optimal interaction order K and the corresponding K factors that are significant in determining the disease status. The algorithm is as follows:
1) For each Formula run 10-fold cross-validation to find the optimal set of K factors (described below).
2) Compare the prediction errors (on the left out set) and the consistencies (how many times out of 10 folds the optimal set of factors was selected) for different K.
3) Select the K with the smallest estimate of prediction error and/or the largest consistency. This K is the final size of the model, and the optimal set for the chosen order forms the best multifactor model.

In step 1 above, MDR uses cross-validation to find the optimal set of factors for each K. The following steps are repeated for each cross-validation fold:

1) Construct a contingency table among every possible set of K factors.
2) Label the cells of the table "high risk" if the cases/control ratio is greater than 1 in the training part Formula and "low risk" otherwise.
3) Compute the training error for the Formula data, by classifying high risk as a case and low risk a control.
4) For the set of K factors that yields the lowest training error, compute the prediction error using the remaining Formula
The set of K factors that achieves the lowest training error most frequently is named the "optimal set of size K," and the largest frequency is referred to as the consistency for size Formula

A strong selling point of MDR is that it can simultaneously detect and characterize multiple genetic loci associated with diseases. It searches through any levels of interaction regardless of the significance of the main effects. It is therefore able to detect high-order interactions even when the underlying main effects are statistically insignificant. However, this "strength" is also its weakness; MDR can "only" identify interactions and hence will suffer severely from lack of power if the real effects are additive. For example, if there are 3 loci active, and their effect is additive, MDR can only see them all as a 3-factor interaction. Typically, the power for detecting interactions decreases with K since the number of parameters grows exponentially with K, so this is a poor approach if the real effects are additive and lower dimensional. Of course, one can post-process a 3-factor interaction term and find that it is additive, but the real art here is in discovering the relevant factors involved.

MDR suffers from several technical disadvantages. First, cells in high-dimensional tables will often be empty; these cells cannot be labeled based on the cases/control ratio. Second, the binary assignment (high risk/low risk) is highly unstable when the proportions of cases and controls are similar.

Dimensionality of MDR.

The authors of MDR claim (Ritchie and others, 2003Go) that MDR reduces a p-dimensional model to a 1-dimensional model, where p is the total number of available factors. This statement is apparently based on the binary partitioning of the samples into the high-risk and low-risk groups—a 1-dimensional description. This characterization is flawed at several levels because in order to produce this reduction, MDR searches in potentially very high-dimensional spaces:
1) MDR searches for the optimal interaction order K.
2) MDR searches for an optimal set of K factors, among Formula possibilities.
3) Given K factors, MDR "searches" for the optimal binary assignment of the cells of a table into high risk and low risk.

All these amount to an "effective dimension" or "degrees of freedom" that is typically much larger than 1. We demonstrate, through a simulation, a more realistic assessment of the dimensionality of MDR. Here, we present the results and refer the readers to Appendix 6 for the details of the simulation procedure.

We simulated 500 samples with 10 factors, each having 3 levels; the responses were randomly chosen to be Formula and thus, none of the factors was relevant to the response. Changing the order of interaction (Formula) and the total number of available factors (from K to 10), we computed the deviance changes for the fitted models. We estimated the degrees of freedom of the models by repeating the simulation 200 times and averaging the deviance measures.

Figure 1 captures the results. The horizontal lines mark the degrees of freedom from MDR (the lower dotted line) and LR (the upper dotted line) using a fixed set of factors. These are summarized as well in Table 1. For example, an MDR model with a third-order interaction of 3-level factors has an effective dimension of Formula—above half way between the claimed 1 and the 26 of LR.


Figure 1
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Plots of the average differences in deviance between the fitted and the null models: the solid curves in the 3 panels represent the MDR models with the interaction orders 1, 2, and 3, respectively. The vertical segments at the junction points are the standard error bars. As the interaction order increased, the effective degrees of freedom increased as well. In addition, each curve monotonically increased along with the number of available factors as the optimal set of factors was searched over a larger space. The horizontal lines mark the degrees of freedom from MDR (the lower dotted line) and LR (the upper dotted line) without searching (we used a fixed set of factors so that there was no effect due to searching for an optimal set of factors).

 

View this table:
[in this window]
[in a new window]

 
Table 1. The estimated degrees of freedom for MDR and LR, using K = 1, 2, and 3 factors (standard errors in parentheses). LR exact refers to the asymptotic exact degrees of freedom

 
2.2 Conditional LR

Conditional LR is an essential tool for the analysis of categorical factors with binary responses. Unlike MDR, LR is able to fit additive and other lower order effects as well as full-blown interactions. Therefore, LR can yield a more precise interpretation that distinguishes the presence of additive effects from the presence of interaction effects. In fact, the users of MDR fit LR models using the factors selected by MDR precisely for this reason; to simplify the high-order interactions into its component effects. LR is sometimes criticized due to the difficulties of estimating a large number of parameters with a relatively small number of samples (Ritchie and others, 2001Go); however, we provide a solution to overcome this drawback. Biologists (Coffey and others, 2004Go, e.g.) have shown that LR performs as well as other methods in cases where it is able to be fit.

2.3 FlexTree

Huang and others (2004)Go proposed a tree-structured learning method, FlexTree, to identify the genes related to the cause of complex diseases along with their interactions. It is a rather complex procedure that aims to build a tree with splits in the form of a linear combination of multiple factors. Beginning from the root node that contains all the observations, each node is recursively split into 2 daughter nodes through the following steps:

1) Use backward shaving to select the optimal set of predictors for splitting the specific node. For the backward shaving, form a decreasing series of candidate subsets based on the bootstrapped scores. Then, determine the best subset among the series that yields the largest cross-validated impurity measure.
2) Perform a permutation test to see if the linear relationship between the selected subset of predictors and the outcome is strong enough. If so, go to the next step. If not, stop splitting the node.
3) Use the selected subset of predictors to compute the regression coefficients Formula and the splitting threshold Formula such that a binary split is determined based on Formula The optimal scoring method is used for estimating Formula and C is chosen to maximize the resulting Gini index for the node.

Huang and others (2004)Go compared FlexTree to other methods such as classification and regression trees, quick, unbiased and efficient statistical tree, logic regression, bagging, multiple additive regression trees, and random forest; they showed that FlexTree performed better than or as well as these competing methods. Using a very similar data set, we compared the performance of our method with that of FlexTree (in Section 5).


    3. PENALIZED LR
 TOP
 SUMMARY
 1. INTRODUCTION
 2. RELATED WORK
 3. PENALIZED LR
 4. SIMULATION STUDY
 5. REAL DATA EXAMPLE
 5.2 Bladder cancer data...
 6. DISCUSSION
 APPENDIX A: DIMENSIONALITY OF...
 APPENDIX B: TWO-LOCUS MODELING6
 REFERENCES
 
The generic LR model has the form

Formula (3.1)

where X is a vector of predictors (typically dummy variables derived from factors in the present setting). LR coefficients are typically estimated by maximum likelihood (McCullagh and Nelder, 1989Go); in fact, the deviance (A.1) that we used in Section 2.1 is twice the negative log-likelihood. Here, we maximize the log-likelihood subject to a size constraint on Formula-norm of the coefficients (excluding the intercept) as proposed in Lee and Silvapulle (1988)Go and Le Cessie and Van Houwelingen (1992)Go. This amounts to minimizing the following equation:

Formula (3.2)

where l indicates the binomial log-likelihood and Formula is a positive constant. The coefficients are regularized in the same manner as in ridge regression (Hoerl and Kennard, 1970Go). The importance of the quadratic penalty, particularly in our application, will be elaborated in subsequent sections.

To fit penalized LR models, we repeat the Newton–Raphson steps, which result in the "iteratively reweighted ridge regressions" (IRRR) algorithm:

Formula (3.3)


Formula (3.4)


Formula (3.5)

Formula is the Formula matrix of the predictors (n and p are the numbers of the samples and the predictors, respectively), Formula is the vector of Formula responses, Formula is the vector of probability estimates that the responses are equal to 1, Formula is the diagonal matrix with the diagonal elements Formula for Formula, Formula is the diagonal matrix with the diagonal elements Formula and Formula is the current "working" response in the IRRR algorithm.

As a result of the quadratic penalization, the norm of the coefficient estimates is smaller than in the case of regular LR; however, none of the coefficients is zero. As in ridge regression, the amount of shrinkage that gets applied to each coefficient depends on the variance of the corresponding factor. This analogy to ridge regression is easily seen from (3.3)–(3.5).

Using the values from the final Newton–Raphson step of the IRRR algorithm, we estimate the effective degrees of freedom of the model (Hastie and Tibshirani, 1990Go) and the variance of the coefficient estimates (Gray, 1992Go). The effective degrees of freedom are approximated by

Formula (3.6)

where Formula is obtained from the final step of the algorithm. This representation is based on similar ideas to those described in Appendix 6. The variance of the coefficients is also estimated from the final iteration:

Formula (3.7)


Formula (3.8)


Formula (3.9)

where Formula denotes the information in Formula This is referred to as a "sandwich estimate" (Gray, 1992Go).

We now elaborate on the use of penalized LR specifically as it relates to our problem.

3.1 Advantages of quadratic penalization

Using quadratic regularization with LR has a number of attractive properties:

1) When we fit interactions between categorical factors, the number of parameters can grow large. The penalization nevertheless enables us to fit the coefficients in a stable fashion.
2) We can code factors in a symmetric fashion using dummy variables, without the usual concern for multicollinearity. (In Section 3.5, we introduce a missing value imputation method taking advantage of this coding scheme.)
3) Zero cells are common in multifactor contingency tables. These situations are handled gracefully.

Since quadratic regularization overcomes collinearity among the variables, a penalized LR model can be fit with a large number of factors or high-order interaction terms. The sample size does not limit the number of parameters. In Section 3.2, we illustrate our variable selection strategy; a growing number of variables in the model is not detrimental to the variable search.

The quadratic penalty makes it possible to code each level of a factor by a dummy variable, yielding coefficients with direct interpretations. Each coefficient reveals the significance of a particular level of a factor. This coding method cannot be applied to regular LR because the dummy variables representing a factor are perfectly collinear (they sum to 1). To overcome this, one of the levels is omitted or else the levels of the factors are represented as contrasts.

It turns out that the penalized criterion (3.2) creates the implicit constraint that the coefficients of the dummy variables representing any discrete factor/interaction of factors must sum to zero. Consider the model

Formula (3.10)

where D is a vector of dummy variables that represent the levels of a 3-level categorical factor. As can be seen from (3.10), adding a constant vector to Formula and subtracting the same constant from Formula would not change the probability estimate. However, because our criterion minimizes Formula the coefficients are identifiable in such a way that the elements of Formula sum to zero. Given a data set of n observations Formula, we differentiate the objective function (3.2) with respect to the coefficients and obtain

Formula (3.11)


Formula (3.12)

These equations, in turn, imply Formula Zhu and Hastie (2004)Go explored this property of the Formula-penalization in (multinomial) penalized LR using continuous factors. We can learn more from (3.12):

  • Higher order interactions have particular marginal constraints. For example, consider an interaction between factor 1 and 2:

    Formula (3.13)


    Formula (3.14)

    We can see this by summing over appropriate subsets of the equations as in (3.12).

  • Each of these imply that Formula.
  • A generalization is that for any penalized interaction term, if the coefficients of a marginal are "not" penalized, the sums as in (3.13) and (3.14) are zero.

When a column of Formula is so unbalanced that it contains no observations at a particular level (or combination of levels), the corresponding dummy variable is zero for all n observations. This phenomenon is common in SNP data because one allele of a locus can easily be prevalent over the other allele on the same locus. The lack of observations for certain levels of factors occurs even more frequently in high-order interaction models. We cannot fit a regular LR model with an input matrix that contains a column of zeros. However, when LR is accompanied by any small amount of quadratic penalization, the coefficient of the zero column will automatically be zero.

We demonstrate this for a simple 2-way interaction term in a model. As in (3.12),

Formula (3.15)

where Formula is the number of observations with Formula and Formula, Formula is the number among these with Formula, and Formula is the coefficient for the jth level of variable 1, etc. The equivalence implies that if Formula then Formula and, thus, Formula for any Formula An analogous equality holds at any interaction order.

3.2 Variable selection: forward stepwise procedure

Penalizing the norm of the coefficients results in a smoothing effect in most cases. However, with an Formula-penalization, none of the coefficients is set to zero unless the distribution of the factors is extremely sparse as illustrated in Section 3.1. For prediction accuracy and interpretability, we often prefer using only a subset of the features in the model, and thus we need to include a method for variable selection. For that purpose, we use a classic approach: forward selection, followed by backward deletion.

In each forward step, a factor/interaction of factors is added to the model. A preset total number of forward steps are repeated. In the subsequent backward steps, a factor/interaction of factors is deleted, beginning with the final (and largest) model from the forward steps. Backward deletion continues until only one factor remains in the active set (the set of variables in the model). The choice of factor to be added or deleted in each step is based on the cost-complexity statistic Formula where cp is "complexity parameter." Popular choices are Formula and Formula for Akaike information criterion (AIC) and Bayesian information cirterion (BIC), respectively.

When adding or deleting variables, we obey the hierarchy principle: when an interaction of multiple factors is in the model, the lower order factors comprising the interaction should also be present. This is symmetric in the 2 lower order factors; we in fact implement a less stringent asymmetric version. To allow interaction terms to enter the model more easily, we modify the convention, such that any factor/interaction of factors in the active set can form a new interaction with any other single factor, even when the single factor is not yet in the active set. This asymmetric hierarchy was proposed by Friedman (1991)Go in the context of Multivariate Adaptive Regression Splines models. To add more flexibility, an option we allow is to provide all possible second-order interactions as well as main effect terms as candidate factors at the beginning of the forward steps.

In the backward deletion process, we again respect the asymmetric hierarchy; not all components (of lower order) of an interaction term can be dropped before the interaction term. As the size of the active set is reduced monotonically, the backward deletion process produces a series of models, from the most complex model to the NULL model. Using the list of corresponding scores, we select the model size that generated the minimum score C.

3.3 Variable selection: forward stagewise procedure

Our quadratic penalization scheme has an unwelcome side effect, particularly when the regularization parameter Formula is large. Suppose factor Formula has a strong effect and factor Formula has no effect, either as a main effect or interaction with Formula. With only Formula in the model and a sizeable quadratic penalty parameter Formula, its coefficients Formula are not allowed to enter at full strength. If we include the interaction term Formula, its coefficients obey (3.13) Formula, and hence they can augment the reduced strength main effects. Since, for example, Formula, it is easy to see that quadratic penalization "favors" breaking up a large main effect coefficient into a sum of smaller interaction pieces. This suggests that interaction terms might be included in the model to compensate for the regularized main effects, rather than for the purpose intended1.

We often use a small value of Formula, largely as a device for controlling numerical stability; in these cases, this effect will be negligible. When larger values are used, we propose an alternative and more restrictive forward "stagewise" selection approach.

That is, when a variable is added to the model, the coefficients for all the previously included factors are fixed and only the coefficients for the new variable are penalized. This prevents the earlier coefficients from readjusting themselves to take advantage of the phenomenon described above.

We compare the forward stagewise and stepwise procedures in Section 5; the sets of factors selected by the 2 procedures are very similar and often identical because the factors that enter the model in the forward stagewise procedure are often in the same sequence as in the forward selection steps. Our software offers both choices, but based on our experience so far the original stepwise procedure is our preferred choice.

3.4 Choosing the regularization parameter {lambda}

Here, we explore the smoothing effect of Formula-penalization mentioned briefly in Section 3.2. When building factorial models with interactions, there is always a risk of overfitting the data even with a selected subset of the features. In addition to the advantages of using quadratic regularization emphasized in Section 3.1, it can also be used to smooth a model and thus control its effective size or degrees of freedom (3.6). As heavier regularization is imposed by increasing Formula the deviance of the fit increases (the fit degrades), but the variance (3.7)–(3.9) of the coefficients and the effective degrees of freedom (3.6) of the model decrease. As a result, when the model size is determined by AIC/BIC, a larger value of Formula tends to choose a model with more variables and allows complex interaction terms to join the model more easily.

We used simulation to illustrate the patterns of models selected by varying Formula and for guidance in choosing an appropriate value. We ran 3 sets of simulation analyses, for each one generating data with a different magnitude of interaction effect. For all 3 cases, we generated data sets consisting of a binary response and 6 categorical factors with 3 levels. Only the first 2 of the 6 predictors affected the response with the conditional probabilities of belonging to class 1 as in the tables below. (AA, Aa, aa) and (BB, Bb, bb) are the possible genotypes for the first 2 factors; Formula meaning that the proportions of the alleles A and B in the population are assumed to be the same as the alleles a and Formula respectively. Figure 2 displays the log odds for class Formula for all possible combinations of levels of the first 2 factors; the log odds are additive for the first model, while the next 2 show interaction effects.


Figure 2
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 The patterns of log odds for class 1, for different levels of the first 2 factors.

 
Formula

For each model, we generated 30 data sets of size Formula with balanced class labels. Then, we applied our procedure with Formula for each Formula selecting a model based on BIC. Table 2 summarizes how many times Formula (the additive model with the first 2 factors) and Formula (the interaction between the first 2 factors) were selected. The models that were not counted in the table include FormulaFormula or the ones with the terms other than Formula and Formula Given that Formula is the true model for the first set while Formula is appropriate for the second and the third, ideally we should be fitting with small values of Formula for the additive model but increasing Formula as stronger interaction effects are added.


View this table:
[in this window]
[in a new window]

 
Table 2. The number of times that the additive and the interaction models were selected. A + B is the true model for the first set, while A x B is the true model for the second and the third

 
We can cross-validate to choose the value of Formula for each fold, we obtain a series of optimal models (based on AIC/BIC) corresponding to the candidate values of Formula and compute the log-likelihoods using the omitted fold. Then, we choose the value of Formula that yields the largest average (cross-validated) log-likelihood. We demonstrate this selection strategy in Section 4.

3.5 Missing value imputation

The coding method we implemented suggests an easy, but reasonable, method of imputing any missing values. If there are any samples lacking an observation for a factor Formula we then compute the sample proportions of the levels of Formula among the remaining samples. These proportions, which are the expected values of the dummy variables, are assigned to the samples with missing cells.

In this scheme, the fact that the dummy variables representing any factor sum to 1 is retained. In addition, our approach offers a smoother imputation than does filling the missing observations with the level that occurred most frequently in the remaining data. Through simulations in Section 4, we show that this imputation method yields a reasonable result.


    4. SIMULATION STUDY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. RELATED WORK
 3. PENALIZED LR
 4. SIMULATION STUDY
 5. REAL DATA EXAMPLE
 5.2 Bladder cancer data...
 6. DISCUSSION
 APPENDIX A: DIMENSIONALITY OF...
 APPENDIX B: TWO-LOCUS MODELING6
 REFERENCES
 
To compare the performance of penalized LR to that of MDR under various settings, we generated 3 epistatic models and a heterogeneity model, some of which are based on the suggestions in Neuman and Rice (1992)Go. Each training data set contained 400 samples (200 cases and 200 controls) and 10 factors, only 2 of which were significant. Three levels of the 2 significant factors were distributed so that the conditional probabilities of being diseased were as in the tables below; the levels of the remaining 8 insignificant factors were in Hardy–Weinberg equilibrium. For all 4 examples, the overall proportion of the diseased population was Formula

Formula

Figure 3 contains the plots of the log odds for all the conditional probabilities in the tables. (Zeros are replaced by Formula to compute the log odds.) As can be seen, we designed the third epistatic model so that the log odds are additive (the odds are multiplicative) in the first 2 factors; the interaction effect is more obvious in the first 2 epistatic models than in the heterogeneity model. Our distinction of the heterogeneity and epistatic models is based on Vieland and Huang (2003)Go and Neuman and Rice (1992)Go. We discuss how it is different from the additive/interaction scheme in LR in Appendix 6.


Figure 3
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 The patterns of the log odds for case, for different levels of the first 2 factors.

 
In addition to this initial simulation, we added noise to the data as described in Ritchie and others (2003)Go. The data were perturbed to create the following errors:
1) Missing cells (MS): For Formula of the samples, one of the significant factors is missing.
2) Genetic heterogeneity (GH): For Formula of the cases, the third and the fourth factors, instead of the first 2, are significant.
We used the initial data with no error and the perturbed data to compare the prediction accuracy and power in detecting the significant factors between our method and MDR.

Under each scenario, we simulated 30 sets of training and test data sets. For each training set, we selected the regularization parameter Formula through cross-validation and, using the chosen Formula built a model based on the BIC criterion. For each cross-validation, we provided candidate values of Formula in an adaptive way. We first applied a small value, Formula to the whole training data set and achieved models of different sizes from the backward deletion. Based on the series of models, we defined a set of reasonable values for the effective degrees of freedom. Then, we computed the values of Formula that would reduce the effective degrees of freedom of the largest model to the smaller values in the set.

We measured the prediction errors by averaging the 30 test errors. Table 3 summarizes the prediction accuracy comparison of the penalized LR and the MDR; the standard errors of the error estimates are parenthesized. The table shows that for both methods, the error rates increase when the data contain errors. The prediction accuracies are similar between the 2 methods although MDR yields slightly larger error rates in most situations.


View this table:
[in this window]
[in a new window]

 
Table 3. The prediction accuracy comparison of penalized LR and MDR (the standard errors are parenthesized): the prediction accuracies are similar between the 2 methods although MDR yields slightly larger error rates in most situations

 
Table 4 contains the numbers counting the cases (out of 30) for which the correct factors were identified. For PLR, the number of cases for which the interaction terms were also selected is parenthesized; the numbers vary reflecting the magnitude of interaction effect imposed in these 4 models as shown in Figure 3.


View this table:
[in this window]
[in a new window]

 
Table 4. The number of cases (out of 30) for which the correct factors were identified. For PLR, the number of cases that included the interaction terms is in the parentheses

 
For the heterogeneity model, main effects exist for both the 2 significant factors. In addition, as one is stronger than the other, MDR was not successful in identifying them simultaneously even for the data with no error, as shown in Table 4. In the case of the heterogeneity model or the second epistatic model, MDR suffered from a decrease in power, especially with GH perturbations. When GH perturbations were added to the second epistatic model, MDR correctly specified the 4 factors only 16 out of 30 times, while our method did so in all 30 simulations. These results show that the penalized LR method is more powerful than MDR, especially when multiple sets of significant factors exist; in these situations, MDR often identifies only a subset of the significant factors.


    5. REAL DATA EXAMPLE
 TOP
 SUMMARY
 1. INTRODUCTION
 2. RELATED WORK
 3. PENALIZED LR
 4. SIMULATION STUDY
 5. REAL DATA EXAMPLE
 5.2 Bladder cancer data...
 6. DISCUSSION
 APPENDIX A: DIMENSIONALITY OF...
 APPENDIX B: TWO-LOCUS MODELING6
 REFERENCES
 

5.1 Hypertension data set

We compared our method to FlexTree and MDR using the data from the SAPPHIRe (Stanford Asian Pacific Program for Hypertension and Insulin Resistance) project. The goal of the SAPPHIRe project was to detect the genes that predispose individuals to hypertension. A similar data set was used in Huang and others (2004)Go to show that the FlexTree method outperforms many competing methods. The data set contains the menopausal status and the genotypes on 21 distinct loci of 216 hypotensive and 364 hypertensive Chinese women. The subjects’ family information is also available; samples belonging to the same family are included in the same cross-validation fold for all the analyses.

Prediction performance.

We applied 5-fold cross-validation to estimate the misclassification rates using penalized LR with forward stepwise variable selection, FlexTree, and MDR. For penalized LR, a complexity parameter was chosen for each fold through an internal cross-validation. For MDR, we used internal cross-validations to select the most significant sets of features.

Huang and others (2004)Go initially used an unequal loss for the 2 classes: misclassifying a hypotension sample was twice as costly as misclassifying a hypertension sample. We fit penalized LR, FlexTree, and MDR with an equal as well as an unequal loss. For MDR with an unequal loss, we used "1/(1+2)" as threshold for the class proportion when labeling the cells of the tables, instead of "1/2" as in the usual cases of equal loss.

The results are compared in Table 5. Penalized LR achieved lower misclassification cost than other methods with either loss function. When an equal loss was used, FlexTree and MDR generated highly unbalanced predictions, assigning most samples to the larger class. Although penalized LR also achieved a low specificity, it was not so serious as in the other 2 methods.


View this table:
[in this window]
[in a new window]

 
Table 5. Comparison of prediction performance among different methods

 
Figure 4 shows the receiver operating characteristic (ROC) curves for penalized LR with an unequal (left panel) and an equal (right panel) loss function. For both plots, vertical and horizontal axes indicate the sensitivity and the specificity, respectively. Because penalized LR yields the predicted probabilities of a case, we could compute different sets of sensitivity and specificity by changing the classification threshold between 0 and Formula The solid circles on the curves represent the values we achieved with the usual threshold Formula The plus signs corresponding to FlexTree and the squares corresponding to MDR are all located toward the lower left corner, away from the ROC curves. In other words, penalized LR would achieve a higher sensitivity (specificity) than other methods if the specificity (sensitivity) were fixed the same as theirs.


Figure 4
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 ROC curves for penalized LR with an unequal (left panel) and an equal (right panel) loss function: penalized LR would achieve a higher sensitivity (specificity) than other methods if the specificity (sensitivity) were fixed the same as theirs.

 
Bootstrap analysis of the feature selection.

Applying our forward stepwise procedure to the whole data set yields a certain set of significant features as listed in Table 6. However, if the data were perturbed, a different set of features would be selected. Through a bootstrap analysis (Efron and Tibshirani, 1993Go), we provide a measure of how likely the features were to be selected and examine what other factors could have been preferred.


View this table:
[in this window]
[in a new window]

 
Table 6. Significant factors selected from the whole data set and their frequencies in 300 bootstrap runs

 
We illustrate the bootstrap analysis using a fixed value of Formula For each of Formula bootstrap data sets, we ran a forward stepwise procedure with Formula which is a value that was frequently selected in previous cross-validation. At the end of the B bootstrap runs, we counted the frequency for every factor/interaction of factors that has been included in the model at least once. The second and the fourth columns of Table 6 contain the counts for the corresponding features; some of them were rarely selected. Table 7 lists the factors/interactions of factors that were selected with relatively high frequencies.


View this table:
[in this window]
[in a new window]

 
Table 7. Factors/interactions of factors that were selected in 300 bootstrap runs of forward stepwise procedure with relatively high frequencies

 
Not all the commonly selected factors listed in Table 7 were included in the model when we used the whole data set. It is possible that some factors/interactions of factors were rarely selected simultaneously because of a strong correlation among them. To detect such instances, we propose using the co-occurrence matrix (after normalizing for the individual frequencies) among all the factors/interactions of factors listed in Table 7 as a dissimilarity matrix and applying hierarchical clustering. Then, any group of factors that tends not to appear simultaneously would form tight clusters.

Using the 11 selected features in Table 7, we first constructed the Formula co-occurrence matrix so that the Formulath element was the number of the bootstrap runs in which the ith and the jth features were selected simultaneously. Then, we normalized the matrix by dividing the Formulath entry by the number of bootstrap runs in which either the ith or the jth feature was selected. That is, denoting the Formulath entry as Formula we divided it by Formula for every i and Formula

As we performed hierarchical clustering with the normalized co-occurrence distance measure, PTPN1i4INV and MLRI2V were in a strong cluster: they were in the model simultaneously for only 2 bootstrap runs. Analogously, Formula and AGT2R1A1166C appeared 106 and 35 times, respectively, but only twice simultaneously. For both clusters, one of the elements was selected in our model (Table 6) while the other was not. Hence, the pairs were presumably used as alternatives in different models.

To compare the forward stagewise variable selection scheme to the forward stepwise method that we have used so far, we again summarize the factors/interactions of factors that were selected in 300 bootstrap runs of forward stagewise procedure in Table 8. Although the forward stagewise procedure tends to select single factors more frequently and the 2-way interaction terms less frequently compared to the previous scheme, the 2 methods share common factors/interactions of factors. For instance, menopause and MLRI2V were the 2 most frequent main effect terms, and Formula and Formula were among the most common interaction terms.


View this table:
[in this window]
[in a new window]

 
Table 8. Factors/interactions of factors that were selected in 300 bootstrap runs of forward stagewise procedure with relatively high frequencies

 

    5.2 Bladder cancer data set
 TOP
 SUMMARY
 1. INTRODUCTION
 2. RELATED WORK
 3. PENALIZED LR
 4. SIMULATION STUDY
 5. REAL DATA EXAMPLE
 5.2 Bladder cancer data...
 6. DISCUSSION
 APPENDIX A: DIMENSIONALITY OF...
 APPENDIX B: TWO-LOCUS MODELING6
 REFERENCES
 
We show a further comparison of different methods with another data set, which was used by Hung and others (2004)Go for a case–control study of bladder cancer. The data set consisted of genotypes on 14 loci and the smoke status of 201 bladder cancer patients and 214 controls.

Prediction performance.

We compared the prediction error rate of penalized LR with those of FlexTree and MDR through 5-fold cross-validation. As summarized in Table 9, penalized LR achieved higher sensitivity and specificity than FlexTree and better balanced class predictions than MDR.


View this table:
[in this window]
[in a new window]

 
Table 9. Comparison of prediction performance among different methods

 
As done in Section 5.1, we generated the ROC curve (Figure 5) for penalized LR by varying the classification threshold between 0 and Formula Both sensitivity and specificity of FlexTree are lower than those of penalized LR; therefore, penalized LR would achieve higher sensitivity (specificity) than FlexTree if its specificity (sensitivity) is adjusted to be the same as FlexTree. Although the sensitivity of MDR is higher than that of penalized LR, the square is still off the ROC curve, toward the lower left corner. In addition, sensitivity and specificity are more even for penalized LR.


Figure 5
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 ROC curve for penalized LR. Penalized LR achieved higher sensitivity and specificity than FlexTree.

 
Bootstrap analysis of the feature selection.

When we fit a penalized LR model with forward stepwise selection using this bladder cancer data set, the 4 terms in Table 10 were selected. To validate their significance, we performed a similar bootstrap analysis as in Section 5.1. The second and the fourth columns of Table 10 record the number of bootstrap runs (out of Formula) in which the factors were chosen.


View this table:
[in this window]
[in a new window]

 
Table 10. Significant factors selected from the whole data set and their frequencies in 300 bootstrap runs

 
The factors/interactions of factors that were frequently selected through the bootstrap runs are listed in Table 11. The factors in Table 10 form the subset with the highest ranks among the ones listed in Table 11, providing evidence of reliability. The latter half of Table 11 shows that even the interaction terms with the largest counts were not as frequent as other common main effect terms. When we applied MDR, the second-order interaction term Formula was often selected; however, according to the bootstrap results, LR method can explain their effect in a simpler, additive model. In addition, MDR was not able to identify other potentially important features.


View this table:
[in this window]
[in a new window]

 
Table 11. Factors/interactions of factors that were selected in 300 bootstrap runs of forward stepwise procedure with relatively high frequencies

 
We also used the co-occurrence matrix of the factors in Table 11 as a dissimilarity measure and applied hierarchical clustering. One of the tightest clusters was the pair of GSTM1 and NAT2: they were in the model 133 and 88 times, respectively, but coincided only 33 times, implying that NAT2 was often used to replace GSTM1.

These results from the bootstrap analysis are consistent with the findings in Hung and others (2004)Go in several ways. The factors with high frequencies (the first column of Table 11) are among the ones that were shown to be significantly increasing the risk of bladder cancer, through conventional analyses reported in Hung and others (2004)Go. Hung and others also incorporated some known facts about the functional similarities of the genes and improved the estimates of the odds ratio. From this hierarchical modeling, MPO, GSTM1, and MnSOD achieved high odds ratios with improved accuracy. In addition, their analysis of gene–environment interaction showed that although smoking status itself was a significant factor, none of its interaction with other genes was strikingly strong. Similarly, as can be seen from Table 11, our bootstrap runs did not detect any critical interaction effect.


    6. DISCUSSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. RELATED WORK
 3. PENALIZED LR
 4. SIMULATION STUDY
 5. REAL DATA EXAMPLE
 5.2 Bladder cancer data...
 6. DISCUSSION
 APPENDIX A: DIMENSIONALITY OF...
 APPENDIX B: TWO-LOCUS MODELING6
 REFERENCES
 
We have proposed using LR with a penalization on the size of the Formula-norm of the coefficients. The penalty was imposed not only for the usual smoothing effect but also for convenient and sometimes necessary features that the quadratic penalization accompanied. In regular LR models, a small sample size prohibits high-order interaction terms, and variables with constant zero entries are often not allowed. Because these situations are common in modeling gene–gene interactions, LR is limited in its applications. However, the quadratic penalization scheme yields a stable fit, even with a large number of parameters, and automatically assigns zero to the coefficients of zero columns.

We modified the hierarchy rule of the forward stepwise procedure to allow the interaction terms to enter the model more easily. One strategy was to accept an interaction term as a candidate if either component was already in the model. If a strong interaction effect with negligible main effects is suspected, more flexible rules, such as accepting an interaction term even with no main effect terms, should be applied. However, the forward stepwise procedure selects variables in a greedy manner. A less greedy selection through Formula-regularization will allow the terms to enter the model more smoothly.

LR yields a reasonable prediction accuracy and identification of significant factors along with their interaction structures. We have shown that adding a quadratic penalization is a simple but powerful remedy that makes it possible to use LR in building gene–gene interaction models.


    APPENDIX A: DIMENSIONALITY OF MDR6
 TOP
 SUMMARY
 1. INTRODUCTION
 2. RELATED WORK
 3. PENALIZED LR
 4. SIMULATION STUDY
 5. REAL DATA EXAMPLE
 5.2 Bladder cancer data...
 6. DISCUSSION
 APPENDIX A: DIMENSIONALITY OF...
 APPENDIX B: TWO-LOCUS MODELING6
 REFERENCES
 
Here, we present the details of how the simulations were done to evaluate the effective degrees of freedom of the MDR fits, as briefly discussed in Section 2.1. First, we review a standard scenario for comparing nested LR models. Suppose we have n measurements on two 3-level factors Formula and Formula and a binary (case/control) response Y generated "completely at random"—that is, as a coin flip, totally independent of Formula or Formula. We then fit 2 models for the probabilities Formula of a "case" in cell i of Formula and cell j of Formula:

1) Formula: a constant model Formula, which says the probability of a case is fixed and independent of the factors (the correct model).
2) Formula: a second-order interaction LR model, which allows for a separate probability Formula of a case in each cell of the Formula table formed by the factors.
If Formula is the observed binary response for observation Formula and the model probability is Formula, then the "deviance" measures the discrepancy between the data and the model:

Formula (A.1)

We now fit the 2 models separately, by minimizing the deviance above for each, yielding fitted models Formula and Formula. In this case, the "change in deviance"

Formula (A.2)

measures the improvement in fit from using the richer model over the constant model. Since the smaller model is correct in this case, the bigger model is "fitting the noise." Likelihood theory tells us that as the sample size n gets large, the change in deviance has a Formula-distribution with degrees of freedom equal to Formula, the difference in the number of parameters in the 2 models. If we fit an additive LR model for Formula instead, the change in deviance would have an asymptotic Formula-distribution. Two important facts emerge from this preamble:

  • The more parameters we fit, the larger the change in deviance from a null model and the more we overfit the data.
  • The degrees-of-freedom measures the average amount of overfitting; indeed, the degrees of freedom d is the "mean" of the Formula-distribution.
This analysis works for models fit in linear subspaces of the parameter space. However, we can generalize it in a natural way to assess more complex models, such as MDR.

In the scenario above, MDR would examine each of the 9 cells in the 2-way table and, based on the training data responses, create its "1-dimensional" binary factor Formula with levels high risk and low risk. With this factor in hand, we could go ahead and fit a 2-parameter model with probabilities of a case Formula and Formula in each of these groups. We could then fit this model to the data, yielding a fitted probability vector Formula, and compute the change in deviance Formula. Ordinarily for a single 2-level factor fit to a null model, we would expect a Formula-distribution. However, the 2-level factor was not predetermined but "fit to the data." Hence, we expect change of deviances bigger than predicted by a Formula. The idea of the simulation is to fit these models many times to null data and estimate the effective degrees of freedom as the average change in the deviance (Hastie and Tibshirani, 1990Go).

We used this simulation model to examine 2 aspects of the effective dimension of MDR:

  • For a fixed set of K factors, the effective degrees-of-freedom cost for creating the binary factor Formula.
  • The additional cost for searching among all possible sets of size K from a pool of p available factors.
In our experiments, we varied both K and p. The results are summarized in Section 2.1 along with Figure 1 and Table 1.


    APPENDIX B: TWO-LOCUS MODELING6
 TOP
 SUMMARY
 1. INTRODUCTION
 2. RELATED WORK
 3. PENALIZED LR
 4. SIMULATION STUDY
 5. REAL DATA EXAMPLE
 5.2 Bladder cancer data...
 6. DISCUSSION
 APPENDIX A: DIMENSIONALITY OF...
 APPENDIX B: TWO-LOCUS MODELING6
 REFERENCES
 
Many researchers have suggested methods to categorize 2-locus models for genetic diseases and to mathematically formulate the corresponding probabilities of influencing the disease status. The 2-locus models are often divided into 2 classes: "heterogeneity models" for which a certain genotype causes the disease independently of the genotype on the other locus and "epistatis models" for which the 2 genotypes are dependent. To represent the 2 distinct classes, the concept of "penetrance" is often used. Penetrance is a genetic term meaning the proportion of individuals with a disease-causing gene that actually shows the symptoms of the disease.

Let A and B denote potential disease-causing genotypes on 2 different loci, and use the following notation to formulate different genetic models:

Formula

That is, Formula and Formula denote the penetrance, or the conditional probability of resulting in the disease, for the individuals carrying the genotypes Formula and both A and Formula respectively.

Vieland and Huang (2003)Go defined "heterogeneity" between 2 loci to be the relationship with the following "fundamental heterogeneity equation":

Formula (B.1)

which is directly derived from the following more obvious representation of the independent effect of a pair of genotypes:

Formula

They referred to any other 2-locus relationship for which (B.1) does not hold as "epistatic." Neuman and Rice (1992)Go distinguished the heterogeneity models likewise. Risch (1990)Go also characterized "multiplicative" and "additive" 2-locus models; the disease penetrance for carriers of both genotypes A and B was multiplicative or additive in the penetrance scores for single genotypes A and Formula Risch considered the additive model to be a reasonable approximation of a heterogeneity model.

As we demonstrated in Sections 3.4 and 4, LR identifies the relationship among the active genes as either additive or having an interaction; however, this distinction is not equivalent to that of heterogeneity and the epistatic relationship described above. For example, "epistatic model III" in Section 4 has a probability distribution such that the conditional probabilities of disease are additive in log odds; we expect an additive model when applying LR. Although in genetics, heterogeneity models are often characterized by no interaction among loci in affecting the disease, the factors are not necessarily conceived to be additive in LR. However, in the example illustrated in Section 4, the interaction effect in the heterogeneity model was not as critical as in other epistatic models, and LR found an additive model in more than Formula of the repeats.


    ACKNOWLEDGMENTS
 
The authors thank Richard Olshen for valuable comments and for a version of data from the SAPPHIRe project, John Witte for the bladder cancer dataset, and Amir Najmi and Balasubramanian Narasimhan for the codes for FlexTree. The authors are also grateful to Robert Tibshirani and other Hastie–Tibshirani Lab members for helpful discussions. Trevor Hastie was partially supported by grant DMS-0505676 from the National Science Foundation and grant 2R01 CA 72028-07 from the National Institutes of Health. Conflict of Interest: None declared.


    FOOTNOTES
 
1 We thank one of the referees for alerting us to this phenomenon. Back


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. RELATED WORK
 3. PENALIZED LR
 4. SIMULATION STUDY
 5. REAL DATA EXAMPLE
 5.2 Bladder cancer data...
 6. DISCUSSION
 APPENDIX A: DIMENSIONALITY OF...
 APPENDIX B: TWO-LOCUS MODELING6
 REFERENCES
 

    Coffey C, Hebert P, Ritchie M, Krumholz H, Gaziano J, Ridker P, Brown N, Vaughan D, Moore J. An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene interactions on risk of myocardial infarction: the importance of model validation. BMC Bioinformatics (2004) 5:49.[CrossRef][Medline]

    Efron B, Tibshirani R. An Introduction to the Bootstrap (1993) Boca Raton, FL: Chapman & Hall.

    Friedman J. Multivariate adaptive regression splines. The Annals of Statistics (1991) 19:1–67.

    Gray R. Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. Journal of the American Statistical Association (1992) 87:942–951.[CrossRef][Web of Science]

    Hahn L, Ritchie M, Moore J. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interaction. Bioinformatics (2003) 19:376–382.[Abstract/Free Full Text]

    Hastie T, Tibshirani R. Generalized Additive Models (1990) London: Chapman & Hall.

    Hoerl AE, Kennard R. Ridge regression: biased estimation for nonorthogonal problems. Technometrics (1970) 12:55–67.[CrossRef][Web of Science]

    Huang J, Lin A, Narasimhan B, Quertermous T, Hsiung C, Ho L, Grove J, Oliver M, Ranade K, Risch N, et al. Tree-structured supervised learning and the genetics of hypertension. Proceedings of the National Academy of Sciences of the United States of America (2004) 101:10529–10534.[Abstract/Free Full Text]

    Hung R, Brennan P, Malaveille C, Porru S, Donato F, Boffetta P, Witte J. Using hierarchical modeling in genetic association studies with multiple markers: application to a case-control study of bladder cancer. Cancer Epidemiology, Biomarkers & Prevention (2004) 13:1013–1021.[Abstract/Free Full Text]

    Le Cessie S, Van Houwelingen J. Ridge estimators in logistic regression. Applied Statistics (1992) 41:191–201.[CrossRef][Web of Science]

    Lee A, Silvapulle M. Ridge estimation in logistic regression. Communications in Statistics, Simulation and Computation (1988) 17:1231–1257.[CrossRef][Web of Science]

    McCullagh P, Nelder J. Generalized Linear Models (1989) Boca Raton, FL: Chapman & Hall.

    Neuman R, Rice J. Two-locus models of disease. Genetic Epidemiology (1992) 9:347–365.[CrossRef][Web of Science][Medline]

    Risch N. Linkage strategies for genetically complex traits. i. multilocus models. American Journal of Human Genetics (1990) 46:222–228.[Web of Science][Medline]

    Ritchie M, Hahn L, Moore J. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genetic Epidemiology (2003) 24:150–157.[CrossRef][Web of Science][Medline]

    Ritchie M, Hahn L, Roodi N, Bailey L, Dupont W, Parl F, Moore J. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. American Journal of Human Genetics (2001) 69:138–147.[CrossRef][Web of Science][Medline]

    Vieland V, Huang J. Two-locus heterogeneity cannot be distinguished from two-locus epistasis on the basis of affected-sib-pair data. American Journal of Human Genetics (2003) 73:223–232.[CrossRef][Web of Science][Medline]

    Zhu J, Hastie T. Classification of gene microarrays by penalized logistic regression. Biostatistics (2004) 46:505–510.

    Received May 26, 2006; revised February 6, 2007; accepted for publication March 2, 2007.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


    This article has been cited by other articles:


    Home page
    BioinformaticsHome page
    T. T. Wu, Y. F. Chen, T. Hastie, E. Sobel, and K. Lange
    Genome-wide association analysis by lasso penalized logistic regression
    Bioinformatics, March 15, 2009; 25(6): 714 - 721.
    [Abstract] [Full Text] [PDF]


    Home page
    BioinformaticsHome page
    C. Yang, Z. He, X. Wan, Q. Yang, H. Xue, and W. Yu
    SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies
    Bioinformatics, February 15, 2009; 25(4): 504 - 511.
    [Abstract] [Full Text] [PDF]


    Home page
    Hum Mol GenetHome page
    V. B. Guzman, A. Yambartsev, A. Goncalves-Primo, I. D.C.G. Silva, C. R.N. Carvalho, J. C.L. Ribalta, L. R. Goulart, N. Shulzhenko, M. Gerbase-DeLima, and A. Morgun
    New approach reveals CD28 and IFNG gene interaction in the susceptibility to cervical cancer
    Hum. Mol. Genet., June 15, 2008; 17(12): 1838 - 1844.
    [Abstract] [Full Text] [PDF]


    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    9/1/30    most recent
    kxm010v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Park, M. Y.
    Right arrow Articles by Hastie, T.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Park, M. Y.
    Right arrow Articles by Hastie, T.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?