Biostatistics Advance Access published online on June 19, 2007
Biostatistics, doi:10.1093/biostatistics/kxm024
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Identification of SNP interactions using logic regression
Collaborative Research Center 475, Department of Statistics, University of Dortmund, 44221 Dortmund, Germany holger.schwender{at}udo.edu
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
Interactions of single nucleotide polymorphisms (SNPs) are assumed to be responsible for complex diseases such as sporadic breast cancer. Important goals of studies concerned with such genetic data are thus to identify combinations of SNPs that lead to a higher risk of developing a disease and to measure the importance of these interactions. There are many approaches based on classification methods such as CART and random forests that allow measuring the importance of single variables. But none of these methods enable the importance of combinations of variables to be quantified directly. In this paper, we show how logic regression can be employed to identify SNP interactions explanatory for the disease status in a casecontrol study and propose 2 measures for quantifying the importance of these interactions for classification. These approaches are then applied on the one hand to simulated data sets and on the other hand to the SNP data of the GENICA study, a study dedicated to the identification of genetic and geneenvironment interactions associated with sporadic breast cancer.
Keywords: Feature selection; GENICA; Single nucleotide polymorphism; Variable importance measure
| 1. INTRODUCTION |
|---|
|
|
|---|
Although all humans share far more than 99% of their DNA, there are still millions of differences between the DNA of 2 individuals. The most common, and so far the best investigated, genetic variations are single nucleotide polymorphisms (SNPs). A SNP occurs when a single nucleotide is altered, that is, when (usually 2) different sequence alternatives exist at a single base-pair position. To distinguish a SNP from a mutation, the less frequent variant has to occur in at least 1% of the population. Since the human genome is diploid, that is, consists of pairs of chromosomes, each SNP is explained by 2 bases. Therefore, each SNP can take one of the following 3 forms:
- "Homozygous reference genotype": both bases explaining the SNP are the more frequent variant.
- "Heterozygous variant genotype": one of the bases is the more frequent and the other the less frequent variant.
- "Homozygous variant genotype": both bases are the less frequent variant.
- "Heterozygous variant genotype": one of the bases is the more frequent and the other the less frequent variant.
SNPs are assumed to alter the risk for developing a particular disease. It is, however, very unlikely that individual SNPs play an important role in the development of complex diseases such as sporadic breast cancer. Instead, high-order interactions of SNPs are supposed to explain the differences between low- and high-risk groups (Garte, 2001).
In an association study concerned with SNP data, it is thus of interest to construct classification rules of the following type:
- If SNP A is of the heterozygous variant genotype AND SNP B is of the homozygous variant genotype OR both SNP C AND D are NOT of the homozygous reference genotype, then a person has a higher risk for the disease of interest.
A procedure developed for solving exactly this type of problems is logic regression (Ruczinski and others, 2003) which attempts to identify Boolean combinations of binary variables for the prediction of, for example, the casecontrol status of an observation.
Other classification methods such as Classification and Regression Trees (CART) (Breiman and others, 1984), bagging (Breiman, 1996), random forests (Breiman, 2001), and support vector machines (Vapnik, 1995) can also be applied to SNP data (Schwender and others, 2004). But in comparisons with CART, random forests (Ruczinski and others, 2004), and other regression procedures (Kooperberg and others, 2001; Witte and Fijal, 2001), logic regression has shown a good performance when applied to SNP data.
Another goal in the analysis of SNP data is the quantification of the importance of the identified SNPs and SNP interactions for classification. Many discrimination methods provide approaches to measure the importance of a single variable. Examples are the variable importance measures of random forests and CART and the squared weights used in Recursive Feature Elimination with Support Vector Machines (Guyon and others, 2002) for recursive feature elimination with support vector machines. These methods, however, do not quantify the importance of interactions of variables directly unless these interactions are included as variables into the procedure. This, however, is impractical since analyzing only 50 variables would lead to more than 250 000 input variables if we were interested in up to 4-way interactions.
Thus, methods are needed that just use the variables themselves as inputs into the model but enable us to identify combinations of variables and quantify the importance of these interactions. In this paper, we propose approaches based on logic regression that exactly fulfill these needs.
The paper is organized as follows: In Section 2, a brief introduction to Boolean algebra and logic regression is given. We then describe a method based on logic regression and bootstrapping for identifying potentially interesting interactions in Section 3 and propose 2 measures for quantifying the importance of such interactions in Section 4. In Section 5, these approaches are applied to simulated data sets and to the SNP data of the GENICA study, a study dedicated to the identification of genetic and geneenvironment interactions associated with sporadic breast cancer. In Section 6, they are compared with another variable selection procedure based on logic regression.
| 2. LOGIC REGRESSION |
|---|
|
|
|---|
Logic regression is an adaptive regression methodology for predicting the outcome in classification and regression problems based on Boolean combinations of logic variables such as
- S1: "SNP S is not of the homozygous reference genotype"
- S2: "SNP S is of the homozygous variant genotype."
means SNP S is NOT of the homozygous variant genotype) and combined to a logic expression by the operators
(AND) and
(OR). In logic regression, these logic expressions are represented by logic trees (e.g. see Figure 1). Logic trees, however, can be employed not only as graphical representations of logic expressions but also to generate new logic trees in the search for the best model. Permissible moves in this tree-growing process are alternating an operator or a variable, respectively, pruning or growing a branch, and adding or removing variables (for details, see Ruczinski and others, 2003).
|
For example, in a casecontrol study, logic regression searches for the logic expression L that best explains the cases. If L is true for a new observation, this observation will be classified as case. Besides this single-tree approach, logic regression also provides a multiple-tree method in which several logic expressions Li,i = 1,...,p, are adaptively constructed and combined by a generalized linear model
|
|
with response Y, parameters ßi,i = 0,...,p, and link function g. Since our interest centers on casecontrol studies, we assume g to be the logit function.
Although the logic expression displayed in Figure 1 is relatively easy to interpret, it becomes more complicated to interpret such expressions the more variables they contain. Therefore, we propose to convert each logic expression into a disjunctive normal form (DNF), that is, an OR-combination of AND-combinations. The DNF of the logic tree L = A
BC
(C
D)
EC displayed in Figure 1 is, for example, given by
|
|
The advantage of the DNF is that interactions are directly identifiable since they are given by the AND combinations. For example, the above logic expression consists of the 3 interactions A
BC, C
EC, and D
EC and is true if at least one of these conjunctions is true.
To avoid redundancy, the DNF should consist only of prime implicants, that is, minimal AND- combinations. If, for example, A
B
C and A
B
CC are part of the DNF, then C will be redundant and only the prime implicant A
B is needed.
Our goal is to identify all interactions that might have an influence on the risk of developing a disease. Therefore, we are not interested in obtaining a minimal DNF, that is a DNF consisting of a minimum number of prime implicants, but a DNF containing all prime implicants. In Schwender (2007), an algorithm based on matrix algebra for generating such a DNF of a logic expression is presented.
| 3. IDENTIFICATION OF INTERESTING INTERACTIONS |
|---|
|
|
|---|
One of the search algorithms used in logic regression is based on the Markov chain Monte Carlo (MC) approach. Kooperberg and Ruczinski (2005) run this algorithm, called MC logic regression, on the whole data set not to find a single best logic regression model but to obtain a large collection of models that fit almost as well as the best one. This set is then used to identify combinations of variables occurring frequently in these models.
Contrary to Kooperberg and Ruczinski (2005), we propose a subset-based approach in which the default search algorithm of logic regression, that is simulated annealing, is applied to different subsets of the data (for a comparison of both methods, see Section 6). More precisely, we suggest the following procedure, called logicFS, for the identification of variables and interactions that might be explanatory for the casecontrol status of an observation.
ALGORITHM 1(logicFSidentification of interesting interactions)
- 1) Draw a bootstrap sample of size n from the n observations of the data set of interest.
- 2) Construct a logic regression model based on the bootstrap sample.
- 3) Convert each of the logic expressions into a DNF consisting of prime implicants.
- 4) Repeat steps 13 B times.
- 2) Construct a logic regression model based on the bootstrap sample.
Some of the interactions identified by logicFS are very important for the prediction. Others are not important at all or might actually be obstructive for a good classification. It is therefore necessary to quantify the importance of each of these potentially interesting interactions.
| 4. MEASURING THE IMPORTANCE OF IDENTIFIED INTERACTIONS |
|---|
|
|
|---|
For a first impression on which variables or interactions might be important or not, the proportion of models generated by logicFS that contain a specific interaction can be computed for each identified interaction. This is similar to the measure used in MC logic regression to quantify the importance of the variables and combinations of variables. In this approach, the models visited after the burn-in are employed to compute for each variable, each pair, and each triplet of variables, the proportion of models in which the respective variables appear jointly in the same logic tree. The combinations of variables occurring most frequently are then assumed to be the most important interactions.
However, some SNP interactions are explanatory for only a small subset of patients. Such interactions will hardly be found, and it is likely that they appear only in very few of the models. They would thus be called unimportant by the above measure although they are actually very important for the correct prediction of some of the patients. Moreover, a suitable measure should quantify how much a particular interaction improves the classification. This improvement should not be computed on the same data set on which the classification rule has been trained but on an independent data set containing new observations.
Since in logicFS a logic regression model is constructed based on a subset of the data, the out-of-bag (oob) observations, that is, the observations not contained in the bootstrap sample, can be employed to estimate the importance of the interactions.
As mentioned in Section 2, there exist both single- and multiple-tree approaches of logic regression. While logicFS can handle either of these methods, different importance measures are employed for the 2 approaches.
In the single-tree case, the importance of a prime implicant, that is a variable or an interaction, P for classification is computed by
![]() | (4.1) |
where Lb is the set of prime implicants identified in the bth iteration of logicFS, b = 1,...,B; Nb is the number of oob observations in the bth iteration that are correctly classified by the logic regression model constructed in the bth iteration; and N
/N
is the number of oob observations correctly classified by the bth model after P has been removed from/added to the model. In other words, to get a measurement of the influence of P on the correct classification, we compare how well the logic regression models perform when P is part of the logic expressions or not.
In the multiple-tree case, it is not possible unambiguously to add an interaction to one of the logic trees since it is not clear to which of the logic expressions it should be appended. The prime implicant P is, therefore, only removed from (and not added to) the models, and the multiple-tree measure is determined by: (a) calculating the number Nb of correctly classified oob observations for each of the B iterations, (b) removing P from all models, (c) recalculating the number of correctly classified oob observationsnow denoted by N
for each of the B iterations, (d) computing
![]() | (4.2) |
The multiple-tree measure is similar to the variable importance measure of random forests. The only difference is that Breiman (2001) permutes the outcome of the variable once and computes N
based on the permuted outcomes. By contrast, we remove the variable/interaction and calculate N
based on the model without this variable/interaction since a prime implicant P can be removed from a logic tree in DNF without destroying the structure of the remaining tree.
For a particular interaction, a large value of both (4.1) and (4.2) corresponds to a high importance of this interaction, whereas a value of about zero leads to the assumption that the interaction has no importance for classification. A prime implicant showing a negative importance is obstructive for a good classification since the number of misclassifications will increase if this interaction is added to the model.
| 5. APPLICATION TO SNP DATA |
|---|
|
|
|---|
In this section, we apply logicFS using 50 000 iterations in each run of simulated annealing and the 2 variable importance measures (4.1) and (4.2) to simulated and real SNP data. Since the input variables of logic regression, and hence of logicFS, have to be binary, each SNP Si,i = 1,...,m, is split into the 2 variables
- Si1: "At least one of the bases explaining Si is the less frequent variant."
- Si2: "Both bases explaining Si are the less frequent variant." These dummy variables are used instead of the SNPs themselves, where Si1 codes for a dominant variation and Si2 for a recessive effect.
To investigate if our procedures are able to identify the influential interactions in casecontrol studies, we employ 2 simulations. In the first simulation, we are particularly interested in the stability of the approaches, that is, whether logicFS always identifies the interactions intended to be influential and whether the importance of an interaction provided by either (4.1) or (4.2) is always about the same, when the approaches are applied to the same data set. The goal of the second simulation is to determine if our procedure can cope with real association studies in which single interactions might have moderate effects and a relatively high percentage of the cases cannot be classified by the measured SNPs.
To examine the former issue, data of 1000 observations (500 cases and 500 controls) and 50 SNPs are simulated, where an observation is classified as a case if one of the following 4 logic expressions is true: S12 (explaining 100 cases), S

S32 (150), S42
S52
S62 (100), S72
S82 (150). Apart from the SNP values explaining the cases, the values of each of the 50 SNPs are randomly drawn such that the minor allele frequency, that is the frequency of the less frequent variant, of each SNP lies between 0.2 and 0.4 and the HardyWeinberg equilibrium is fulfilled. Using 100 bootstrap samples and allowing a maximum of 20 variables in each of the logic expression models, logicFS is applied to this data set twiceonce with the single-tree approach and once with the multiple-tree approach allowing 3 logic trees to grow. Afterwards, VIMSingle and VIMMultiple are computed for each of the interactions in the respective approaches. This procedure is repeated 50 times.
Table 1 shows how many of the identified interactions appear in how many of the 50 iterations. For example, in the single-tree approach, 72 interactions appear only once and 15 in 2 of 50 iterations. Only 9 of the interactions found in the single-tree approach and 16 of the interactions in the multiple-tree approach are detected in all 50 iterations. Figure 2 displays the median and the 25% and 75% quantiles of the 50 values of the importance measures for each of these 9 and 16 interactions, respectively.
|
|
In the single-tree case, only the 4 interactions explanatory for the cases and the three 2-way interactions contained in the explanatory 3-way interaction are identified with a positive importance in all iterations. As expected from the fact that typically about 37% of the observations are oob, the 2-way interactions have an importance a little smaller than 0.37x150 = 55.5, and S12 shows an importance slightly smaller than 37. Figure 2 also reveals that the single-tree estimates of the importances are very stable since they do not differ much between the 50 iterations.
As in the single-tree case, the top 3 logic expressions are the 2 explanatory 2-way interactions and S12. The 3-way interaction shows the sixth highest importance and is surrounded by the binary variables belonging to the 2-way interactions. The latter also explains why the importance of the 2-way interactions is smaller in the multiple-tree case compared to the single-tree case: even though the importances of both the variables themselves and the corresponding 2-way interactions are computed separately, they are considered jointly in the computation.
As a second simulation, SNP data are considered that are more realistic for a genetic association study. Data of 1000 observations and 50 SNPs are generated, where each SNP exhibits a minor allele frequency of 0.25. The casecontrol status y of each observation is randomly drawn from a Bernoulli distribution with
|
|
where L1 = S61
S
and L2 = S
S
S
.
Thus, the probability of being a case in this association study is 0.378 even if an observation exhibits none of the 2 interactions intended to be influential for the risk of developing a disease. A reason for this might be that there are other genetic or environmental factors that have not been surveyed in this study but that also have an impact on the disease risk.
This procedure is repeated 50 times such that 50 data sets are generated. The mean number of cases and controls over these data sets for the different probabilities of being a case are summarized in Table 2.
|
Both the single-tree approach with a maximum of 6 variables and the multiple-tree approach with 2 trees and a maximum of 8 variables are applied to each of these data sets using B = 50 iterations.
Table 3 reveals that the 2 SNP interactions intended to be influential for the disease risk are detected in all the 50 data sets. Moreover, they are identified as the 2 most important logic expressions in almost any of these data sets, where S61
S
mostly ranks first with a mean importance of 18.88 in the single and 15.19 in the multiple-tree approach and S
S
S
ranks second with a mean importance of 12.21 or 6.44, respectively. If one of these interactions ranks third (or lower), then the expressions identified to be more important typically contain this or the other influential interaction plus another variable.
|
The GENICA study is carried out by the Interdisciplinary Study Group on Gene Environment Interaction and Breast Cancer in Germany (http://www.genica.de), a joint initiative of researchers dedicated to the identification of genetic and environmental risk factors associated with sporadic breast cancer. This age-matched and population-based casecontrol study was launched within the activities of the German Human Genome Project (DHGP) and continues at the time of writing.
Although exogenous risk factors such as reproduction variables, hormone variables, and lifestyle factors have also been assessed, we here focus our interest on a subset of the genotype data from the GENICA study. More precisely, data from 1258 women (609 cases and 649 controls) and 40 SNPs belonging to the DNA repair or the xenobiotic and drug metabolism pathway are available for our analysis.
All observations showing more than 5 missing values as well as SNPs having more than 10% missing values or fewer than 30 women not showing the homozygous reference genotype are removed from the analysis leading to a total of 35 SNPs and 1191 women (561 cases and 630 controls). The remaining missing values are replaced SNP-wise by random draws from the marginal distribution.
For the application of logicFS to the GENICA SNP data, each of the SNPs is again coded by 2 dummy variables. Only 59 of these 70 binary variables are used in the analysis since for each of the other 11 variables, there are less than 10 women for which this variable is true.
Using B = 200 iterations, logicFS is applied to this data set twiceonce with a single-tree and a maximum of 10 variables contained in this tree and once allowing 3 trees to grow with a maximum of 16 variables in all the 3 trees combined.
In the single-tree case, this leads to the detection of 1052 potentially interesting SNPs and SNP interactions, whereas in the multiple-tree case, 1589 SNPs and SNP interactions are identified. Figure 3, however, reveals that just one interaction, namely !X18&X20 or decoded ERCC2_6540
ERCC2_188801, consisting of 2 SNPs from the gene ERCC2 (Excision Repair Cross-Complementing group 2; formerly XPD) seems to be associated with the casecontrol status. If thus ERCC2_6540 (refSNP ID: rs1799793) is of the homozygous reference genotype and ERCC2_18880 (rs1052559) is not of this genotype, then a woman will have a little higher risk of developing breast cancer.
|
As indicated by the multiple-tree measure in Figure 3, ERCC2_6540 itself has a slight effect on breast cancer risk. This also explains the importances of the other interactions in the single-tree case which all include ERCC2_6540
. | 6. COMPARISON WITH MC LOGIC REGRESSION |
|---|
|
|
|---|
To compare logicFS with MC logic regression, the latter is applied to all data sets used in Section 5, and VIMSingle, VIMMultiple, and the fractions of models in which variables, pairs, and triplets of variables jointly occur are determined. Since VIMSingle and VIMMultiple should be computed on a data set independent from the one employed for building the models, each of the data sets is randomly split into a training and a test set. For better comparability, each training set is composed of 63.2% of the observations as the bootstrap samples used in logicFS are expected to contain 63.2% of the observations. MC logic regression is applied to each of these training sets using the same settings as in Section 5 and 500 000 iterations. The last 100 000 models are kept in memory to compute the 3 measures.
In Figure 4, the results of the analysis of the GENICA data set with MC logic regression are displayed. This figure shows that in the application of the single-tree approach, X44, that is GSTP1_313
, appears in more models than ERCC2_18880
(i.e. X20). However, GSTP1_313 has no relevance for classificationneither as a single variable nor in interaction with another variable. As mentioned before, ERCC2_6540 is the only individual variable showing a slight effect. It therefore occurs in virtually all models found by both the single- and the multiple-tree approach. In the latter application, VIMMultiple of both ERCC2_18880
and ERCC2_65401
is larger than the importance of their interaction, which is due to the fact that these 2 variables mostly appear in different trees. Reducing the number of trees might be a solution to this problem, which is not necessary when using logicFS. There, ERCC2_6540
ERCC2_18880
is detected as the most important variable no matter how many trees are allowed to grow (cf. right panel of Figure 3).
|
In the applications of MC logic regression to the data sets of Simulation 2, S61
S
is always identified by the single-tree approach and in about 60% of the data sets also by the multiple-tree approaches, whereas S
S
S
is found in about 40% of the applications (cf. Table 4). If identified, S61
S
is the most important variable for only about 50% of the time. By contrast, logicFS always identifies these 2 prime implicants intended to be explanatory for the casecontrol status, and they are detected as the 2 most important interactions in virtually any application (cf. Table 3).
|
S12, S

S32, and S72
S82 are detected in any of the applications of the single-tree approach to the data set of Simulation 1, whereas S42
S52
S62 is identified in none of the repeats, even though S42, S52, and S62 appear frequently in the same model in some of the iterationsbut never as the specific interaction S42
S52
S62. The median values of Single for S12, S
S32, and S72
S82 are 39.06, 52.42, and 50.67, respectively, and the median proportions of models containing S12 or the pair {S21,S32}, or {S72,S82} are each larger than 99%. However, {S32,S72} or {S21,S82}, for example, occasionally also appears jointly in more than 99% of the models. They would, therefore, wrongly be considered to be as important as {S21,S32} and {S72,S82} when using the proportion as importance measure. By contrast, none of the 2-way interactions including either the pair {S32,S72} or {S21,S82} exhibits a large value of VIMSingle. The application of the multiple-tree approach leads to similar results. These 2 simulations show the drawback of MC logic regression: SNP interactions are identified by applying a search algorithm once to the whole (training) data set. If an interaction explanatory for the casecontrol status is in one of the models visited during this search, then it is very likely that it will be identified to be important. However, if the variables composing this interaction do not jointly occur in any of these models, this interaction will not be found.
By contrast, in logicFS, a search algorithm is applied several times to different subsets of the data such that an interesting interaction is not lost even if it is not identified in some of the runs. Thus, logicFS stabilizes the search for interesting variables and interactions.
| 7. DISCUSSION |
|---|
|
|
|---|
A common and important task in genetic association studies is the identification of SNPs and SNP interactions associated with a covariate of interest, for example, a disease. Since SNP interactions are assumed to be more influential than individual SNPs, there is a need for a method which is able to identify such interactions. For a good prediction of the covariate of interest, this method should, in addition, be able to quantify the importance of interactions.
In this paper, we have introduced a procedure called logicFS based on a combination of bootstrap and logic regression for the identification of potentially interesting logic expressions that represent SNP interactions and 2 measures for quantifying the importance of these features for classification in casecontrol studies.
In the applications to simulated SNP data, all logic expressions intended to be explanatory for the casecontrol status of the observations are identified in any of the repeats always having the highest importances. In the analysis of the GENICA SNP data set, only one interaction between 2 SNPs of the ERCC2 gene could be detected that slightly increases the risk of developing breast cancer. This supports the findings of Justenhoven and others (2004).
Other importance measures such as the one of random forests would also identify ERCC2_6540 and ERCC2_18880 as the 2 most important variables. Advantages of our approaches are that they allow us to compute the importances of interactions of such variables without using the interactions as input variables and that the genotype responsible for the higher disease risk is directly revealed by the prime implicants.
An advantage of logicFS over MC logic regression (Kooperberg and Ruczinski, 2005) is that the search for interesting interactions is stabilized by running a search algorithm several times on several subsets of the data. In contrast to using the fraction of models containing a particular set of variables as importance measure, VIMSingle and VIMMultiple can quantify the importance of a specific interaction for classification and can be employed to compare interactions of different orders. The latter is not possible directly when using fractions since each subset of variables contained in the set of interest will show at least the same (or a higher) frequency/importance as this set of variables.
Since the goal of a casecontrol study is the construction of a classification rule based on as few variables as possible, the identification of SNP interactions associated with casecontrol status is just the first step, but a very important one. In a next step, one could, for example, take the k most important features or all interactions exceeding a specific importance and use these as binary variables in logic regression or in any other classification procedure.
The variable importance measures are currently restricted to analyses of data with a binary outcome. They can, however, be extended, for example, to quantitative trait loci studies in which the covariate of interest is continuous. In this case, sums of squares would replace the numbers of correctly classified observations in (4.1) and (4.2) and the signs of the differences in (4.1) and (4.2) would have to be changed. Since logic regression already includes linear regression (Ruczinski and others, 2003), logicFS can be used as is to identify interactions associated with the quantitative trait.
Apart from using logicFS for the identification of interactions, it can be employed as a classification procedure since it is actually a bagging (Breiman, 1996) version of logic regression. Using the output of logicFS, the casecontrol status of a new observation can be predicted by majority voting, that is, by assigning the observation to the class predicted by the majority of the B logic regression models or by averaging over the class probabilities. Since logic trees and CART trees are relatedeach logic tree can be transformed into a CART tree and vice versalogic trees might also be instable classifiers. It is, therefore, likely that the bagging version of logic regression might improve the classification.
The computation time of the proposed approaches depends not only on the number B of repeats of the search with simulated annealing and the number of iterations used in simulated annealing but also on the maximum number of variables and trees allowed in the models. For example, the application of logicFS to the data set of Simulation 1 on an AMD Athlon XP 3000+ machine with 1 GB of RAM takes about 400 s when using the single-tree approach and about 520 s when employing the multiple-tree method. The computation of VIMSingle requires 2.5 s and the calculation of VIMMultiple about 93 s. Since logicFS is not restricted to simulated annealing, other faster algorithms such as a greedy search can be employed to reduce the computation time considerably.
All the approaches presented in this paper have been implemented in the R package logicFS that can be downloaded from http://www.bioconductor.org, the web page of the Bioconductor project (Gentleman and others, 2004). This package also contains a version of logicFS which allows bagging on logic regression models.
| ACKNOWLEDGMENTS |
|---|
Financial support of the Deutsche Forschungsgemeinschaft (SFB 475, "Reduction of Complexity in Multivariate Data Structures") is gratefully acknowledged. The authors would also like to thank all partners within the GENICA research network for their cooperation. Conflict of Interest: None declared.
| REFERENCES |
|---|
|
|
|---|
-
Breiman L. Bagging predictors. Machine Learning (1996) 26:123140.
Breiman L. Random forests. Machine Learning (2001) 45:532.[CrossRef][Web of Science]
Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees (1984) Belmont, CA: Wadsworth.
Garte S. Metabolic susceptibility genes as cancer risk factors: time for a reassessment? Cancer Epidemiology, Biomarkers and Prevention (2001) 10:12331237.
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, and others. Bioconductor: open software development for computational biology and bioinformatics. Genome Biology (2004) 5:R80.[CrossRef][Medline]
Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning (2002) 46:389422.[CrossRef][Web of Science]
Justenhoven C, Hamann U, Pesch B, Harth V, Rabstein S, Baisch C, Vollmert C, Illig T, Ko Y, Brüning T, Brauch H. ERCC2 genotypes and a corresponding haplotype are linked with breast cancer risk in a German population. Cancer Epidemiology, Biomarkers and Prevention (2004) 13:20592064.
Kooperberg C, Ruczinski I. Identifying interacting SNPs using Monte Carlo logic regression. Genetic Epidemiology (2005) 28:157170.[CrossRef][Web of Science][Medline]
Kooperberg C, Ruczinski I, LeBlanc M, Hsu L. Sequence analysis using logic regression. Genetic Epidemiology (2001) 21:626631.
Ruczinski I, Kooperberg C, LeBlanc M. Logic regression. Journal of Computational and Graphical Statistics (2003) 12:475511.[CrossRef][Web of Science]
Ruczinski I, Kooperberg C, LeBlanc M. Exploring interactions in high-dimensional genomic data: an overview of logic regression, with applications. Journal of Multivariate Analysis (2004) 90:178195.[CrossRef][Web of Science]
Schwender H. Minimization of Boolean expressions using matrix algebra. In: Technical Report, SFB 475 (2007) Dortmund, Germany: Department of Statistics, University of Dortmund.
Schwender H, Zucknick M, Ickstadt K, Bolt HM. A pilot study on the application of statistical classification procedures to molecular epidemiological data. Toxicology Letters (2004) 151:291299.[CrossRef][Web of Science][Medline]
Vapnik V. The Nature of Statistical Learning Theory (1995) New York: Springer.
Witte JS, Fijal BA. Introduction: analysis of sequence data and population structure. Genetic Epidemiology (2001) 21:600601.
Received July 5, 2006; revised November 29, 2006; revised March 2, 2007; accepted for publication April 25, 2007.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
T. T. Wu, Y. F. Chen, T. Hastie, E. Sobel, and K. Lange Genome-wide association analysis by lasso penalized logistic regression Bioinformatics, March 15, 2009; 25(6): 714 - 721. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Khoury and S. Wacholder Invited Commentary: From Genome-Wide Association Studies to Gene-Environment-Wide Interaction Studies--Challenges and Opportunities Am. J. Epidemiol., January 15, 2009; 169(2): 227 - 230. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







