Biostatistics Advance Access originally published online on June 19, 2007
Biostatistics 2008 9(1):187-198; doi:10.1093/biostatistics/kxm024
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Identification of SNP interactions using logic regression
Collaborative Research Center 475, Department of Statistics, University of Dortmund, 44221 Dortmund, Germany holger.schwender{at}udo.edu
* To whom correspondence should be addressed.
Interactions of single nucleotide polymorphisms (SNPs) are assumed to be responsible for complex diseases such as sporadic breast cancer. Important goals of studies concerned with such genetic data are thus to identify combinations of SNPs that lead to a higher risk of developing a disease and to measure the importance of these interactions. There are many approaches based on classification methods such as CART and random forests that allow measuring the importance of single variables. But none of these methods enable the importance of combinations of variables to be quantified directly. In this paper, we show how logic regression can be employed to identify SNP interactions explanatory for the disease status in a case–control study and propose 2 measures for quantifying the importance of these interactions for classification. These approaches are then applied on the one hand to simulated data sets and on the other hand to the SNP data of the GENICA study, a study dedicated to the identification of genetic and gene–environment interactions associated with sporadic breast cancer.
Keywords: Feature selection; GENICA; Single nucleotide polymorphism; Variable importance measure
Received July 5, 2006; revised November 29, 2006; revised March 2, 2007; accepted for publication April 25, 2007.