Biostatistics Advance Access published online on May 8, 2007
Biostatistics, doi:10.1093/biostatistics/kxm011
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Retrospective analysis of haplotype-based casecontrol studies under a flexible model for geneenvironment association
Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan, People's Republic of China
Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, 6120 Executive Boulevard, EPS 8038, Rockville, MD 20852, USA chattern{at}mail.nih.gov
Department of Statistics, Texas A&M University, TAMU 3143, College Station, TX 77843-3143, USA
* To whom correspondence should be addressed.
Genetic epidemiologic studies often involve investigation of the association of a disease with a genomic region in terms of the underlying haplotypes, that is the combination of alleles at multiple loci along homologous chromosomes. In this article, we consider the problem of estimating haplotypeenvironment interactions from casecontrol studies when some of the environmental exposures themselves may be influenced by genetic susceptibility. We specify the distribution of the diplotypes (haplotype pair) given environmental exposures for the underlying population based on a novel semiparametric model that allows haplotypes to be potentially related with environmental exposures, while allowing the marginal distribution of the diplotypes to maintain certain population genetics constraints such as HardyWeinberg equilibrium. The marginal distribution of the environmental exposures is allowed to remain completely nonparametric. We develop a semiparametric estimating equation methodology and related asymptotic theory for estimation of the disease odds ratios associated with the haplotypes, environmental exposures, and their interactions, parameters that characterize haplotypeenvironment associations and the marginal haplotype frequencies. The problem of phase ambiguity of genotype data is handled using a suitable expectationmaximization algorithm. We study the finite-sample performance of the proposed methodology using simulated data. An application of the methodology is illustrated using a casecontrol study of colorectal adenoma, designed to investigate how the smoking-related risk of colorectal adenoma can be modified by "NAT2," a smoking-metabolism gene that may potentially influence susceptibility to smoking itself.
Keywords: Casecontrol studies; EM algorithm; Geneenvironment interactions; Haplotype; Semiparametric methods
Received October 13, 2006; revised March 2, 2007; accepted for publication March 14, 2007.