Skip Navigation


Biostatistics Advance Access first published online on September 19, 2006
This version published online on May 14, 2007

Biostatistics, doi:10.1093/biostatistics/kxl028
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
8/3/546    most recent
kxl028v2
kxl028v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Johnson, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Johnson, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

Bayesian method for gene detection and mapping, using a case and control design and DNA pooling

Toby Johnson

School of Biological Sciences, The University of Edinburgh, Edinburgh EH9 3JT, UK and Rothamsted Research, Harpenden, Hertfordshire AL5 2JQ, UK

toby.johnson{at}ed.ac.uk


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 2.2 Method for Bayesian...
 2.3 Single-point analyses
 3. RESULTS
 3.1 Power to detect...
 3.2 Estimation of QTL...
 3.3 Effect of experimental...
 3.4 Application to real...
 4. DISCUSSION
 SOFTWARE
 REFERENCES
 
Association mapping studies aim to determine the genetic basis of a trait. A common experimental design uses a sample of unrelated individuals classified into 2 groups, for example cases and controls. If the trait has a complex genetic basis, consisting of many quantitative trait loci (QTLs), each group needs to be large. Each group must be genotyped at marker loci covering the region of interest; for dense coverage of a large candidate region, or a whole-genome scan, the number of markers will be very large. The total amount of genotyping required for such a study is formidable. A laboratory effort efficient technique called DNA pooling could reduce the amount of genotyping required, but the data generated are less informative and require novel methods for efficient analysis. In this paper, a Bayesian statistical analysis of the classic model of McPeek and Strahs is proposed. In contrast to previous work on this model, I assume that data are collected using DNA pooling, so individual genotypes are not directly observed, and also account for experimental errors. A complete analysis can be performed using analytical integration, a propagation algorithm for a hidden Markov model, and quadrature. The method developed here is both statistically and computationally efficient. It allows simultaneous detection and mapping of a QTL, in a large-scale association mapping study, using data from pooled DNA. The method is shown to perform well on data sets simulated under a realistic coalescent-with-recombination model, and is shown to outperform classical single-point methods. The method is illustrated on data consisting of 27 markers in an 880-kb region around the CYP2D6 gene.

Keywords: Association mapping; Case and control study; DNA pooling; Genetic association; LD mapping; QTL mapping


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 2.2 Method for Bayesian...
 2.3 Single-point analyses
 3. RESULTS
 3.1 Power to detect...
 3.2 Estimation of QTL...
 3.3 Effect of experimental...
 3.4 Application to real...
 4. DISCUSSION
 SOFTWARE
 REFERENCES
 
For many traits of interest, including susceptibility to many genetic diseases, the genetic basis is complex, meaning that many genes of individually small effect (quantitative trait loci [QTLs]) contribute. For detecting and mapping such QTLs, association mapping studies that use large samples of essentially unrelated individuals may have 2 advantages over linkage studies that use pedigrees or families: For a given sample size, association studies may have more power to detect a QTL (e.g. Risch and Merikangas, 1996; Risch, 2000), and they may allow the QTL to be fine mapped with greater resolution or precision (e.g. Terwilliger, 1995; McPeek and Strahs, 1999). Often the size of the genomic region under study will be large, and many markers will need to be genotyped. In a large-scale association study, the probability of directly observing a given QTL is small. For example, there may be over 12 million common single-nucleotide polymorphisms (SNPs) in the human genome (dbSNP Build 126; see http://www.ncbi.nlm.nih.gov/projects/SNP), but even the largest studies currently plan to genotype only half a million SNPs (Thomas and others, 2005). Thus, most QTLs will have to be detected and mapped indirectly, via linkage disequilibrium (LD; nonrandom association) between the QTL and one or more of the genotyped markers. Because of nongenetic factors and the effects of other genetically distant QTLs, each QTL will explain only a small fraction of the total variance in phenotype. If the trait is binary (such as presence or absence of a disease), the difference in QTL allele frequencies between the 2 trait groups will be small. The combination of individually small effect sizes and the expectation of imperfect LD or association between QTLs and genotyped markers mean that each marker will need to be genotyped in large numbers of individuals.

Thus, large numbers of markers will need to be genotyped in large numbers of individuals to detect QTLs, and to infer their positions, frequencies, and effects on the trait. In order to reduce the cost of such a study, an experimental strategy called DNA pooling has been proposed (e.g. Arnheim and others, 1985; Barcellos and others, 1997; Sham and others, 2002; Norton and others, 2004; Prentice and Qi, 2006). Here DNA from individuals with similar phenotypes is physically mixed together into a pool before genotyping. After overheads to do with construction of pools and assay development, the cost of genotyping an entire pool at a given marker is reduced to the cost of genotyping a single individual. Thus, costs may be reduced by a factor close to the number of individuals in a pool (divided by the number of experimental replicates used for each pool), when the overheads can be spread across many markers and many disease studies.

Compared with individual genotyping, a DNA-pooling strategy incurs 3 types of information loss and error. At each marker, only the marginal counts of the alleles present within each pool are available, and so there is (i) no information about deviations from Hardy–Weinberg proportions within each marker within each pool (Risch and Teng, 1998) and (ii) no information about phase or LD across markers within each pool (Johnson, 2005b). Further, (iii) the marginal counts are only estimated from some kind of quantitative genotyping experiment (e.g. Germer and others, 2000; Le Hellard and others, 2002), rather than by counting individual genotyping experiments. Thus, additional challenges arise when developing statistically and computationally efficient methods for analysis of data from pooled DNA.

Many ongoing and planned genomewide association studies are using (or plan to use) multistage designs, using individual genotyping at every stage (Thomas and others, 2005). However, DNA pooling has been used for preliminary stages in some large-scale studies (e.g. Hinds and others, 2004), and even for whole-genome scans (Zubenko and others, 2002; Sawcer and Compston, 2003), and the entire volume 143 of Journal of Neuroimmunology describing the GAMES project, Shiffman and others, 2004; Butcher and others, 2005; Prentice and Qi, 2006). Efficient methods for analyzing data from pooled DNA are therefore desirable for at least 2 reasons. First, they are a prerequisite for large-scale simulation and empirical studies to determine whether DNA pooling is the most cost-effective strategy for some stages in a multistage design. Second, they will allow us to make maximum use of data already collected.

In this paper, I propose a new Bayesian method for QTL detection and fine-scale mapping, assuming an association study with data collected using DNA pooling. I consider the simplest experimental design where there are 2 trait groups. These could have been classified by the presence or absence of a binary trait such as a disease. Alternatively, if the trait is quantitative, individuals from each tail of the trait distribution could make up the 2 groups, for example the upper and lower 10–15% tails (e.g. Darvasi and Soller, 1994; Bader and others, 2001; Hinds and others, 2004). For simplicity, I will refer throughout the paper to the 2 trait groups as cases and controls and to one of the alleles at the QTL as the disease allele. (It is worth noting here that data collected from more than 2 groups can be analyzed using the Bayesian method proposed here, by discarding or combining data from some of the groups. Such an approach is valid and may be useful, but will almost certainly not make most efficient use of the available data, since it will be based on only a marginal observation that is almost certainly not sufficient.)

Full likelihood or Bayesian analysis of large data sets collected using DNA pooling, assuming an adequately realistic model, does not appear to be computationally realistic at present. The method proposed here makes an approximation by using a likelihood function derived using a simplified "decay of haplotype sharing" model (McPeek and Strahs, 1999). This approximation may be acceptable because it allows all the necessary computations to be done analytically or using simple numerical algorithms. This means that the method proposed here can be used on very large data sets, with hundreds of cases and hundreds of controls genotyped at thousands of markers. Avoiding the use of more complicated algorithms such as Markov chain Monte Carlo (MCMC) means that there are no concerns about mixing and convergence, and no difficulties with computing normalizing constants and Bayes factors (BFs) for model testing and model comparison. When the model is correct, the BF is (in various classical senses, see O'Hagan and Forster, 2004, Chapter 5) an optimal test statistic for detecting a QTL (see also Patterson and others, 2004). Therefore, the approach proposed here can be viewed as choosing an approximate model that is simple enough to allow an exact and optimal analysis. It will therefore complement alternative approaches that perform approximate or suboptimal analyses for more realistic models.

One example of such an alternative approach is to perform a classical single-point analysis, which applies a separate test to the data for each marker. Such an approach can be made essentially free of any population genetic model, meaning that not only no worring assumptions are made but also there is no efficient framework for combining analyses across many markers (e.g. to correct for multiple testing). When genotype data are available and when the QTL is not one of the genotyped markers, such single-point methods are known to lack power (e.g. Risch, 2000; Mott and others, 2000; Zollner and Pritchard, 2005), and may produce inefficient point estimates and inefficiently large region estimates for the position of the QTL (e.g. Morris and others, 2002; Morris and others, 2003).

The main question addressed in this paper is to identify in exactly what respects the Bayesian method proposed here can usefully supplement classical single-point analyses. To this end, I analyzed a large sample of synthetic data sets simulated under a realistic coalescent-with-recombination (Hudson, 1983) model, which is almost identical in specification to models used to test other recently proposed statistical methods for association data (Morris and others, 2004; Zollner and Pritchard, 2005; Waldron and others, 2006). Analyses of these data sets show that the Bayesian method proposed here offers at least 3 advantages. First, the Bayesian method has improved power to detect a QTL. Second, point estimates for the position of the QTL derived from the proposed method outperform the simple procedure of choosing the map position of the marker with the most significant single-point test result (the "minimum P value method," e.g. Kaplan and Morris, 2001a,b). Third, after a heuristic "flattening correction" (based on a derivation by McPeek and Strahs, 1999), credibility intervals for the position of the QTL cover its true position with frequencies close to their nominal size.

The structure of this paper is as follows. Section 2.1 reviews the model used to generate simulated data sets. Section 2.2 briefly summarizes the new Bayesian method of analysis proposed here. Technical details are relegated to online supporting material available at Biostatistics online (Appendices A and B). Section 2.3 gives some necessary details about the single-point methods of analysis against which the Bayesian method is compared. The main result in this paper is actually the computational method described in the online supporting material available at Biostatistics online (Appendix B). However, the results presented in Section 3 have greater practical interest. The performance of the proposed Bayesian method is studied in terms of QTL detection in Section 3.1, and in terms of fine-scale QTL mapping in Section 3.2. Some more technical aspects of the results are discussed only in the preprint version of this paper (Johnson, 2005a). In Section 3.3, I briefly discuss the effects of different experimental designs on the performance of the proposed method. In Section 3.4, I illustrate the method on the real data set of Hosking and others (2002). Finally, in Section 4, I discuss general strategies for statistical analysis in large-scale genetic association studies.


    2. METHODS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 2.2 Method for Bayesian...
 2.3 Single-point analyses
 3. RESULTS
 3.1 Power to detect...
 3.2 Estimation of QTL...
 3.3 Effect of experimental...
 3.4 Application to real...
 4. DISCUSSION
 SOFTWARE
 REFERENCES
 

2.1 Simulation model

Each simulated data set corresponds to a 1-Mb region, and was simulated in 3 steps. First, haplotypes were simulated according to a standard population genetic model, the coalescent-with-recombination (Hudson, 1983). Second, disease phenotypes were simulated using a standard genetic epidemiological model. Third, allele frequency estimates were simulated assuming a model for experimental errors that is consistent with assays currently in use. The first 2 steps are relatively standard and so I elaborate only briefly. A more detailed description of the third step then follows.

First, I used the "mksamples" program of Hudson (2002) to simulate a sample of 20 000 (1-Mb long) regions, assuming the standard neutral coalescent model with population recombination rate 4Nec = 400 (Ne = 10 000 assuming 1cM/Mb), and assuming the infinite sites mutation model with population mutation rate either {theta} = 4Neu = 10 or {theta} = 40. (These are unrealistically low mutation rates, the idea is to simulate some of the SNPs in the region rather than all of them.) Chromosomes were paired at random to generate a sample of 10 000 individuals.

Second, one SNP with a minor allele frequency between 10% and 20% was chosen as the disease QTL. The disease status of each individual was simulated assuming multiplicative risks, so that {gamma}01/{gamma}00 = {gamma}11/{gamma}01 = g. Here {gamma}G is the penetrance (probability of having the disease) given genotype G at the disease QTL, with 1 the minor allele. The parameter g is called the allelic relative risk (see, e.g. Risch and Merikangas, 1996). Simulations for this paper used values of g = 4 and g = 1. The penetrance of the wild-type homozygote, {gamma}00, was set so that the marginal probability of having the disease was 0.02. (Thus the number of case chromosomes nd was random with expectation 0.02 x 10 000 x 2 = 400.) Data from all nd/2 case individuals, and an equal number nc/2 of randomly chosen control individuals, were used to make up the 2 pools. Excluding the disease QTL, all simulated SNPs with a minor allele frequency greater than 0.05 in the control pool were analyzed, so the number of SNPs, L, was also random.

Third, the estimated allele frequencies at each SNP were simulated assuming a model for experimental errors. Gaussian errors on a raw allele frequency scale (as assumed, e.g. by Visscher and Le Hellard, 2003, or Simpson and others, 2005) were not used for simulating data because this can produce allele frequency estimates outside the range zero to one. As described in the online supporting material available at Biostatistics online (Appendix A), the model for experimental errors assumed here can be motivated by a protocol for genotyping pooled DNA using kinetic polymerase chain reaction (PCR) growth curves (Germer and others, 2000), assuming r = 2 experimental replicates, and errors in lag estimation that have standard deviation {sigma} = 0.2PCRcycles independent of true allele frequency. A large data set comparing genotyped individuals and pools (Shiffman and others, 2004) supports this model. However, this is in fact a much more general model that assumes simply that there are Gaussian errors on a logistic scale. As such, it is compatible with a wide range of protocols for genotyping pooled DNA. Brohede and others (2005) and Butcher and others (2005) genotyped pooled DNA using a microarray protocol, and observed absolute errors in the range 0.03–0.06 by comparing against individual genotypes. Brohede and others (2005) surveyed the literature and found that, for a range of other protocols, allele frequency can be estimated with a mean absolute error standard error generally in the range 0.01–0.03. These figures suggest that the error model assumed here, in which the standard error for allele frequency estimation is about 0.025 for intermediate allele frequencies (or more precisely, 0.1 x p(1 – p) where p is the true allele frequency), will be representative for protocols other than those based on kinetic PCR.


    2.2 Method for Bayesian analysis
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 2.2 Method for Bayesian...
 2.3 Single-point analyses
 3. RESULTS
 3.1 Power to detect...
 3.2 Estimation of QTL...
 3.3 Effect of experimental...
 3.4 Application to real...
 4. DISCUSSION
 SOFTWARE
 REFERENCES
 
The new method analyzes data consisting of allele frequency estimates at L marker SNPs, obtained from a single pool of nd case chromosomes and a single pool of nc control chromosomes. The genetic map positions of all marker SNPs are assumed known. The method allows a flexible model for experimental errors in allele frequency estimation, where different error distributions are allowed for different SNPs.

The variable of main interest is µ, which is the unknown map position of the disease QTL relative to the leftmost marker SNP. I assume that a unique mutation event generated the disease allele at the disease QTL, so that some number of haplotypes in the case pool carry the disease allele and are identical by descent (i.b.d.) at this position. The variable {rho} is the expected frequency of i.b.d.haplotypes in the case pool. For technical reasons, I assume that the disease allele and i.b.d. haplotypes are absent from the control pool. This assumption is biologically very unrealistic, but actually seems to be reasonably innocuous in practice. Informally, it seems that {rho} can be thought of roughly as describing the excess of i.b.d. haplotypes in the case pool compared with the control pool.

The model of McPeek and Strahs (1999) makes a biologically unrealistic assumption that the genealogy under the disease allele is star shaped. The variable {tau} is the depth of this genealogy, and can be thought of as the age of the disease allele. This assumption has the technically convenient consequence that each disease allele is embedded in a block of i.b.d. haplotype, called ancestral haplotype, the endpoints of which are independently and exponentially distributed.

Outside the blocks of ancestral haplotype, Hardy–Weinberg and linkage equilibrium are assumed to apply. Thus, the control pool is assumed to be entirely in linkage equilibrium, and the case pool is assumed to deviate from linkage equilibrium only due to blocks of ancestral haplotype. This assumption is certainly a very bad one when haplotype or genotype data are available (Liu and others, 2001; Morris and others, 2002; Li and Stephens, 2003), but when only multilocus allele frequency data are available it may be more innocuous. The informal argument for this is that because allele frequency data alone do not contain any information about LD, it would not make a big difference to include additional LD parameters in the model and then to integrate them out.5pt

For technical reasons, 3 potentially very worrying modeling assumptions have just been introduced. It is worth emphasizing that all these assumptions are violated in the simulation model described in Section 2.1. Therefore, to some extent, performance when analyzing simulated data sets measures the practical consequences, for making inference, of these nonbiological assumptions.

The key to the computations (described in the online supporting material available at Biostatistics online, Appendix B) is that, under the modeling assumptions above, the whole data set can be modeled as a single hidden Markov model (HMM), conditional only on the 3 parameters µ, {rho}, and {tau}. In contrast, previous approaches have modeled haplotypes or genotypes as separate HMMs that are independent conditional on a large number of parameters including allele frequencies and the unknown ancestral haplotype (McPeek and Strahs, 1999; Morris and others, 2000; Zhang and Zhao, 2000; Zhang and Zhao, 2002; Liu and others, 2001). The Bayesian analysis proposed here involves an independent precomputation for each SNP. This precomputation integrates out experimental errors, population allele frequencies, and the ancestral haplotype at that SNP. The values thus computed are the emission probabilities of the HMM. A main part of the analysis involves integrating out the hidden variables, which are the number of chromosomes in the case pool that carry ancestral haplotype, at each marker SNP. This integration is achieved using a minor variation on the standard backwards algorithm for HMMs. This computation is then repeated over a lattice of design points in the space for (µ, {rho}, {tau}). By traversing the design points in a particular order, many of the HMM computations can be reused, which speeds up the whole computation massively. Any or all of the variables µ, {rho}, and {tau} can then be integrated out using quadrature. For low-dimensional spaces such as the one considered here, this method of computation is more reliable and more efficient than MCMC methods.

Unless otherwise stated, Bayesian analyses performed here assumed weak priors: µ was uniform on the 1-Mb region being studied, {rho} was uniform on (0, 1), and {tau} was exponential with mean 1000 generations.

The assumption of a star-shaped genealogy is not realistic. McPeek and Strahs (1999) derived a "flattening correction," which was justified by them by the use of a quasi-score function, and was used in a Bayesian context by Morris and others (2000). Essentially, this involves flattening the likelihood function by raising all likelihoods to a power wn = (1 + (n – 1)cn)–1 < 1. Here cn is the pairwise correlation over sampled chromosomes of the conditional score function for the position of the QTL. An expression for cn for a coalescent model is given in Appendix D of McPeek and Strahs (1999). The n in this equation (which cn also depends on) is the number of chromosomes carrying the QTL. It is not at all clear whether or how this correction should be applied in the present context, because (i) as noted by Morris and others (2000), the quasi-score justification used by McPeek and Strahs (1999) does not apply in a Bayesian setting, (ii) in the present work, the likelihood is never written as a conditional product across chromosomes carrying the disease allele, (iii) it is not known how many chromosomes carry the disease allele, and (iv) in my computational implementation, no proper likelihoods are ever computed, only likelihoods marginal to allele frequencies. However, as shown in the results, the following ad-hoc approach produces good results. The procedure is to first estimate n by ndE({rho}|data), the product of the number of case chromosomes and the posterior expectation of {rho}, and then to flatten the marginal posterior for µ by raising it to a power wn and renormalizing.


    2.3 Single-point analyses
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 2.2 Method for Bayesian...
 2.3 Single-point analyses
 3. RESULTS
 3.1 Power to detect...
 3.2 Estimation of QTL...
 3.3 Effect of experimental...
 3.4 Application to real...
 4. DISCUSSION
 SOFTWARE
 REFERENCES
 
The inferences from the Bayesian method proposed here can be compared against simple but widely used classical single-point analyses. When allele frequencies in each pool are known exactly, a {chi}2 test can be used on the counts of the 2 alleles in the 2 pools (see e.g. Clayton, 2001), at each marker SNP separately. Several authors (e.g. Visscher and Le Hellard, 2003; Simpson and others, 2005) have considered how to perform an equivalent test when the errors in allele frequency estimates are Gaussian. They have shown that a "shrunk" test statistic is approximately distributed as {chi}2 with one degree of freedom. This shrunk statistic is equal to the ordinary {chi}2 statistic computed using a point estimate of the counts times a factor 2Vs/(2Vs + Ve) where Vs is the estimated sampling variance of the allele frequency due to sampling a finite number of cases and controls, under the null hypothesis of equal allele frequency in cases and controls, and Ve is the variance of the allele frequency in either pool due to experimental error.

This shrunk {chi}2 method can be applied to the simulated data sets considered here because the simulated errors are relatively small. There is therefore a local approximately linear relationship between the logistically transformed and the raw allele frequencies, so the simulated Gaussian errors on a logistic scale produce errors on a raw allele frequency scale that are approximately Gaussian. Using the Jacobian (given by (A4) in the online supporting material available at Biostatistics online) to describe this local linear relationship, for the simulations performed here, the appropriate shrinking factor is (approximately, for small errors)

Formula (2.1)

where Formula is the allele frequency estimated under the null hypothesis, that is, by combining the case and control pools.

The most widely considered statistics from a classical single-point analysis are as follows: Let Pmin be the smallest P value of the L (shrunk) {chi}2 tests applied to the L SNPs in a given data set, and let Formula be the map position of the marker with the smallest P value.

Although it has been suggested that Formula would be a "good" point estimate for the position of the QTL (Kaplan and Morris, 2001a,b), it is worth pointing out that it is asymptotically inadmissible for a model very similar to the one assumed here. As discussed in the preprint version of this paper (Johnson(2005a)), the point estimate Formula (when the marker SNPs lie on the interval [0, 1]) has uniformly lower expected loss, under both absolute and squared error loss, in the limit of a QTL with small effect. Although this argument does not technically apply for the model simulated here, it does suggest that better point estimates than Formula may be found, and suggests what their asymptotic behavior ought to be, at least approximately.


    3. RESULTS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 2.2 Method for Bayesian...
 2.3 Single-point analyses
 3. RESULTS
 3.1 Power to detect...
 3.2 Estimation of QTL...
 3.3 Effect of experimental...
 3.4 Application to real...
 4. DISCUSSION
 SOFTWARE
 REFERENCES
 
I analyzed 1000 simulated data sets for each of the following 3 experimental designs. In design I, there was a modest SNP density, simulated using {theta} = 10. The median number of SNPs (with minor allele frequency at least 0.05 in the control pool) was 28, with interquartile range 24–32 and range 12–51. For design I, errors in allele frequency estimation were simulated as described in Section 2.1 (see also the online supporting material available at Biostatistics online, Appendix A), with {sigma} = 0.2 PCR cycles and r = 2 experimental replicates. Design II examines the effect of more precise allele frequency estimates, assuming that allele frequencies were known exactly, which might be considered an idealization of having an effectively infinite number of experimental replicates. Design III examines the effect of increased SNP density, simulated using {theta} = 40. The median number of SNPs was 117 (with interquartile range 108–126 and range 75–157). Errors in allele frequency estimation were as for design I.

For each design, 500 data sets were simulated assuming that there was a QTL with a genotype relative risk g = 4, and 500 data sets were simulated assuming a null model with no QTL (g = 1, so the penetrances {gamma}00 = {gamma}01 = {gamma}11 = 0.02 are all equal). The median number of case or control individuals, nd/2 = nc/2, was 200 (with interquartile range 191–209 and range 154–248).

These simulations assumed relative risks that are higher, and correspondingly sample sizes that are smaller than may be realistic for many studies of QTLs influencing complex genetic diseases. This reflects the need to analyze a reasonably large number of simulated data sets with the computing resources currently available. For designs I and II, with low SNP densities, the Bayesian analysis was conducted by evaluating the posterior at 106 design points on a 100 x 100 x 100 lattice (as described in the online supporting material available at Biostatistics online, (B27)). For these analyses, the mean run time was about 36 min on a 2.4-GHz Intel® XeonTM processor. For design III, with higher SNP density, the posterior was evaluated at 2 x 106 design points on a 200 x 100 x 100 lattice, and the mean run time was about 4.7 h. Thus, a total of about 250 processor-days were used to analyze all the simulated data sets.

It is worth emphasizing that all the tests described in the following sections concern the classical sense performance of statistics computed from the Bayesian analysis. Strictly, the Bayesian sense performance can only be tested by conducting a Bayesian analysis assuming a more realistic model or prior.


    3.1 Power to detect a QTL
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 2.2 Method for Bayesian...
 2.3 Single-point analyses
 3. RESULTS
 3.1 Power to detect...
 3.2 Estimation of QTL...
 3.3 Effect of experimental...
 3.4 Application to real...
 4. DISCUSSION
 SOFTWARE
 REFERENCES
 
In this section, I compare the power of tests to detect a QTL. I consider 2 different test statistics, and different methods of determining critical regions. The first test statistic is 2 ln BF, twice the logarithm of the Bayes factor for a model with a QTL compared against a model with no QTL (computed using (B28) in the online supporting material available at Biostatistics online). The second test statistic is Pmin x L. Multiplying the smallest single-point P value by L achieves a simple Bonferroni correction for multiple testing that makes the critical region independent of L. I compare methods for determining critical regions analytically (by arbitrary or approximate methods) or empirically (from analyses of data sets simulated under a null model). I report the performance of tests with nominal sizes of {alpha} = 0.05 and {alpha} = 0.01 (a more general comparison is made in the online supporting material available at Biostatistics online, Figures 6–8). For each test, I report the estimated true size and power, along with exact 95% binomial confidence intervals for their values. True size was estimated using 500 simulations under the null model with genotype relative risk g = 1, i.e. where risk is independent of genotype at the QTL. Power was estimated using 500 simulations under an alternative with g = 4.

From a Bayesian perspective, 2 ln BF > 0 indicates evidence in favor of the model with a QTL over the model with no QTL. As Table 1 shows, the test with this critical region has quite small size and high power. However, much more simulation work, for different models and combinations of parameters, is required to establish the generality of this result. Also, at least from a classical perspective it is desirable to be able to choose a critical region according to the size (or perhaps power) that is desired.


View this table:
[in this window]
[in a new window]

 
Table 1. Performance of tests to detect a disease QTL. Values in parentheses are 95% confidence intervals for the true size and power

 
An arbitrary critical region is Formula. I say arbitrary because this in fact guarantees nothing about the classical sense error rate, but does bound the Bayesian sense error rate: The posterior probability that there is no QTL is less than {alpha}, Pr(noQTL|data) < {alpha}. This bound assumes that the correct prior for different models is Pr(QTL) = Pr(noQTL), and that the models and the priors for all parameters within the models are all correctly specified. It can be seen from Table 1 that these arbitrary critical regions give tests with true sizes that are generally smaller than {alpha}. (The exception is for {alpha} = 0.01 for design III, and may simply be due to estimating size using only 500 simulations.) Such tests are therefore conservative. Causes may include the simplified model used to compute the BF, the dependence of the BF on the prior specification for the age of the disease QTL, and the fact that this critical region bounds the Bayesian sense error rate rather than controls the classical sense error rate. The use of these arbitrary critical regions Formula entails a loss of power due to the actual size of the test being smaller than intended, so a better method for determining a critical region is desirable.

Assuming goodness of a {chi}2 approximation for the shrunk single-point test statistic (Visscher and Le Hellard, 2003; Simpson and others, 2005; Section 2.3), and using a simple Bonferroni correction, suggests an approximate critical region Pmin x L < {alpha}. These tests have true sizes equal to or smaller than their nominal sizes, which is expected because the Bonferroni correction is conservative. This effect increases in severity as the SNP density increases, because there are a greater number of more positively correlated tests.

These, respectively, arbitrary and approximate methods for determining critical regions are not very accurate, and cannot really be recommended. If such a method is to be used, it would be reasonable to prefer a test based on the 2 ln BF statistic. For data sets simulated using design III, such tests are more powerful. For data sets simulated using designs I and II, there is no clear difference in power between the 2 test statistics, but tests based on 2 ln BF appear to be more conservative.

It is not very meaningful to compare the power of tests with different sizes. Therefore, I used simulations to estimate exact critical regions, so that the power of different tests with the same true size {alpha} could be compared. For these tests, the estimated size is exactly equal to the nominal size because the same set of simulations are used to compute both values. Although the critical region for {alpha} = 0.01 is unlikely to be well estimated using only 500 simulations, by a simple symmetry argument this procedure still allows a fair comparison across the different test statistics. In every case, the test based on 2 ln BF is substantially more powerful than the test based on Pmin x L (Table 1).

By combining simulations in which there is not and is a QTL, we can view test statistics as"classifiers," and ask how well they discriminate between the 2 cases. The receiver operating characteristics (ROC) for the 2 statistics are compared in the online supporting material available at Biostatistics online (Figures 6–8). The ROC curves are equivalent to plotting estimated power (= sensitivity) against size of test (= 1 – specificity) for all possible tests (in fact, only all tests with nondisjoint critical regions). When viewed in this way, it can be seen that the BF-based statistic is uniformly equal to or superior to the minimum P-value-based statistic. The advantage of the BF seems greatest for design I, with low density of SNPs and errors in allele frequency estimation. This may be because when the data set is less informative, it may be more important to have a model-based way to combine information across SNPs.

For comparison, the online supporting material available at Biostatistics online (Figures 6–8) also shows ROC curves for a nonparametric likelihood ratio (NLR) test statistic based on an approach that I have described previously Johnson (2005b). In these simulations, tests based on the NLR are superior to tests based on Pmin x L, and for designs I and II they are not clearly distinguishable from tests based on the BF. The advantage of NLR test statistic is that it can be computed much more quickly than the BF. The disadvantages are that there is no theoretical support for using the NLR, that I have not found a quick way of computing approximate or arbitrary critical values, and that the NLR statistic has no Bayesian interpretation and is not a very good approximation to the BF. In particular, the NLR can never be negative and so can never indicate Bayesian sense evidence in favor of no disease QTL.

Although the BF test statistic does not depend on the prior probabilities for the 2 models (QTL or no QTL), it does depend on the prior distribution for parameters within each model (see, e.g. O'Hagan and Forster(2004), Section 7.17–7.21). To examine typical levels of sensitivity to prior specification, I recomputed the BF for the 1000 data sets simulated using design I. I used 3 alternative log-normal priors for {tau}, assuming ln({tau}) had mean either 5.8, 6.8, or 7.8, respectively, and variance 0.742 in every case. The absolute values of the BFs only changed a little (see online supporting material available at Biostatistics online, Figure 9). This reflects the fact that, for these data sets, the likelihoods have good support for wide ranges of {tau}. There was no discernible effect on test performance or ROC curves (results not shown). Therefore, at least the average classical sense performance of the proposed method seems robust to sensible levels of prior misspecification for {tau}.

Although the simulations described here are adequate for demonstrating the superiority of the BF-based test over the Pmin-based test, we should be cautious about extrapolating from the current results. In particular, it seems that the arbitrary critical region Formula becomes less conservative as SNP density increases, and the approximate (Bonferroni) critical region Pmin x L < {alpha} becomes more conservative as SNP density increases. In a real situation, a critical region should be determined using simulations conditioned on as many ancilliary statistics of the observed data as possible, which obviously include sample size and SNP density, although for complex models it may be a matter of guesswork which other statistics are approximately ancilliary. An approach that could be most useful in practice is a variant of the permutation test of Churchill and Doerge (1994). This could be applied if there were matched pairs of pools of cases and controls, and each pair was genotyped in separate DNA-pooling experiments (Law and others, 2004; Shiffman and others, 2004). Then the phenotype labels could be permuted within each pair, giving a set of equiprobable values for any test statistic under the null hypothesis. Such an approach could not be explored here because it would require too much computation.


    3.2 Estimation of QTL position
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 2.2 Method for Bayesian...
 2.3 Single-point analyses
 3. RESULTS
 3.1 Power to detect...
 3.2 Estimation of QTL...
 3.3 Effect of experimental...
 3.4 Application to real...
 4. DISCUSSION
 SOFTWARE
 REFERENCES
 
Figures 1 and 2 show analyses of 4 randomly chosen simulated data sets with QTLs (g = 4). These illustrate the fact that these data sets contain only weak information about the position of the QTL (or at least that the Bayesian method proposed here only extracts weak information). It nonetheless seems worthwhile to examine how much information is present.


Figure 1
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Example simulated data sets with g = 4 and where allele frequencies are known exactly. Points are – log 10(P) for single-point {chi}2 tests. Dotted lines are posterior density and solid lines are posterior density with the flattening correction of McPeek and Strahs (1999) (which is not visibly different for the third data set). Vertical dashed lines show position of disease QTL.

 

Figure 2
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. The same simulated data sets as shown in Figure 1, but with errors in allele frequency estimation with {sigma} = 0.2 PCR cycles and r = 2 experimental replicates. Points are – log 10(P) for single-point shrunk (Visscher and Le Hellard, 2003) {chi}2 tests. Dotted lines are posterior density and solid lines are posterior density with the flattening correction of McPeek and Strahs (1999). Vertical dashed lines show position of disease QTL.

 
The performance of different methods for estimating the position of the QTL was assessed using the 500 simulations with g = 4, for each of the 3 experimental designs considered. Due to the nature of the simulations performed here, the errors reported are averaged over the distribution of the true value of µ. They are therefore not classical expected losses in the strict sense, but expected losses averaged with respect to a distribution of parameter values. Bayesian point estimators have uniquely best performance when measured in this way (O'Hagan and Forster, 2004, Chapter 5); the theory requires that the model and prior are both correct. In particular, under squared error losses, the average expected loss is minimized by the posterior mean, and under absolute error losses, the average expected loss is minimized by the posterior median. As shown in Table 2, point estimators derived from the posterior computed using the Bayesian method described here are superior to Formula, the map location of the marker with the smallest P value. For design I, with low SNP density and errors in allele frequency estimation, the Baysian analysis produces a 18% reduction in root average mean squared error and a 11% reduction in average mean absolute error. For design II, when allele frequencies are known exactly, the figures are similar, 21% and 11%, respectively. For design III, with high SNP density, the relative improvements are greater, 33% and 28%, respectively. The nonparametric method developed previously by me Johnson (2005b) produces point estimates that are competitive with the Bayesian method under squared error losses, for designs I and II.


View this table:
[in this window]
[in a new window]

 
Table 2. Performance of point estimators of QTL position

 
Figure 3 shows coverage frequencies estimated using the 500 data sets simulated with g = 4. The coverage of credibility intervals constructed from the marginal posterior for µ falls well below nominal levels. This suggests strongly that the simple model at the core of the proposed Bayesian method is not a good approximation to the more realistic model the data were simulated under. One way to improve the model would be to allow a more realistic model for the shape of the genealogy at the QTL. This can be achieved, in a very ad-hoc way, by applying a "flattening correction" based on a result derived by (1999, see Section 2.2). Here, this was applied to the posterior density for disease QTL position by raising all values of density to a power wn (which depends on other variables in the analysis), and then renormalizing. For the simulations performed here, wn had median 0.56 and interquartile range 0.48–0.63. When this procedure was applied, the agreement between nominal and achieved coverage was excellent for designs I and II (Figure 3), suggesting that at low SNP density the most serious misspecification of the current model is the assumption of a star-shaped genealogy. The agreement between nominal and achieved coverage was somewhat poorer for design III (Figure 3). This suggests that at high SNP density, the assumption of linkage equilibrium outside blocks of ancestral haplotype becomes an important aspect of model misspecification.


Figure 3
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Nominal and achieved coverage of credibility intervals for position of QTL. Highest posterior density credibility intervals were constructed from the marginal posterior, either without (dotted lines) or with (solid lines) the flattening correction of McPeek and Strahs (1999). Dashed line shows equality between nominal and achieved coverage.

 

    3.3 Effect of experimental design
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 2.2 Method for Bayesian...
 2.3 Single-point analyses
 3. RESULTS
 3.1 Power to detect...
 3.2 Estimation of QTL...
 3.3 Effect of experimental...
 3.4 Application to real...
 4. DISCUSSION
 SOFTWARE
 REFERENCES
 
Although only 3 experimental designs were considered here, a tentative conclusion can be drawn. For the parameter values chosen here, it would be better to devote experimental resources to genotyping a larger number of SNPs (design III versus design I) than replicate genotyping of a smaller number of SNPs (design II versusdesign I). This result obviously depends on the magnitude of experimental errors assumed for design I, and so may not be general. It also assumes that the data are analyzed using a correctly calibrated error model. The results do show that errors in allele frequency estimation are not a problem in themselves, and that if an appropriate analysis is performed, then errors in allele frequency estimation can be more than compensated for by genotyping at greater density.


    3.4 Application to real data
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 2.2 Method for Bayesian...
 2.3 Single-point analyses
 3. RESULTS
 3.1 Power to detect...
 3.2 Estimation of QTL...
 3.3 Effect of experimental...
 3.4 Application to real...
 4. DISCUSSION
 SOFTWARE
 REFERENCES
 
It is not really possible to examine the effectiveness of the Bayesian method described here on real data, due to a lack of relevant published data sets. Primarily for the purpose of comparison with other fine-scale mapping methods, I have applied it to the data set of Hosking and others (2002) and to quasi-synthetic data sets generated from that data set. Hosking and others (2002) collected data using individual genotyping. In order to pretend that the data were acquired using DNA pooling, I use a hypergeometric error model to relate the observed counts with missing data to the underlying full data that were not observed. This assumes that the data are missing at random within and across SNPs.

To my knowledge, no fine-scale mapping method has been published that does not perform well on the data of Hosking and others (2002). Therefore, observing that the present method performs acceptably, as shown in Figure 4, is not necessarily encouraging. To simulate a disease with a complex genetic basis, I generated 3 data sets by randomly relabeling controls as cases with probability 10%, 20%, or 30%. As shown in Figure 4, on all 4 data sets 95% credibility intervals covered the true location of the CYP2D6 gene after the flattening correction of (1999, see Section 2.2) had been applied. This provides weak evidence that the method developed here may be reliable for mapping QTLs from real data.


Figure 4
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Analysis of data of (top panel Hosking and others(2002)), and quasi-synthetic data sets generated by randomly relabeling controls as cases with probability 10%, 20%, or 30% (lower 3 panels, top to bottom). Points are – log 10(P) for single-point {chi}2 tests, and dotted and solid lines are posterior density for disease gene position, without and with the flattening correction of McPeek and Strahs (1999). Vertical dashed lines show the true position of CYP2D6.

 

    4. DISCUSSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 2.2 Method for Bayesian...
 2.3 Single-point analyses
 3. RESULTS
 3.1 Power to detect...
 3.2 Estimation of QTL...
 3.3 Effect of experimental...
 3.4 Application to real...
 4. DISCUSSION
 SOFTWARE
 REFERENCES
 
In this paper, I have described and tested a Bayesian method for detecting and mapping a QTL, using multilocus data collected using DNA pooling within 2 trait groups.

Relatively recently, likelihood-based fine-scale mapping methods have been developed for genotype data that build on previous haplotype-based analyses by treating the unobserved haplotypes as missing data and integrating over all possible haplotypes that are consistent with the observed genotypes. This integration can be performed either using MCMC (Liu and others, 2001; Reeve and Rannala, 2002; Morris and others, 2003) or using exact numerical methods for HMMs (Zhang and Zhao, 2002). Data from pooled DNA are estimated counts of alleles at each locus with no phase information. Fine-scale mapping from genotype data and pooled DNA can in theory be regarded as closely related missing data problems. The approach taken in this paper combines elements of the approaches of Zhang and Zhao (2002), Morris and others (2000) and Liu and others (2001). Like Zhang and Zhao (2002), I use a model that is sufficiently simple that I can use HMM methods to sum over all possible haplotypes that are consistent with the observed data. However, after computing the likelihood using a propagation algorithm, Zhang and Zhao (2002) then maximize that likelihood with respect to the remaining model parameters. In contrast and like Morris and others (2000) and Liu and others (2001), I embed the HMM within a fully Bayesian approach and compute posterior probability distributions for the quantities of interest.

One advantage of a Bayesian approach is that probability statements can be made directly about quantities of interest. For example, we can state the probability that there is a QTL in any given region, including the whole region under study. Thus, mapping and detecting a QTL are intimately related aspects of the same analysis. They are different inferences that are made from the same posterior probability distribution. Within the Bayesian framework, there is no need to choose between a bewildering array of estimators, test statistics, and methods for correcting for multiple testing; the approach has a pleasing simplicity, at least conceptually.

However, the probabilities computed in a Bayesian analysis are only meaningful if the model and prior are realistic. The Catch-22 is that in order to compute Baysian posterior probabilities, I had to assume a model that was worringly oversimplified and not very believable. The present work is therefore best regarded as a step toward Bayesian analysis of data collected using DNA pooling. It may be helpful to draw parallels with methods for analysis of genotype data (collected using individual genotyping). Sadly, the present method allows us to make inferences assuming a model less elaborate than the one of Morris and others (2000), whereas we might aspire to being able to assume a model like the one of Morris and others (2002) or Zollner and Pritchard (2005). However, Bayesian analysis of such realistic models has required MCMC to integrate over high-dimensional spaces of auxiliary variables or missing data. Such computationally intensive approaches may have difficulty in handling large data sets. In contrast, the method described here is relatively fast, and large data sets could be analyzed with realistic computational resources. For example, 27 processor-days would be required to analyze data from a whole-genome scan with 100 cases, 500 000 SNPs, and evaluation of the posterior at points 50 kb apart. In contrast, Zollner and Pritchard (2005) estimate that their MCMC-based algorithm for data from individual genotypings would take 85 processor-years for the same scale of analysis. A further advantage of avoiding Monte Carlo methods is that the large numbers of analyses needed for a sliding window analysis, or a permutation test, can be performed without needing human intervention to adjust mixing parameters or monitor convergence. Finally and perhaps most significantly, I am able to compute a BF to compare models in which there is, and is not, a disease QTL in the whole region of interest. To my knowledge, no association mapping method using genotype data is able to do this, although Patterson and others (2004) are able to compute a BF for "admixture" mapping using genotype data.

There is a Bayesian justification for the present method. (This is the best model for which a Bayesian analysis of data from pooled DNA is currently possible.) However, serious concerns about model inadequacy (Well, that model simply isn't good enough!) mean that, in this paper, I have mostly focused on a classical frequentist justification. Using simulations assuming a more realistic model, I have shown that the present method is uniformly superior to classical single-point methods of analysis. The simulation results demonstrate that the BF computed using the present method makes a more powerful test for the presence of a QTL than the minimum P value from single-point tests, that the posterior density for the position of the QTL leads to a better point estimator than the position of the marker with the minimum P value, and that well-calibrated credibility intervals can be derived from the posterior density for the position of the QTL, after applying a heuristic flattening correction (McPeek and Strahs, 1999,see Section 2.2).

Single-point methods are the most obvious way to analyze data collected using DNA pooling, although composite likelihood (CL) methods (Terwilliger, 1995; Xiong and Guo, 1997; Collins and Morton, 1998; Maniatis and others, 2004, 2005) could also be used. The performance of CL methods was not examined here because no CL method has been developed that allows errors in allele frequency estimation, no CL method assumes a model more realistic than the one assumed here, and CL methods do not produce well-calibrated confidence intervals. Perhaps because of this, Maniatis and others (2005) state that "the main objective in positional cloning is to estimate the kb location of a causal SNP as accurately as possible, with its support interval an important but secondary objective." However, it seems to me that we should focus on methods for computing well-calibrated credibility intervals, and ideally a well-calibrated posterior density. The acid test is to ask whether a statistical method informs us about what is a good action or decision to be taken subsequently. A point estimate for QTL position, without a reliable measure of precision, is not very helpful for planning future experiments to further refine the position of that QTL.

Underlying the Bayesian approach proposed here is a likelihood function derived from a model in which a number of simplifications made, as detailed in the online supporting material available at Biostatistics online (Appendix A). Given these simplifications, readers familiar with the more realistic models currently used to analyze individual genotype data may wonder why the method proposed here works at all. The short answer is that the simplified model seems adequate for approximating the likelihood function for the marginal observation obtained from genotyping pooled DNA, even though it does not adequately approximate the likelihood function for individual genotype data. Further discussion of why this might be the case, and how a Bayesian analysis assuming a more realistic method might be achieved, can be found in the longer preprint version of this paper (Johnson, 2005a).

For the parameters chosen for the simulations performed here, the benefits of the present Bayesian method are somewhat modest. It remains unclear whether there would be larger benefits for other values of the simulation parameters, in particular more SNPs in the data set, and/or larger benefits from a Baysian analysis with a more realistic model. Clarification of both points awaits access to substantial computational resources.

One feature of the posteriors computed using the present method (and especially after the flattening correction of McPeek and Strahs, 1999) is that they are very heavy tailed, and so large credibility intervals (99%, 99.9%) tend to be very wide, perhaps almost as wide as credibility intervals computed from the prior! This suggests that, if a multistage fine-scale mapping experiment was conducted using DNA pooling, we would not be making radical reductions in the size of the region under study at each stage, but rather would be increasing the density of markers in some regions more than others after each stage of analysis.

The simulated data sets studied here were generated under a model that lacks realism in several respects. For example, in simulating errors in allele frequency estimation, I have ignored differential amplification of the 2 alleles, which may cause estimates of allele frequencies obtained using pooled DNA to be biased. This manifests itself as only a second-order effect on the difference in allele frequency between case and control pools (Visscher and Le Hellard, 2003)). Data about differential amplification can be accomodated easily in the present method of analysis. Even if no data from heterozygotes are available, it is possible to use an error model computed by integrating over a distribution of differential amplification constants, following the approach of Moskvina and others (2005).

The simulations studied here assume a uniform recombination rate of 1 cM/Mb over a 1-Mb region, which is not realistic (e.g. Jeffreys and others, 2001; McVean and others, 2004). However, since the method is parameterized in terms of genetic rather than physical map positions of the marker SNPs, it might be expected to work reasonably well if positions were instead measured on a map of population recombination rates, such as the one estimated by Myers and others (2005).

Another important difference between real studies and the simulations reported here is that, in any real study, the SNPs used would have been ascertained and would have been selected for high information content. Possible criteria for selection include SNP call and error rates in individual genotyping, allele frequency, performance of genotyping assays on pooled DNA, functional information, and selection of tagging SNPs (e.g. Carlson and others (2003). It is likely that the relative performances of different tests for the presence of a QTL will depend on how SNPs are selected. The simulation results reported here describe the performance of different tests when the SNPs are selected essentially at random, requiring them only to have minor allele frequency greater than 5% in a sample of ~200 control genotypes. Their primary value is in demonstrating that the Bayesian method developed here works, rather than that it is necessarily best for all experimental designs. It is clear that a great deal of further work, studying both simulated and real data sets, is needed before we can identify the most cost-effective experimental design.


    SOFTWARE
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 2.2 Method for Bayesian...
 2.3 Single-point analyses
 3. RESULTS
 3.1 Power to detect...
 3.2 Estimation of QTL...
 3.3 Effect of experimental...
 3.4 Application to real...
 4. DISCUSSION
 SOFTWARE
 REFERENCES
 
A software package implementing the methods described here is available from the web site http://homepages.ed.ac.uk/tobyj/software/. Source code is available and the package can be distributed freely under the terms of the GNU general public licence (Free Software Foundation, 1991).


    ACKNOWLEDGMENTS
 
This work was supported by Biotechnology and Biological Sciences Research Council grant number 206/D16977 to Kevin Dawson. I am grateful to several referees for helpful comments on the manuscript. I am especially grateful to the referee who found an error in the previous version of equation (B13). Conflict of Interest: None declared.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 2.2 Method for Bayesian...
 2.3 Single-point analyses
 3. RESULTS
 3.1 Power to detect...
 3.2 Estimation of QTL...
 3.3 Effect of experimental...
 3.4 Application to real...
 4. DISCUSSION
 SOFTWARE
 REFERENCES
 

    Arnheim N, Strange C, Erlich H. Use of pooled DNA samples to detect linkage disequilibrium of polymorphic restriction fragments and human disease: studies of the HLA class II loci. Proceedings of the National Academy of Sciences of the United States of America (1985) 82:6970–6974.[Abstract/Free Full Text]

    Bader JS, Bansal A, Sham P. Efficient SNP-based tests of association for quantitative phenotypes using pooled DNA. GeneScreen (2001) 1:143–150.[CrossRef]

    Barcellos LF, Klitz W, Field LL, Tobias R, Bowcock AM, Wilson R, Nelson MP, Nagatomi J, Thomson G. Association mapping of disease loci, by use of a pooled DNA genomic screen. American Journal of Human Genetics (1997) 61:734–747.[Web of Science][Medline]

    Brohede J, Dunne R, McKay JD, Hannan GN. PPC: an algorithm for accurate estimation of SNP allele frequencies in small equimolar pools of DNA using data from high density microarrays. Nucleic Acids Research (2005) 33, e142.

    Butcher LM, Meaburn E, Knight J, Sham PC, Schalkwyk LC, Craig IW, Plomin R. SNPs, microarrays and pooled DNA: identification of four loci associated with mild mental impairment in a sample of 6000 children. Human Molecular Genetics (2005) 14:1315–1325.[Abstract/Free Full Text]

    Carlson CS, Eberle MA, Rieder MJ, Smith JD, Kruglyak L, Nickerson DA. Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nature Genetics (2003) 33:518–521.[CrossRef][Web of Science][Medline]

    Churchill GA, Doerge RW. Empirical threshold values for quantitative trait mapping. Genetics (1994) 138:963–971.[Abstract]

    Clayton D. Population association. In: Handbook of Statistical Genetics—Balding DJ, Bishop M, Cannings C, eds. (2001) Chichester, UK: Wiley. 519–540.

    Collins A, Morton NE. Mapping a disease locus by allelic association. Proceedings of the National Academy of Sciences of the United States of America (1998) 95:1741–1745.[Abstract/Free Full Text]

    Darvasi A, Soller M. Selective DNA pooling for determination of linkage between a molecular marker and a quantitative trait locus. Genetics (1994) 138:1365–1373.[Abstract]

    FREE SOFTWARE FOUNDATION. GNU general public license. (1991) http://www.gnu.org/licenses/gpl.html. Accessed 3 May 2005.

    Germer S, Holland MJ, Higuchi R. High-throughput SNP allele-frequency determination in pooled DNA samples by kinetic PCR. Genome Research (2000) 10:258–266.[Abstract/Free Full Text]

    Hinds DA, Seymour AB, Durham LK, Banerjee P, Ballinger DG, Milos PM, Cox DR, Thompson JF, Frazer KA. Application of pooled genotyping to scan candidate regions for association with HDL cholesterol levels. Human Genomics (2004) 1:421–434.[Medline]

    Hosking LK, Boyd PR, Xu CF, Nissum M, Cantone K, Purvis IJ, Khakhar R, Barnes MR, Liberwirth U, Hagen-Mann K, and others. Linkage disequilibrium mapping identifies a 390 Kb region associated with CYP2D6 poor drug metabolising activity. The Pharmacogenomics Journal (2002) 2:165–175.[CrossRef][Medline]

    Hudson RR. Properties of a neutral allele model with intragenic recombination. Theoretical Population Biology (1983) 23:183–201.[CrossRef][Web of Science][Medline]

    Hudson RR. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics (2002) 18:337–338.[Abstract/Free Full Text]

    Jeffreys AJ, Kauppi L, Neumann R. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nature Genetics (2001) 29:217–222.[CrossRef][Web of Science][Medline]

    Johnson T. Bayesian method for disease QTL detection and mapping, using a case and control design and DNA pooling. Preprint (2005a) http://www.arxiv.org/abs/q-bio.GN/0507018. Accessed 13 July 2005.

    Johnson T. Multipoint linkage disequilibrium mapping using multilocus allele frequency data. Annals of Human Genetics (2005b) 69:474–498.[CrossRef][Web of Science][Medline]

    Kaplan N, Morris R. Issues concerning association studies for fine mapping a susceptibility gene for a complex disease. Genetic Epidemiology (2001a) 20:432–457.[CrossRef][Web of Science][Medline]

    Kaplan N, Morris R. Prospects for association-based fine mapping of a susceptibility gene for a complex disease. Theoretical Population Biology (2001b) 60:181–191.[CrossRef][Web of Science][Medline]

    Law GR, Rollinson S, Feltbower R, Allan JM, Morgan GJ, Roman E. Application of DNA pooling to large studies of disease. Statistics in Medicine (2004) 23:3841–3850.[CrossRef][Web of Science][Medline]

    Le Hellard S, Ballereau SJ, Visscher PM, Torrance HS, Pinson J, Morris SW, Thomson ML, Semple CAM, Muir WJ, Blackwood DHR, and others. SNP genotyping on pooled DNAs: comparison of genotyping technologies and a semi automated method for data storage and analysis. Nucleic Acids Research (2002) 30, e74.

    Li N, Stephens M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics (2003) 165:2213–2233.[Abstract/Free Full Text]

    Liu JS, Sabatti C, Teng J, Keats BJ, Risch N. Bayesian analysis of haplotypes for linkage disequilibrium mapping. Genome Research (2001) 11:1716–1724.[Abstract/Free Full Text]

    Maniatis N, Collins A, Gibson J, Zhang W, Tapper W, Morton NE. Positional cloning by linkage disequilibrium. American Journal of Human Genetics (2004) 74:846–855.[CrossRef][Web of Science][Medline]

    Maniatis N, Morton NE, Gibson J, Xu C-F, Hosking LK, Collins A. The optimal measure of linkage disequilibrium reduces error in association mapping of affection status. Human Molecular Genetics (2005) 14:145–153.[Abstract/Free Full Text]

    McPeek MS, Strahs A. Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to fine-scale genetic mapping. American Journal of Human Genetics (1999) 65:858–875.[CrossRef][Web of Science][Medline]

    McVean GAT, Myers SR, Hunt S, Deloukas P, Bentley DR, Donnelly P. The fine-scale structure of recombination rate variation in the human genome. Science (2004) 204:581–584.

    Meaburn E, Butcher LM, Liu L, Fernandes C, Hansen V, Al-Chalabi A, Plomin R, Craig I, Schalkwyk LC. Genotyping DNA pools on microarrays: tackling the QTL problem of large samples and large numbers of SNPs. BMC Genomics (2005) 6:52.[CrossRef][Medline]

    Morris AP, Whittaker JC, Balding DJ. Bayesian fine-scale mapping of disease loci, by hidden Markov models. American Journal of Human Genetics (2000) 67:155–169.[CrossRef][Web of Science][Medline]

    Morris AP, Whittaker JC, Balding DJ. Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies. American Journal of Human Genetics (2002) 70:686–707.[CrossRef][Web of Science][Medline]

    Morris AP, Whittaker JC, Balding DJ. Little loss of information due to unknown phase for fine-scale linkage-disequilibrium mapping with single-nucleotide-polymorphism genotype data. American Journal of Human Genetics (2004) 74:945–953.[CrossRef][Web of Science][Medline]

    Morris AP, Whittaker JC, Xu C-F, Hosking LK, Balding DJ. Multipoint linkage-disequilibrium mapping narrows location interval and identifies mutation. Proceedings of the National Academy of Sciences of the United States of America (2003) 100:13442–13446.[Abstract/Free Full Text]

    Moskvina V, Norton N, Williams N, Holmans P, Owen M, O'Donovan M. Streamlined analysis of pooled genotype data in SNP-based association studies. Genetic Epidemiology (2005) 28:273–282.[CrossRef][Web of Science][Medline]

    Mott R, Talbot CJ, Turri MG, Collins AC, Flint J. A method for fine mapping quantitative trait loci in outbred animal stocks. Proceedings of the National Academy of Sciences of the United States of America (2000) 97:12649–12654.[Abstract/Free Full Text]

    Myers S, Bottolo L, Freeman C, McVean G, Donnelly P. A fine-scale map of recombination rates and hotspots across the human genome. Science (2005) 310:321–324.[Abstract/Free Full Text]

    Norton N, Williams NM, O'Donovan MC, Owen MJ. DNA pooling as a tool for large-scale association studies in complex traits. Annals of Medicine (2004) 36:146–152.[CrossRef][Web of Science][Medline]

    O'Hagan A, Forster J. Kendall's Advanced Theory of Statistics, Volume 2B: Bayesian Inference (2004) 2nd edition. London: Arnold.

    Patterson N, Hattangadi N, Lane B, Lohmueller KE, Hafler DA, Oksenberg JR, Hauser SL, Smith MM, O'Brien SJ, Altshuler D, and others. Methods for high-density admixture mapping of disease genes. American Journal of Human Genetics (2004) 74:979–1000.[CrossRef][Web of Science][Medline]

    Prentice RL, Qi L. Aspects of the design and analysis of high-dimensional SNP studies for disease risk estimation. Biostatistics (2006) 7:339–354.[Abstract/Free Full Text]

    Reeve JP, Rannala B. DMLE+: Bayesian linkage disequilibrium gene mapping. Bioinformatics (2002) 18:894–895.[Abstract/Free Full Text]

    Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science (1996) 273:1516–1517.[Abstract/Free Full Text]

    Risch N, Teng J. The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases. I. DNA pooling. Genome Research (1998) 8:1273–1288.[Abstract/Free Full Text]

    Risch NJ. Searching for genetic determinants in the new millennium. Nature (2000) 405:847–856.[CrossRef][Medline]

    Sawcer S, Compston A. The genetic analysis of multiple sclerosis in Europeans: concepts and design. Journal of Neuroimmunology (2003) 143:13–16.[CrossRef][Web of Science][Medline]

    Sham P, Bader JS, Craig I, O'Donovan M, Owen M. DNA pooling: a tool for large-scale association studies. Nature Reviews Genetics (2002) 3:862–871.[CrossRef][Web of Science][Medline]

    Shiffman D, Luke MM, Lakoubova OA, Pullinger CR, Aouizerat BE, Zellner CA, Ports TA, Michaels AD, Drew DW, Catanese JJ, and others. Novel genetic markers associated with myocardial infarction: a genomic scale scan of putative functional polymorphisms. Poster from American College of Cardiology Meeting (2004) http://www.celera.com/pdf/ACC04_poster_final_Page2.pdf. Accessed 7 September 2004.

    Simpson CL, Knight J, Butcher LM, Hansen VK, Meaburn E, Schalkwyk LC, Craig IW, Powell JF, Sham PC, Al-Chalabi A. A central resource for accurate allele frequency estimation from pooled DNA genotyped on DNA microarrays. Nucleic Acids Research (2005) 33:e25.[Abstract/Free Full Text]

    Terwilliger J. A powerful likelihood method for the analysis of linkage disequilibrium between trait loci and one or more polymorphic marker loci. American Journal of Human Genetics (1995) 56:777–787.[Web of Science][Medline]

    Thomas DC, Haile RW, Duggan D. Recent developments in genomewide association scans: a workshop summary and review. American Journal of Human Genetics (2005) 77:337–345.[CrossRef][Web of Science][Medline]

    Visscher PM, Le Hellard S. Simple method to analyze SNP-based association studies using DNA pools. Genetic Epidemiology (2003) 24:291–296.[CrossRef][Web of Science][Medline]

    Waldron ERB, Whittaker JC, Balding DJ. Fine mapping of disease genes via haplotype clustering. Genetic Epidemiology (2006) 30:170–179.[CrossRef][Web of Science][Medline]

    Xiong M, Guo S-W. Fine-scale genetic mapping based on linkage disequilibrium: theory and applications. American Journal of Human Genetics (1997) 60:1513–1531.[Web of Science][Medline]

    Zhang S, Zhao H. Linkage disequilibrium mapping in populations of variable size using the decay of haplotype sharing and a stepwise-mutation model. Genetic Epidemiology (2000) 9:S99–S105.

    Zhang S, Zhao H. Linkage disequilibrium mapping with genotype data. Genetic Epidemiology (2002) 22:66–77.[CrossRef][Web of Science][Medline]

    Zollner S, Pritchard JK. Coalescent-based association mapping and fine mapping of complex trait loci. Genetics (2005) 169:1071–1092.[Abstract/Free Full Text]

    Zubenko GS, Hughes HB, Stiffler JS, Zubenko WN, Kaplan BB. Genome survey for susceptibility loci for recurrent, early-onset major depression: results at 10cM resolution. American Journal of Medical Genetics (2002) 114:413–422.[CrossRef][Web of Science][Medline]

    Received January 11, 2006; revised July 11, 2006; revised August 24, 2006; accepted for publication September 13, 2006.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


    This article has been cited by other articles:


    Home page
    BioinformaticsHome page
    N. Homer, W. D. Tembe, S. Szelinger, M. Redman, D. A. Stephan, J. V. Pearson, S. F. Nelson, and D. Craig
    Multimarker analysis and imputation of multiple platform pooling-based genome-wide association studies
    Bioinformatics, September 1, 2008; 24(17): 1896 - 1902.
    [Abstract] [Full Text] [PDF]


    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow Supplementary Material
    Right arrow All Versions of this Article:
    8/3/546    most recent
    kxl028v2
    kxl028v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Johnson, T.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Johnson, T.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?