Research Article

A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies

  • Bryan N. Howie,

    Affiliation: Department of Statistics, University of Oxford, Oxford, United Kingdom

    Current address: Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America

  • Peter Donnelly,

    Affiliations: Department of Statistics, University of Oxford, Oxford, United Kingdom, Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford, United Kingdom

  • Jonathan Marchini mail

    Affiliation: Department of Statistics, University of Oxford, Oxford, United Kingdom

  • Published: June 19, 2009
  • DOI: 10.1371/journal.pgen.1000529

Reader Comments (4)

Post a new comment on this article

Conclusions regarding BEAGLE

Posted by SharonBrowning on 16 Jul 2009 at 14:11 GMT

The authors of this paper have a slight misunderstanding of Beagle 3.0, which affects their conclusions regarding this program.

The paper states: “When BEAGLE encounters multiple reference panels, as in Scenario B, it simply downweights the less complete panels during the burn-in stage of its model-fitting procedure. Specifically, every individual in the dataset is assigned a weight that reflects the completeness of that individual's genotypes – individuals with more missing data get lower weights, and therefore have less influence on the early steps of the model-fitting algorithm.” (end of page 7, in Results: Scenario B: Modeling considerations).
In fact, this is only partly true. BEAGLE has two categories: reference panel (< 7% missing genotype data) and sample (> 7% missing genotype data). The sample receives lower weights in the early steps of the model fitting algorithm. When there are two levels of missing data as in Scenario B of this paper, the less complete reference panel will not be treated as a reference with respect to the weighting scheme. Thus it is not surprising that BEAGLE performs better with the restricted data sets than with the full Scenario B data set.

There is a simple, alternative procedure for performing imputation when two nested reference panels are available (Scenario B). Run two analyses: one analysis with markers present in both reference panels using both reference panels, and one analysis with markers present in only one reference panel using only that reference panel. The additional effort required to run two analyses is negligible in the context of a genome-wide association study. For BEAGLE, we recommend this two-part procedure for analysis of nested reference panels because performing the two-part analysis generally gives faster, more accurate imputation than performing a single analysis using nested reference panels (as demonstrated in this paper).

We would also like to point out that the statement that BEAGLE required “2.5 GB per 7.5 Mb region of the genome” (page 12, in Results: Scenario B; Computational requirements) could be misleading, as the memory requirements flatten out as the region size increases. In our experience, BEAGLE will perform imputation in the same set of samples in a 245 Mb region (all of chromosome 1) using 3 GB of memory with default options, and 2 GB of memory with the low memory option. Also the low memory option for BEAGLE increases computing time by at most a factor of 2, so that even with this option BEAGLE will still be considerably faster than the most accurate competing methods.

No competing interests declared.