Advertisement
Research Article

Reconstructing Native American Migrations from Whole-Genome and Whole-Exome Data

  • Simon Gravel mail,

    simon.gravel@mcgill.ca

    Affiliations: Department of Human Genetics, McGill University, Montréal, Québec, Canada, McGill University and Génome Québec Innovation Centre, Montréal, Québec, Canada

    X
  • Fouad Zakharia equal contributor,

    equal contributor Contributed equally to this work with: Fouad Zakharia, Andres Moreno-Estrada, Jake K. Byrnes

    Affiliation: Department of Genetics, Stanford University, Stanford, California, United States of America

    X
  • Andres Moreno-Estrada equal contributor,

    equal contributor Contributed equally to this work with: Fouad Zakharia, Andres Moreno-Estrada, Jake K. Byrnes

    Affiliation: Department of Genetics, Stanford University, Stanford, California, United States of America

    X
  • Jake K. Byrnes equal contributor,

    equal contributor Contributed equally to this work with: Fouad Zakharia, Andres Moreno-Estrada, Jake K. Byrnes

    Affiliations: Department of Genetics, Stanford University, Stanford, California, United States of America, Ancestry.com DNA LLC, San Francisco, California, United States of America

    X
  • Marina Muzzio,

    Affiliations: Department of Genetics, Stanford University, Stanford, California, United States of America, Laboratorio de Genética Molecular Poblacional, Instituto Multidisciplinario de Biología Celular (IMBICE). CCT- CONICET-La Plata, Argentina and Facultad de Ciencias Naturales y Museo, Universidad Nacional de La Plata, La Plata, Argentina

    X
  • Juan L. Rodriguez-Flores,

    Affiliation: Weill Cornell Medical College, New York, New York, United States of America

    X
  • Eimear E. Kenny,

    Affiliations: Department of Genetics, Stanford University, Stanford, California, United States of America, Department of Genetics and Genomic Sciences, The Charles Bronfman Institute for Personalized Medicine, Center for Statistical Genetics, and Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America

    X
  • Christopher R. Gignoux,

    Affiliation: Department of Bioengineering and Therapeutic Sciences and Medicine, Univeristy of California San Francisco, San Francisco, California, United States of America

    X
  • Brian K. Maples,

    Affiliation: Department of Genetics, Stanford University, Stanford, California, United States of America

    X
  • Wilfried Guiblet,

    Affiliation: Department of Biology, University of Puerto Rico at Mayaguez, Mayaguez, Puerto Rico

    X
  • Julie Dutil,

    Affiliation: Department of Biochemistry, Ponce School of Medicine and Health Sciences, Ponce, Puerto Rico

    X
  • Marc Via,

    Affiliations: Department of Bioengineering and Therapeutic Sciences and Medicine, Univeristy of California San Francisco, San Francisco, California, United States of America, Department of Psychiatry and Clinical Psychobiology, University of Barcelona, Barcelona, Spain

    X
  • Karla Sandoval,

    Affiliation: Department of Genetics, Stanford University, Stanford, California, United States of America

    X
  • Gabriel Bedoya,

    Affiliation: Universidad de Antioquia, Medellín, Colombia

    X
  • The 1000 Genomes Project,

    Membership of The 1000 Genomes Project can be found in the acknowledgments or at http://www.1000genomes.org/participants.

    X
  • Taras K. Oleksyk,

    Affiliation: Department of Biology, University of Puerto Rico at Mayaguez, Mayaguez, Puerto Rico

    X
  • Andres Ruiz-Linares,

    Affiliation: Department of Genetics, Evolution and Environment, University College London, London, United Kingdom

    X
  • Esteban G. Burchard,

    Affiliation: Department of Bioengineering and Therapeutic Sciences and Medicine, Univeristy of California San Francisco, San Francisco, California, United States of America

    X
  • Juan Carlos Martinez-Cruzado,

    Affiliation: Department of Biology, University of Puerto Rico at Mayaguez, Mayaguez, Puerto Rico

    X
  • Carlos D. Bustamante

    Affiliation: Department of Genetics, Stanford University, Stanford, California, United States of America

    X
  • Published: December 26, 2013
  • DOI: 10.1371/journal.pgen.1004023

Abstract

There is great scientific and popular interest in understanding the genetic history of populations in the Americas. We wish to understand when different regions of the continent were inhabited, where settlers came from, and how current inhabitants relate genetically to earlier populations. Recent studies unraveled parts of the genetic history of the continent using genotyping arrays and uniparental markers. The 1000 Genomes Project provides a unique opportunity for improving our understanding of population genetic history by providing over a hundred sequenced low coverage genomes and exomes from Colombian (CLM), Mexican-American (MXL), and Puerto Rican (PUR) populations. Here, we explore the genomic contributions of African, European, and especially Native American ancestry to these populations. Estimated Native American ancestry is in MXL, in CLM, and in PUR. Native American ancestry in PUR is most closely related to populations surrounding the Orinoco River basin, confirming the Southern America ancestry of the Taíno people of the Caribbean. We present new methods to estimate the allele frequencies in the Native American fraction of the populations, and model their distribution using a demographic model for three ancestral Native American populations. These ancestral populations likely split in close succession: the most likely scenario, based on a peopling of the Americas thousand years ago (kya), supports that the MXL Ancestors split kya, with a subsequent split of the ancestors to CLM and PUR kya. The model also features effective populations of in Mexico, in Colombia, and in Puerto Rico. Modeling Identity-by-descent (IBD) and ancestry tract length, we show that post-contact populations also differ markedly in their effective sizes and migration patterns, with Puerto Rico showing the smallest effective size and the earlier migration from Europe. Finally, we compare IBD and ancestry assignments to find evidence for relatedness among European founders to the three populations.

Author Summary

Populations of the Americas have a rich and heterogeneous genetic and cultural heritage that draws from a diversity of pre-Columbian Native American, European, and African populations. Characterizing this diversity facilitates the development of medical genetics research in diverse populations and the transfer of medical knowledge across populations. It also represents an opportunity to better understand the peopling of the Americas, from the crossing of Beringia to the post-Columbian era. Here, we take advantage sequencing of individuals of Colombian (CLM), Mexican (MXL), and Puerto Rican (PUR) origin by the 1000 Genomes project to improve our demographic models for the peopling of the Americas. The divergence among African, European, and Native American ancestors to these populations enables us to infer the continent of origin at each locus in the sampled genomes. The resulting patterns of ancestry suggest complex post-Columbian migration histories, starting later in CLM than in MXL and PUR. Whereas European ancestral segments show evidence of relatedness, a demographic model of synonymous variation suggests that the Native American Ancestors to MXL, PUR, and CLM panels split within a few hundred years over 12 thousand years ago. Together with early archeological sites in South America, these results support rapid divergence during the initial peopling of the Americas.

Introduction

The 1000 Genomes project [1] released sequence data for 66 Mexican-American (MXL), 60 Colombian (CLM), and 55 Puerto Rican (PUR) individuals using an array of technologies including low-coverage whole genome sequence data, high-coverage exome capture data, and OMNI 2.5 genotyping data. These data provide a unique window into the settlement of the Americas that complement archeological and the more limited genetic data previously available. Here we interpret these data to answer basic questions about the pre- and post-Columbian demographic history of the Americas.

People reached the Americas by crossing Beringia during the Last Glacial Maximum, likely between 16–20 kya (see e.g. [2], [3], [4], [5]). The presence of early South American sites such as Monte Verde [6] suggests a rapid occupation of the continent, which is supported also by recent mitochondrial DNA studies [7]. A coastal route has been proposed to explain this rapid expansion (e.g., [8],[6],[7]), but other migration routes, possibly concurrent, have also been proposed (see. e.g., [5],[9], and references therein). This original peopling of the Americas, followed by European contact starting in 1492 and substantial African slave trade starting in 1502, have created a diverse genetic heritage in American populations.

The initial settlement of the Caribbean has been much debated (e.g. [10],[11],[12] and references therein). People reached the islands around 7 kya, probably from a Mesoamerican source [11]. Around 4.5 kya, a second wave of migrants probably reached the islands, likely coming from the Orinoco Delta or the Guianas in South America and speaking Arawakan languages (see [13] and references therein). By approximately 1.3 kya, they had established large Taíno communities through the Greater Antilles, including Puerto Rico.

The earliest available account reports 600,000 Native Americans in Puerto Rico at the time of European arrival, not counting women and children (Vázquez de Espinosa 1629). More conservative estimates suggest 110,000 individuals [14], and as few as 30,000 inhabitants in 1508 [15]. All references agree that the Native American population was subsequently largely decimated through disease, forced labor, emigration, and war. Despite the bottleneck at contact, admixture and the subsequent population growth on the Island resulted in a Native American genetic contribution averaging of the modern population of million [16].

The MXL were sampled in Los Angeles, USA and the CLM in Medellin, Colombia. These panels represent urban populations, but recent urbanization means that they derive ancestry from larger geographic areas. Among respondents to the 2005 Colombia Census in Medellin, were born in the city, and were born in another part of Colombia, with a sizable proportion from the surrounding Department of Antioquia. Given this high rate of within-country migration, but a relatively low rate of migration from outside Colombia, we can think of the sample as representing a diverse sample from Antioquia. Similarly, the 1.2M Angelenos of Mexican origin in the 2010 US census represent the added contributions of multiple waves of migrations starting with the city's foundation in 1781 and received contributions from diverse states.

The use of genetic data to study Native American history is well established. The bulk of these studies rely on Y chromosome [17],[18],[19],[20],[21],[22],[23],[24] and mitochondria DNA (mtDNA) [25],[26],[27],[28],[29],[30],[31],[22],[32],[7],[33],[34],[35], with a number of studies using increasingly dense sets of autosomal markers [22],[36],[37],[38],[39],[40]. Such studies provided evidence for a bottleneck recovery into the Americas 16–12 kya (e.g., [34],[35]), and for complex models of migrations and admixture within Native groups [40].

In this article, we use the 1000 Genomes data and a diversity of population genetic tools to delve deeper in the founding of the Puerto Rican, Mexican, and Colombian populations. To propose models for Native American demography, we must first quantify the African, European, and Native American contributions to these populations. Because of strong sex-asymmetric migrations, autosomal and sex-linked markers exhibit substantial differences in ancestry proportions [41],[42],[43],[44],[45],[46]. Focusing on the autosomal regions, we infer the locus-specific pre-Columbian continental ancestry in each sample, and estimate the timing and intensity of different migration waves that contributed to these populations. Using identity-by-descent analysis, we identify relatedness among the different ancestral groups and estimate recent effective population sizes.

We also propose a three-population model based on the diffusion approximation to study the distribution of allele frequencies across the Native American ancestors of the MXL, PUR, and CLM. We present statistical methods that take advantage of admixture linkage patterns to disentangle the histories of each continental group. The large sample of sequence data allows for the joint inference of split times and effective population sizes among the Native ancestors to the three panels. Finally, through an expectation maximization (EM) framework, we estimate genome-wide allele frequencies in the inferred Native components of MXL, CLM, and PUR genomes.

A broad summary of the data and analysis pipelines used in this article are displayed in Figure 1.

thumbnail

Figure 1. Schematic of the data and analysis pipelines used in this article.

The three types of 1000 Genomes data are shown in orange: whole-genome, low-coverage data; exome capture; and genotyping chip. Only genotyping chip data was available in trio-phased form; for the other two datasets we used unphased genotypes. Among the analysis approaches (black arrows), the EM and the negative ascertainment analysis are novel: they are presented in the Methods section.

doi:10.1371/journal.pgen.1004023.g001

Results

Global ancestry proportions and clustering

To estimate the global proportions of African, European, and Native American ancestry in the CLM, MXL, and PUR, we combined them with YRI, CEU, and a panel of Native American samples [40] and performed an admixture [47] analysis (Figure 2(a)) and principal component analysis (Figure S1). Dense genotyping arrays allow for inference of ancestry at the level of individual loci, using software such as RFMix [48]. Trio-phased OMNI data was used to generate such locus-specific ancestry calls for 66 CLM, 68 MXL, and 64 PUR individuals, including all sequenced individuals, as part of the 1000 Genomes Project. Summing up the local ancestry contribution inferred by RFMix provides an alternate estimate of ancestry proportions.

thumbnail

Figure 2. Genome-wide ancestry patterns.

(a) Individual ancestry proportions in the 1000 Genomes CLM, MXL, and PUR populations according to admixture, (b) Map showing the sampling locations for the populations most closely related to the Native components of the 1000 Genomes populations. (c) Principal component analysis restricted to genomic segments inferred to be of Native Ancestry in these populations, compared to a reference panel of Native American groups from [40], pooled according to country of origin as a proxy for geography. Populations sampled across many locations are labeled according to the country of the centroid of locations. (d) Zoomed version of the PCA plot, showing specific Native American population labels, colored according to country of origin.

doi:10.1371/journal.pgen.1004023.g002

Using admixture, we find Native American proportions being in PUR, in CLM, and in MXL (Figure 2a). RFMix finds values falling within percentage points of these values, and within one percentage point of the values inferred in the 1000 Genomes project through related methods [1]. Estimates of African ancestry showed a larger difference across methods, with admixture (RFMix) estimates at in PUR, in CLM, and in MXL.

The inferred Native American ancestry proportions are in good agreement with results from the GALA study [49], which reported proportions of in Puerto Rico and in Mexico. The PUR result is also comparable to the of Native ancestry inferred in a different Puerto Rican sample [16]. By contrast, none of the populations from Colombia in [37] show median ancestry proportions quite similar to the CLM sample from Medellin, the closest being the sample from the surrounding Department of Antioquia, with Native, African and European.

Figure 2(c–d) shows a principal component analysis restricted to segments of inferred Native ancestry [50]. We find that the MXL individuals cluster primarily with southern Mexican Native groups (mostly Mixe), and the CLM cluster primarily with the Embera, Kogii, and Wayu, all of which were sampled in Colombia North-West of the Andes, where Medellin is also located. The PUR clusters principally with populations South-East of the Andes, surrounding the Guyanas and the Orinoco River basin (Ticuna, Guahibo, Palikur, Jamamadi, Piapoco), although a few populations from further south are also close in PCA space, particularly the Guaraní and the Chané, together with some Kaqchikel, Toba, and Wichi individuals. The Piapoco and the Palikur speak Arawakan languages. The other groups with known Arawakan-speaking ancestors in our panel are the Chané, whose ancestors spoke Arawakan and likely originated in Guiana [51], and the Guarani, through gene flow from the Chané [52]. Taken together, these clustering patterns support a demic diffusion of the Arawakan/Taínos into Puerto Rico from a southern American route, and reduced gene flow between Native Americans groups living in the Andes or to the west, and groups living east of the Andes.

Ancestry tracts analysis

Because continuous tracts of local ancestry are progressively broken down by recombination, the length distribution of continuous ancestry tracts can reveal details of the timing and mode of the migration processes. We used RFMix to infer ancestry tracts (Text S1), and the software tracts [53] to infer the migration rates and model likelihoods under different scenarios. Tracts can predict the distribution of ancestry block length for arbitrary models of time-varying migration, under the assumptions that the migrants are themselves not admixed, and that the admixed population follows Wright-Fisher reproduction. Since admixture only begins after two populations are in contact, the admixed population is founded when the second population arrives. Tracts determines the time and ancestry proportions at the onset of admixture and the time and magnitude of subsequent migrations by maximum likelihood. Because of limited statistical power, we start with a simple model in which each population contributes a single pulse of migration. We then progressively introduce models with additional periods of migration when justified by information criteria, as described in Text S1. The models that best describe the data are shown in Figures 3 and S2. Parameters for these, together with confidence intervals obtained through bootstrap over individuals, are provided in Table S1 in the Text S1 file.

thumbnail

Figure 3. Ancestry tract length distribution in PUR (a) and CLM (b) compared to the predictions of the best-fitting migration model.

Solid lines represent model predictions and shaded areas are one standard deviation confidence regions surrounding the predictions, assuming a Poisson distribution of counts per bin. The best-fitting models are displayed under each graph. Pie charts sizes indicate the proportion of migrants at each generation, and the pie parts represent the fraction of migrants of each origin at a given generation. Migrants are taken to have uniform continental ancestry. ‘Single-pulse’ admixture events occurring at non integer time in generations are distributed among neighboring generations: in the CLM, the inferred onset was 13.02 generations ago (ga). The model involves founding 14 ga, but almost complete replacement 13 ga. At 30 years per generation [68], 14.9 ga corresponds to , and 13 to . Model parameters and confidence intervals are displayed in Table S1 in the Text S1 file.

doi:10.1371/journal.pgen.1004023.g003

For MXL, we considered a model introduced in [54]: three populations start contributing migrants at the same time, but Europeans and Native Americans keep contributing at a constant rate. The best-fitting model has an onset of admixture 15.1 generations ago (ga), with a CI of , in good agreement with [54] despite a different genotyping chip and local ancestry inference method.

In PUR, we found evidence for two periods of European and African migration, the first ga ( CI ) and the most recent period at ga ( CI 5.9–8.8). This model is in excellent agreement with historical records, which suggest that isolated Native populations contributed little gene flow to the colony after the initial contact period, and that substantial slave trade and European immigration continued until the second half of the 19th century. We do not mean to imply that migrations actually occurred in exactly two distinct pulses-we do not have the resolution to distinguish more than two pulses per population. However, the inference of a migration pulse 6.8 ga indicates that migrations occurred during a period spanning this date. This complex scenario, with multiple waves of migration from African and European individuals, is consistent with the observation that European and African ancestries vary across the island, whereas no evidence of such variation was found in Native ancestry [16].

The inferred onset of admixture in CLM is 13.0 ga ( CI ), significantly later than that in both MXL and PUR and consistent with later European settlement in western Colombia compared to Mexico and Puerto Rico. We also find evidence for a small but statistically significant second wave of Native American migration, 4.8 ga ( CI 4–6). As above, this does not necessarily indicate a single, punctual event, but probable contact between an admixed population and Native American individuals during that period. By contrast, we find no evidence for continuing African gene flow in CLM.

Identity by descent analysis

We used germline [55] and the trio-phased OMNI data above to identify segments identical-by-descent (IBD) within and across populations (see Text S1). Not surprisingly, we found more IBD segments within populations (23936) compared to across populations (1440), and within-population segments were longer (Figure S3).

The MXL population exhibits significantly less within-population IBD compared to the other two panels (Figure 4). The amount of IBD among unrelated individuals can be used to infer the underlying population size under panmictic assumption: the larger a population, the more distant the expected relationship between any two individuals [56]. Using IBD segments longer than 4 cM, we infer effective population sizes of 140,000 in MXL, 15,000 in CLM, and 10,000 in PUR. As we will show, these largely reflect post- admixture population sizes.

thumbnail

Figure 4. Number of IBD tracts by length bin in the three panel populations (independent of ancestry estimations), normalized by the number of individual pairs.

The lower level of IBD in the MXL population indicate a much larger effective population size.

doi:10.1371/journal.pgen.1004023.g004

We expect long IBD segments to be inherited from a recent common ancestor, and therefore to have identical continental ancestry. Comparing the RFMix ancestry assignments on chromosomes that have been identified as IBD by germline thus provides a measure of the consistency of the two methods (see [57] for a related metric). Rates of IBD-Ancestry mismatch ranged from in segments of to less than for segments longer than 40 Mb (Figure S4).

Patterns of ancestry in IBD segments within a population differ markedly from those across populations (Figure 5): IBD segments within populations contain many ancestry switches. This indicates that many common ancestors lived after contact, and that the effective population sizes estimated using IBD largely reflects post-contact demography. The IBD patterns in cross-population IBD segments exhibited fewer ancestry switches than a random control (Figure S5), as may be expected if common ancestors often predate the onset of admixture. Cross-population IBD segments were also found to be overwhelmingly of European origin: among the 120 longest cross-population IBD segments, 117 are in European-inferred segments, two are among Native segments, and one is among African segments. This is not due to overall ancestry proportions, as can be observed by considering the alternate (non-IBD) haplotypes at the same positions (Figure S5). This is likely a result of the colonization history, in which European colonists rapidly spread from a relatively specific region over a large continent. This interpretation is supported by the admixture analysis (Figure S6), showing a common cluster of ancestry for the European component dominant in PUR, CLM, MXL, and Andean populations, but not in CEU, Eskimo-Aleut, and Na-Dene. Finally, we were interested in testing whether the relationship between IBD and ancestry can be used to date recombination events. The ancestry within an IBD segment represents the ancestry state of the most recent common ancestor. The shorter the IBD segment, the older the ancestor, and the less time available since the onset of admixture to create ancestry switch points through recombination. Indeed, we find that the density of ancestry switch-points on IBD tracts increases with IBD tract length in PUR (bootstrap , see Text S1) and in MXL (bootstrap ), whereas the results are not significant in CLM. Thus we can use ancestry patterns in admixed populations not only to recognize recombination events but also to help date most recent common ancestors and recombination events (see Text S1 for details). The small amount of cross-population IBD among Native American tracts tells us that the ancestral Native populations were not as closely related as European founders, consistent with historical and anthropological data.

thumbnail

Figure 5. Continental origin of IBD segments.

(a) Local ancestry assignments in the neighborhood of the 120 longest inferred IBD segments within a population, (b) Local ancestry assignments in the neighborhood of the 120 longest inferred IBD segments across populations. Within inferred IBD segments, ancestry mismatches correspond error rate within population, and error rate across population.

doi:10.1371/journal.pgen.1004023.g005

Demographic inference from sequence data

To infer split times and population sizes of the Native ancestors, we consider the joint site frequency spectrum (SFS). The SFS is informative of demography because stochastic differences in allele frequencies accumulate over time and at a rate that depends on population sizes. We use the diffusion-approximation framework implemented in [58] to perform the inference. We focus on synonymous sites in the 1000 Genomes exome capture data of 60 CLM, 66 MXL, and 55 PUR individuals because the high coverage reduces sequencing artifacts and synonymous sites are less affected by selection compared to non-synonymous sites. A complete model with admixture would require at least one European, one African, and three Native American populations, which is beyond the 3-population limit of We therefore wish to focus on variants within Native American backgrounds.

Unfortunately, trio-phased sequencing data was not available for most samples. Because of phasing uncertainty, the actual ancestry assignment for variants at ancestry-heterozygous loci is uncertain. To overcome this, we introduce a negative ascertainment scheme, in which we only consider variable sites that have not been observed in any of the non-Native populations in the 1000 Genomes data set. The effect of this ascertainment scheme is to remove the majority of variants that predate the split of Native Americans from the rest of the populations. An additional benefit of this approach is that the impact of European and African tracts incorrectly assigned as Native American will be substantially reduced. We hypothesized that the effect of negative ascertainment could be approximately modeled by a strict bottleneck at the Native/non-Native split time. This was confirmed through simulations (see S1).

We considered a simple 3-population demographic model starting with a constant population of size . At time the population size changes to . From this population of size , population diverged with size at time and populations and diverge at a later time with respective sizes and . We considered all three split orderings, with . In the optimal model, illustrated on Figure 6, we have , , . This model is a vast oversimplification of the historical demographic processes. However, given the limited statistical power to reconstruct time-dependent demographic histories using allele frequency data (e.g. [59]), such simple models with step-wise constant population sizes provide useful coarse-grained pictures of human demography. The population sizes in this model are effective population sizes: they are the size of Wright-Fisher populations that best explain the observed patterns of polymorphism. They differ from census sizes because of population size fluctuations, overlapping generations, sex bias, offspring number dispersion, and other departures from the Wright-Fisher assumptions. The ratio is expected to converge to large values to reflect both the negative ascertainment scheme (see Methods) and the expansion post-founding of the Americas. The current data does not enable us to model these two effects separately, so the recovery time can be thought of as an interpolation between the two events. When performing likelihood optimization, tended to slowly increase without bound. Beyond a value of 100, this had minimal impact on the likelihood function and other parameter estimates. We therefore fixed this value to to facilitate optimization and prevent numerical instabilities. All other parameters, and the order of population splits, were chosen to maximize the model likelihood.

thumbnail

Figure 6. An illustration of the maximum likelihood demographic model for the Native American ancestors to the CLM, MXL, and PUR panels.

Parameter values are provided in Table 1. The ordering of the split shown (i.e., MXL splitting first) maximized the likelihood, but among the bootstrap replicates all three orders were observed.

doi:10.1371/journal.pgen.1004023.g006

We find dramatic differences in the inferred population sizes of the Native Ancestors to the MXL, CLM, and PUR (see Table 1), with the MXL showing by far the largest effective population size at 64,000, times larger than the CLM and 32 times larger than the PUR. Given the many sources of uncertainty and model limitations, these ratios are in good qualitative agreement with pre-Columbian populations estimated at 14M in central Mexico [60], 3M in Colombia [60], and somewhat over 110,000 in Puerto Rico [61]. This could largely be a coincidence, given that the Native ancestors to the MXL and CLM were not panmictic populations over present-day political divisions. Another possible explanation for the differences in effective population sizes is a serial founder model after the crossing of Beringia: CLM and PUR would have experienced stricter and longer bottlenecks compared to MXL due to greater distances traveled from Beringia. The crossing to Puerto Rico is likely to have introduced intense bottlenecks in PUR, resulting in a smaller recent effective population size.

thumbnail

Table 1. Parameter estimates for the model displayed on Figure 6, assuming a bottleneck at the foundation of the Americas 16,000 years ago.

doi:10.1371/journal.pgen.1004023.t001

The model suggests that PUR and CLM ancestral populations did not share serial founding events past the split with the MXL ancestors and split well before the expected arrival of the Arawak people of the Caribbean. Indeed, the first and second split times ( and , respectively) are remarkably close to each other, with (bootstrap CI: , see S1, Figure S7, and Table 1). This corresponds to a difference of about 500 years, 12,000 years ago. In fact, the splits are so close that it is impossible to distinguish which population split first, with bootstrap instances supporting all three orderings: the Taíno ancestry does not appear much more closely related to either CLM or MXL Native ancestors. This is also consistent with the PCA results shown in Figure 2, showing a clear distinction between Native American groups in eastern and western Colombia.

Despite strong historical evidence for extensive population bottlenecks suffered by Native American populations following the arrival of Europeans [62], we could not detect the presence of such bottlenecks through allele frequency analysis. However, the presence of such bottlenecks may affect our interpretation of effective population sizes. To quantify this, we fixed the timing and magnitudes of bottlenecks using non-genetic sources, and re-inferred model parameters. Dobyns [62] proposed a maximum population reduction of in the Native American population after European contact, but this number is expected to vary from location to location. Because we are studying admixed populations, the size of the bottleneck is related to the number of individuals that contributed to the admixed population, thus Dobyns' estimate may not apply. In PUR, where the decline was particularly abrupt, we considered a decline of spanning years (see S1). We found that inferred parameters were little affected by the existence of such a bottleneck, with the exception of the effective population size in the pre-bottleneck PUR population, which would be 3.9 times larger than in the no-bottleneck model. Assuming an additional bottleneck in the CLM population led to similar 4-fold increase in inferred pre-bottleneck CLM population size, with little effect on inferred split times. These are significant effects, but are less than the inferred differences in effective population sizes. Thus, in the absence of extreme differences in the recent bottlenecks experienced by the three populations, the observed differences in population sizes likely point to differences in pre-Columbian demography.

By calibrating our results using , towards the most recent end of the range of plausible values for the peopling of the Americas (see e.g., [6] and references therein), we find a mutation rate of (bootstrap CI: ), within the range of recently published human mutation rates [63]. The narrowest confidence interval reported in [63] was , obtained from a de novo exome sequencing study [64]. Our sampling confidence interval is narrower than this value, but the main source of uncertainty here is the degree to which the bottleneck in our model reflects the bottleneck at the founding of the Americas, or the earlier split with the ancestors to the Chinese (CHB) and Japanese (JPT) sample, as well as uncertainty with respect to the timing of these two events (see Figure 7). The effect of changing the founding time or mutation rate assumptions would be to scale all parameters and confidence intervals according to Thus the absolute uncertainty on individual parameters is larger than the sampling uncertainty suggests.

thumbnail

Figure 7. Plausible parameter range for the human mutation rate and the founding time of the Native American populations.

The shaded blue area is the confidence interval from the current analysis. The horizontal line shows the lowest mutation rate estimate from [63], and the vertical line shows the lowest plausible date for the founding of the ancestral Native American populations according to [6]. The plausible region, given by the overlap of the three areas, would correspond to a mutation rate of and a Native American founding time .

doi:10.1371/journal.pgen.1004023.g007

Estimating Native American allele frequencies

There is scarce publicly available, genome-wide data about Native American genomic diversity. The 1000 Genomes dataset offers the opportunity to provide a diversity resource for Native American genomics by reconstructing the genetic makeup of Native American populations ancestral to the PUR, CLM, and MXL. This is particularly interesting in the case of the Puerto Rican population, where such reconstruction may be the only way to understand the genetic make-up of the pre-Columbian inhabitants of the Islands. Using the expectation maximization method presented in the Methods section, we estimated the allele frequencies in the Native-American-inferred part of the genomes of the sequenced individuals. These estimates are available at http://genomes.uprm.edu/Taino/.

Figure 8 shows the distribution of the number of Native American haplotypes per site and the resulting confidence intervals for allele frequency in each population for exome capture target regions. Absolute confidence intervals are narrow for rare variants, and reach a maximum for SNPs at intermediate frequency; the leftmost peak in the bimodal distribution corresponds to the large number of rare variants, whereas the right most peak encompasses a broader range of frequencies.

thumbnail

Figure 8. Estimating Native American allele frequencies.

(a) Number of inferred Native American haplotypes per site, out of 120 CLM, 132 MXL, and 110 PUR haplotypes. (b) Distribution of confidence intervals widths for allele frequency estimations among the exomic Native American segments of the three panels.

doi:10.1371/journal.pgen.1004023.g008

Focusing on the variants with observations in all populations and within the exome capture regions, where coverage and accuracy were highest, the most significantly different among Native groups is rs11183610 on chromosome 12, with an estimated frequency of in MXL Native ancestry, in CLM Native ancestry, and in PUR Native Ancestry. The MXL-PUR difference remains significant after Bonferroni correction (bootstrap , see Methods). The bulk of the differentiation among populations is likely due to genetic drift, but such sub-continental ancestry informative markers are also interesting candidates for further selection scans.

Discussion

The bottleneck at the founding of the Americas provides a unique opportunity to obtain precise estimates of the human autosomal mutation rate, as reported in Table 1 and Figure 7. One remaining challenge in interpretation is whether the ‘founding time’ studied here corresponds to the bottleneck at the founding of the Americas, or the split time of the Native Americans with the Asian populations. Fortunately, this uncertainty can be addressed by sequencing either trio-phased populations from the Americas, or individuals of Native American ancestry without large amounts of recent European and African ancestry. In either case, the dramatic events that led to the initial peopling of the Americas, together with the early dates of South American archaeological sites, provides us with estimates of the human mutation rate that are more precise than pedigree-based estimates. A more thorough study of the robustness of these estimates to model assumptions is therefore desirable.

We find substantially larger effective population size in Mexico than in the other two populations through IBD-based and allele-frequency based estimates. These methods are sensitive to different time-scales: IBD analysis largely reflects post-Columbian events, as evidenced by the large number of mixed ancestry IBD segments in Figure 5(a). Allele frequencies reflect older events as well, and we showed that recent bottlenecks alone are unlikely to be responsible for the much larger effective MXL population size. To interpret the population size differences, we must consider the recent histories of the populations studied here. The MXL panel was recruited in Los Angeles among Mexican-American individuals, who may come from different regions in Mexico, a much wider geographical region than Puerto Rico, thus likely more populated. A natural question is whether the larger effective population sizes in MXL reflect a large panmictic population in Mexico, or a large number of small, previously isolated populations. Figure 2 and references [65],[40] provide compelling evidence that there is substantial population structure within Native groups of Mexico. However, Figure 2 also shows that the Native component of the MXL forms a relatively homogeneous cluster together with populations from southern Mexico. The much larger Native populations in central and southern Mexico are likely to have contributed the most to the Native American ancestry of Mexican mestizos, and thus Mexicans-Americans. Even though the MXL may have ancestors in different parts of Mexico, their Native genetic origins likely reflect the demographic history of the areas in Mexico with the highest Native American population sizes.

Because Puerto Rico is an island, building a relatively complete population genetic model for the population may be more tractable. Clearly, our model of a single idealized pre-Columbian Native American, European, and African populations, joining to form a panmictic admixed population, is an oversimplification. African and European ancestry proportions vary along the island [16] and eastern parts of Puerto Rico, with elevated proportions of African ancestry, are underrepresented in this study. By contrast, we do not have evidence for variation in the amount or composition of the Native American ancestry across the island, and it is likely that the conclusions about the pre-Columbian Native American fraction of the population are robust to sampling ascertainment. Interestingly, we find that the distribution of ancestry tract length in a sample of individuals of Puerto Rican descent in south Florida gave very similar results, despite different location, sequencing platform, and local ancestry inference method [50]. Historical gene flow inference using individuals of Colombian descent in south Florida provided comparable estimates of the time of admixture onset, but different patterns of recent gene flow–as is typical in demographic inference, inference of recent events is more sensitive to population structure.

Our analyses largely rely on accurate estimates of local ancestry patterns along the genome obtained through RFMix. This method has been shown to provide more than accuracy on three-way admixture using comparable reference panels [48], an accuracy level that enables accurate estimation of genome-wide diversity [54]. To ensure that our results are robust to residual errors, we further took into account the difficulty of calling short ancestry tracts in our migration estimates, and performed negative ascertainment of non-Native American alleles in the demographic inference. Some of these results can be independently verified by independent sequencing of contemporary or ancient individuals with more uniform ancestry. However, understanding the genetic history of admixed populations will continue to rely on statistically picking apart the contributions of different ancestral populations, and the development of improved statistical methods, particularly for admixture that is ancient or between closely related populations, remains highly desirable.

The genetic heterogeneity in continental ancestry proportions among populations of the Americas is well appreciated [66],[67],[43]. Our results emphasize more fine-scale aspects of this diversity: because of the similarity between European founders of different populations and the high divergence among the Native American ancestors, populations that appear similar under classical tests such as or principal component analysis may still harbor population specific Native American haplotypes that must be carefully accounted for when performing rare-variant association testing in cosmopolitan cohorts. Similarly, the choice of a replication cohort for an identified risk variant should be guided by the ancestral background on which the variant is found. The PUR may be an excellent replication cohort for a result found in CLM if the background is European. If the background is Native American, a different cohort with related Native Ancestry would likely be much more appropriate. Understanding the genetics of the different ancestral populations of the Americas, and the relatedness among these ancestral groups, will therefore facilitate the development of association methods that account for and take advantage of this rich diversity.

Methods

Negative ascertainment

Ideally, we would have been able to directly model the joint site-frequency spectrum (SFS) of all the ancestral populations to the PUR, CLM, and MXL. However, because we are interested in distinguishing the Native American ancestries to the three populations, this would require modeling at least 5 populations, which is beyond the scope of current methods. We would like to use the inferred local ancestry to focus on the Native American ancestry only, but this is difficult because most Native American haplotypes are in segments heterozygous for ancestry. Because of phasing errors, allele-specific ancestry can be incorrectly assigned. To minimize the impact of such mis-assigned ancestry and to ensure that we focused on variants of genuine Native American ancestry, we discarded all variants observed in 1000 Genomes individuals of African, European, and Asian ancestry, as well as variants observed in Hispanic/Latino populations in segments with no Native American ancestry inferred.

We then considered all remaining variable sites that were assigned Nat/Nat diploid ancestry and Nat/Eur ancestry, and calculated the expected frequency distribution under the assumption of perfect negative ascertainment, that is, that all remaining variants were on the Native American background. Because the European backgrounds are expected to carry a number of singletons, this would result in an overestimate of the number of singletons in the Native Ancestry. Fortunately, this bias is easy to estimate empirically: we first choose segments of Eur/Eur ancestry to mimic the European haplotypes in our sample. After performing the negative ascertainment scheme on these genotypes, we can directly estimate the bias in the negative ascertainment scheme. In practice, this correction is very low except for singletons, as expected. The number of excess singletons was 129 for CLM, 73 for PUR, and 40 for MXL. The largest non-singleton correction is 1.3 for doubletons in CLM.

Because negative ascertainment removes a significant proportion of the variants that were present at the Native American split from other populations, we hypothesized that this effect could be well-approximated by a severe bottleneck at the time of split between non-Native and Native American ancestry.

Figure 9 provides a simulated example, wherein a marginal spectrum (top) is compared to a spectrum negatively ascertained using 100 diploid individuals from the ‘outgroup’ population (middle) and to a bottleneck approximation equivalent (bottom). More quantitatively, we simulated a two-populations sample diverged 12.1kya, and negatively ascertained using a population diverged at 16.5 kya, and attempted to model this as a two-population model with an early bottleneck. The inferred bottleneck timing was within of the split time with the outgroup, and the three population sizes and split time between populations 1 and 2 were within of the correct value. These biases are well within the acceptable range given other biases and uncertainties.

thumbnail

Figure 9. Illustration of the negative ascertainment scheme, with simulation.

(a) A basic three population model, showing the joint site-frequency spectrum for populations 1 and 2 as a heat map. (b) Conditioning on variants not being observed in the out-population results in a SFS skewed towards rare variants. (c) A quantitatively similar effect can be obtained by introducing a drastic bottleneck at the root of the tree and considering only two populations.

doi:10.1371/journal.pgen.1004023.g009

Allele frequencies in Native American segments

We wish to estimate the allele frequencies at each site among segments of Native American origin, but we have to contend with a finite sample and inaccurate phasing. We therefore choose to model the underlying population frequency across all populations using Bayes rule(1)
where is the observed genotype data, , and is the diploid local ancestry calls (e.g., for populations A and B). From this distribution we can calculate expected frequency and confidence intervals. We report inferred frequencies and confidence intervals at non-monomorphic sites.

To estimate , we write as the frequencies of the non reference allele in populations and . We have , for ancestry and genotype heterozygous segments, , and so forth. To estimate , we first observe that because we are considering population frequencies, rather than sample frequencies, is independent of : . This suggests the use of a self-consistent, expectation-maximization procedure. We estimate the underlying frequency distribution as(2)
the sum over the estimated probabilities at each site. We can thus iterate Equations (1) and (2) until self-consistency is reached to estimate both allele frequency distributions and single-site allele frequencies in each population.

A final caveat is that the sum runs over all sites, including monomorphic ones. If we only observe the subset of sites that are polymorphic, an additional step is needed. If is the number of monomorphic (unobserved) sites (denoted as ), and represents the sum over polymorphic sites, we have(3)
and, therefore,
Intuitively, we are correcting for the proportions of sites at every frequency that might have gone undetected. Results are reported using 20 EM iterations, for sites where all individuals had both ancestry and genotype calls, and data can be downloaded at http://genomes.uprm.edu/Taino/.

To test this method, we considered 84 diploid individuals, each formed by drawing two chromosomes (without replacement) from 84 CEU and 84 YRI individuals, resulting in a simulated 50–50 admixture proportion. We considered 100,000 sites on chromosome 22, and performed the EM inference as described.

Among the 85677 sites that were found to be polymorphic, only 13 had a sample allele frequency departing from the confidence interval for the European ancestry, and 51 among the African ancestry. Confidence intervals encompass much more than of sample allele frequencies, emphasizing that the width of the confidence interval largely reflects the uncertainty about the population frequency given a fixed sample frequency, rather than the phasing uncertainty.

Optimizing the demographic model

Because the demographic model considered here does not involve migrations between Native groups, we considered the composite likelihood of three pairwise two-population allele frequency distributions, rather than the full three-population spectrum. This allows for much faster inference and better convergence of the numerical optimization. In principle, it also enables the joint inference of more than three populations. We showed through simulations that the use of a composite likelihood had an effect on inferred parameters that was much smaller than other sources of uncertainty. We used grids of 20,40, and 60 grid points per population, and projected Native American allele frequencies to sample sizes of 10 in PUR, 20 in CLM, and 40 in MXL.

Supporting Information

Figure S1.

The first two principal components for 1000 Genomes populations, showing the distribution of admixed populations.

doi:10.1371/journal.pgen.1004023.s001

(EPS)

Figure S2.

Ancestry tract length distribution in MXL compared to the predictions of the best-fitting migration model (displayed below). Solid lines represent model predictions and shaded areas are one-sigma confidence regions surrounding the predictions, assuming a Poisson distribution [54].

doi:10.1371/journal.pgen.1004023.s002

(EPS)

Figure S3.

Distribution of IBD lengths within populations (red) and across populations (purple).

doi:10.1371/journal.pgen.1004023.s003

(EPS)

Figure S4.

IBD inconsistency rate as a function of IBD length. Long IBD segments exhibit significantly fewer ancestry inconsistencies. The line represents within-population IBD, the red dots represents across-population IBD.

doi:10.1371/journal.pgen.1004023.s004

(EPS)

Figure S5.

Ancestry assignments in a control formed by taking the non-IBD matching haplotypes at loci where the alternate haplotype are IBD.

doi:10.1371/journal.pgen.1004023.s005

(EPS)

Figure S6.

Results of admixture analysis with K = 3 to K = 12, with Native American populations grouped by geographic origin.

doi:10.1371/journal.pgen.1004023.s006

(EPS)

Figure S7.

(a) Bootstrap distributions and (b) pairwise correlations for demographic inference parameters. Vertical red bars mark the optimal parameters.

doi:10.1371/journal.pgen.1004023.s007

(EPS)

Text S1.

Supplementary methods include additional description of statistical and filtering methods used in this article.

doi:10.1371/journal.pgen.1004023.s008

(PDF)

Acknowledgments

The members of the 1000 Genomes project are: Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, Clark AG, Donnelly P, Eichler EE, Flicek P, Gabriel SB, Gibbs RA, Green ED, Hurles ME, Knoppers BM, Korbel JO, Lander ES, Lee C, Lehrach H, Mardis ER, Marth GT, McVean GA, Nickerson DA, Schmidt JP, Sherry ST, Wang J, Wilson RK, Gibbs RA, Dinh H, Kovar C, Lee S, Lewis L, Muzny D, Reid J, Wang M, Wang J, Fang X, Guo X, Jian M, Jiang H, Jin X, Li G, Li J, Li Y, Li Z, Liu X, Lu Y, Ma X, Su Z, Tai S, Tang M, Wang B, Wang G, Wu H, Wu R, Yin Y, Zhang W, Zhao J, Zhao M, Zheng X, Zhou Y, Lander ES, Altshuler DM, Gabriel SB, Gupta N, Flicek P, Clarke L, Leinonen R, Smith RE, Zheng-Bradley X, Bentley DR, Grocock R, Humphray S, James T, Kingsbury Z, Lehrach H, Sudbrak R, Albrecht MW, Amstislavskiy VS, Borodina TA, Lienhard M, Mertes F, Sultan M, Timmermann B, Yaspo ML, Sherry ST, McVean GA, Mardis ER, Wilson RK, Fulton L, Fulton R, Weinstock GM, Durbin RM, Balasubramaniam S, Burton J, Danecek P, Keane TM, Kolb-Kokocinski A, McCarthy S, Stalker J, Quail M, Schmidt JP, Davies CJ, Gollub J, Webster T, Wong B, Zhan Y, Auton A, Gibbs RA, Yu F, Bainbridge M, Challis D, Evani US, Lu J, Muzny D, Nagaswamy U, Reid J, Sabo A, Wang Y, Yu J, Wang J, Coin LJ, Fang L, Guo X, Jin X, Li G, Li Q, Li Y, Li Z, Lin H, Liu B, Luo R, Qin N, Shao H, Wang B, Xie Y, Ye C, Yu C, Zhang F, Zheng H, Zhu H, Marth GT, Garrison EP, Kural D, Lee WP, Leong WF, Ward AN, Wu J, Zhang M, Lee C, Griffin L, Hsieh CH, Mills RE, Shi X, von Grotthuss M, Zhang C, Daly MJ, DePristo MA, Altshuler DM, Banks E, Bhatia G, Carneiro MO, del Angel G, Gabriel SB, Genovese G, Gupta N, Handsaker RE, Hartl C, Lander ES, McCarroll SA, Nemesh JC, Poplin RE, Schaffner SF, Shakir K, Yoon SC, Lihm J, Makarov V, Jin H, Kim W, Kim KC, Korbel JO, Rausch T, Flicek P, Beal K, Clarke L, Cunningham F, Herrero J, McLaren WM, Ritchie GR, Smith RE, Zheng-Bradley X, Clark AG, Gottipati S, Keinan A, Rodriguez-Flores JL, Sabeti PC, Grossman SR, Tabrizi S, Tariyal R, Cooper DN, Ball EV, Stenson PD, Bentley DR, Barnes B, Bauer M, Cheetham R, Cox T, Eberle M, Humphray S, Kahn S, Murray L, Peden J, Shaw R, Ye K, Batzer MA, Konkel MK, Walker JA, MacArthur DG, Lek M, Sudbrak R, Amstislavskiy VS, Herwig R, Shriver MD, Bustamante CD, Byrnes JK, De La Vega FM, Gravel S, Kenny EE, Kidd JM, Lacroute P, Maples BK, Moreno-Estrada A, Zakharia F, Halperin E, Baran Y, Craig DW, Christoforides A, Homer N, Izatt T, Kurdoglu AA, Sinari SA, Squire K, Sherry ST, Xiao C, Sebat J, Bafna V, Ye K, Burchard EG, Hernandez RD, Gignoux CR, Haussler D, Katzman SJ, Kent WJ, Howie B, Ruiz-Linares A, Dermitzakis ET, Lappalainen T, Devine SE, Liu X, Maroo A, Tallon LJ, Rosenfeld JA, Michelson LP, Abecasis GR, Kang HM, Anderson P, Angius A, Bigham A, Blackwell T, Busonero F, Cucca F, Fuchsberger C, Jones C, Jun G, Li Y, Lyons R, Maschio A, Porcu E, Reinier F, Sanna S, Schlessinger D, Sidore C, Tan A, Trost MK, Awadalla P, Hodgkinson A, Lunter G, McVean GA, Marchini JL, Myers S, Churchhouse C, Delaneau O, Gupta-Hinch A, Iqbal Z, Mathieson I, Rimmer A, Xifara DK, Oleksyk TK, Fu Y, Liu X, Xiong M, Jorde L, Witherspoon D, Xing J, Eichler EE, Browning BL, Alkan C, Hajirasouliha I, Hormozdiari F, Ko A, Sudmant PH, Mardis ER, Chen K, Chinwalla A, Ding L, Dooling D, Koboldt DC, McLellan MD, Wallis JW, Wendl MC, Zhang Q, Durbin RM, Hurles ME, Tyler-Smith C, Albers CA, Ayub Q, Balasubramaniam S, Chen Y, Coffey AJ, Colonna V, Danecek P, Huang N, Jostins L, Keane TM, Li H, McCarthy S, Scally A, Stalker J, Walter K, Xue Y, Zhang Y, Gerstein MB, Abyzov A, Balasubramanian S, Chen J, Clarke D, Fu Y, Habegger L, Harmanci AO, Jin M, Khurana E, Mu XJ, Sisu C, Li Y, Luo R, Zhu H, Lee C, Griffin L, Hsieh CH, Mills RE, Shi X, von Grotthuss M, Zhang C, Marth GT, Garrison EP, Kural D, Lee WP, Ward AN, Wu J, Zhang M, McCarroll SA, Altshuler DM, Banks E, del Angel G, Genovese G, Handsaker RE, Hartl C, Nemesh JC, Shakir K, Yoon SC, Lihm J, Makarov V, Degenhardt J, Flicek P, Clarke L, Smith RE, Zheng-Bradley X, Korbel JO, Rausch T, Sttz AM, Bentley DR, Barnes B, Cheetham R, Eberle M, Humphray S, Kahn S, Murray L, Shaw R, Ye K, Batzer MA, Konkel MK, Walker JA, Lacroute P, Craig DW, Homer N, Church D, Xiao C, Sebat J, Bafna V, Michaelson JJ, Ye K, Devine SE, Liu X, Maroo A, Tallon LJ, Lunter G, Iqbal Z, Witherspoon D, Xing J, Eichler EE, Alkan C, Hajirasouliha I, Hormozdiari F, Ko A, Sudmant PH, Chen K, Chinwalla A, Ding L, McLellan MD, Wallis JW, Hurles ME, Blackburne B, Li H, Lindsay SJ, Ning Z, Scally A, Walter K, Zhang Y, Gerstein MB, Abyzov A, Chen J, Clarke D, Khurana E, Mu XJ, Sisu C, Gibbs RA, Yu F, Bainbridge M, Challis D, Evani US, Kovar C, Lewis L, Lu J, Muzny D, Nagaswamy U, Reid J, Sabo A, Yu J, Guo X, Li Y, Wu R, Marth GT, Garrison EP, Leong WF, Ward AN, del Angel G, DePristo MA, Gabriel SB, Gupta N, Hartl C, Poplin RE, Clark AG, Rodriguez-Flores JL, Flicek P, Clarke L, Smith RE, Zheng-Bradley X, MacArthur DG, Bustamante CD, Gravel S, Craig DW, Christoforides A, Homer N, Izatt T, Sherry ST, Xiao C, Dermitzakis ET, Abecasis GR, Kang HM, McVean GA, Mardis ER, Dooling D, Fulton L, Fulton R, Koboldt DC, Durbin RM, Balasubramaniam S, Keane TM, McCarthy S, Stalker J, Gerstein MB, Balasubramanian S, Habegger L, Garrison EP, Gibbs RA, Bainbridge M, Muzny D, Yu F, Yu J, del Angel G, Handsaker RE, Makarov V, Rodriguez-Flores JL, Jin H, Kim W, Kim KC, Flicek P, Beal K, Clarke L, Cunningham F, Herrero J, McLaren WM, Ritchie GR, Zheng-Bradley X, Tabrizi S, MacArthur DG, Lek M, Bustamante CD, De La Vega FM, Craig DW, Kurdoglu AA, Lappalainen T, Rosenfeld JA, Michelson LP, Awadalla P, Hodgkinson A, McVean GA, Chen K, Tyler-Smith C, Chen Y, Colonna V, Frankish A, Harrow J, Xue Y, Gerstein MB, Abyzov A, Balasubramanian S, Chen J, Clarke D, Fu Y, Harmanci AO, Jin M, Khurana E, Mu XJ, Sisu C, Gibbs RA, Fowler G, Hale W, Kalra D, Kovar C, Muzny D, Reid J, Wang J, Guo X, Li G, Li Y, Zheng X, Altshuler DM, Flicek P, Clarke L, Barker J, Kelman G, Kulesha E, Leinonen R, McLaren WM, Radhakrishnan R, Roa A, Smirnov D, Smith RE, Streeter I, Toneva I, Vaughan B, Zheng-Bradley X, Bentley DR, Cox T, Humphray S, Kahn S, Sudbrak R, Albrecht MW, Lienhard M, Craig DW, Izatt T, Kurdoglu AA, Sherry ST, Ananiev V, Belaia Z, Beloslyudtsev D, Bouk N, Chen C, Church D, Cohen R, Cook C, Garner J, Hefferon T, Kimelman M, Liu C, Lopez J, Meric P, O'Sullivan C, Ostapchuk Y, Phan L, Ponomarov S, Schneider V, Shekhtman E, Sirotkin K, Slotta D, Xiao C, Zhang H, Haussler D, Abecasis GR, McVean GA, Alkan C, Ko A, Dooling D, Durbin RM, Balasubramaniam S, Keane TM, McCarthy S, Stalker J, Chakravarti A, Knoppers BM, Abecasis GR, Barnes KC, Beiswanger C, Burchard EG, Bustamante CD, Cai H, Cao H, Durbin RM, Gharani N, Gibbs RA, Gignoux CR, Gravel S, Henn B, Jones D, Jorde L, Kaye JS, Keinan A, Kent A, Kerasidou A, Li Y, Mathias R, McVean GA, Moreno-Estrada A, Ossorio PN, Parker M, Reich D, Rotimi CN, Royal CD, Sandoval K, Su Y, Sudbrak R, Tian Z, Timmermann B, Tishkoff S, Toji LH, Tyler-Smith C, Via M, Wang Y, Yang H, Yang L, Zhu J, Bodmer W, Bedoya G, Ruiz-Linares A, Ming CZ, Yang G, You CJ, Peltonen L, Garcia-Montero A, Orfao A, Dutil J, Martinez-Cruzado JC, Oleksyk TK, Brooks LD, Felsenfeld AL, McEwen JE, Clemm NC, Duncanson A, Dunn M, Green ED, Guyer MS, Peterson JL.

Contributions

Conceived and designed the experiments: SG MV KS TKO ARL EGB JCMC CDB. Analyzed the data: SG FZ JKB MM AME JLRF EEK CRG WG. Contributed reagents/materials/analysis tools: SG FZ JKB AME CRG BKM JD GB TKO ARL EGB. Wrote the paper: SG MM AME CDB.

Author Contributions

Conceived and designed the experiments: SG MV KS TKO ARL EGB JCMC CDB. Analyzed the data: SG FZ JKB MM AME JLRF EEK CRG WG. Contributed reagents/materials/analysis tools: SG FZ JKB AME CRG BKM JD GB TKO ARL EGB. Wrote the paper: SG MM AME CDB.

References

  1. 1. Genomes Project Consortium (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65.
  2. 2. O'Rourke DH, Raff JA (2010) The human genetic history of the Americas: the final frontier. Curr Biol 20: R202–R207. doi: 10.1016/j.cub.2009.11.051
  3. 3. Luis Lanata J, Martino L, Osella A, Garcia-Herbst A (2008) Demographic conditions necessary to colonize new spaces: the case for early human dispersal in the Americas. World Archaeology 40: 520–537. doi: 10.1080/00438240802452890
  4. 4. Goebel T, Waters MR, O'Rourke DH (2008) The late Pleistocene dispersal of modern humans in the Americas. Science 319: 1497–1502. doi: 10.1126/science.1153569
  5. 5. Dillehay TD (2009) Probing deeper into first American studies. Proc Natl Acad Sci USA 106: 971–978. doi: 10.1073/pnas.0808424106
  6. 6. Dillehay TD, Ramrez C, Pino M, Collins MB, Rossen J, et al. (2008) Monte Verde: seaweed, food, medicine, and the peopling of South America. Science 320: 784–786. doi: 10.1126/science.1156533
  7. 7. Bodner M, Perego UA, Huber G, Fendt L, Rock AW, et al. (2012) Rapid coastal spread of First Americans: Novel insights from South America's Southern Cone mitochondrial genomes. Genome Res 22: 811–820. doi: 10.1101/gr.131722.111
  8. 8. Hurst CT (1943) A Folsom site in a mountain valley of Colorado. American Antiquity 8: 250–253. doi: 10.2307/275905
  9. 9. Meltzer DJ (2009) First Peoples in a New World. Colonizing Ice Age America. Univ of California Press.
  10. 10. Rouse I (1992) The Tainos: Rise and decline of the people who greeted Columbus. Yale University Press.
  11. 11. Rodrguez-Ramos R (2010) Rethinking Puerto Rican Precolonial History. University Alabama Press.
  12. 12. Veloz Maggiolo M (1991) Panorama histórico del caribe precolombino. Quinto Centenario del Descubrimiento de América Banco Central de la República Dominicana.
  13. 13. Hopper R (2008) Taino Indians: Settlements of the Caribbean. Lambda Alpha Journal 38: 62–69.
  14. 14. Moscoso F (2008) Caciques, aldeas y población tana de Boriquén (Puerto Rico), 1492–1582. Academia Puertorriqueña de la Historia.
  15. 15. Alegra RE, Quiñones ER (1999) Historia y cultura de Puerto Rico: Desde la época pre-colombina hasta nuestros das. Fundación Francisco Carvajal.
  16. 16. Via M, Gignoux CR, Roth LA, Fejerman L, Galanter J, et al. (2011) History Shaped the Geographic Distribution of Genomic Admixture on the Island of Puerto Rico. PLoS ONE 6: e16513. doi: 10.1371/journal.pone.0016513
  17. 17. Underhill PA, Jin L, Zemans R, Oefner PJ, Cavalli-Sforza LL (1996) A pre-Columbian Y chromosome-specific transition and its implications for human evolutionary history. Proc Natl Acad Sci USA 93: 196–200. doi: 10.1073/pnas.93.1.196
  18. 18. Lell JT, Brown MD, Schurr TG, Sukernik RI, Starikovskaya YB, et al. (1997) Y chromosome polymorphisms in native American and Siberian populations: identification of native American Y chromosome haplotypes. Hum Genet 100: 536–543. doi: 10.1007/s004390050548
  19. 19. Bianchi NO, Catanesi CI, Bailliet G, Martinez-Marignac VL, Bravi CM, et al. (1998) Characterization of ancestral and derived Y-chromosome haplotypes of New World native populations. Am J Hum Genet 63: 1862–1871. doi: 10.1086/302141
  20. 20. Karafet TM, Zegura SL, Posukh O, Osipova L, Bergen A, et al. (1999) Ancestral Asian source(s) of new world Y-chromosome founder haplotypes. Am J Hum Genet 64: 817–831. doi: 10.1086/302282
  21. 21. Bortolini MC, Salzano FM, Thomas MG, Stuart S, Nasanen SPK, et al. (2003) Y-chromosome evidence for differing ancient demographic histories in the Americas. Am J Hum Genet 73: 524–539. doi: 10.1086/377588
  22. 22. Mesa NR, Mondragón MC, Soto ID, Parra MV, Duque C, et al. (2000) Autosomal, mtDNA, and Y-chromosome diversity in Amerinds: pre- and post-Columbian patterns of gene flow in South America. Am J Hum Genet 67: 1277–1286. doi: 10.1086/321214
  23. 23. Bortolini MC, Salzano FM, Bau CHD, Layrisse Z, Petzl-Erler ML, et al. (2002) Y-chromosome biallelic polymorphisms and Native American population structure. Ann Hum Genet 66: 255–259. doi: 10.1046/j.1469-1809.2002.00114.x
  24. 24. Bailliet G, Ramallo V, Muzzio M, García A, Santos MR, et al. (2009) Brief communication: Restricted geographic distribution for Y-Q* paragroup in South America. Am J Phys Anthropol 140: 578–582. doi: 10.1002/ajpa.21133
  25. 25. Torroni A, Schurr TG, Cabell MF, Brown MD, Neel JV, et al. (1993) Asian affinities and continental radiation of the four founding Native American mtDNAs. Am J Hum Genet 53: 563–590.
  26. 26. Achilli A, Perego UA, Bravi CM, Coble MD, Kong QP, et al. (2008) The Phylogeny of the Four Pan-American MtDNA Haplogroups: Implications for Evolutionary and Disease Studies. PLoS ONE 3: e1764. doi: 10.1371/journal.pone.0001764
  27. 27. Kumar S, Bellis C, Zlojutro M, Melton PE, Blangero J, et al. (2011) Large scale mitochondrial sequencing in Mexican Americans suggests a reappraisal of Native American origins. BMC Evol Biol 11: 293. doi: 10.1186/1471-2148-11-293
  28. 28. Malhi RS, Cybulski JS, Tito RY, Johnson J, Harry H, et al. (2010) Brief communication: mitochondrial haplotype C4c confirmed as a founding genome in the Americas. Am J Phys Anthropol 141: 494–497. doi: 10.1002/ajpa.21238
  29. 29. Perego UA, Achilli A, Angerhofer N, Accetturo M, Pala M, et al. (2009) Distinctive Paleo-Indian migration routes from Beringia marked by two rare mtDNA haplogroups. Curr Biol 19: 1–8. doi: 10.1016/j.cub.2008.11.058
  30. 30. Perego UA, Angerhofer N, Pala M, Olivieri A, Lancioni H, et al. (2010) The initial peopling of the Americas: a growing number of founding mitochondrial genomes from Beringia. Genome Res 20: 1174–1179. doi: 10.1101/gr.109231.110
  31. 31. Tamm E, Kivisild T, Reidla M, Metspalu M, Smith DG, et al. (2007) Beringian standstill and spread of Native American founders. PLoS ONE 2: e829. doi: 10.1371/journal.pone.0000829
  32. 32. Sandoval K, Buentello-Malo L, Peñaloza-Espinosa R, Avelino H, Salas A, et al. (2009) Linguistic and maternal genetic diversity are not correlated in Native Mexicans. Hum Genet 126: 521–531. doi: 10.1007/s00439-009-0693-y
  33. 33. Bonatto SL, Salzano FM (1997) A single and early migration for the peopling of the Americas supported by mitochondrial DNA sequence data. Proc Natl Acad Sci USA 94: 1866–1871. doi: 10.1073/pnas.94.5.1866
  34. 34. Mulligan CJ, Kitchen A, Miyamoto MM (2008) Updated three-stage model for the peopling of the Americas. PLoS ONE 3: e3199. doi: 10.1371/journal.pone.0003199
  35. 35. Fagundes NJR, Kanitz R, Eckert R, Valls ACS, Bogo MR, et al. (2008) Mitochondrial population genomics supports a single pre-Clovis origin with a coastal route for the peopling of the Americas. Am J Hum Genet 82: 583–592. doi: 10.1016/j.ajhg.2007.11.013
  36. 36. Wang S, Lewis CM, Jakobsson M, Ramachandran S, Ray N, et al. (2007) Genetic variation and population structure in native Americans. PLoS Genet 3: e185. doi: 10.1371/journal.pgen.0030185
  37. 37. Rojas W, Parra MV, Campo O, Caro MA, Lopera JG, et al. (2010) Genetic make up and structure of Colombian populations by means of uniparental and biparental DNA markers. Am J Phys Anthropol 143: 13–20. doi: 10.1002/ajpa.21270
  38. 38. Yang NN, Mazieres S, Bravi C, Ray N, Wang S, et al. (2010) Contrasting patterns of nuclear and mtDNA diversity in Native American populations. Ann Hum Genet 74: 525–538. doi: 10.1111/j.1469-1809.2010.00608.x
  39. 39. Scliar MO, Soares-Souza GB, Chevitarese J, Lemos L, Magalhães WCS, et al. (2012) The population genetics of quechuas, the largest native south american group: Autosomal sequences, SNPs, and microsatellites evidence high level of diversity. Am J Phys Anthropol 147: 443–451. doi: 10.1002/ajpa.22013
  40. 40. Reich D, Patterson N, Campbell D, Tandon A, Mazieres S, et al. (2012) Reconstructing Native American population history. Nature 488: 370–374.
  41. 41. Martínez-Cortés G, Salazar-Flores J, Fernández-Rodríguez LG, Rubi-Castellanos R, Rodrguez-Loya C, et al. (2012) Admixture and population structure in Mexican-Mestizos based on paternal lineages. J Hum Genet 57: 568–574. doi: 10.1038/jhg.2012.67
  42. 42. Rubi-Castellanos R, Martínez-Cortés G, Muñoz-Valle JF, González-Martín A, Cerda-Flores RM, et al. (2009) Pre-Hispanic Mesoamerican demography approximates the present-day ancestry of Mestizos throughout the territory of Mexico. Am J Phys Anthropol 139: 284–294. doi: 10.1002/ajpa.20980
  43. 43. Bedoya G, Montoya P, García J, Soto I, Bourgeois S, et al. (2006) Admixture dynamics in Hispanics: a shift in the nuclear genetic ancestry of a South American population isolate. Proc Natl Acad Sci USA 103: 7234–7239. doi: 10.1073/pnas.0508716103
  44. 44. Martínez-Cruzado JC, Toro-Labrador G, Viera-Vera J, Rivera-Vega MY, Startek J, et al. (2005) Reconstructing the population history of Puerto Rico by means of mtDNA phylogeographic analysis. Am J Phys Anthropol 128: 131–155. doi: 10.1002/ajpa.20108
  45. 45. Bolnick DA, Bolnick DI, Smith DG (2006) Asymmetric male and female genetic histories among Native Americans from Eastern North America. Mol Biol Evol 23: 2161–2174. doi: 10.1093/molbev/msl088
  46. 46. Carvajal-Carmona LG, Soto ID, Pineda N, Ortíz-Barrientos D, Duque C, et al. (2000) Strong Amerind/White Sex Bias and a Possible Sephardic Contribution among the Founders of a Population in Northwest Colombia. The American Journal of Human Genetics 67: 1287–1295. doi: 10.1016/s0002-9297(07)62956-5
  47. 47. Alexander DH, Novembre J, Lange K (2009) Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19: 1655–1664. doi: 10.1101/gr.094052.109
  48. 48. Maples BK, Gravel S, Kenny EE, Bustamante CD (2013) RFMix: A Discriminative Modeling Approach for Rapid and Robust Local-Ancestry Inference. Am J Hum Genet 93: 278–288. doi: 10.1016/j.ajhg.2013.06.020
  49. 49. Galanter JM, Fernandez-Lopez JC, Gignoux CR, Barnholtz-Sloan J, Fernandez-Rozadilla C, et al. (2012) Development of a panel of genome-wide ancestry informative markers to study admixture throughout the Americas. PLoS Genet 8: e1002554. doi: 10.1371/journal.pgen.1002554
  50. 50. Moreno-Estrada A, Gravel S, Zakharia F, McCauley JL, Byrnes JK, et al. (2013) Reconstructing the Population Genetic History of the Caribbean. arXiv doi: 10.1371/journal.pgen.1003925
  51. 51. Moseley C (2004) Encyclopedia of the World's Endangered Languages. Routledge doi: 10.4324/9780203645659
  52. 52. Combes I, Lowrey K (2006) Slaves without Masters? Arawakan Dynasties among the Chiriguano (Bolivian Chaco, Sixteenth to Twentieth Centuries). Ethnohistory 53: 689–714. doi: 10.1215/00141801-2006-019
  53. 53. Gravel S (2012) Population genetics models of local ancestry. Genetics 191: 607–619. doi: 10.1534/genetics.112.139808
  54. 54. Kidd JM, Gravel S, Byrnes J, Moreno-Estrada A, Musharoff S, et al. (2012) Population genetic inference from personal genome data: impact of ancestry and admixture on human genomic variation. Am J Hum Genet 91: 660–671. doi: 10.1016/j.ajhg.2012.08.025
  55. 55. Gusev A, Lowe JK, Stoffel M, Daly MJ, Altshuler D, et al. (2008) Whole population, genome-wide mapping of hidden relatedness. Genome Res 19: 318–326. doi: 10.1101/gr.081398.108
  56. 56. Palamara PF, Lencz T, Darvasi A, Pe'er I (2012) Length distributions of identity by descent reveal fine-scale demographic history. Am J Hum Genet 91: 809–822. doi: 10.1016/j.ajhg.2012.08.030
  57. 57. Baran Y, Pasaniuc B, Sankararaman S, Torgerson DG, Gignoux C, et al. (2012) Fast and accurate inference of local ancestry in Latino populations. Bioinformatics 28: 1359–1367. doi: 10.1093/bioinformatics/bts144
  58. 58. Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD (2009) Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet 5: e1000695. doi: 10.1371/journal.pgen.1000695
  59. 59. Myers S, Fefferman C, Patterson N (2008) Can one learn history from the allelic spectrum? Theor Popul Biol 73: 342–348. doi: 10.1016/j.tpb.2008.01.001
  60. 60. Salzano FM, Bortolini MC (2005) The Evolution and Genetics of Latin American Populations. Cambridge University Press.
  61. 61. Moscoso F (2008) Caciques, aldeas y población taína de Boriquén (Puerto Rico), 1492–1582. Academia puertorriquena de la historia.
  62. 62. Dobyns HF (1966) An appraisal of techniques with a new hemispheric estimate. Current Anthropology 7: 395–416. doi: 10.1086/200749
  63. 63. Scally A, Durbin R (2012) Revising the human mutation rate: implications for understanding human evolution. Nat Rev Genet 13: 745–753. doi: 10.1038/nrg3295
  64. 64. Sanders SJ, Murtha MT, Gupta AR, Murdoch JD, Raubeson MJ, et al. (2012) De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 485: 237–241. doi: 10.1038/nature10945
  65. 65. Gorostiza A, Acunha-Alonzo V, Regalado-Liu L, Tirado S, Granados J, et al. (2012) Reconstructing the history of Mesoamerican populations through the study of the mitochondrial DNA control region. PLoS ONE 7: e44666. doi: 10.1371/journal.pone.0044666
  66. 66. Bryc K, Velez C, Karafet T, Moreno-Estrada A, Reynolds A, et al. (2010) Colloquium paper: genome-wide patterns of population structure and admixture among Hispanic/Latino populations. Proc Natl Acad Sci USA 107 Suppl 2: 8954–8961. doi: 10.1073/pnas.0914618107
  67. 67. Wang S, Ray N, Rojas W, Parra MV, Bedoya G, et al. (2008) Geographic patterns of genome admixture in Latin American Mestizos. PLoS Genet 4: e1000037. doi: 10.1371/journal.pgen.1000037
  68. 68. Tremblay M, Vézina H (2000) New estimates of intergenerational time intervals for the calculation of age and origins of mutations. Am J Hum Genet 66: 651–658. doi: 10.1086/302770