Search
Advanced Search
Average Rating (0 User Ratings)
    • Currently 0/5 Stars.
    See all categories
      • Currently 0/5 Stars.
      • Currently 0/5 Stars.
      • Currently 0/5 Stars.
    Rate This Article

Open Access

Research Article

Genetic Variation and Population Structure in Native Americans

Sijia Wang 1 #, Cecil M. Lewis, Jr. 2 #, Mattias Jakobsson 2 ,3#, Sohini Ramachandran 4 , Nicolas Ray 5 , Gabriel Bedoya 6 , Winston Rojas 6 , Maria V. Parra 6 , Julio A. Molina 7 , Carla Gallo 8 , Guido Mazzotti 8 , Giovanni Poletti 9 , Kim Hill 10 , Ana M. Hurtado 10 , Damian Labuda 11 , William Klitz 12 ,13, Ramiro Barrantes 14 , Maria Cátira Bortolini 15 , Francisco M. Salzano 15 , Maria Luiza Petzl-Erler 16 , Luiza T. Tsuneto 16 , Elena Llop 17 , Francisco Rothhammer 17 ,18, Laurent Excoffier 5 , Marcus W. Feldman 4 , Noah A. Rosenberg 2 ,3*, Andrés Ruiz-Linares 1

1 The Galton Laboratory, Department of Biology, University College London, London, United Kingdom, 2 Department of Human Genetics, University of Michigan, Ann Arbor, Michigan, United States of America, 3 Center for Computational Medicine and Biology, University of Michigan, Ann Arbor, Michigan, United States of America, 4 Department of Biological Sciences, Stanford University, Stanford, California, United States of America, 5 Computational and Molecular Population Genetics Lab, University of Bern, Bern, Switzerland, 6 Laboratorio de Genética Molecular, Universidad de Antioquia, Medellín, Colombia, 7 Center for Neurobehavioral Genetics, University of California Los Angeles, Los Angeles, United States of America, 8 Laboratorios de Investigación y Desarrollo, Facultad de Ciencias y Filosofía, Universidad Peruana Cayetano Heredia, Lima, Perú, 9 Facultad de Medicina, Universidad Peruana Cayetano Heredia, Lima, Perú, 10 Department of Anthropology, University of New Mexico, Albuquerque, New Mexico, United States of America, 11 Département de Pédiatrie, CHU Sainte-Justine, Université de Montréal, Montréal, Quebec, Canada, 12 School of Public Health, University of California Berkeley, Berkeley, California, United States of America, 13 Public Health Institute, Oakland, California, United States of America, 14 Escuela de Biología, Universidad de Costa Rica, San José, Costa Rica, 15 Departamento de Genética, Instituto de Biociências, Universidade Federal do Rio Grande do Sul, Porto Alegre, Rio Grande do Sul, Brazil, 16 Departamento de Genética, Universidade Federal do Paraná, Curitiba, Paraná, Brazil, 17 Programa de Genética Humana, Instituto de Ciencias Biomédicas, Facultad de Medicina, Universidad de Chile, Santiago, Chile, 18 Instituto de Alta Investigación, Universidad de Tarapacá, Arica, Chile

Abstract

We examined genetic diversity and population structure in the American landmass using 678 autosomal microsatellite markers genotyped in 422 individuals representing 24 Native American populations sampled from North, Central, and South America. These data were analyzed jointly with similar data available in 54 other indigenous populations worldwide, including an additional five Native American groups. The Native American populations have lower genetic diversity and greater differentiation than populations from other continental regions. We observe gradients both of decreasing genetic diversity as a function of geographic distance from the Bering Strait and of decreasing genetic similarity to Siberians—signals of the southward dispersal of human populations from the northwestern tip of the Americas. We also observe evidence of: (1) a higher level of diversity and lower level of population structure in western South America compared to eastern South America, (2) a relative lack of differentiation between Mesoamerican and Andean populations, (3) a scenario in which coastal routes were easier for migrating peoples to traverse in comparison with inland routes, and (4) a partial agreement on a local scale between genetic similarity and the linguistic classification of populations. These findings offer new insights into the process of population dispersal and differentiation during the peopling of the Americas.

Author Summary

Studies of genetic variation have the potential to provide information about the initial peopling of the Americas and the more recent history of Native American populations. To investigate genetic diversity and population relationships in the Americas, we analyzed genetic variation at 678 genome-wide markers genotyped in 29 Native American populations. Comparing Native Americans to Siberian populations, both genetic diversity and similarity to Siberians decrease with geographic distance from the Bering Strait. The widespread distribution of a particular allele private to the Americas supports a view that much of Native American genetic ancestry may derive from a single wave of migration. The pattern of genetic diversity across populations suggests that coastal routes might have been important during ancient migrations of Native American populations. These and other observations from our study will be useful alongside archaeological, geological, and linguistic data for piecing together a more detailed description of the settlement history of the Americas.

Introduction

Patterns of genetic diversity and population structure in human populations constitute an important foundation for many areas of research in human genetics. Most noticeably, they provide an invaluable source of data for inferences about human evolutionary history [13]. In addition, the distribution of genetic variation informs the design and interpretation of studies that search for genes that confer an increased susceptibility to disease [46].

Recent genomic studies have produced detailed genome-wide descriptions of genetic diversity and population structure for a wide variety of human populations, both at the global level [719] and for individual geographic regions, including East Asia [20], Europe [21,22], and India [23]. Here we report the first such analysis of indigenous populations from the American landmass, using 678 microsatellites genotyped in 530 individuals from 29 Native American populations. The study is designed to investigate several questions about genetic variation in Native Americans: what records of the original colonization from Siberia are retained in Native American genetic variation? What geographic routes were taken in the Americas by migrating peoples? What is the genetic structure of Native American populations? To what extent does genetic differentiation among populations parallel the differentiation of Native American languages? In addressing these questions, our analyses identify several surprising features of genetic variation and population history in the Americas.

Results

We collected genome-wide microsatellite genotype data for 751 autosomal markers in 422 individuals from 24 Native American populations spanning ten countries and seven linguistic “stocks” (Tables S1 and S2). We also collected data on 14 individuals from a Siberian population, Tundra Nentsi. To enable comparisons with data previously reported in the worldwide collection of populations represented by the Human Genome Diversity Project–Centre d'Etude du Polymorphisme Humain (HGDP–CEPH) cell line panel [7,11,13], data analysis was restricted to 678 loci typed across all populations (see Methods). The combined dataset contains genotypes for 1,484 individuals from 78 populations, including 29 Native American groups and two Siberian groups (Figure 1).

thumbnail

Figure 1. Populations Included in This Study

The world map shows the 78 populations investigated in the combined dataset, with the locations of the 29 populations studied in the Americas shown in detail in the larger map. The 25 newly examined populations, including the Siberian Tundra Nentsi, are marked in red, and the previously genotyped HGDP-CEPH populations are marked in yellow.

doi:10.1371/journal.pgen.0030185.g001

Genetic Diversity

We compared levels of genetic diversity across geographic regions worldwide (Table 1). A serial founding African-origin model of human evolution [10,11]—in which each successive human migration involved only a subset of the genetic variation available at its source location, and in which the Bering Strait formed the only entry point to the American landmass—predicts reduced genetic diversity in Native Americans compared to other populations, as well as a north-to-south decline in genetic diversity among Native American populations. Indeed, Native Americans were found to have lower genetic diversity, as measured by heterozygosity, than was seen in populations from other continents (Table 1). Additionally, applying a sample size-corrected measure of the number of distinct alleles in a population [24,25], Native Americans had fewer distinct alleles per locus compared to populations in other geographic regions (Figure 2A). Among Native American populations, the highest heterozygosities were observed in the more northerly populations, and the lowest values were seen in South American populations (Table 2). The lowest heterozygosities of any populations worldwide occurred in isolated Amazonian and eastern South American populations, such as Surui and Ache. More generally, heterozygosity was reduced in eastern populations from South America compared to western populations (Table 1, p = 0.02, Wilcoxon rank sum test). Eastern South American populations also had fewer distinct alleles per locus than populations elsewhere in the Americas (Figure 2B).

thumbnail

Table 1.

Heterozygosity and FST (×100) for Various Geographic Regions

doi:10.1371/journal.pgen.0030185.t001
thumbnail

Figure 2. The Mean and Standard Error Across 678 Loci of the Number of Distinct Alleles as a Function of the Number of Sampled Chromosomes

(A) Geographic regions worldwide. (B) Subregions within the Americas. For a given locus, region, and sample size g, the number of distinct alleles averaged over all possible subsamples of g chromosomes from the given region is computed according to the rarefaction method [24,25]. For each sample size g, loci were considered only if their sample sizes were at least g in each geographic region. Error bars denote the standard error of the mean across loci.

doi:10.1371/journal.pgen.0030185.g002
thumbnail

Table 2.

Heterozygosity for Newly Sampled Populations and for the Five Previously Sampled Native American Populations (Pima, Maya, Piapoco, Karitiana, and Surui)

doi:10.1371/journal.pgen.0030185.t002

Assuming a single source for a collection of populations, the serial founding model predicts a linear decline of genetic diversity with geographic distance from the source location [11,26]. Such a pattern is observed at the worldwide level, as a linear reduction of heterozygosity is seen with increasing distance from Africa, where distance to Native American populations is measured via a waypoint near the Bering Strait (Figure 3A). To investigate the source location for Native Americans, we considered only the Native American data and allowed the source to vary, measuring the correlation of heterozygosity with distance from putative points of origin. Consistent with the founding from across the Bering Strait, the correlation of heterozygosity with geographic distance from a hypothesized source location had the most strongly negative values (r = −0.436) when the source for Native Americans was placed in the northernmost part of the American landmass (Figure 3B). The smallest value for the correlation coefficient was seen at 55.6°N 98.8°W, in central Canada, but as a result of relatively sparse sampling in North America, all correlations in the quartile with the smallest values, plotted in the darkest shade in Figure 3B, were within a narrow range (−0.436 to −0.424).

thumbnail

Figure 3. Heterozygosity in Relation to Geography

(A) Relationship between heterozygosity and geographic distance from East Africa. Populations in Sub-Saharan Africa and Oceania are marked with gray triangles and squares, respectively, and the remaining non-American populations from Europe, Asia, and northern Africa are marked with gray pentagons. Within the Americas, populations are color-coded and symbol-coded by language stock (see Figure 8). Denoting heterozygosity by H and geographic distance in thousands of kilometers by D, the regression line for the graph is H = 0.7679 − 0.00658D, with correlation coefficient −0.862. (B) The fit of a linear decline of heterozygosity with increasing distance from a putative source, considering Native American populations only. The color of a point indicates a correlation coefficient r between expected heterozygosity and geographic distance from the point, with darker colors denoting more strongly negative correlations. Across the Americas, the correlation ranges from −0.436 to 0.575, and color bins are set to equalize the number of points drawn in the four colors. From darkest to lightest, the four colors represent points with correlations in (−0.436, −0.424), (−0.424, −0.316), (−0.316, 0.494), and (0.494, 0.575), respectively.

doi:10.1371/journal.pgen.0030185.g003

One way to examine the support for particular colonization routes within the American landmass is to determine if a closer relationship between heterozygosity and geography is observed when “effective” geographic distances are computed along these routes, rather than along shortest-distance paths. Using PATHMATRIX [27] to take the precise locations of continental boundaries into consideration in effective geographic distance calculations (see Methods)—rather than using a waypoint approach [11] to measure distance—does not substantially alter the correlation of heterozygosity with distance from the Bering Strait (r = −0.430, 1:1 coastal/inland cost ratio in Figure 4A). However, when coastlines are treated as preferred routes of migration in comparison with inland routes, the percent of variance in heterozygosity explained by effective distance increases to 34% (r = −0.585 for a coastal/inland cost ratio of 1:10 in Figure 4A). In contrast, all scenarios tested that had coastal/inland cost ratios greater than 1 explain a smaller proportion of the variance in heterozygosity than do the scenarios with coastal/inland cost ratio of 1 or less.

thumbnail

Figure 4. Heterozygosity and Least-Cost Paths in a Coastal Migration Scenario

(A) R2 (square of the correlation) between heterozygosity H and effective geographic distance (least-cost distance), assuming differential permeability of coastal regions compared to inland regions. Correlations significant at the 0.05 level are indicated by closed symbols, and those that are not significant are indicated by open symbols. (B) Least-cost routes for the scenario with 1:10 coastal/inland cost ratio.

doi:10.1371/journal.pgen.0030185.g004

The preferred routes in the optimal scenario of a 1:10 coastal/inland cost ratio include a path to the Ache, Guarani, and Kaingang populations that travels around northern South America (Figure 4B). With these three populations excluded, the role of coastlines is almost unchanged (Figure S1), and a 1:10 ratio continues to explain the largest fraction of variation in heterozygosity (r = −0.595). Applying a reduced cost only to the Pacific coast, a preference is still seen for ratios slightly less than 1 compared to ratios greater than 1, and the scenario producing the closest fit is a 1:2 ratio (Figure S2). A stronger preference for a Pacific coastal route was observed excluding from the computations the Chipewyan, Cree, and Ojibwa populations, three groups that follow an Arctic route in Figure 4B, or excluding Ache, Guarani, and Kaingang in addition to Chipewyan, Cree, and Ojibwa (Figure S3). We did not find a closer fit of heterozygosity and effective distance assuming a reduced cost for travel along major rivers, and indeed we observed that a higher cost for riverine routes was preferred (Figure S3).

Intercontinental Population Structure

To investigate population structure at the worldwide level, we used unsupervised model-based clustering as implemented in the STRUCTURE program [28,29]. Using STRUCTURE, we applied a mixture model that allows for allele frequency correlation across a set of K genetic clusters, with respect to which individual membership coefficients are estimated (see Methods).

As has been observed previously [7,9,13,16,23], cluster analysis with worldwide populations identifies a major genetic cluster corresponding to Native Americans (Figure 5), indicating an excess similarity of individual genomes within the Americas compared to genomes in other regions. Inclusion of the Native American data collected here did not substantially alter the clusters identified in previous analyses. When the genotypes were analyzed using a model with five clusters, the clusters corresponded to Sub-Saharan Africa, Eurasia west of the Himalayas, Asia east of the Himalayas, Oceania, and the Americas. For a model with six clusters, the sixth cluster corresponded mainly to the isolated Ache and Surui populations from South America. Almost no genetic membership from the cluster containing Africans and a relatively small amount of membership from the cluster containing Europeans were detected in the Native Americans, indicating that with relatively few exceptions, the samples examined here represent populations that have experienced little recent European and African admixture.

thumbnail

Figure 5. Unsupervised Analysis of Worldwide Population Structure

The number of clusters in a given plot is indicated by the value of K. Individuals are represented as thin vertical lines partitioned into segments corresponding to their membership in genetic clusters indicated by the colors.

doi:10.1371/journal.pgen.0030185.g005

To search for signals of similarity to Siberians in the Native American populations, we used a supervised cluster analysis [28,29] in which Native Americans were distributed over five clusters (Figure 6). Four of these clusters were forced to correspond to Africans, Europeans, East Asians excluding Siberians, and Siberians (Tundra Nentsi and Yakut), and the fifth cluster was not associated with any particular group a priori. Most Native American individuals were seen to have majority membership in this fifth cluster, and considering their estimated membership in the remaining clusters, Native Americans were genetically most similar to Siberians. A noticeable north-to-south gradient of decreasing similarity to Siberians was observed, as can be seen in the declining membership in the red cluster from left to right in Figure 6. Genetic similarity to Siberia is greatest for the Chipewyan population from northern Canada and for the more southerly Cree and Ojibwa populations. Detectable Siberian similarity is visible to a greater extent in Mesoamerican and Andean populations than in the populations from eastern South America.

thumbnail

Figure 6. Supervised Population Structure Analysis, Using Five Clusters, Four of Which Were Forced to Correspond to Africans, Europeans, East Asians Excluding Siberians, and Siberians

doi:10.1371/journal.pgen.0030185.g006

Intracontinental Population Structure

The level of population structure observed among Native Americans, as determined using FST [30], was 0.081, exceeding that of other geographic regions (Table 1). Comparing regions within the Americas, the highest FST value was observed in eastern South America, with intermediate values occurring in western South America and Central America and with the smallest value occurring in North America (Table 1). These results are compatible with the lower overall level of Native American genetic variation, particularly in eastern South America, as the mathematical connection between heterozygosity and FST predicts that low heterozygosities will tend to produce higher FST values [11,3133].

Applying unsupervised model-based clustering [28,29] to the Native Americans, considerable population substructure is detectable (Figure 7). For a model with two clusters, one cluster corresponds largely to the northernmost populations, while the other corresponds to populations from eastern South America; the remaining populations are partitioned between these two clusters, with greater membership of the more northerly populations in the “northern” cluster. As the number of clusters is increased, the least genetically variable groups form distinctive clusters (for example, the Ache, Karitiana, and Surui populations). However, variation exists across replicates in the nature of the partitioning, and to illustrate the range of solutions observed, Figure 7 summarizes each clustering solution that was seen in at least 12% of replicate analyses for each K from two to nine. These summaries indicate that the main clustering solutions with a given K “refine” the partitions observed with K − 1 clusters, in the sense that each of the K clusters is either identical to, or is a subset of, one of the K − 1 clusters. A likely explanation for the multimodality is the presence of several population subgroups that are roughly equally likely to form individual clusters. For small K, not enough slots are available, and only when K is sufficiently large is each of these groups able to occupy its own cluster.

thumbnail

Figure 7. Unsupervised Analysis of Native American Population Structure

The colored plots at the left show the estimated population structure of Native Americans, obtained using STRUCTURE. The number of clusters in a given plot is indicated by the value of K on the right side of the figure. Next to the K = 7 plot, the population names and the major language stocks of the populations are also displayed. The left-to-right order of the individuals is the same in all plots. The diagram on the right summarizes the outcomes of 100 replicate STRUCTURE runs for each of several values of K. Each row represents a value of K, and within each row, each box represents a clustering solution that appeared at least 12 times in 100 replicates (see Methods). The number of appearances of a solution is listed above the box, and the boxes are arrayed from left to right in decreasing order of the frequencies of the solutions to which they correspond. The DISTRUCT plot shown on the left corresponds to the leftmost box on the right side of the figure. An approximate description of the clusters is located inside the box, with each row in the box representing a different cluster. The numbers 1, 2, and 3 are used to refer to the green cluster in the K = 2 DISTRUCT plot, the blue cluster in the K = 2 DISTRUCT plot, and the yellow cluster in the K = 9 DISTRUCT plot, respectively. The following population abbreviations are also used: A, Ache; Arh, Arhuaco; Cab, Cabecar; Chip, Chipewyan; E, Embera; G, Guaymi; K, Karitiana; Kog, Kogi; P, Pima; S, Surui; T, Ticuna (both Ticuna groups combined); W, Waunana. Clusters are indicated using set notation; for example {A} represents a cluster containing Ache only, and 2\{A,S} represents a cluster that corresponds to cluster 2 (the blue cluster for K = 2), excluding Ache and Surui. An asterisk indicates approximately 50% membership of a population in a cluster. A line is drawn from a box representing a solution with K clusters to a box representing a solution with K+1 clusters if the solution with K+1 clusters refines the solution with K clusters—that is, if all of the clusters in the solution with K+1 clusters subdivide the clusters in the solution with K clusters. In case of ties for the highest-frequency solution (K = 4 and K = 5), boxes are oriented in order to avoid the crossing of lines between them.

doi:10.1371/journal.pgen.0030185.g007
thumbnail

Figure 8. Neighbor-Joining Tree of Native American Populations

Each language stock is given a color, and if all populations subtended by an edge belong to the same language stock, the clade is given the color that corresponds to that stock. Branch lengths are scaled according to genetic distance, but for ease of visualization, a different scale is used on the left and right sides of the middle tick mark at the bottom of the figure. The tree was rooted along the branch connecting the Siberian populations and the Native American populations, and for convenience, the forced bootstrap score of 100% for this rooting is indicated twice.

doi:10.1371/journal.pgen.0030185.g008

For K = 7, a relatively stable clustering solution is observed, appearing in 44 of 100 replicates (compared to seven of 100 for the next most frequently observed solution). This clustering solution has distinctive clusters for three of the smallest and least genetically variable groups in the sample—Karitiana and Surui from Brazil, and Ache from Paraguay. Two separate samples from the Amazonian Ticuna group of Colombia form the basis for a cluster, as does the Pima group from Mexico. The remaining two clusters include one centered on the North American groups and one centered on the Chibchan–Paezan language stock from Central and South America. The cluster containing Chibchan–Paezan populations—the only cluster at K = 7 that corresponds well to a major language stock—separates into two subclusters when K is increased to nine. Despite the large geographic distance between Mesoamerica and the Andes, Mesoamerican populations (Mixtec, Zapotec, Mixe, and Maya from Mexico and Kaqchikel from Guatemala) and Andean populations (Inga from Colombia, Quechua from Peru, and Aymara and Huilliche from Chile) have similar estimated membership across clusters when K = 7, and together with five additional populations (Zenu, Wayuu, and Piapoco from Colombia, and Kaingang and Guarani from Brazil), they comprise a single cluster when K = 9.

Genes and Languages

We compared the classification of the populations into linguistic “stocks” [34,35] (Table S2) with their genetic relationships as inferred on a neighbor-joining tree constructed from Nei genetic distances [36] between pairs of populations (Figure 8). As the use of a single-family grouping (Amerind) of all languages not belonging to the Na–Dene or Eskimo–Aleutian families is controversial [37], we focused our analysis on the taxonomically lower level of linguistic stocks.

In the neighbor-joining tree (Figure 8), a reasonably well-supported cluster (86%) includes all non-Andean South American populations, together with the Andean-speaking Inga population from southern Colombia. Within this South American cluster, strong support exists for separate clustering of Chibchan–Paezan (97%) and Equatorial–Tucanoan (96%) speakers (except for the inclusion of the Equatorial–Tucanoan Wayuu population with its Chibchan–Paezan geographic neighbors, and the inclusion of Kaingang, the single Ge–Pano–Carib population, with its Equatorial–Tucanoan geographic neighbors). Within the Chibchan–Paezan and Equatorial–Tucanoan subclusters several subgroups have strong support, including Embera and Waunana (96%), Arhuaco and Kogi (100%), Cabecar and Guaymi (100%), and the two Ticuna groups (100%). When the tree-based clustering is repeated with alternate genetic distance measures, despite the high Mantel correlation coefficients [38] between distance matrices (0.98, 0.98, and 0.99 for comparisons of the Nei and Reynolds matrices, the Nei and chord matrices, and the Reynolds and chord matrices, respectively), higher-level groupings tend to differ slightly or to have reduced bootstrap support (Figures S4 and S5). However, local groupings such as Cabecar and Guaymi, Arhuaco and Kogi, Aymara and Quechua, and Ticuna (Arara) and Ticuna (Tarapaca) continue to be supported (100%). This observation of strongly supported genetic relationships for geographically proximate linguistically similar groups coupled with smaller support at the scale of major linguistic groupings is also seen in Native American mitochondrial data [39].

To more quantitatively test the correspondence of genetic and linguistic variation in the Americas, we computed the Mantel correlation of genetic and linguistic distances (Table 3). Nei's Da distance [36] was used for the genetic computations, and linguistic distances were measured along a discrete scale (see Methods). Considering all of the Native American populations and treating all linguistic stocks as equidistant (Table S3), the Mantel correlation of Nei genetic distance with linguistic distance is small (r = 0.04). The correlation is also small when using between-stock linguistic distance measures (Tables S4S11) that make use of shared etymologies identified by Greenberg [34]. For two ways of computing linguistic distance, using the Dice and Jaccard indices (see Methods), respectively, the correlations are r = −0.01 and r = −0.02. When the effects of geography are controlled, or when stocks are excluded from the computation individually, the partial correlations of linguistic and genetic distance [40] remain low.

thumbnail

Table 3.

Correlation of Genetic and Linguistic Distances

doi:10.1371/journal.pgen.0030185.t003

A potential explanation for the low correlation coefficients—suggested by the apparent genetic and linguistic correspondence in the neighbor-joining tree for closely related groups—is that sizeable correlation between genetic and linguistic distance may exist only below a certain level of linguistic distance. Considering genetic and linguistic differentiation only for pairs of populations within linguistic stocks, the correlation of genetic distance and linguistic distance increases (r = 0.53). The partial correlation of genetic distance and linguistic distance remains fairly high when the effect of geographic distance is controlled (r = 0.40), although 11% of random matrix permutations produce higher values (Table 3).

By excluding language stocks from the computation individually, it is possible to investigate the extent to which individual linguistic stocks are responsible for the within-stock correlation of genetic and linguistic distance. When the Equatorial–Tucanoan stock is excluded, the correlation increases to 0.68, and the partial correlation controlling for geographic distance increases to 0.66. Excluding the Andean stock, however, both the correlation and the partial correlation decrease (to 0.46 and 0.26, respectively). Excluding any of the three other stocks for which more than one population is represented (Northern Amerind, Central Amerind, Chibchan–Paezan) does not lead to a sizeable change in either the correlation coefficient (0.54, 0.51, 0.55) or the partial correlation coefficient (0.40, 0.39, 0.40).

Native American Private Alleles

Considering alleles found only in one major geographic region worldwide, Native Americans have the fewest private alleles (Figure 9A). Private alleles, which lie at the extreme ends of the allele size range more often than expected by chance (p < 0.023), usually have low frequencies in the geographic region where they are found (≤13%). Within the Americas, counting alleles private to one of four subregions, northern populations have the most and eastern South American populations have the fewest private alleles, with western South American populations having slightly more than Central American populations (Figure 9B).

thumbnail

Figure 9. The Mean and Standard Error Across 678 Loci of the Number of Private Alleles as a Function of the Number of Sampled Chromosomes

For a given locus, region, and sample size g, the number of private alleles in the region—averaging over all possible subsamples that contain g chromosomes each from the five regions—is computed according to an extension of the rarefaction method [25]. For each sample size g, loci were considered only if their sample sizes were at least g in each geographic region. Error bars denote the standard error of the mean across loci.

doi:10.1371/journal.pgen.0030185.g009

Despite this general lack of high-frequency private alleles, especially in Native Americans, we observed that the only common (>13%) regionally private variant in the worldwide dataset was a Native American private allele. This allele, corresponding to a length of 275 base pairs at locus D9S1120, was found at a frequency of 36.4% in the full Native American sample, and was absent from the other 49 world populations. Allele 275 is the smallest variant observed at the locus and it is present in each of the 29 Native American populations—at frequencies ranging from 11.1% in Ticuna (Tarapaca) to 97.1% in Surui (Figure 10). This allele has now been observed in every Native American population in which the locus has been investigated [41,42], and it has only been seen elsewhere in two populations at the far eastern edge of Siberia [42].

thumbnail

Figure 10. Allele Frequency Distribution at Tetranucleotide Locus D9S1120

For each population the sizes of the colored bars are proportional to allele frequencies in the population, with alleles color-coded as in the legend. Alleles are ordered from bottom to top by increase in size, with the smallest allele, a Native American private allele of size 275, shown in red, and the largest allele, 315, shown in dark blue.

doi:10.1371/journal.pgen.0030185.g010

Discussion

Because of the likely submergence of key archaeological sites along the Pacific coast, the relative absence of a written record, and the comparatively recent time scale of the initial colonization, population-genetic approaches provide a particularly important source of data for the study of Native American population history [4352]. In this article, building upon recent investigations that have increased the size of Native American genetic datasets beyond classical marker, Y-chromosomal, mitochondrial, and single-gene studies [7,11,13,16,41,5365], we have examined genome-wide patterns of variation in a dataset that—in terms of total genotypes—represents the largest continent-wide Native American population-genetic study performed to date. Our results have implications for a variety of topics in the demographic history of Native Americans, including (1) the process by which the American landmass was originally populated, (2) the routes taken by the founders during and subsequent to the migration, and (3) the extent to which genes and languages have traveled together during the diversification of Native American populations. We discuss these issues in sequence.

Genetic Signatures of the Colonization from Siberia

The lower level of genetic diversity observed in the Americas compared to other continental regions is compatible with a reduction in population size associated with a geographically discrete founding, representing one of the most recent in a series of major bottlenecks during human expansions outward from Africa [11]. Gradients of genetic diversity (Figure 3) and decreasing similarity to Siberians (Figure 6) also point to extant Native Americans as the descendants of a colonization process initiated from the northwestern part of the American landmass. An alternative possibility that could produce a genetic diversity gradient—namely, a north-to-south gradient of recent admixture from high-diversity European populations—can be eliminated as a possible explanation given that (1) European admixture is not strongly correlated with distance from the Bering Strait (r = −0.135), (2) inclusion of a European admixture covariate in the regression of heterozygosity on distance from the Bering Strait is not supported (p = 0.37) and only slightly increases the fit of the regression model (R2 = 0.208 compared to R2 = 0.182), and (3) the regression of heterozygosity on distance from the Bering Strait does not change substantially when the most highly admixed populations are excluded from the analysis (Table S12). The genetic diversity and population structure gradients—which are generally compatible with principal component maps of allele frequencies at small numbers of classical markers [1,66] and with some analyses of mitochondrial, X-chromosomal, and Y-chromosomal data [67,68]—are more clearly visible in our study of a larger number of loci.

Although gradients of genetic diversity and Siberian similarity constitute major features of the pattern of Native American variation when considering all of the loci together, one important aspect of Native American variation—the distribution of a private allele at locus D9S1120—deviates from the genome-wide pattern and does not show a north-to-south frequency gradient. The geographic distribution of this allele is similar to the distributions of certain mitochondrial and Y-chromosomal variants that are also ubiquitous in the Americas, but that are absent elsewhere or that are found outside the Americas only in extreme northeast Siberia [6974]. Such distributions are most easily explained by the spatial diffusion of initially rare variants during the colonization of the continent, rather than by continent-wide natural selection or by an origin considerably later than the colonization [42,75,76]. The restricted distribution in Asia of D9S1120 allele 275 and similar Y-chromosomal and mitochondrial variants suggests one of several explanations [42]: the ancestral population that migrated to the Americas may have already acquired a degree of genetic differentiation from other Asian populations [77], descendants of the original Native American founders are no longer present elsewhere in Asia, or these descendants have not yet been genotyped at loci that carry apparently private Native American variants.

The genomic continent-wide patterns observed here can be explained most parsimoniously by a single main colonization event, as proposed by some interpretations of archaeological, mitochondrial, and Y-chromosomal data [67,74,7783]. In this view, at each step in the migration, a subset of the population splitting off from a parental group moves deeper into the Americas, taking with it a subset of the genetic variation present in the parental population. This scenario would be expected to produce a set of low-diversity populations with distinctive patterns of variation at the far terminus of the migration, such as those we and others [84] observe in the Ache and Surui populations. It can also explain the gradient of Siberian similarity, and the continent-wide distribution of D9S1120 allele 275. Alternatively, similar patterns could result from gene flow across the Bering Strait in the last few thousand years, together with continual interactions between neighbors on both sides of the Bering Strait [47]. It is also possible to envision a series of prehistoric migrations, possibly from the same source population, with the more recent descendants gradually diffusing into pre-existing Native American populations.

Routes of Population Dispersal

Largely on the basis of archaeological data, a classical model for the colonization of the Americas posits that humans entered the region towards the end of the Wisconsin glaciation (~11,000 y ago) via a mid-continental ice-free corridor between the Cordilleran and Laurentide glaciers [78,79]. According to this model, migration southwards would have followed a pattern with a front of advance at approximately the same latitude across North America.

It is interesting to consider the patterns of genetic structure observed here within the context of the emphasis placed recently on the Pacific coast as an alternative to the inland ice-free corridor route of population dispersal in the Americas [79,8587]. The late timing of the rapid inland colonization model has been put into some doubt by the discovery of early archaeological sites that predate by thousands of years the most recent deglaciation of North America [88]. In addition, recent geological evidence indicates that ice-free areas west of the Cordilleran ice sheet may have existed as early as ~14,000 y ago [79], suggesting the possibility of an early coastal migration. Within South America, the coastal colonization model suggests an early southward migration along the western side of the Andes and is consistent with an interpretation that modern speakers of Andean languages may represent descendants of the first occupiers of the region [1]. Recent computer simulations also suggest that a coastal colonization model may more easily explain observed patterns of classical marker and mitochondrial DNA diversity [89].

Several observations from our data are compatible with the proposal of a coastal colonization route. The stronger correlation of genetic diversity with geographic distance when higher coastal mobility is taken into account (Figure 4) supports a possible role for population dispersals along the coast (note, however, that the difference in the tree structure induced by the optimal route in Figure 4 and the tree in Figure 8 suggests that alternative routes might be preferred if more aspects of the genetic data were incorporated into the coastal analysis). Consistent with observations of recent migration paths of certain Amazonian populations [43], we did not find support for migrations along major rivers. Finally, the relative genetic similarity of Andean populations to populations from Mesoamerica (Figure 7) is also compatible with an early Pacific coastal colonization. Under this view, the east-to-west difference in genetic diversity in South America, a pattern also observed with mitochondrial and Y-chromosomal markers [9092] (including the extremely low diversity in the Ache [93,94] and Surui [94] populations), could reflect an initial colonization of western South America followed by subsampling of western populations to form the eastern populations.

An alternative interpretation of the Mesoamerican and Andean similarity is that this pattern is recent in origin. In this case, the reduced diversity and increased population structure in eastern South America may reflect a deep divergence between western and eastern populations, so that their different levels of differentiation could result from different levels of gene flow and genetic drift in western and eastern South America. The genetic similarity among Andean populations, and their relative similarity to the populations sampled from Mesoamerica, would perhaps then reflect recent gene flow along the coast.

Similar to results seen in some mitochondrial studies [9597], Central American and South American populations from the Chibchan–Paezan language stock had slightly reduced heterozygosity compared to neighboring populations. Interestingly, the Cabecar and Guaymi populations from lower Central America (Costa Rica and Panama) were robustly placed at the tips of a northwest South American Chibchan–Paezan cluster in the tree of Figure 8. One explanation of this observation is that these populations may be of South American origin, as the ancestral group for the cluster could have been a South American population, most of whose descendants remain in South America. Alternatively, the large cluster containing the Chibchan–Paezan and Equatorial–Tucanoan populations could be the result of a colonization of South America separate from the colonization by the Andean populations—with the founder population possibly speaking a language from which modern Chibchan–Paez