supplemental methods

advertisement
TEXT S1
COALESCENT SIMULATIONS. We used the computer package Simcoal [v2.1,
1] to simulate the branching and isolation by distance models. For the branching model,
we simulated the population history depicted in Figure 2 (in the main text). Each
population was comprised of multiple diploid genomes, and each genome was comprised
of multiple unlinked microsatellite loci. The loci accumulated mutations under a stepwise
model [2] at a rate of 0.0075 mutations per generation [3]. Moving backwards in time, the
populations merged in the pattern depicted in Figure 2. The first mergers occurred
between populations within islands beginning 75 generations before the present (bp). The
ancestral island populations merged in a nested fashion between 150 and 2000
generations bp, with the final merger occurring 4000 generations bp, representing the
initial population split in Asia. Other population branching times and branching patterns
were also simulated. Their common thread was the deep split between the Oceanic and
Papuan branches and the absence of migration between populations. The reported allelic
identity pattern was unaffected by these adjustments or by the adjustment of the number
of genomes per population, the number of unlinked loci per genome, or population
growth rates.
For the isolation by distance simulations, populations fused into a single ancestral
population at a single time in the past. Populations were arrayed end to end in a linear
stepping-stone fashion, and, prior to fusion, each population exchanged a constant portion
of its genomes each generation with its immediate neighbors. In different replicates of the
simulations, we varied the number of genomes/population, the number of loci/genome,
1
the fusion time, and the proportion of migrants/population/generation. Varying these
parameters did not affect the reported allelic identity pattern.
We also constructed an isolation by distance model in which populations
exchanged genomes with geographic neighbors in two dimensions. In the 2-dimensional
simulations, we used a 7 x 7 grid of 49 populations. Both the linear and 2-dimensional
models showed a steady decay in allelic identity with increasing geographic distance
between populations. We report only the linear stepping-stone results in the main text.
Fifty simulations sufficiently captured the pattern of allelic identity variation
associated with each model. Following each simulation, twenty diploid genomes were
sampled from each population, from which the within- and between-population allelic
identities were estimated. The allelic identities were then averaged for the 50 simulations
and plotted in the color-coded heat plots.
TREE CONSTRUCTION. The language trees were estimated using two
simultaneous, completely independent, analyses starting from different random trees,
using the Mr. Bayes v3.1 [4] computer package. The analyses were continued until the
standard deviation of the split frequencies between the runs was less than 0.01. We used
the program defaults for “standard” data for the prior probability distribution for the
parameters of the likelihood model. The defaults included: 1) all possible trees were
considered a priori to be equally probably; 2) the probability of a change from one binary
linguistic feature state to the other was equal; 3) unconstrained branch lengths with
uniform prior probabilities. After discarding the initial 250 trees generated by the
program, approximately 1000 trees were retained, and the reported bootstrap values for
the language tree and revised language tree represent the percent of time a particular
2
branch was observed in the 1000 or so trees. All estimated branch lengths were
significantly greater than zero. Language trees were also constructed using distancebased and parsimony methods, and the tree topologies and the model fitting results and
conclusions were similar for all construction methods.
The gene tree was constructed from RST genetic distances [2] obtained using
Arlequin v3.1 [5]. Trees were also constructed from FST-based distances [using the
transformation of 6] and from minimum genetic distances [7]. All trees had nearly
identical topologies. The bootstrap values were obtained from microsatellite allele
frequencies estimated using CONVERT [8] and the PHYLIP package [9]. Trees were
drawn using TREEVIEW [10] and modified by hand.
OUTLIER POPULATION IDENTIFICATION IN THE OBSERVED VS.
EXPECTED GENETIC DISTANCE PLOTS. To identify the populations that
contributed most to the lack of fit of the language tree, we removed the linguistic data for
one population, made a new language tree (using the Bayesian approach), and then refit
the new tree. This procedure was implemented 23 times, once for each population. The
procedure indicated that, by a wide margin, the Kol contributed most to the lack of fit of
the language tree. We then repeated the procedure 22 times for the Kol-less tree. The
population that contributed most to the lack of fit was then removed, a new tree was
made, and the model-fitting procedure was applied to the new tree. This procedure was
repeated until the reduction in  leveled off, which occurred after an additional 4
populations were removed (see Figure S1). The 5 removed populations were the Kol,
Ata, Kuot, Saposa and Tigak (see also Table 4 in the main text).
3
FIGURE S1
Reduction in  After Removal of the Nth Outlier Population
Reduction in  after removal
3000
2500
2000
1500
1000
500
0
0
1
2
3
4
5
6
7
8
9
10 11
12
13
14 15
16
17
Nth population removed
LINGUISTIC DATA AND SUPPLEMENTAL ANALYSES. Table S1 is the
matrix of linguistic distances. Figure S2 shows the distribution of the linguistic distances
in localized geographic and linguistic subsets of the data. The top plot highlights the large
linguistic distances among the Papuan languages compared to Oceanic languages. The
distances between the Oceanic and Papuan languages are about as large as those between
Papuan languages, though some of the distances are more intermediate between the two
groups, which is consistent with previous findings that there has been some structural
linguistic exchange between the language groups [11,12].
The bottom plot highlights the linguistic distances for the coastal and interior sets.
The similar distributions to the top plot reflect the strong association between language
and location, i.e., Oceanic languages are located mainly on the coasts, and Papuan
languages are located mainly in island interiors. The lowest inter-location values (the
4
three leftmost yellow bars) are between the Oceanic languages located in the interior of
New Britain (Mamusi and Nakanai-S) and the Oceanic languages on the coasts of the
island. The distances are typical of those between other Oceanic languages (top plot). In
contrast to the linguistic pattern, the allelic identity in the Mamusi and Nakanai-S is high
(see main text and Table 1), which is more typical of Papuan-speaking groups. The
Oceanic-typical linguistic distances but high allelic identity for the two populations may
reflect a genetic founder effect that occurred when the Mamusi and Nakanai-S moved
from the New Britain coasts to its interior.
FIGURE S2
Count
50
Linguistic Distances within and between Papuan and
Oceanic languages
45
Oceanic vs. Oceanic
40
Papuan vs. Papuan
35
Oceanic vs. Papuan
30
25
20
15
10
5
0
5
10
15
20
25
30
35
40
45
50
55
Linguistic Distance (proportion of different features)
Count
40
Linguistic Distances within and between Coastally- and
Interiorly-located languages
35
Coast vs. Coast
30
Interior vs. Interior
Coast vs. Interior
25
20
15
10
5
0
0
5
10
15
20
25
30
35
40
45
50
55
Linguistic Distance (proportion of different features)
5
Table S1. Proportion of different features between language
Aita
Anêm Ata
Kaket Mali
Kol
Kove Kuot
Tun
Mad
Mam Mang Mela Meng Muss Nak
Nalik Nasioi Notsi Sap
Sulka Tigak Tolai
Aita
0
Anêm
44.80
0
Ata
43.80 28.70
0
Kaket
43.70 39.80 45.30
0
Mali
39.60 33.00 31.90 29.10
0
Kol
47.90 32.40 37.60 34.70 27.00
0
Kove
39.80 23.80 26.70 38.50 33.30 34.60
0
Kuot
43.80 34.80 37.20 36.90 34.80 36.00 36.20
0
Tungag
38.20 34.30 38.50 26.00 31.40 35.30 29.30 34.30
0
Madak
38.80 32.70 42.00 28.10 31.70 38.60 30.50 28.70 17.70
0
Mamusi
39.00 20.00 25.60 36.00 31.60 37.90 11.80 34.70 25.60 21.10
0
Mangseng 40.20 25.50 34.40 34.80 37.30 37.00 17.20 35.30 24.00 22.60 8.60
0
Melamela 37.60 29.60 30.80 34.70 28.70 39.60 26.50 38.90 19.80 20.60 18.10 19.80
0
Mengen
43.00 31.50 35.90 34.00 34.20 38.00 27.20 43.20 24.50 21.20 13.80 16.80 17.00
0
Mussau
40.40 29.10 34.40 28.70 31.10 35.60 25.00 30.10 24.70 22.60 15.10 20.20 20.80 14.90
0
Nakanai 39.40 29.70 37.20 31.40 35.10 38.00 25.20 36.00 20.60 18.20 12.60 13.90 12.10 14.80 18.40
0
Nalik
35.90 36.00 35.90 30.00 32.40 33.60 23.50 30.60 14.70 23.50 25.50 26.00 23.80 27.50 25.70 21.30
0
Nasioi
34.50 44.10 40.00 45.20 44.10 47.00 42.10 44.10 49.50 51.10 39.50 43.00 44.30 43.40 48.90 49.50 43.90
0
Notsi
37.80 29.50 31.80 26.00 31.40 29.10 24.50 31.40 19.80 15.80 21.30 21.10 24.00 24.50 18.60 22.30 24.80 46.40
0
Saposa
34.50 26.50 39.50 30.40 30.60 35.10 26.00 27.60 21.50 24.20 21.80 25.80 16.00 24.50 21.70 14.60 18.60 42.20 26.10
0
Sulka
49.00 28.70 41.50 35.00 37.40 30.60 33.30 44.30 29.50 28.70 27.40 27.50 27.80 22.50 29.10 28.80 35.10 48.00 31.40 33.70
0
Tigak
35.60 24.00 35.20 35.80 28.80 37.60 25.50 35.60 19.40 21.50 20.00 20.60 14.00 21.60 20.60 20.60 24.80 44.70 22.70 19.40 29.80
0
Tolai
39.30 36.20 33.30 39.60 30.50 29.10 20.80 37.10 26.50 23.40 19.10 22.20 21.60 23.50 18.00 20.20 22.30 48.50 19.60 23.70 29.50 22.20
0
6
HEAT PLOT OF NEW BRITAIN INTERIOR. In the isolation by distance
simulations, the populations are arrayed next to one another in a linear stepping stone, but
the 33 sampled NIM populations are not located next to one another in a simple linear
fashion. Figure S3 shows why the lack of congruence between the observed heat plot and
the simulated isolation by distance heat plot is not the result of this lack of geographic
contiguity of the NIM populations. Figure S3A is the heat plot for 9 contiguous
populations from an isolation by distance simulation of 50 populations, and Figure S3B is
the observed heat plot for the 9 interior New Britain populations. The plots look different
because the 9 simulated populations are geographically contiguous, while the 9 observed
populations are from very different geographic locations in the New Britain interior. Five
of them are located close to one another in central New Britain, another is located in the
central eastern portion of the island and three are located in the north. Figure S3C
contains 9 populations from the isolation by distance simulations sampled in 3 noncontiguous blocks: 5 contiguous populations, a geographic gap of 12 populations, 1
population, a geographic gap of 3 populations, and 3 contiguous populations. The
simulated between-group allelic identity pattern in Figure S3C is remarkably similar to
the observed pattern in Figure S3B. Both plots capture the allelic identity gaps between
the non-contiguous blocks, caused by the gaps in geographic distance, and both plots
show a steady decrease in allelic identity away from the diagonals, representing a steady
decay in genetic similarity with increasing geographic distance.
7
Kaket-R
Kaket-M
Mali
Kol
Mamusi-K
Mamausi-L
Nakanai-S
Ata-U
Ata-L
Pop 9
Pop 8
Pop 7
Pop 6
Pop 5
Pop 4
Pop 3
Pop 2
Pop 1
FIGURE S3
Ata-L
Pop 1
Ata-U
Pop 2
Nakanai-S
Pop 3
Mamausi-L
Pop 4
Mamusi-K
Pop 5
Kol
Pop 6
Mali
Pop 7
Kaket-M
Pop 8
Kaket-R
Pop 9
0.23
0.25
0.28
0.3
0.32
0.32
0.34
0.36
0.38
B. Observed Allelic Identities:
New Britain Interior
Gene Identity
Pop 32
Pop 31
Pop 30
Pop 26
Pop 13
Pop 12
Pop 11
Pop 10
Pop 9
A. Simulated Allelic Identities:
Gene Identity New Britain Interior
Contiguous Populations
0.3
Pop 32
Pop 31
Pop 30
Pop 26
Pop 13
Pop 12
Pop 11
Pop 10
Pop 9
0.22
0.25
0.27
0.3
0.32
Identity
C. Simulated AllelicGene
Identities:
New Britain Interior
Non-contiguous Populations
8
LITERATURE CITED
1. Laval G, Excoffier L (2004) SIMCOAL 2.0: a program to simulate genomic diversity
over large recombining regions in a subdivided population with a complex
history. Bioinformatics 20: 2485-2487.
2. Slatkin M (1995) A measure of population subdivision based on microsatellite allele
frequencies. Genetics 139: 457-462.
3. Brinkmann B, Klintschar m, Neuhuber F, Huhne J, Rolf B (1998) Mutation rate in
human microsatellites: influences of the structure and length of tandem repeat.
American Journal of Human Genetics 62: 1408-1415.
4. Huelsenbeck JP, Ronquist F (2001) MRBAYES: Bayesian inference of phylogeny.
Bioinformatics 754-755.
5. Excoffier L, Laval G, Schneider S (2005) Arlequin ver. 3.0: An integrated software
package for population genetics data analysis. Evolutionary Bioinformatics
Online. pp. 47-50.
6. Reynolds J, Weir BS, Cockerham CC (1983) Estimation of the Coancestry Coefficient:
Basis for a Short-Term Genetic Distance. Genetics 105: 767-779.
7. Nei M (1987) Molecular Evolutionary Genetics. New York: Columbia University
Press.
8. Glaubitz JC (2004) CONVERT: A user-friendly program to reformat diploid genotypic
data for commonly used population genetic software packages. Molecular
Ecology Notes 4: 309-310.
9. Felsenstein J (2005) PHYLIP (Phylogeny Inference Package) version 3.6. Seattle:
Distributed by the author. Department of Genome Sciences, University of
Washington.
10. Page R (1996) TREEVIEW: An application to display phylogenetic trees on personal
computers. Computer Applications in the Biosciences 12: 357-358.
11. Lindström E, Terrill A, Reesink G, Dunn DM (2007) The Languages of Island
Melanesia. In: Friedlaender JS, editor. Genes, Language, and Culture Change in
the Southwest Pacific. New York: Oxford University Press. pp. 118-140.
12. Reesink G (2005) Sulka of East New Britain: a mixture of Oceanic and Papuan traits.
Oceanic Linguistics 44: 145-193.
9
Download