randomly cambridge

advertisement
Electronic supplement
We applied two different statistical approaches to infer, whether kinship clusters would mirror the
social structure given by communities and cliques. The general idea behind the approaches is
described in the main document and will be extended here in more detail.
2) Relatedness coefficients and simulated Wilcoxon test statistic
The general idea behind this approach is to compare the degree of average pair-wise relatedness
between the different hierarchical components of the network: cliques, communities and the
population background. In the following we exemplify the procedure for the question, if
relatedness within a community is higher than between communities. To this end, all possible
relationships between the individuals in the network are divided into two groups: pairs of
individuals that share community membership and those that come from different communities.
Pseudoreplication and non-independence of data points (each individual is involved in many
dyadic interactions) make a standard unpaired Wilcoxon rank-sum-test impractical to test for
relatedness differences between the two groups. The Wilcoxon test statistic as such, however, is
still a valuable tool that can be used in a rank-randomization approach. Ranks are permutated
and randomly assigned to one of the groups keeping the number of dyads as in the original
sample. The Wilcoxon statistic is then calculated for the randomized data. This process is
repeated 10000 times yielding a null distribution of the test statistic, which can then be compared
to the empirical value (community vs. non-community). An empirical value that is high compared
to the simulated distribution indicates that the two groups differ more in relatedness than is
expected for the population background. The relative frequency of simulated values that are
equal to or larger than the empirical value can be interpreted as the p-value, the probability of
committing a Type I error (Manly 1997).
3) Pedigrees, simulated contingency tables and χ2 test statistic
This approach follows the same logic as the abovementioned procedure. It also compares the
degree of relatedness within a given structure to the background of the next hierarchical level
(cliques, communities, population). Instead of comparing relatedness coefficients it uses an
estimated relationship category between individuals. Relationship categories include parentoffspring (PO), full sib (FS), half sib (HS) and unrelated (U). The analysis was restricted to
individual pairs where the relationship was attributed with at least 95% confidence (based on
5000 iterations Kalinowski et al. 2006); ambiguous categories such as HSorFS, FSorPO etc.
were excluded. All pairs, where specific estimates were available were then divided into two
groups: the ‘within-community’ and the ‘between-community’ group (also for ‘within clique’ –
‘between clique’). The frequency information of each relationship category is then summarized in
a contingency table for both groups. Here again, due to pseudoreplication and non-independence
of the counts a standard χ2 test checking for independence between the genetic relationship and
group is not applicable. We therefore simulated 1000 contingency tables in which genetic
relationships were randomly assigned to each of the groups keeping row and column totals
constant. For each simulated contingency table a χ2 value was calculated. The resulting χ2
distribution can be used as a test statistic on the χ2 value obtained from the empirical contingency
table. As in the approach above, high empirical values lying in the upper margin of the simulated
distribution indicate differences between groups that are beyond the random expectation of the
population background. The proportion of simulated values equally large or larger than the
empirical χ2 value was again used to infer statistical significance.
In order to see which of the categories (PO, FS, HS, U) actually produce the effect, we
need to examine the nature of variation across the contingency table. The null hypothesis implies
that the expected relative numbers in different columns are the same in every row. The χ2
residuals (square root of the χ2 statistic) show where there may be departures from this pattern.
In a standard approach residuals behave like random variables with mean zero and variance 1.
Residuals somewhat larger than about 2 (or smaller than -2) can be thought of as making a
significant contribution (compare e.g. Maindonald & Braun 2003). The same idea can be applied
to our randomization approach. This only difference is that expected values are not directly
derived from the original contingency tables, but are calculated as the mean of the simulated
tables for each cell. The ‘simulated residuals’ are almost identical to the standard residuals. As in
the standard approach, the residuals tell us something about the degree of deviation for each cell.
To obtain cell-specific p-values, we simply count the fraction of values from the simulated table
that is higher as the observed value. If more than 95% of the simulated values are higher than the
observed value, it can be concluded that a category is significantly underrepresented, if less than
5% are equal to or higher it is overrepresented, respectively.
Kalinowski, S. T., Wagner, A. P. & Taper, M. L. 2006 ML-RELATE: a computer
program for maximum likelihood estimation of relatedness and relationship. Mol
Ecol Notes 6, 576-579.
Maindonald, J. & Braun, J. 2003 Data Analysis and Graphics Using R. An Examplebased Approach. Cambridge Series in Statistical and Probabilistic Mathematics.
Cambridge: Cambridge University Press.
Manly, B. F. J. 1997 Randomization, Bootstrap and Monte Carlo Methods in Biology.
London: Chapman & Hall.
Download