Reconstructing Kinship Relationships
in Wild Populations
I do not believe that the accident of birth makes people sisters and
brothers. It makes them siblings. Gives them mutuality of parentage.
Maya Angelou
Tanya
Berger-Wolf
UIC
Mary Ashley
W. Art
UIC
Chaovalitwongse Bhaskar DasGupta Ashfaq Khokhar
UIC
Rutgers
UIC
Saad Sheikh
Ecole
Polytechnique
Priya Govindan
Rutgers
Isabel Caballero
UIC
Chun-An
(Joe)
Chou
Rutgers
Alan Perez-Rathkeo
UIC
Microsatellites (STR)
Advantages:
Codominant (easy inference
Alleles
5’
of genotypes and allele
frequencies)
Many heterozygous alleles per
locus
Possible to estimate other
population parameters
Cheaper than SNPs
CACACACA
#1
CACACACA
#2
CACACACACACA
#3
CACACACACACACA
Genotypes
1/1
2/2
3/3
1/2
1/3
2/3
But:
Few loci
And:
Large families
Self-mating
…
Diploid Siblings
allele
locus
father (.../...),(a /b ),(.../...),(.../...)
(.../...),(c /d ),(.../...),(.../...) mother
(.../...),(e /f ),(.../...),(.../...) child
one from father
one from mother
Siblings: two children with the same parents
Question: given a set of children, find the sibling groups
Why Reconstruct Sibling Relationships?
Used in: conservation biology, animal
management, molecular ecology,
genetic epidemiology
Necessary for: estimating heritability
of quantitative characters,
characterizing mating systems and
fitness.
• But: hard to sample
parent/offspring pairs.
Sampling cohorts of
juveniles is easier
The Problem
Ind
Locus 1 Locus 2
allele 1/allele 2
1
1/2
1/2
2
1/3
3/4
3
1/4
3/5
4
3/3
7/6
5
1/3
3/4
6
1/3
3/7
7
1/5
8/2
8
1/6
2/2
Sibling
Groups:
2, 4, 5, 6
1, 3
7, 8
Existing Methods
Method
Approach
ErrorDetection
Assumptions
Almudevar &
Field
(1999,2003)
Minimal Sibling
groups under
likelihood
No
Minimal sibgroups,
representative allele
frequencies
KinGroup
(2004)
Markov Chain Monte No
Carlo/ML
Allele Frequencies etc. are
representative
Family
Finder(2003)
Partition population
using likelihood
graphs
No
Allele Frequencies etc. are
representative
Pedigree
(2001)
Markov Chain Monte No
Carlo/ML
Allele Frequencies etc are
representative
COLONY
(2004)
Simulated
Annealing/ ML
Yes
Monogamy for one sex
Fernandez &
Toro (2006)
Simulated
Annealing/ ML
No
Co-ancestry matrix is a good
measure, parents can be
reconstructed or are available
Inheritance Rules
father (.../...),(a /b ),(.../...),(.../...)
(.../...),(c /d ),(.../...),(.../...) mother
child 1 (.../...),(e1 /f1 ),(.../...),(.../...)
child 2 (.../...),(e2 /f2 ),(.../...),(.../...)
…
child 3 (.../...),(e3 /f3 ),(.../...),(.../...)
child n (.../...),(en/fn ),(.../...),(.../...)
4-allele rule: siblings have at most 4 distinct alleles in a locus
2-allele rule: In a locus in a sibling group: a + R ≤ 4
Num distinct
alleles
Num alleles that appear with 3
others or are homozygot
Our Approach: Mendelian Constrains
4-allele rule:
siblings have at most 4 different alleles in a locus
Yes: 3/3, 1/3, 1/5, 1/6
No: 3/3, 1/3, 1/5, 1/6, 3/2
2-allele rule: In a locus in a sibling group:
a+R≤4
Num distinct
alleles
Num alleles that appear with 3
others or are homozygot
Yes: 3/3, 1/3, 1/5
No: 3/3, 1/3, 1/5, 1/6
Our Approach: Sibling Reconstruction
Given:
n diploid individuals sampled at l loci
Find: Minimum number of 2-allele sets that contain all individuals
NP-complete even when we know sibsets are at most 3
1.0065 approximation gap
Ashley et al ’09
ILP formulation
Chaovalitwongse et al. ’07, ’10
Minimum Set Cover based algorithm with optimal solution
(using CPLEX)
Berger-Wolf et al. ’07
Parallel implementation
Sheikh, Khokhar, BW ‘10
Canonical families
1/1
1/2
1/1
1/1
1/3
1/4
2/2
1/2
2/1
1/1
1/1
1/2
1/3
1/2
1/3
1/3
1/4
2/1
2/1
2/1
2/3
2/3
3/1
2/4
2/2
3/1
3/1
2/1
4/1
3/2
3/2
4/2
2/3
2/4
3/4
3/3
4/4
ID
1
alleles
55/43
1/2
2
43/114
2/3
3
43/55
2/1
4
55/114
1/3
5
114/43
3/2
6
55/78
1/4
Aside: Minimum Set Cover
Given: universe U = {1, 2, …, n}
collection of sets S = {S1, S2,…,Sm}
where Si subset of U
Find:
min | I |
I [m ]
the smallest number of sets in S
whose union is the universe U
such that
U Si U
iI
Minimal Set Cover is NP-hard
(1+ln n)-approximable (sharp)
Are we done?
Challenges
No ground truth available
Growing number of methods
Biologists need (one) reliable reconstruction
Genotyping errors
Answer: Consensus
Consensus is what many people say in chorus
but do not believe as individuals
Abba Eban (1915 - 2002), Israeli diplomat
In "The New Yorker," 23 Apr 1990
Consensus Methods
Combine multiple solutions to a problem to generate one
unified solution
: →S
C S*
Based on Social Choice Theory
Commonly used where the real solution is not known
e.g. Phylogenetic Trees
...
S1
S2
Consensus
Sk
S
Error-Tolerant Approach
...
Locus l
Locus 3
Locus 2
Locus 1
Sheikh et al. 08
Sibling
Reconstructio
n
Algorithm
...
S1
S2
Consensus
Sk
S
Distance-based Consensus
Algorithm
– Compute a consensus solution S={g1,...,gk }
– Search for a good solution near S
fd
fq
fq
Search
...
Consensus
S
S1
S2
Sk
Ss
fd
NP-hard for any fd, fq or an arbitrary linear combination
Sheikh et al. ‘08
A Greedy Approach - Algorithm
Compute a strict consensus
While total distance is not too large
Merge two sibgroups with minimal (total) distance
Quality: fq=n-|C|
Distance function from solution C to C’
fd(C,C’) =sum of costs of merging groups in C to obtain C’
=sum of costs of assigning individuals to groups
Cost of assigning individual to a group:
Benefit: Alleles and allele pairs shared
Cost: Minimum Edit Distance
Auto Greedy Consensus
Change costs to average per locus costs
Compare max group error on per locus basis
Treat cost and benefit independently
In order to qualify a merge
Cost <= maxcost
Benefit >= minbenefit
Benefit = max benefit among possible merges
A Greedy Approach
S1 = { {1,2,3},{4,5}, {6,7} }
S2 = { {1,2,3},{4}, {5,6,7} }
S3 = { {1,2},{3,4,5}, {6,7} }
Strict
Consensus
S = { {1,2}, {3}, {4}, {5}, {6,7} }
S={ {1,2}, {3,6,7}, {4}, {5} }
{1,2}
{1,2}
{3}
3.5
{4}
{5}
{6,7}
1.1
2.5
5.1
0.3
0.5
0.1
0.6
1.1
{4}
1.0
3.0
4.9
{5}
2.0
1.2
3.5
{6,7}
0.6
0.9
1.2
{3}
0.5
{4}
1.0
3.0
{5}
2.0
1.2
3.5
{6,7} 0.6
0.9
1.2
4.1
{1,2}
{1,2}
{3,6,7} {4}
{5} {6,7}
3.5
1.1
2.5 5.1
3.1
2.2 6.1
{3,6,7} 1.7
0.6 1.1
4.9
4.1
Testing and Validation: Protocol
Get a dataset with known sibgroups
(real or simulated)
2. Find sibgroups using our alg
3. Compare the solutions
Partition distrance, Gusfield ’03 = assignment
problem
1.
Compare to other sibship methods
Family Finder, COLONY
Test Data
Salmon (Salmo salar) - Herbinger et al., 1999
351 individuals, 6 families, 4 loci. No missing
alleles
Shrimp (Penaeus monodon) - Jerry et al., 2006
59 individuals,13 families, 7 loci. Some missing
alleles
Ants (Leptothorax acervorum )- Hammond et al.,
2001
Ants are haplodiploid species. The data consists of
377 worker diploid ants
Simulated populations of juveniles for a range of values of number of
parents, offspring per parent, alleles, per locus, number of loci, and the
distributions of those.
Experimental Protocol
Generate F females and M males (F=M=5, 10, 20)
Each with l loci (l=2, 4, 6,8,10)
Each locus with a alleles (a=10, 15)
Generate f families (f=5,10,20)
For each family select female+male uniformly at random
For each parent pair generate o offspring
(o=5,10)
For each offspring for each locus choose allele outcome
uniformly at random
Introduce random errors
Results
Results
Conclusions
Combinatorial algorithms with minimal
assumptions
Behaves well on real and simulated data
Better than others with few loci, few large families
Error tolerant
Useful, high demand
New and improved:
Efficient implementation
Other objectives (bio vs math)
Other genealogical relationships
Different combinatorial approach
Pedigree amalgamation
Perez-Rathlke et al. (in submission)
Ashley et al. ‘10
Sheikh et al. ‘09, ’10
Brown & B-W, ‘10