8sibship - Laboratory for Computational Population Biology

advertisement
Reconstructing Kinship Relationships
in Wild Populations
I do not believe that the accident of birth makes people sisters and
brothers. It makes them siblings. Gives them mutuality of parentage.
Maya Angelou
Tanya
Berger-Wolf
UIC
Mary Ashley
W. Art
UIC
Chaovalitwongse Bhaskar DasGupta Ashfaq Khokhar
UIC
Rutgers
UIC
Saad Sheikh
Ecole
Polytechnique
Priya Govindan
Rutgers
Isabel Caballero
UIC
Chun-An
(Joe)
Chou
Rutgers
Alan Perez-Rathkeo
UIC
Microsatellites (STR)
 Advantages:
 Codominant (easy inference
Alleles
5’
of genotypes and allele
frequencies)
 Many heterozygous alleles per
locus
 Possible to estimate other
population parameters
 Cheaper than SNPs
CACACACA
#1
CACACACA
#2
CACACACACACA
#3
CACACACACACACA
Genotypes
1/1
2/2
3/3
1/2
1/3
2/3
 But:
 Few loci
 And:
 Large families
 Self-mating
 …
Diploid Siblings
allele
locus
father (.../...),(a /b ),(.../...),(.../...)
(.../...),(c /d ),(.../...),(.../...) mother
(.../...),(e /f ),(.../...),(.../...) child
one from father
one from mother
Siblings: two children with the same parents
Question: given a set of children, find the sibling groups
Why Reconstruct Sibling Relationships?
 Used in: conservation biology, animal
management, molecular ecology,
genetic epidemiology
 Necessary for: estimating heritability
of quantitative characters,
characterizing mating systems and
fitness.
• But: hard to sample
parent/offspring pairs.
Sampling cohorts of
juveniles is easier
The Problem
Ind
Locus 1 Locus 2
allele 1/allele 2
1
1/2
1/2
2
1/3
3/4
3
1/4
3/5
4
3/3
7/6
5
1/3
3/4
6
1/3
3/7
7
1/5
8/2
8
1/6
2/2
Sibling
Groups:
2, 4, 5, 6
1, 3
7, 8
Existing Methods
Method
Approach
ErrorDetection
Assumptions
Almudevar &
Field
(1999,2003)
Minimal Sibling
groups under
likelihood
No
Minimal sibgroups,
representative allele
frequencies
KinGroup
(2004)
Markov Chain Monte No
Carlo/ML
Allele Frequencies etc. are
representative
Family
Finder(2003)
Partition population
using likelihood
graphs
No
Allele Frequencies etc. are
representative
Pedigree
(2001)
Markov Chain Monte No
Carlo/ML
Allele Frequencies etc are
representative
COLONY
(2004)
Simulated
Annealing/ ML
Yes
Monogamy for one sex
Fernandez &
Toro (2006)
Simulated
Annealing/ ML
No
Co-ancestry matrix is a good
measure, parents can be
reconstructed or are available
Inheritance Rules
father (.../...),(a /b ),(.../...),(.../...)
(.../...),(c /d ),(.../...),(.../...) mother
child 1 (.../...),(e1 /f1 ),(.../...),(.../...)
child 2 (.../...),(e2 /f2 ),(.../...),(.../...)
…
child 3 (.../...),(e3 /f3 ),(.../...),(.../...)
child n (.../...),(en/fn ),(.../...),(.../...)
4-allele rule: siblings have at most 4 distinct alleles in a locus
2-allele rule: In a locus in a sibling group: a + R ≤ 4
Num distinct
alleles
Num alleles that appear with 3
others or are homozygot
Our Approach: Mendelian Constrains
4-allele rule:
siblings have at most 4 different alleles in a locus
Yes: 3/3, 1/3, 1/5, 1/6
No: 3/3, 1/3, 1/5, 1/6, 3/2
2-allele rule: In a locus in a sibling group:
a+R≤4
Num distinct
alleles
Num alleles that appear with 3
others or are homozygot
Yes: 3/3, 1/3, 1/5
No: 3/3, 1/3, 1/5, 1/6
Our Approach: Sibling Reconstruction
Given:
n diploid individuals sampled at l loci
Find: Minimum number of 2-allele sets that contain all individuals
 NP-complete even when we know sibsets are at most 3
1.0065 approximation gap
Ashley et al ’09
 ILP formulation
Chaovalitwongse et al. ’07, ’10
 Minimum Set Cover based algorithm with optimal solution
(using CPLEX)
Berger-Wolf et al. ’07
 Parallel implementation
Sheikh, Khokhar, BW ‘10
Canonical families
1/1
1/2
1/1
1/1
1/3
1/4
2/2
1/2
2/1
1/1
1/1
1/2
1/3
1/2
1/3
1/3
1/4
2/1
2/1
2/1
2/3
2/3
3/1
2/4
2/2
3/1
3/1
2/1
4/1
3/2
3/2
4/2
2/3
2/4
3/4
3/3
4/4
ID
1
alleles
55/43
1/2
2
43/114
2/3
3
43/55
2/1
4
55/114
1/3
5
114/43
3/2
6
55/78
1/4
Aside: Minimum Set Cover
Given: universe U = {1, 2, …, n}
collection of sets S = {S1, S2,…,Sm}
where Si subset of U
Find:
min | I |
I [m ]
the smallest number of sets in S
whose union is the universe U
such that
U Si  U
iI
Minimal Set Cover is NP-hard
(1+ln n)-approximable (sharp)
Are we done?
Challenges
 No ground truth available
 Growing number of methods
 Biologists need (one) reliable reconstruction
 Genotyping errors
Answer: Consensus
Consensus is what many people say in chorus
but do not believe as individuals
Abba Eban (1915 - 2002), Israeli diplomat
In "The New Yorker," 23 Apr 1990
Consensus Methods
Combine multiple solutions to a problem to generate one
unified solution
: →S
 C S*
 Based on Social Choice Theory
 Commonly used where the real solution is not known
e.g. Phylogenetic Trees
...
S1
S2
Consensus
Sk
S
Error-Tolerant Approach
...
Locus l
Locus 3
Locus 2
Locus 1
Sheikh et al. 08
Sibling
Reconstructio
n
Algorithm
...
S1
S2
Consensus
Sk
S
Distance-based Consensus
Algorithm
– Compute a consensus solution S={g1,...,gk }
– Search for a good solution near S
fd
fq
fq
Search
...
Consensus
S
S1
S2
Sk
Ss
fd
NP-hard for any fd, fq or an arbitrary linear combination
Sheikh et al. ‘08
A Greedy Approach - Algorithm
 Compute a strict consensus
 While total distance is not too large
 Merge two sibgroups with minimal (total) distance
 Quality: fq=n-|C|
 Distance function from solution C to C’
fd(C,C’) =sum of costs of merging groups in C to obtain C’
=sum of costs of assigning individuals to groups
Cost of assigning individual to a group:
Benefit: Alleles and allele pairs shared
Cost: Minimum Edit Distance
Auto Greedy Consensus
 Change costs to average per locus costs
 Compare max group error on per locus basis
 Treat cost and benefit independently
 In order to qualify a merge
 Cost <= maxcost
 Benefit >= minbenefit
 Benefit = max benefit among possible merges
A Greedy Approach
S1 = { {1,2,3},{4,5}, {6,7} }
S2 = { {1,2,3},{4}, {5,6,7} }
S3 = { {1,2},{3,4,5}, {6,7} }
Strict
Consensus
S = { {1,2}, {3}, {4}, {5}, {6,7} }
S={ {1,2}, {3,6,7}, {4}, {5} }
{1,2}
{1,2}
{3}
3.5
{4}
{5}
{6,7}
1.1
2.5
5.1
0.3
0.5
0.1
0.6
1.1
{4}
1.0
3.0
4.9
{5}
2.0
1.2
3.5
{6,7}
0.6
0.9
1.2
{3}
0.5
{4}
1.0
3.0
{5}
2.0
1.2
3.5
{6,7} 0.6
0.9
1.2
4.1
{1,2}
{1,2}
{3,6,7} {4}
{5} {6,7}
3.5
1.1
2.5 5.1
3.1
2.2 6.1
{3,6,7} 1.7
0.6 1.1
4.9
4.1
Testing and Validation: Protocol
Get a dataset with known sibgroups
(real or simulated)
2. Find sibgroups using our alg
3. Compare the solutions
 Partition distrance, Gusfield ’03 = assignment
problem
1.

Compare to other sibship methods
 Family Finder, COLONY
Test Data
 Salmon (Salmo salar) - Herbinger et al., 1999
351 individuals, 6 families, 4 loci. No missing
alleles
 Shrimp (Penaeus monodon) - Jerry et al., 2006
59 individuals,13 families, 7 loci. Some missing
alleles
 Ants (Leptothorax acervorum )- Hammond et al.,
2001
Ants are haplodiploid species. The data consists of
377 worker diploid ants
Simulated populations of juveniles for a range of values of number of
parents, offspring per parent, alleles, per locus, number of loci, and the
distributions of those.
Experimental Protocol
Generate F females and M males (F=M=5, 10, 20)
Each with l loci (l=2, 4, 6,8,10)
Each locus with a alleles (a=10, 15)
Generate f families (f=5,10,20)
For each family select female+male uniformly at random
For each parent pair generate o offspring
(o=5,10)
For each offspring for each locus choose allele outcome
uniformly at random
Introduce random errors
Results
Results
Conclusions
 Combinatorial algorithms with minimal
assumptions
 Behaves well on real and simulated data
 Better than others with few loci, few large families
 Error tolerant
 Useful, high demand
New and improved:
 Efficient implementation
 Other objectives (bio vs math)
 Other genealogical relationships
 Different combinatorial approach
 Pedigree amalgamation
Perez-Rathlke et al. (in submission)
Ashley et al. ‘10
Sheikh et al. ‘09, ’10
Brown & B-W, ‘10
Download