Reconstructing Kinship Relationships in Wild Populations I do not believe that the accident of birth makes people sisters and brothers. It makes them siblings. Gives them mutuality of parentage. Maya Angelou Tanya Berger-Wolf UIC Mary Ashley W. Art UIC Chaovalitwongse Bhaskar DasGupta Ashfaq Khokhar UIC Rutgers UIC Saad Sheikh Ecole Polytechnique Priya Govindan Rutgers Isabel Caballero UIC Chun-An (Joe) Chou Rutgers Alan Perez-Rathkeo UIC Microsatellites (STR) Advantages: Codominant (easy inference Alleles 5’ of genotypes and allele frequencies) Many heterozygous alleles per locus Possible to estimate other population parameters Cheaper than SNPs CACACACA #1 CACACACA #2 CACACACACACA #3 CACACACACACACA Genotypes 1/1 2/2 3/3 1/2 1/3 2/3 But: Few loci And: Large families Self-mating … Diploid Siblings allele locus father (.../...),(a /b ),(.../...),(.../...) (.../...),(c /d ),(.../...),(.../...) mother (.../...),(e /f ),(.../...),(.../...) child one from father one from mother Siblings: two children with the same parents Question: given a set of children, find the sibling groups Why Reconstruct Sibling Relationships? Used in: conservation biology, animal management, molecular ecology, genetic epidemiology Necessary for: estimating heritability of quantitative characters, characterizing mating systems and fitness. • But: hard to sample parent/offspring pairs. Sampling cohorts of juveniles is easier The Problem Ind Locus 1 Locus 2 allele 1/allele 2 1 1/2 1/2 2 1/3 3/4 3 1/4 3/5 4 3/3 7/6 5 1/3 3/4 6 1/3 3/7 7 1/5 8/2 8 1/6 2/2 Sibling Groups: 2, 4, 5, 6 1, 3 7, 8 Existing Methods Method Approach ErrorDetection Assumptions Almudevar & Field (1999,2003) Minimal Sibling groups under likelihood No Minimal sibgroups, representative allele frequencies KinGroup (2004) Markov Chain Monte No Carlo/ML Allele Frequencies etc. are representative Family Finder(2003) Partition population using likelihood graphs No Allele Frequencies etc. are representative Pedigree (2001) Markov Chain Monte No Carlo/ML Allele Frequencies etc are representative COLONY (2004) Simulated Annealing/ ML Yes Monogamy for one sex Fernandez & Toro (2006) Simulated Annealing/ ML No Co-ancestry matrix is a good measure, parents can be reconstructed or are available Inheritance Rules father (.../...),(a /b ),(.../...),(.../...) (.../...),(c /d ),(.../...),(.../...) mother child 1 (.../...),(e1 /f1 ),(.../...),(.../...) child 2 (.../...),(e2 /f2 ),(.../...),(.../...) … child 3 (.../...),(e3 /f3 ),(.../...),(.../...) child n (.../...),(en/fn ),(.../...),(.../...) 4-allele rule: siblings have at most 4 distinct alleles in a locus 2-allele rule: In a locus in a sibling group: a + R ≤ 4 Num distinct alleles Num alleles that appear with 3 others or are homozygot Our Approach: Mendelian Constrains 4-allele rule: siblings have at most 4 different alleles in a locus Yes: 3/3, 1/3, 1/5, 1/6 No: 3/3, 1/3, 1/5, 1/6, 3/2 2-allele rule: In a locus in a sibling group: a+R≤4 Num distinct alleles Num alleles that appear with 3 others or are homozygot Yes: 3/3, 1/3, 1/5 No: 3/3, 1/3, 1/5, 1/6 Our Approach: Sibling Reconstruction Given: n diploid individuals sampled at l loci Find: Minimum number of 2-allele sets that contain all individuals NP-complete even when we know sibsets are at most 3 1.0065 approximation gap Ashley et al ’09 ILP formulation Chaovalitwongse et al. ’07, ’10 Minimum Set Cover based algorithm with optimal solution (using CPLEX) Berger-Wolf et al. ’07 Parallel implementation Sheikh, Khokhar, BW ‘10 Canonical families 1/1 1/2 1/1 1/1 1/3 1/4 2/2 1/2 2/1 1/1 1/1 1/2 1/3 1/2 1/3 1/3 1/4 2/1 2/1 2/1 2/3 2/3 3/1 2/4 2/2 3/1 3/1 2/1 4/1 3/2 3/2 4/2 2/3 2/4 3/4 3/3 4/4 ID 1 alleles 55/43 1/2 2 43/114 2/3 3 43/55 2/1 4 55/114 1/3 5 114/43 3/2 6 55/78 1/4 Aside: Minimum Set Cover Given: universe U = {1, 2, …, n} collection of sets S = {S1, S2,…,Sm} where Si subset of U Find: min | I | I [m ] the smallest number of sets in S whose union is the universe U such that U Si U iI Minimal Set Cover is NP-hard (1+ln n)-approximable (sharp) Are we done? Challenges No ground truth available Growing number of methods Biologists need (one) reliable reconstruction Genotyping errors Answer: Consensus Consensus is what many people say in chorus but do not believe as individuals Abba Eban (1915 - 2002), Israeli diplomat In "The New Yorker," 23 Apr 1990 Consensus Methods Combine multiple solutions to a problem to generate one unified solution : →S C S* Based on Social Choice Theory Commonly used where the real solution is not known e.g. Phylogenetic Trees ... S1 S2 Consensus Sk S Error-Tolerant Approach ... Locus l Locus 3 Locus 2 Locus 1 Sheikh et al. 08 Sibling Reconstructio n Algorithm ... S1 S2 Consensus Sk S Distance-based Consensus Algorithm – Compute a consensus solution S={g1,...,gk } – Search for a good solution near S fd fq fq Search ... Consensus S S1 S2 Sk Ss fd NP-hard for any fd, fq or an arbitrary linear combination Sheikh et al. ‘08 A Greedy Approach - Algorithm Compute a strict consensus While total distance is not too large Merge two sibgroups with minimal (total) distance Quality: fq=n-|C| Distance function from solution C to C’ fd(C,C’) =sum of costs of merging groups in C to obtain C’ =sum of costs of assigning individuals to groups Cost of assigning individual to a group: Benefit: Alleles and allele pairs shared Cost: Minimum Edit Distance Auto Greedy Consensus Change costs to average per locus costs Compare max group error on per locus basis Treat cost and benefit independently In order to qualify a merge Cost <= maxcost Benefit >= minbenefit Benefit = max benefit among possible merges A Greedy Approach S1 = { {1,2,3},{4,5}, {6,7} } S2 = { {1,2,3},{4}, {5,6,7} } S3 = { {1,2},{3,4,5}, {6,7} } Strict Consensus S = { {1,2}, {3}, {4}, {5}, {6,7} } S={ {1,2}, {3,6,7}, {4}, {5} } {1,2} {1,2} {3} 3.5 {4} {5} {6,7} 1.1 2.5 5.1 0.3 0.5 0.1 0.6 1.1 {4} 1.0 3.0 4.9 {5} 2.0 1.2 3.5 {6,7} 0.6 0.9 1.2 {3} 0.5 {4} 1.0 3.0 {5} 2.0 1.2 3.5 {6,7} 0.6 0.9 1.2 4.1 {1,2} {1,2} {3,6,7} {4} {5} {6,7} 3.5 1.1 2.5 5.1 3.1 2.2 6.1 {3,6,7} 1.7 0.6 1.1 4.9 4.1 Testing and Validation: Protocol Get a dataset with known sibgroups (real or simulated) 2. Find sibgroups using our alg 3. Compare the solutions Partition distrance, Gusfield ’03 = assignment problem 1. Compare to other sibship methods Family Finder, COLONY Test Data Salmon (Salmo salar) - Herbinger et al., 1999 351 individuals, 6 families, 4 loci. No missing alleles Shrimp (Penaeus monodon) - Jerry et al., 2006 59 individuals,13 families, 7 loci. Some missing alleles Ants (Leptothorax acervorum )- Hammond et al., 2001 Ants are haplodiploid species. The data consists of 377 worker diploid ants Simulated populations of juveniles for a range of values of number of parents, offspring per parent, alleles, per locus, number of loci, and the distributions of those. Experimental Protocol Generate F females and M males (F=M=5, 10, 20) Each with l loci (l=2, 4, 6,8,10) Each locus with a alleles (a=10, 15) Generate f families (f=5,10,20) For each family select female+male uniformly at random For each parent pair generate o offspring (o=5,10) For each offspring for each locus choose allele outcome uniformly at random Introduce random errors Results Results Conclusions Combinatorial algorithms with minimal assumptions Behaves well on real and simulated data Better than others with few loci, few large families Error tolerant Useful, high demand New and improved: Efficient implementation Other objectives (bio vs math) Other genealogical relationships Different combinatorial approach Pedigree amalgamation Perez-Rathlke et al. (in submission) Ashley et al. ‘10 Sheikh et al. ‘09, ’10 Brown & B-W, ‘10