8sibship - Laboratory for Computational Population Biology

Reconstructing Kinship Relationships in Wild Populations I do not believe that the accident of birth makes people sisters and brothers. It makes them siblings. Gives them mutuality of parentage. Maya Angelou Tanya Berger-Wolf UIC Mary Ashley W. Art UIC Chaovalitwongse Bhaskar DasGupta Ashfaq Khokhar UIC Rutgers UIC Saad Sheikh Ecole Polytechnique Priya Govindan Rutgers Isabel Caballero UIC Chun-An (Joe) Chou Rutgers Alan Perez-Rathkeo UIC Microsatellites (STR)  Advantages:  Codominant (easy inference Alleles 5’ of genotypes and allele frequencies)  Many heterozygous alleles per locus  Possible to estimate other population parameters  Cheaper than SNPs CACACACA #1 CACACACA #2 CACACACACACA #3 CACACACACACACA Genotypes 1/1 2/2 3/3 1/2 1/3 2/3  But:  Few loci  And:  Large families  Self-mating  … Diploid Siblings allele locus father (.../...),(a /b ),(.../...),(.../...) (.../...),(c /d ),(.../...),(.../...) mother (.../...),(e /f ),(.../...),(.../...) child one from father one from mother Siblings: two children with the same parents Question: given a set of children, find the sibling groups Why Reconstruct Sibling Relationships?  Used in: conservation biology, animal management, molecular ecology, genetic epidemiology  Necessary for: estimating heritability of quantitative characters, characterizing mating systems and fitness. • But: hard to sample parent/offspring pairs. Sampling cohorts of juveniles is easier The Problem Ind Locus 1 Locus 2 allele 1/allele 2 1 1/2 1/2 2 1/3 3/4 3 1/4 3/5 4 3/3 7/6 5 1/3 3/4 6 1/3 3/7 7 1/5 8/2 8 1/6 2/2 Sibling Groups: 2, 4, 5, 6 1, 3 7, 8 Existing Methods Method Approach ErrorDetection Assumptions Almudevar & Field (1999,2003) Minimal Sibling groups under likelihood No Minimal sibgroups, representative allele frequencies KinGroup (2004) Markov Chain Monte No Carlo/ML Allele Frequencies etc. are representative Family Finder(2003) Partition population using likelihood graphs No Allele Frequencies etc. are representative Pedigree (2001) Markov Chain Monte No Carlo/ML Allele Frequencies etc are representative COLONY (2004) Simulated Annealing/ ML Yes Monogamy for one sex Fernandez & Toro (2006) Simulated Annealing/ ML No Co-ancestry matrix is a good measure, parents can be reconstructed or are available Inheritance Rules father (.../...),(a /b ),(.../...),(.../...) (.../...),(c /d ),(.../...),(.../...) mother child 1 (.../...),(e1 /f1 ),(.../...),(.../...) child 2 (.../...),(e2 /f2 ),(.../...),(.../...) … child 3 (.../...),(e3 /f3 ),(.../...),(.../...) child n (.../...),(en/fn ),(.../...),(.../...) 4-allele rule: siblings have at most 4 distinct alleles in a locus 2-allele rule: In a locus in a sibling group: a + R ≤ 4 Num distinct alleles Num alleles that appear with 3 others or are homozygot Our Approach: Mendelian Constrains 4-allele rule: siblings have at most 4 different alleles in a locus Yes: 3/3, 1/3, 1/5, 1/6 No: 3/3, 1/3, 1/5, 1/6, 3/2 2-allele rule: In a locus in a sibling group: a+R≤4 Num distinct alleles Num alleles that appear with 3 others or are homozygot Yes: 3/3, 1/3, 1/5 No: 3/3, 1/3, 1/5, 1/6 Our Approach: Sibling Reconstruction Given: n diploid individuals sampled at l loci Find: Minimum number of 2-allele sets that contain all individuals  NP-complete even when we know sibsets are at most 3 1.0065 approximation gap Ashley et al ’09  ILP formulation Chaovalitwongse et al. ’07, ’10  Minimum Set Cover based algorithm with optimal solution (using CPLEX) Berger-Wolf et al. ’07  Parallel implementation Sheikh, Khokhar, BW ‘10 Canonical families 1/1 1/2 1/1 1/1 1/3 1/4 2/2 1/2 2/1 1/1 1/1 1/2 1/3 1/2 1/3 1/3 1/4 2/1 2/1 2/1 2/3 2/3 3/1 2/4 2/2 3/1 3/1 2/1 4/1 3/2 3/2 4/2 2/3 2/4 3/4 3/3 4/4 ID 1 alleles 55/43 1/2 2 43/114 2/3 3 43/55 2/1 4 55/114 1/3 5 114/43 3/2 6 55/78 1/4 Aside: Minimum Set Cover Given: universe U = {1, 2, …, n} collection of sets S = {S1, S2,…,Sm} where Si subset of U Find: min | I | I [m ] the smallest number of sets in S whose union is the universe U such that U Si  U iI Minimal Set Cover is NP-hard (1+ln n)-approximable (sharp) Are we done? Challenges  No ground truth available  Growing number of methods  Biologists need (one) reliable reconstruction  Genotyping errors Answer: Consensus Consensus is what many people say in chorus but do not believe as individuals Abba Eban (1915 - 2002), Israeli diplomat In "The New Yorker," 23 Apr 1990 Consensus Methods Combine multiple solutions to a problem to generate one unified solution : →S  C S*  Based on Social Choice Theory  Commonly used where the real solution is not known e.g. Phylogenetic Trees ... S1 S2 Consensus Sk S Error-Tolerant Approach ... Locus l Locus 3 Locus 2 Locus 1 Sheikh et al. 08 Sibling Reconstructio n Algorithm ... S1 S2 Consensus Sk S Distance-based Consensus Algorithm – Compute a consensus solution S={g1,...,gk } – Search for a good solution near S fd fq fq Search ... Consensus S S1 S2 Sk Ss fd NP-hard for any fd, fq or an arbitrary linear combination Sheikh et al. ‘08 A Greedy Approach - Algorithm  Compute a strict consensus  While total distance is not too large  Merge two sibgroups with minimal (total) distance  Quality: fq=n-|C|  Distance function from solution C to C’ fd(C,C’) =sum of costs of merging groups in C to obtain C’ =sum of costs of assigning individuals to groups Cost of assigning individual to a group: Benefit: Alleles and allele pairs shared Cost: Minimum Edit Distance Auto Greedy Consensus  Change costs to average per locus costs  Compare max group error on per locus basis  Treat cost and benefit independently  In order to qualify a merge  Cost <= maxcost  Benefit >= minbenefit  Benefit = max benefit among possible merges A Greedy Approach S1 = { {1,2,3},{4,5}, {6,7} } S2 = { {1,2,3},{4}, {5,6,7} } S3 = { {1,2},{3,4,5}, {6,7} } Strict Consensus S = { {1,2}, {3}, {4}, {5}, {6,7} } S={ {1,2}, {3,6,7}, {4}, {5} } {1,2} {1,2} {3} 3.5 {4} {5} {6,7} 1.1 2.5 5.1 0.3 0.5 0.1 0.6 1.1 {4} 1.0 3.0 4.9 {5} 2.0 1.2 3.5 {6,7} 0.6 0.9 1.2 {3} 0.5 {4} 1.0 3.0 {5} 2.0 1.2 3.5 {6,7} 0.6 0.9 1.2 4.1 {1,2} {1,2} {3,6,7} {4} {5} {6,7} 3.5 1.1 2.5 5.1 3.1 2.2 6.1 {3,6,7} 1.7 0.6 1.1 4.9 4.1 Testing and Validation: Protocol Get a dataset with known sibgroups (real or simulated) 2. Find sibgroups using our alg 3. Compare the solutions  Partition distrance, Gusfield ’03 = assignment problem 1.  Compare to other sibship methods  Family Finder, COLONY Test Data  Salmon (Salmo salar) - Herbinger et al., 1999 351 individuals, 6 families, 4 loci. No missing alleles  Shrimp (Penaeus monodon) - Jerry et al., 2006 59 individuals,13 families, 7 loci. Some missing alleles  Ants (Leptothorax acervorum )- Hammond et al., 2001 Ants are haplodiploid species. The data consists of 377 worker diploid ants Simulated populations of juveniles for a range of values of number of parents, offspring per parent, alleles, per locus, number of loci, and the distributions of those. Experimental Protocol Generate F females and M males (F=M=5, 10, 20) Each with l loci (l=2, 4, 6,8,10) Each locus with a alleles (a=10, 15) Generate f families (f=5,10,20) For each family select female+male uniformly at random For each parent pair generate o offspring (o=5,10) For each offspring for each locus choose allele outcome uniformly at random Introduce random errors Results Results Conclusions  Combinatorial algorithms with minimal assumptions  Behaves well on real and simulated data  Better than others with few loci, few large families  Error tolerant  Useful, high demand New and improved:  Efficient implementation  Other objectives (bio vs math)  Other genealogical relationships  Different combinatorial approach  Pedigree amalgamation Perez-Rathlke et al. (in submission) Ashley et al. ‘10 Sheikh et al. ‘09, ’10 Brown & B-W, ‘10

8sibship - Laboratory for Computational Population Biology

Related documents

Products

Support

8sibship - Laboratory for Computational Population Biology

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib