Applications in Forensic Genetics Science and Statistics Behind Y STR Systems Wisconsin Department of Justice Madison, WI January 4, 2008 John V. Planz, Ph.D. UNT Center for Human Identification Assessing the Significance of Y STR Data • Characteristics of Y chromosome and loci • Population structure and distribution of Y haplogroups • Forensic applications • Powerplex Y characteristics • Working with haplotype statistics • What about mixtures Classic View of Y-Chromosome • TDF master gene • patrilineal inheritance • no recombination in NRY • recombination in PAR • junk-rich, gene poor Characteristics of the Human Y Chromosome • size: ~ 60 Mb • ~ 35 Mb euchromatic (transcribed) • ~ 25 Mb heterochromatic (non-transcribed) • 95% non-recombining (NRY) • 5% X-recombining (2 pseudoautosomal regions at telomeres) • shape: acrocentric - very short p-arm, long q-arm (“Y” name) • rich in different kinds of repetitive DNA sequences • lack of recombination • relatively poor in gene content Genes on the Human Y Chromosome • 23 Mb of the euchromatic region determined • 156 transcription units • 78 encode proteins (genes) • 27 distinct Y-specific protein-coding genes (gene families) • 16 ubiquitously expressed genes = housekeeping genes – e.g. RPS4Y, ZFY, AMELY, SMCY, DBY • 9 testis-specific genes = male sex determination, spermatogenesis – e.g. SRY, TSPY, CDY, RBMY, DAZ Genes Mapped to Y Chromosome Genes on the Human Y Chromosome • 23 Mb of the euchromatic region determined • 156 transcription units • 78 encode proteins (genes) • 27 distinct Y-specific protein-coding genes (gene families) • 16 ubiquitously expressed genes = housekeeping genes • 9 testis-specific genes = male sex determination, spermatogenesis origin of NRY genes: – derived / preserved from the proto-sex chromosomes (X-homology) – specialization in male-specific function Evolution of Mammalian Sex Chromosomes Lahn, Pearson & Jegalian 2001 Some homology with X – need to consider in validation Polymorphisms of the Human Y Chromosome Mutations create DNA polymorphisms and these may serve as genetic markers • Single-Copy DNA – e.g., SNPs, indels • Repetitive DNA – e.g., STRs Y Chromosome Polymorphisms ~ 200 binary polymorphisms (Y-SNPs) characterized > 300 microsatellites (Y-STRs) characterized 1 minisatellite (MSY1) Not all mutations occur at the same rate ‘hotspots’ ‘coldspots’ Mutation Process for STR loci Y-STR consensus structure and allele ranges Y-STR consensus structure and allele ranges Phylogenetic tree based in binary SNP data Forensic Y STR Systems -From J.M. Butler (2003) Forensic Sci. Rev. 15:91-111 Definitions • Haplotype: combination of allelic states of a set of polymorphic markers lying on the same DNA molecule. • Haplogroup: set of haplotypes defined by slowly mutating markers (mainly SNPs) which have more phylogenetic stability. Unique event polymorphisms (UEP) record history of Y chromosome Why are our Y haplotypes so different? • Many markers to choose from • The selected loci are physically linked • Markers have both SNPs and STRs Infinite Sites Model Stepwise Mutation Model Infinite Alleles Model Population Differentiation • Effective population size of Y chromosome is 1/4 of autosomes or 1/3 of X – lower sequence diversity on Y – more susceptible to genetic drift • random changes in frequency of haplotypes due to sampling bias from one generation to next • accelerates differences between populations • Variance of offspring further reduces Ne (effective population size) Population Differentiation • Geographical clustering due to patrilocal behavior of men – women move closer to man’s birthplace – local geographical differentiation enhanced – Conquest effect From Zerjal et al. Am. J. Hum. Genet. 72:717–721, 2003 Population Differentiation • Geographical clustering due to patrilocal behavior of men – women move closer to man’s birthplace – local geographical differentiation enhanced – Conquest effect You must consider that we are not talking about contemporary populations when discussing this! Converse seen with mtDNA in Native American Populations Forensic Y-STR Applications – Detect male DNA in a sample containing male and female DNA (Huge background of female DNA) – Aspermic males – Fingernail Scrapings – Additional Power of Discrimination – Multiple male donors – Limits of differential extraction/ tissues – Gender clarification (amelogenin) A Forensic Application Finger Nail Scraping Case • Victim was found strangled to death • Suspect had scratches on his face • Based on STR results, suspect could not be excluded; many alleles were below interpretation threshold (inconclusive result) Evidentiary Profile Suspect Profile Identification of Male Contributor DNA in Crime Scene Material Autosomal STR profile Y STR profile Female Victim DNA: Male Suspect DNA: Large Female DNA: Perpetrator Male DNA - See only female DNA profile - Or partial DNA profile - no female DNA - no profile overlap - only male component Investigations regarding Paternal Lineages • Paternity Testing • Kinship Analysis • Deficiency cases • Mass disasters • Missing Persons • Unidentified Remains Deficiency Case Male Lineage ? • Y STR analysis - any male relative in pedigree can be a reference for alleged father **Remember paternal lineage issues for identity testing Y-STR Haplotype Analysis in Deficiency Paternity Case ? DYS19 DYS389I DYS389II DYS390 DYS391 DYS392 DYS393 DYS385 DYS413 YCAII Nephew 14 13 30(16) 25 11 13 12 11-14 22-22 3-7 Son 14 12 29 (16) 24 10 15 12 11-14 22-22 3-7 Exclusion If true biological nephew, then alleged father is excluded as father of child in question Kayser et al. Progress in Forensic Genetics (1998), 7: 494-496 • For effective use, guidelines are needed • ISFG Recommendations • Combine with existing recommendations (NRC II Report) • Nomenclature, Allelic Ladders, Population Genetics, Statistical Issues Basic Interpretation Guidelines • Similar to autosomal STRs • Thresholds for detection and interpretation • Stutter • Mixtures – what constitutes a mixture • Validation studies in concert with guidelines • Interpret evidence before knowns Y STR LOCI • DYS19 • DYS398 I • DYS398 II • DYS390 • DYS391 • DYS392 • DYS393 • DYS385 I/II “Minimal Haplotype” – defined for research only Y STR Loci DYS389 – two loci DYS385 – two loci DYS19 DYS389I DYS389II DYS390 DYS391 DYS392 DYS393 DYS438 DYS439 DYS385a/b SWGDAM Multi-Copy (Duplicated) Marker DYS385 a/b R primer R primer a b F primer F primer Duplicated regions are 40,775 bp apart and facing away from each other DYS389 I/II II F primer I a=b ab Single Region but Two PCR Products (because forward primers bind twice) DYS389I DYS389II F primer R primer Figure 9.5, J.M. Butler (2005) Forensic DNA Typing, 2nd Edition © 2005 Elsevier Academic Press Kits • Commercially available Y-STR multiplex kits --allow for standard markers and QA/QC • Most have EMH and SWGDAM recommended loci • Extra loci added to enhance discrimination PowerPlex® Y System DYS19 DYS389I DYS389II DYS390 DYS391 DYS392 DYS393 DYS437 DYS438 DYS439 DYS385a/b Powerplex® Y Allelic ladder 92 alleles Powerplex® Y Kit 1 ng Male DNA DYS391 DYS389I DYS438 DYS393 DYS439 DYS437 DYS389II DYS19 DYS390 DYS392 DYS385 AmpFlSTR® Yfiler™ Kit DYS19 DYS389I DYS389II DYS390 DYS391 DYS392 DYS393 DYS437 DYS438 DYS439 DYS385a/b DYS448 DYS456 DYS458 DYS635 GATA H4 AmpFlSTR® Yfiler™ Allelic ladder 137 alleles AmpFlSTR® Yfiler™ Kit 1 ng Male Control DNA 007 DYS458 DYS389 I DYS438 DYS393 Y GATA H4 DYS391 DYS456 DYS390 DYS19 DYS439 DYS437 DYS389 II DYS385 a/b DYS635 DYS392 DYS448 What can we expect? Powerplex® Y • Sensitivity • Mixtures • Anomalies Sensitivity 1.0 ng 0.5 ng 0.25 ng 0.125 ng 0.0625 ng 0.0312 ng Sensitivity 1.0 ng 0.5 ng 0.25 ng 0.125 ng 0.0625 ng 0.0312 ng Sensitivity 1.0 ng 0.5 ng 0.25 ng 0.125 ng 0.0625 ng 0.0312 ng Male-Female Mixture Series 1:0 1:1 1:10 1:100 1:1000 Male-Female Mixture Series 1:0 1:1 1:10 1:100 1:1000 Male-Female Mixture Series 1:0 1:1 1:10 1:100 1:1000 Of course, with Male:Male mixtures you will get more peaks at each locus. Sensitivities down to about 5% minor contributor are typical. You cannot bank on peak height differences to remain consistent across the dyes or loci, so be careful when trying to physically deconvolute these mixtures… this may not be a valid practice! i.e. If target input DNA is 0.5 ng… 5% minor contributor is only 0.025 ng These types of issues should raise some operational questions: Input DNA: • Total Genomic? • Y specific? • Increase to bring up minor? • Impact of stutter? Valid lab policies and interpretation guidelines must be based on empirical data! Stutter Issues From Fulmer et al. 2007 Promega Application Notes DYS389II N-1, N-2 stutter is commonly seen at all input template amounts. 1.0 ng 0.5 ng 0.25 ng 1.0 ng 0.5 ng 0.25 ng DYS392 N-1, N+1 stutter is commonly seen at all input template amounts… this is common among trinucleotide repeat loci. As with all typing systems, there are anomalies that you should be aware of ! The majority of female samples will not produce typing results with the Y STR kits… But remember…the Y and X are functional homologues and recombination IS possible. always run a female “victim” known when using Y STR kits in the male – female context! Other observed Powerplex® Y anomalies DYS19 Primer binding mutation This was in an Asian (Hong Kong) Chinese sample Other observed Powerplex® Y anomalies “Gene” duplication at DYS385 DYS385 DYS385 a/b R primer a F primer R primer b F primer Multiple mutation steps in the lineage are needed to explain this one!! Multiband Y Patterns • MN ASIAN PA0077 DYS385 - 3 Bands • MN HISPANIC PH0031 DYS390 - 2 Bands • MN HISPANIC PH0063 Multibands • NYC HISPANIC 26 • NYC CAUCASIAN 4 DYS19 - 2 Bands • CT HISPANIC 00-1851 DYS19 - 2 Bands • CT HISPANIC 99-1695 DYS19 - 2 Bands • CT HISPANIC 99-0362 DYS19 - 2 Bands • CT HISPANIC 98-2136 DYS19 - 2 Bands • CT CAUCASIAN 00-3022 DYS385 - 3 Bands • ASIAN A-FTA-34-F/C DYS385 - 3 Bands • ASIAN A-FTA-36-F/C DYS19 – Primer Binding site? • ASIAN A-FTA-32-F/C DYS385 - 4 Bands DYS19 - 2 Bands Must consider this when considering a mixture Population studies with Powerplex® Y Before we can approach interpretive or statistical understanding of the system we need to understand what we are dealing with as a locus…and yes, the whole set of markers in Powerplex Y are just that…a single locus. Several typical validation issues just don’t matter with a Y haplotype system: • Peak height ratio • Hardy-Weinberg Equilibrium But other things do! Y STR Population Data Promega Study Population N Population N CFS AFR CT AFR MI AFR NYC AFR TX AFR CFS CAU CT CAU MI CAU NYC CAU TX CAU CT HIS 37 182 86 80 192 57 164 97 83 194 160 MI HIS MN HIS NYC HIS TX HIS Apache Navajo CFS ASN MN ASN NYC ASN TX ASN CFS EI 97 101 80 192 138 219 28 101 45 73 37 Total = 2443 DYS19 Allele Frequencies African American 0.5 Sinha (n=543) CFS (n=37) CT (n=182) MI (n=86) NYC (n=80) TX (n=193) 0.45 0.4 Frequency 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 12 13 14 15 Alleles 16 17 18 Population Parameters Haplotype Diversity h = n(1-fi 2)/ (n-1) Haplotype Random Match Probability P = fi 2 fi = frequency of each haplotype n = # haplotypes Haplotype Diversity Hispanic 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 DYS437 CT HIS DYS19 MI HIS DYS392 MN HIS DYS393 NYC HIS DYS390 TX HIS Y Haplotype Profiles Population N CFS AFR CT AFR MI AFR NYC AFR TX AFR CFS CAU CT CAU MI CAU NYC CAU TX CAU CT HIS MI HIS MN HIS NYC HIS TX HIS 37 182 86 80 193 57 163 97 83 194 158 97 100 80 192 # Haplotypes 36 172 85 80 181 50 153 87 80 170 130 90 95 74 179 % Single Haplotype Diversity 97.3 94.5 98.8 100 93.8 87.7 93.9 89.7 96.4 87.6 82.3 92.8 95.0 92.5 93.2 0.9985 0.9994 0.9997 >0.9999 0.9993 0.9944 0.9991 0.9968 0.9991 0.9981 0.9963 0.9985 0.9988 0.9968 0.9991 Y Haplotype Profiles Population Apache Navajo CFS ASN MN ASN NYC ASN TX ASN CFS EI N 138 219 28 100 45 70 37 # Haplotypes 70 101 28 96 43 69 35 % Single Haplotype Diversity 50.7 46.1 100 96.0 95.6 98.6 94.6 0.9701 0.9806 >0.9999 0.9992 0.9970 0.9996 0.9955 Haplotype Diversity N>80 1.005 1 0.995 0.99 0.985 0.98 0.975 0.97 0.965 0.96 0.955 high haplotype diversity = high intra-individual variation What about Linkage Equilibrium? Intuitively, since these markers are all on a single chromosome we’d predict strong linkage and a lack of independence between the loci. Although this is very different from what we are used to with our beloved CODIS loci…this is what we expect with a haplotype What do we get? Y STR Loci Pairwise Tests 12 Loci – 66 tests Population N # Equilibrium CFS AFR CT AFR MI AFR NYC AFR TX AFR CFS CAU CT CAU MI CAU NYC CAU TX CAU CT HIS MI HIS MN HIS NYC HIS TX HIS 37 182 86 80 193 57 163 97 83 194 158 97 100 80 192 35 23 27 34 21 30 26 30 22 11 26 33 42 35 37 Y STR Loci Pairwise Tests 12 Loci – 66 tests Population N Apache Navajo CFS ASN MN ASN NYC ASN TX ASN CFS EI 138 219 28 100 45 70 37 # Equilibrium 9 12 60 50 47 50 35 Fewest – Native American Most – Asian (sample size; but Minnesota and Texas) Y STR Loci Pairwise Tests 22 populations; ≥ 17 Equilibrium detected Loci 391/389I 391/389II 389I/439 389I/385-2 439/389II 439/393 439/385-1 439/385-2 389II/393 437/393 # Populations 19 18 17 18 17 19 18 20 17 18 What’s going on here??? – No detectable linkage? Y STR Loci Pairwise Tests 22 populations; ≤ 5 Equilibrium Loci 391/438 389I/389II 438/437 438/19 438/392 438/385-1 438/385-2 437/385-1 19/392 19/385-1 392/385-1 390/385-1 385-1/385-2 # Populations 5 1 4 4 3 0 4 5 5 3 5 5 3 This is what we’d expect… Strong linkage Y STR Loci Pairwise Tests 22 Populations – Examples of Population Specific Disequilibrium Loci 389I/392 438/439 437/439 437/385-2 390/385-2 # Populations/ Equilibria Population/ Equilibria 15 11 16 12 12 Caucasian Caucasian Caucasian African American African American Likely due to haplogroup differences among populations What do we see? • There is evidence of “independence” between some of the loci in some of the populations • A combination of mutation rate, subdivision and random drift can cause this. • One of the biggest factors is Haplogroup Diversity • The marker selection for increasing haplotype diversity is not directly correlated to gene diversity. Approaching Analysis • Some may suggest - “Use the set of Core Y STRs and add more as needed to resolve matches” • First question – when do you stop? • If you get a match, you would have to continue on ad infinitum! • Is this a sensible policy? How much power is needed??? Discriminatory Capacity* for three U.S. populations Y-STR marker combination African American (N=786) Caucasians (N=778) Hispanics (N=381) European minimal haplotype (9) 75.8% 61.7% 79.8% Eur. Minimal + SWGDAM (11) 86.8% 74.3% 85.6% PowerPlex® Y (12) 87.7% 76.7% 88.2% 97.6% 95.5% 95.8% AmpFlSTR® Yfiler kit (17) *DC= (# of different haplotypes / pop. size) x 100 Mulero et al., JFS (2006) 51:64-75 Number of Unique Haplotypes Observed for Three U.S. Populations Y-STR marker combination African American (N=786) Caucasians (N=778) Hispanics (N=381) 496 382 266 Eur. Minimal + SWGDAM (11) 618 503 295 PowerPlex® Y (12) 628 524 306 749 714 350 European minimal haplotype (9) AmpFlSTR® Yfiler kit (17) Mulero et al., JFS (2006) 51:64-75 Common haplotype identified by the European Minimal Haplotype markers (20 individuals in Yfiler haplotype database*) Sample Info C37 C330 ATCC C304 C327 C63 C177 C198 C276RL C294 C236 C85RL C12 C194 C197 C318 C345 C66RL C205 C158 DYS19 DYS385 1 DYS389IDYS389II 1 DYS390 1 1 DYS391 1 DYS392 1 DYS393 1 DYS438 1 DYS439 1 DYS437 1 DYS448 1 DYS456 1 DYS458 1 Y 14 11,14 13 29 24 11 13 13 13 12 14 19 15 17 14 11,14 13 29 24 11 13 13 13 11 15 19 15 19 14 11,14 13 29 24 11 13 13 12 13 15 19 15 17 14 11,14 13 29 24 11 13 13 12 13 15 19 17 18 14 11,14 13 29 24 11 13 13 12 13 15 19 16 16 14 11,14 13 29 24 11 13 13 12 13 15 17 16 17 14 11,14 13 29 24 11 13 13 12 12 15 21 17 19 14 11,14 13 29 24 11 13 13 12 12 15 18 16 17 14 11,14 13 29 24 11 13 13 12 12 15 19 15 19 14 11,14 13 29 24 11 13 13 12 12 15 19 15 17 14 11,14 13 29 24 11 13 13 12 12 14 19 15 18 14 11,14 13 29 24 11 13 13 12 12 14 19 16 16 14 11,14 13 29 24 11 13 13 12 11 15 19 17 18 14 11,14 13 29 24 11 13 13 12 11 15 19 17 17 14 11,14 13 29 24 11 13 13 12 11 15 19 15 16 14 11,14 13 29 24 11 13 13 12 11 15 19 16 17 14 11,14 13 29 24 11 13 13 12 11 15 19 15 18 14 11,14 13 29 24 11 13 13 12 11 15 19 16 19 14 11,14 13 29 24 11 13 13 12 11 14 19 15 17 14 11,14 13 29 24 11 13 13 12 10 15 19 16 20 European Minimal Haplotype # of different haplotypes 0 GATA H4 DYS635 1 12 23 11 23 13 23 12 23 12 24 11 23 12 24 12 23 12 23 11 24 12 23 11 24 12 23 12 23 >13 24 12 23 12 23 12 23 12 23 13 23 PP Y YfilerTM 8 20 * http://www.appliedbiosystems.com/yfilerdatabase/ So…the logic does work… More loci… better resolution… But…doesn’t the size of the database matter? # of different haplotypes Individuals sharing haplotypes Point Estimate (N = 3561) N= 1000 European Minimal Haplotype PP® Y YfilerTM 0 6 20 20 4, 4, 2, 6 0 0.0056 0.0011 0.0011 0.0006 0.0017 0.00028 0.02 0.004 0.004 0.002 0.006 0.001 Approaching Analysis • Unlikely approach because information gain is low • Many samples will already be very limited • Community will rely on commercially available kits not in-house designer systems • QC/ Proficiency Testing • Better to increase size of database(s) to gain power • We will re-visit substructure issues later Approaching Analysis • Some may suggest - “A reference database should contain related individuals” – to better define the population • Probability of paternal relative having the same haplotype is usually 1 • Databases are typically comprised of unrelated individuals • Although a small unknown number of related individuals may be in a database • Able to address significance of a very closely related profile Exclusion with 1 mismatch among 12 analyzed YSTRs Evidence 14 12 28 25 11 11 13 14,14 11 11 15 Known 14 12 28 24 11 11 13 14,14 11 11 15 By having a database of unrelated males one can assess weight of relative (with mutation) versus rarity of haplotype in population Qualitative Conclusions of Y-STR Haplotype Comparison Exclusion - The two haplotypes are dissimilar; i.e, the reference person is excluded as the contributor of Y-specific DNA of the evidence sample Inclusion/Match - The Y haplotypes from two samples are sufficiently similar and potentially could have originated from the same source, or from a common paternal lineage Inconclusive - Exclusion/Inclusion cannot be definitively inferred due to insufficient data from one or both of the DNA samples Calculation to Convey to the Court • Frequency estimate not possible • Court desires a frequency estimate • Point Estimate (Counting Method) • Confidence Interval • Approach the same as mtDNA Calculation to Convey to the Court • The vast majority of possible haplotypes will not be observed in any database • The counting method is likely to be conservative • A correction for sampling • A correction for substructure ?? Calculation to Convey to the Court Approaches • It is more likely that the counting method will be employed by the U.S. laboratories and courts because of its operational simplicity Limitations of the Counting Method • Non-matched sites of the haplotype are given weight equal to that of different origin (but may have some extra value for substructure) • Mutations are not weighted • Haplotypes of the same paternal lineage can be excluded, when they are subject to mutations • Does not recognize evolutionary changes, and/or effect of convergent mutations For Y haplotype observed, count the number of times the profile is observed (X) p = X/N 95% Upper bound on frequency CI = p ± 1.96 p(1-p)/N Where • N is the size of the database What about for Y haplotype that is not observed in your database?? The upper bound of the CI is 1-1/N Where • is the confidence level (0.05 for a 95% CI) • N is the size of the database Following: W. E. Ricker, 1937. Journal of the American Statistical Association, Vol. 32, No. 198: 349-356. Maximum haplotype frequency • If a Y-haplotype is not seen in a sample of N males then at the level of significance: • Maximum frequency = 1 - 1/N • Confidence level = 1- • As N becomes larger, maximum frequency becomes closer to point estimate This is why databases will drive our statistical strength Haplotype frequency N frequency • 100 3/100 (0.03) • 500 3/500 (0.006) • 1,000 3/1,000 (0.003) • 10,000 3/10,000 (0.0003) Calculation to Convey to the Court Confidence Interval • In many instances, the evidentiary haplotype may not be observed in the reference database • As a consequence, the usual assumption of a Normal distribution may not apply for Y-STR haplotype frequency estimates • Ricker’s theory (1937) accommodates this requirement • The counts as well as the confidence bounds are divided by the number of haplotypes sampled in the entire database to estimate the probability of a match Online available Y-STR haplotype reference databases How de we actually get our Haplotype frequencies? Applied Biosystems Yfiler http://www.appliedbiosystems.com/yfilerdatabase/ Promega Powerplex Y http://www.promega.com/techserv/tools/pplexy/default.htm AB Yfiler Haplotype data can be input manually or through file upload. Of Course, There are no matches when testing this many loci. A random Y haplotype Haplotype data are input manually Of Course, There are no matches when testing this many loci. So using our formula from before and an = 0.05 1-1/N 1 – (0.05)1/4004 = 0.00075 So, lets try a haplotype that has been seen in the Database… A general search returns 21 matches in the database. CI = p ± 1.96 p(1-p)/N What p…? What N…? We can evaluate these matches by looking at the distribution of matches among the various population groups. We can do like we did before…looking at the frequency in the whole database: CI = p ± 1.96 p(1-p)/N CI = 0.0052 + 1.96√ (0.0052(0.9948))/4004 Upper bound would be: 0.00743 Or we could do specific to the population group in which the match was found: CI = p ± 1.96 p(1-p)/N CI = 0.0076 + 1.96√ (0.0076(0.9924))/1311 Upper bound would be: Caucasians 0.0123 Or we could do specific to the population group in which the match was found: CI = p ± 1.96 p(1-p)/N CI = 0.0067 + 1.96√ (0.0067(0.9933))/894 Upper bound would be: Hispanics 0.01205 Or we could do specific to the population group in which the match was found: CI = p ± 1.96 p(1-p)/N CI = 0.0036 + 1.96√ (0.0036(0.9964))/1108 Upper bound would be: African American 0.00712 Calculation to Convey to the Court Population Substructure • Correction for population structure may be considered • Effective population size ¼ of autosomal loci • May actually be a little lower • Substructure effects less in US than ancestral populations • Use when reference database considered not representative Problems created by population subdivision Haplotype frequencies calculated from population average frequencies could lead to: – Wrong estimates! Employ a Theta (q ) Correction q is used as a measure of the effects of population substructure (inbreeding, coancestry) NRCII q recommendation was pragmatically set Empirical values are much less for autosomal loci National Academy of Sciences May 1996 Still need to calculate substructure effects But likely to be low for most major populations, if evaluated under a forensic model vs that of an evolutionary model U.S. Y-STR Haplotype Reference Database www.ystr.org/usa AA CAU HIS Total Number of populations sampled 10 11 9 30 Number of individuals sampled 599 628 478 1,705 9 9 9 9 Number of different haplotypes 454 76% 437 70% 354 74% 1116 65% Haplotype diversity 99.8% 99.6% 99.5% 99.7% Most frequent haplotype 12 2.0% 25 3.98% 19 3.97% 53 3.1% Number of Y-STR loci typed (EMH) Kayser et al. J. Forensic Sci. (2002), 47(3): 5513-519 Structure of U.S. Populations with Y-STR Haplotypes Indiana EA Missouri EA Oregon EA Virginia EA European-American Texas EA Cajun EA Lousiana EA Maryland EA New York EA Pennsylvania EA Pennsylvania H Florida H Florida EA Connecticut H Hispanic New York H Oregon H Maryland H Lousiana H Texas H Virginia H RST = 0.1 African-American Lousiana AA Indiana AA Oregon AA Missouri AA Pennsylvania AA New York AA Texas AA Maryland AA Florida AA Virginia AA RST: measure for population differentiation Kayser et al. J. Forensic Sci. (2002), 47(3): 5513-519 Partition (%) of genetic variance (AMOVA) Population African American Asian Caucasian Hispanic Native American Afr-Cau-His All 5 A 98.96 98.69 98.45 99.08 96.98 87.19 83.40 B 1.04 1.31 1.55 0.92 3.02 1.02 1.25 C ----------11.79 15.35 A = within sample population B= among sample populations within major population group (or regional variation) C = among major population components for North American populations FST (AMOVA) AMOVA routine (with the option of allele size difference) of Arlequin 2.0 Population FST African American Asian Caucasian Hispanic Native American Afr-Cau-His All 5 0.0104 0.0131 0.0155 0.0092 0.0302 0.1179 0.1535 ST FST (AMOVA) AMOVA routine (without the option of allele size difference) of Arlequin 2.0 Population ФST FST African American Asian Caucasian Hispanic Native American Afr-Cau-His All 5 0.0104 0.0131 0.0155 0.0092 0.0302 0.1179 0.1535 0.0051 0.0148 0.0071 0.0061 0.0188 0.0745 0.1001 Note: Asian is likely inflated and more data are needed to assess FST FST (AMOVA) Population FST FST African American Asian Caucasian Hispanic Native American Afr-Cau-His All 5 0.0051 0.0148 0.0071 0.0061 0.0188 0.0745 0.1001 0.0006 0.0039 -0.0005 0.0021 0.0282 ----------autosomal Formula f (haplotype) = pi + q (1- pi) With q of 0.01 and our p of 0.00028: 0.00028 + (0.01 x (1 – 0.00028)) 0.00028 + 0.0099 ≈ 0.0103 Note: θ is the limiting factor! Impact Pool populations --- q Most frequent • US populations • Intra-individual variation • Most common haplotypes the same • What is the frequency of unknown or uncommon haplotypes in different datasets? • Even if there is substructure Y STR haplotype is one locus with many alleles Population 1 A1 A2 . . . . A100 Databases with reasonable size approximate this model θ is almost 0 A101 A102 . . . . A200 Population 2 Y STR haplotype is one locus with many alleles Population 1 A1 A2 . . . . An θ approaches 0 A1' A2' . . . . An' Population 2 In reality, with large number of loci a few types are shared and most if not all have never been seen in the database Y STR haplotype is one locus with many alleles So the more loci typed, the more haplotypes/alleles will be in the database Thus, multi-locus kits are valuable for this aspect q approaches 0 In the process of calculating q under forensic model*** Forensic Model Population Substructure Which haplotypes might be more closely related? DYS19 DYS389I DYS389II DYS390 DYS391 DYS392 DYS393 DYS385 A --- 14 12 29 24 10 15 12 11-14 B --- 14 13 29 24 10 15 12 11-14 C --- 14 13 29 24 12 15 12 10-14 D --- 18 11 25 24 10 13 15 12-18 E --- 18 12 25 24 10 15 15 11-18 Forensic Model Population Substructure Are such evolutionary differences considered in forensic evaluation? DYS19 A --- 14 C --- 14 DYS389I 12 13 DYS389II 29 29 DYS390 24 24 DYS391 10 12 DYS392 DYS393 DYS385 15 15 12 12 11-14 10-14 15 15 12 15 11-14 11-18 Exclusion A --- 14 E --- 18 12 12 29 25 24 24 10 10 Exclusion Y STR mutations (father:son allele transmission) Locus Caucasian (N = 199) Afr Amer (N = 203) Hispanic (N = 207) DYS391 DYS389I 12:13 DYS389II 29:30 DYS439 13:12 Asian (N = 83) Total (N = 692) 11:10 1 1 30:29 1* 11:12 2 DYS438 DYS437 0 15:16 DYS19 15:14 2 16:17 2 17:16 DYS392 0 DYS393 14:15 1 DYS390 0 DYS385 14:15 14:15 12,14:14 3 Total 2/2388 6/2436 2/2484 3/996 13/8304 0.00084 0.00246 0.00081 0.0031 0.00157 Y STR mutations (father:son allele transmission) • 692 confirmed father-son pairs (probability > 99.9%) • 14 mutation events were observed • Average rate of 1.57 x 10-3/locus /generation (13/8304) • With a 95% confidence bound of 0.83 x 10-3 to 2.69 x 10-3 • This rate is a little smaller than that of the Kayser, et al. • Estimate (2.80 x 10-3/locus) • But the difference is not statistically significant (P > 0.05). one Asian father-son pair at the DYS389I/II loci complex (12,29) (13, 30) appears as a double mutation, but likely is a single original event. Paternal Relatives share the same haplotype ? 5 Are they related? Mutation: µ (DYS393) = 3.2 x 10-3 Mutation?? 7 14, 12, 28, 22, 10, 11, 14,14 13-14, 19-21 f obs= 0.001 14, 12, 28, 22, 10, 11, 13,13 13-14, 19-21 fobs = 0.007 Likelihood calculation L(X) = 0.001 x 5 x µ/2 + 0.007 x 7 x µ/2 (related) L(Y) = 0.001 x 0.007 (non-related) LR (X/Y) ≈ 12 for patrilinear relationship Next Task • Test independence between autosomal loci and Y haplotypes Independence Testing of Y Haplotype and 13 Autosomal CODIS STR Loci (Autosomal Locus/ Y Haplotype Displaying Disequilibria* - 22 populations) 1. Apache FGA, p-value = 0.03760000 D21S11, p-value = 0.03460000 D18S51, p-value = 0.02820000 D5S818, p-value = 0.02660000 2. Minnesota Asian D8S1179, p < 10-3 3. Minnesota Hispanic D16S539, p-value = 0.03340000 D18S51, p-value = 0.02100000 4. Canada African American FGA, p-value = 0.00920000 5. Canada Asian Indian D7S820, p-value = 0.02820000 6. Connecticut African American FGA, p-value = 0.04300000 THO1, p-value = 0.00280000 7. Connecticut Caucasian THO1, p-value = 0.02880000 8. Michigan Caucasian D16S539, p-value = 0.04820000 Independence Testing of Y Haplotype and 13 Autosomal CODIS STR Loci (Autosomal Locus/ Y Haplotype Displaying Disequilibria* - 22 populations) 9. Michigan Hispanic vWA, p-value = 0.03160000 FGA, p-value = 0.02240000 10. Native American Total D3S1358, p-value = 0.02680000 D21S11, p-value = 0.00060000 D18S51, p-value = 0.00840000 11. Navajo D21S11, p-value = 0.02820000 12. New York Asian D16S539, p-value = 0.00740000 13. New York Caucasian D7S820, p-value = 0.01660000 14. New York Hispanic D21S11, p-value = 0.01340000 15. Texas African American D13S317, p-value = 0.01200000 D18S51, p-value = 0.01420000 16. Texas Hispanic D5S818, p-value = 0.01880000 Next Task • Mixtures • Assume 2 alleles for 11 loci • 211 possible haplotypes with PP Y – 2048 • Most haplotypes not observed in database • Assumption of independence not correct • Minimal haplotype frequency (minimum allele frequency) not practical Mixture • Probability of Exclusion • Binomial distribution - haplotypes excluded and haplotypes not excluded • Count number (m) not excluded; (PI = m/n) • Estimate upper CI of PI • PE = 1- PI • Based on same principles used for autosomal loci (but at haplotype level) Mixture Likelihood Ratio • Four scenarios for two contributor sample in example: •Hp --- S1 and S2 are source • Hd --- S1 and unknown are source (same as PI) • Hd --- S2 and unknown are source (same as PI) • Hd --- two unknowns are source Mixture Likelihood Ratio • Assume three loci, two alleles at each locus, two male suspects •Total alleles the same as in evidence • Equal contribution • 8 possible haplotypes PE/PI 13/15 8/10 22/25 --- 3 locus profile All haplotypes included are 13 8 22 --- haplotype 1 15 8 22 --- haplotype 2 13 8 25 --- haplotype 3 15 8 25 --- haplotype 4 13 10 22 --- haplotype 5 15 10 22 --- haplotype 6 13 10 25 --- haplotype 7 15 10 25 --- haplotype 8 PE/PI 13/15, 8/10, 22/25 --- 3 locus profile All possible haplotypes are included 13 8 22 --- haplotype 1 15 8 22 --- haplotype 2 13 8 25 --- haplotype 3 15 8 25 --- haplotype 4 13 10 22 --- haplotype 5 15 10 22 --- haplotype 6 13 10 25 --- haplotype 7 15 10 25 --- haplotype 8 But certain haplotype pairs can not explain evidence haplotype 1 + haplotype 2 haplotype 1 + haplotype 3 haplotype 1 + haplotype 4 haplotype 1 + haplotype 5 and so on PE/PI 13/15, 8/10, 22/25 --- 3 locus profile All possible haplotypes are included 13 8 22 --- haplotype 1 15 8 22 --- haplotype 2 13 8 25 --- haplotype 3 15 8 25 --- haplotype 4 13 10 22 --- haplotype 5 15 10 22 --- haplotype 6 13 10 25 --- haplotype 7 15 10 25 --- haplotype 8 Only certain haplotype pairs can explain evidence haplotype 1 + haplotype 8 haplotype 2 + haplotype 7 haplotype 3 + haplotype 6 haplotype 4 + haplotype 5 Mixture Likelihood Ratio 1 LR = 2[Pr(H1)Pr(H8) + Pr(H2)Pr(H7) + Pr(H3)Pr(H6) + Pr(H4)Pr(H5)] Mixture LR • Technically correct • Can not estimate individual haplotype frequencies • 217 (131,072) possible haplotypes (Yfiler) 211 (2048) possible haplotypes (PP Y) • Not all combinations can explain the evidence • Assuming independence is not correct • Cannot place types in database, most never seen, too many Use same logic as PI for the denominator in the LR Haplotypes fall into either category E = excluded E = not excluded E/E and E/E pairs can not explain the evidence Only E/E can explain the evidence and only a subset of these fit Mixture LR m* - those pairs (of E/E) that explain evidence m*/n(n-1) and take upper CI as denominator Mixture LR 1 LR = m*/n(n-1) • The denominator is the PI with an assumed number of contributors • Makes better use of data Online available Y-STR haplotype reference databases Calculation of reporting statistics is quite straight-forward with the help of the searchable databases We still have the problem that none of these (or PopStats) is designed to enable mixture calculations!! And doing this with 12 or 17 loci by hand… …would be a bear! John V. Planz, Ph.D UNT Center for Human Identification jplanz@hsc.unt.edu