Introduction to Genetic Analysis Bruce Walsh jbwalsh@u.arizona.edu Ecology and Evolutionary Biology, University of Arizona Adjunct Appointments Molecular and Cellular Biology Plant Sciences Epidemiology & Biostatistics Animal Sciences Outline • Mendelian Genetics – – – – Genes, Chromosomes & DNA Mendel’s laws Linkage Linkage disequilibrium – – – – Fisher’s decomposition of Genetic value Fisher decomposition of Genetic Variances Resemblance between relatives Searching for the underlying genes • Quantitative Genetics Mendelian Genetics Following a single (or several) genes that we can directly score Phenotype highly informative as to genotype Mendel’s Genes Genes are discrete particles, with each parent passing one copy to its offspring. Let an allele be a particular copy of a gene. In Diploids, each parent carries two alleles for every gene, one from each parent Each parent contributes one of its two alleles (at random) to its offspring For example, a parent with genotype Aa (a heterozygote for alleles A and a) has a 50% probability of passing an A allele onto its offspring and a 50% probability of passing along an a allele. Example: Pea seed color Mendel YY (Y found homozygote) that his--> pea yellow lines phenotype differed in seed color, with Yg a(heterozygote) single locus (with --> alleles yellowYphenotype and g) determining green gg (gvs. homozygote) yellow --> green phenotype Note that in this simple case, each genotype maps Y is dominant to g, g is recessive to Y to a single phenotype Likewise, the phenotype can tell us about the underlying Genotype. Green = gg, Yellow = carries Y allele (Y-) Cross Yg x Yg. Offspring are 1/4 YY, 1/2 Yg, 1/4 gg 3/4 yellow peas, 1/4 green peas Cross Yg x gg. Offspring are 1/2Yg, 1/2 gg, 1/2 yellow, 1/2 green Dealing with two (or more) genes For 7 pea traits, Mendel observed Independent Assortment The genotype at one locus is independent of the second RR, Rr - round seeds, rr - wrinkled seeds YY, Yg - yellow seeds, gg - green seeds Pure round, green (RRgg) x pure wrinkled yellow (rrYY) F1 --> RrYg = all round, yellow (Rg/rY) What about the F2? Let R- denote RR and Rr. R- are round. Note in F2, Pr(R-) = 1/2 + 1/4 = 3/4, Pr(rr) = 1/4 Likewise, Y- are YY or Yg, and are yellow Phenotype Genotype Frequency Yellow, round Y-R- (3/4)*(3/4) = 9/16 Yellow, wrinkled Y-rr (3/4)*(1/4) = 3/16 Green, round ggR- (1/4)*(3/4) = 3/16 Green, wrinkled ggrr (1/4)*(1/4) = 1/16 Or a 9:3:3:1 ratio Mendel was wrong: Linkage Bateson and Punnet looked at flower color: P (purple) dominant over p (red ) pollen shape: L (long) dominant over l (round) PPLL x ppll --> PL/pl F1 Phenotype Genotype Observed Expected Purple long 284 215 Purple round P-ll 21 71 Red long ppL- 21 71 Red round ppll 55 24 P-L- Excess of PL, pl gametes over Pl, pL Departure from independent assortment -- why? Chromosomal theory of inheritance Early light microscope work on dividing cells revealed small (usually) rod-shaped structures that appear to pair during cell division. These are chromosomes. It was soon postulated that Genes are carried on chromosomes, because chromosomes behaved in a fashion that would generate Mendel’s laws. We now know that each chromosome consists of a single double-stranded DNA molecule (covered with proteins), and it is this DNA that codes for the genes. Linkage If genes are located on different chromosomes they (with very few exceptions) show independent assortment. Indeed, peas have only 7 chromosomes, so was Mendel lucky in choosing seven traits at random that happen to all be on different chromosomes? Ans: compute this probability. However, genes on the same chromosome, especially if they are close to each other, tend to be passed onto their offspring in the same configuration as on the parental chromosomes. Consider the Bateson-Punnet pea data Let PL / pl denote that in the parent, one chromosome carries the P and L alleles (at the flower color and pollen shape loci, respectively), while the other chromosome carries the p and l alleles. Unless there is a recombination event, one of the two parental chromosome types (PL or pl) are passed onto the offspring. These are called the parental gametes. However, if a recombination event occurs, a PL/pl parent can generate Pl and pL recombinant gametes to pass onto its offspring. Linkage --> excess of parental gametes Let c (or q) denote the recombination frequency --- the probability that a randomly-chosen gamete from the parent is of the recombinant type (i.e., it is not a parental gamete). For a PL/pl parent, the gamete frequencies are Gamete type Frequency Expectation under independent assortment PL (1-c)/2 1/4 pl (1-c)/2 1/4 pL c/2 1/4 Pl c/2 1/4 2, In Parental Bateson data, Freq(ppll) =deficiency, 55/381as =0.144. Freq(ppll) =for [(1-c)/2] Recombinant gametes gametes ininexcess, (1-c)/2 as c/2 > <1/4 1/4for c c< <1/2 1/2 Solving gives c = 0.24 Linkage is our friend While linkage (at first blush) may seem a complication, it is actually our friend, allowing us to map genes --determining which genes are on which chromosomes and also fine-mapping their position on a particular chromosome Historically, the genes that have been mapped have direct effects on phenotypes (pea color, fly eye color, any number of simple human diseases, etc. ) In the molecular era, we are often concerned with molecular markers, variations in the DNA sequence that typically have no effect on phenotype Genetic Maps and Mapping Functions The unit of genetic distance between two markers is the recombination frequency, c (also called q) If the phase of a parent is AB/ab, then 1-c is the frequency of “parental” gametes (e.g., AB and ab), while c is the frequency of “nonparental” gametes (e.g.. Ab and aB). A parental gamete results from an EVEN number of crossovers, e.g., 0, 2, 4, etc. For a nonparental (also called a recombinant) gamete, need an ODD number of crossovers between A & b e.g., 1, 3, 5, etc. Hence, simply using the frequency of “recombinant” (i.e. nonparental) gametes UNDERESTIMATES the m number of crossovers, with E[m] > c In particular, c = Prob(odd number of crossovers) Mapping functions attempt to estimate the expected number of crossovers m from observed recombination frequencies c When considering two linked loci, the phenomena of interference must be taken into account The presence of a crossover in one interval typically decreases the likelihood of a nearby crossover Suppose the order of the genes is A-B-C. If there is no interference (i.e., crossovers occur independently of each other) then cA C = cA B (1 ° cB C ) + (1 ° cA B ) cB C = cA B + cB C ° 2cA B cB C Probability(odd number of crossovers btw A and C) Odd We need number Even tonumber assume of crossovers inindependence A-B, btw odd number A &ofB crossovers and in B-C even in number orderinterference tobetween multiplyBthese & two probabilities When is Cpresent, we can write this as cA C = cA B + cB C ° 2(1 ° ±)cA B cB C Interference parameter d=0 No interference. Crossovers occur of 1 --> complete interference: The presence of each nearby other crossovers aindependently crossover eliminates Mapping functions. Moving from c to m Haldane’s mapping function (gives Haldane map distances) Assume the the numberk k of crossovers in a region This makes of NO INTERFERENCE Pr(Poisson = k) assumption = l Exp[-l]/k! follows a Poisson distribution with parameter m l = expected number of successes 1 X1 X m 2k + 1 1 °- e° 2m -° m c= p(m; 2k + 1) = e = (2k + 1)! 2 k= 0 k= 0 Odd number Prob(Odd number of crossovers) This gives the estimated Haldane distance as ln(1 ° 2c) m= ° 2 Usually in m units of Morgans or m Centimorgans (Cm) Onereported morgan --> = 1.0. One Cm --> = 0.01 Molecular Markers You and your neighbor differ at roughly 22,000,000 nucleotides (base pairs) out of the roughly 3 billion bp that comprises the human genome Hence, LOTS of molecular variation to exploit SNP -- single nucleotide polymorphism. A particular position on the DNA (say base 123,321 on chromosome 1) that has two different nucleotides (say G or A) segregating STR -- simple tandem arrays. An STR locus consists of a number of short repeats, with alleles defined by the number of repeats. For example, you might have 6 and 4 copies of the repeat on your two chromosome 7s SNPs SNPs vs STRs Cons: Less polymorphic (at most 2 alleles) Pros: Low mutation rates, alleles very stable Excellent for looking at historical long-term associations (association mapping) STRs Cons: High mutation rate Pros: Very highly polymorphic Excellent for linkage studies within an extended Pedigree (QTL mapping in families or pedigrees) Linkage disequilibrium At LE, alleles in gametes are independent of each other: freq(AB C)) == freq(A) ) freq(C) freq(AB freq(A)freq(B freq(B ) When linkage disequilibrium (LD) present, alleles are no longer independent --- knowing that one allele is in the gamete provides information on alleles at other loci freq(AB ) 6 = freq(A) freq(B ) The disequilibrium between alleles A and B is given by D A B = freq(AB ) ° freq(A) freq(B ) Forces that Generate LD • • • • • Selection Drift Migration (admixture) Mutation Population structure (stratification) The Decay of Linkage Disequilibrium The frequency of the AB gamete is given by freq(AB ) = freq(A) freq(B ) + D A B Departure from If recombination frequency between theLE A and B loci LE value is c, the disequilibrium in generation t is D (t ) = D (0)(1 ° c) t Note that D(t) ->Initial zero, LD although value the approach can be slow when c is very small Not surprising that very tightly-linked markers (c <<0.01) are often in LD Key Mendelian Concepts • Genes, Chromosomes & DNA • “Classical” vs Molecular markers • Linkage – Parental gametes in excess. Alleles at nearby loci tend to segregate together • Linkage disequilibrium (LD) – Excess of parental gametes seen in any particular cross – LD implies in the population that there is a nonrandom association of allele – Unlinked alleles can show LD due to population structure Quantitative Genetics The analysis of traits whose variation is determined by both a number of genes and environmental factors Phenotype is highly uninformative as to underlying genotype Complex (or Quantitative) trait • No (apparent) simple Mendelian basis for variation in the trait • May be a single gene strongly influenced by environmental factors • May be the result of a number of genes of equal (or differing) effect • Most likely, a combination of both multiple genes and environmental factors • Example: Blood pressure, cholesterol levels – Known genetic and environmental risk factors • Molecular traits can also be quantitative traits – mRNA level on a microarray analysis – Protein spot volume on a 2-D gel Consider Phenotypic a specific locus influencing trait distribution of athe trait For this locus, mean phenotype = 0.15, while overall mean phenotype = 0 Goals of Quantitative Genetics • Partition total trait variation into genetic (nature) vs. environmental (nurture) components • Predict resemblance between relatives – If a sib has a disease/trait, what are your odds? • Find the underlying loci contributing to genetic variation – QTL -- quantitative trait loci • Deduce molecular basis for genetic trait variation • eQTLs -- expression QTLs, loci with a quantitative influence on gene expression – e.g., QTLs influencing mRNA abundance on a microarray Dichotomous (binary) traits Presence/absence traits (such as a disease) can (and usually do) have a complex genetic basis Consider a disease susceptibility (DS) locus underlying a disease, with alleles D and d, where allele D significantly increases your disease risk In particular, Pr(disease | DD) = 0.5, so that the Penetrance of genotype DD is 50% Suppose Pr(disease | Dd ) = 0.2, Pr(disease | dd) = 0.05 dd individuals can rarely display the disease, largely because of exposure to adverse environmental conditions dd individuals can give rise to phenocopies 5% of the time, showing the disease but not as a result of carrying the risk allele If freq(d) = 0.9, what is Prob (DD | show disease) ? freq(disease) = 0.12*0.5 + 2*0.1*0.9*0.2 + 0.92*0.05 = 0.0815 From Bayes’ theorem, Pr(DD | disease) = Pr(disease |DD)*Pr(DD)/Prob(disease) = 0.12*0.5 / 0.0815 = 0.06 (6 %) Pr(Dd | disease) = 0.442, Pr(dd | disease) = 0.497 Thus about 50% of the diseased individuals are phenocopies Basic model of Quantitative Genetics Basic model: P = G + E Genotypic Phenotypic Environmental valuevalue -- we will value occasionally also use z for this G = average phenotypic value forvalue that genotype if we are able to replicate it over the universe of environmental values, G = E[P] G x E interaction --- G values are different across environments. Basic model now becomes P = G + E + GE Contribution of a locus to a trait Q1Q1 Q2Q1 Q2Q2 C C C -a C + a(1+k) C+a+d C+d C + 2a C + 2a C+a d measures dominance, with dG(Q =+ 0 if) the heterozygote d = ak =G(Q ) - [G(Q Q ) G(Q Q ) ]/2 2a 1Q =2 G(Q Q ) 2 2 Q 1 1 2 2 1 1 is exactly intermediate to the two homozygotes k = d/a is a scaled measure of the dominance Example: Apolipoprotein E & Alzheimer’s Genotype ee Average age of onset 68.4 Ee EE 75.5 84.3 2a = G(EE) - G(ee) = 84.3 - 68.4 --> a = 7.95 ak =d = G(Ee) - [ G(EE)+G(ee)]/2 = -0.85 k = d/a = 0.10 Only small amount of dominance Covariances • Cov(x,y) = E [x*y] - E[x]*E[y] Cov(x,y) Cov(x,y) >=<0, 0, negative (linear) (linear) association association between between Cov(x,y) Cov(x+y,z) Cov(x,y) 0,positive no ==0Cov(x,z) linear DOES association NOT + Cov(y,z) imply between no assocation x & y x x&&y y Cov(x,x) = Var(x) cov(X,Y) cov(X,Y) > 0=<00= 0 cov(X,Y) cov(X,Y) Var(x+y) = Cov(x+y,x+y) Y Y Y Y = Cov(x,x) + Cov(y,y) + 2Cov(x,y) = Var(x) + Var(y) + 2 Cov(x,y) X X X X Fisher’s (1918) Decomposition of G One of Fisher’s key insights was that the genotypic value consists of a fraction that can be passed from parent to offspring and a fraction that cannot. Consider the genotypic value Gij resulting from an Gi j = πG + Æi + Æj + ±i j AiAj individual Xdifference (for genotype Dominance deviations --the Mean value, with Average Since parents contribution passpredicted along toG genotypic for their allele i π =single Galleles ¢freq(Q i j value iQ j ) The genotypic value from the to individual Aioffspring, Aj) between the genotypic value predicted from the the allelic effects isathus i (the average effect of allele i) b iactual two single alleles the genotypic value, G Æj represent theseand contributions j = π G + Æi + bi j = ±i j Gi j ° G Fisher’s decomposition is a Regression Gi j = πG + Æi + Æj + ±i j Predicted valueResidual A notational change clearly shows this is a error regression, Gi j = πG + 2Æ1 + (Æ2 ° Æ1)N + ±i j IndependentIntercept (predictor) variable Nslope =Regression # of Q2 alleles residual 8Regression > < 2Æ1 2Æ1 + (Æ2 ° Æ1)N = Æ1 + Æ1 > : 2Æ 1 forN = 0; e.g, Q1Q1 forN = 1; e.g, Q1Q2 forN = 2; e.g, Q2Q2 Allele Q112 common, a common, a21 > a12 a1 = a2 = 0 Both Q and Q2 frequent, G21 Slope = a2 - a1 G22 G G11 0 1 N 2 Consider a diallelic locus, where p1 = freq(Q1) Genotype Q1Q1 Q2Q1 Q2Q2 Genotypic value 0 a(1+k) 2a Mean Allelic effects πG = 2p2 a(1 + p1 k) Æ2 = p1 a [ 1 + k ( p1 ° p2 ) ] Æ1 = ° p2a [ 1 + k ( p1 ° p2 ) ] Dominance deviations ±i j = Gi j ° πG ° Æi ° Æj Average effects and Additive Genetic Values The a values are the average effects of an allele A key concept is the Additive Genetic Value (A) of an individual X ≥ n ( Æ+ AA(G=i j ) = (k) Æi i+ ¥ Æj ) (k ) Æk k= 1 Why all the fuss over A? Suppose father has A = 10 and mother has A = -2 for (say) blood pressure KEY: parentsblood only pass single to their offspring. Expected pressure inalleles their offspring is (10-2)/2 Hence, theyabove only pass the Amean. part of their genotypic = 4 units the along population Offspring A= Value G Average of parental A’s Genetic Variances Gi j = πg + (Æi + Æj ) + ±i j 2n 2 æ2 (G) = æ2 (πg X +n (Æi + Æ (Æ + Æ ) + æ (± jk ) + ±i jk) = æ i j ij ) X k ( ) ( ) 2 2 2 ( ) æ (G) = æ (Æi + Æj As) +Cov(a,d) æ=(±0i j ) k= 1 2 æG k= 1 = 2 æA + 2 æD Dominance Genetic Variance Additive Genetic Variance (or simplyVariance) dominance variance) (or simply Additive Key concepts (so far) • ai = average effect of allele i – Property of a single allele in a particular population (depends on genetic background) • A = Additive Genetic Value (A) – A = sum (over all loci) of average effects – Fraction of G that parents pass along to their offspring – Property of an Individual in a particular population • Var(A) = additive genetic variance – Variance in additive genetic values – Property of a population • Can estimate A or Var(A) without knowing any of the underlying genetical detail (forthcoming) æ2A = 2E [Æ2 ] = 2 Xm Æ2i pi i= 1 One locus, 2 alleles: Q1Q1 Q1Q2 Q2Q2 Since E[a] = 0, 2] Var(a)0= E[(aa(1+k) -ma)2] = E[a2a æA2 = 2p1 p2 a2 [ 1+ k ( p1 ° p2 ) ]2 When dominance present, Dominance effects asymmetric function of allele m m additive variance X X 2 2 2 æD = 2E [± ] = ±i j pi pj frequencies i=1 j=1 One locus, 2 alleles: æD2 = (2p1 p2 ak)2 Equals zero if k = of 0 This is a symmetric function allele frequencies Additive variance, VA, with no dominance (k = 0) VA Allele frequency, p Complete dominance (k = 1) VA VD Allele frequency, p Epistasis Gi j kl = πG + (Æi + Æj + Æk + Æl ) + (±i j + ±k j ) + (ÆÆi k + ÆÆi l + ÆÆj k + ÆÆj l ) + (Ʊi k l + Ʊj k l + Ʊki j + Ʊl i j ) + (±±i j k l ) = πG + A + D + AA + AD + DD Additive Additive Dominance xx Additive Dominant -interactions interactions interaction ---- --Dominance x value dominance Additive Genetic value These components are defined to be interaction uncorrelated, interactions interactions between between between two alleles aansingle allele at dominance aallele at locus one the interaction between the (or orthogonal), so that the at locus onewith locus the with genotype a single at allele another, another e.g. deviation at one locus with theat dominance B2 deviation at genotype another. kj 2 2 allele 2Ai and 2 2 æG = æA + æD + æA A + æA D + æD D Heritability • Central concept in quantitative genetics • Proportion of variation due to additive genetic values – h2 = VA/VP – Phenotypes (and hence VP) can be directly measured – Breeding values (and hence VA ) must be estimated • Estimates of VA require known collections of relatives Key observations • The amount of phenotypic resemblance among relatives for the trait provides an indication of the amount of genetic variation for the trait. • If trait variation has a significant genetic basis, the closer the relatives, the more similar their appearance Genetic Covariance between relatives Sharing meansarise having allelestwo thatrelated are Genetic alleles covariances because Father Mother identical by are descent both copies of than individuals more(IBD): likely to share alleles can two be traced backindividuals. to a single copy in a are unrelated recent common ancestor. One allele IBD IBD No alleles Both IBD alleles Parent-offspring genetic covariance Cov(Gp, Go) --- Parents and offspring share EXACTLY one allele IBD Denote this common allele by A1 Gp = A p + D p = Æ1 + Æx + D 1x Go = A o + D o = Æ1 + Æy + D 1y IBD allele Non-IBD alleles C ov(G o; G p ) = Cov(Æ1 + Æx + D 1x ; Æ1 + Æy + D 1y = Cov(Æ1; Æ1) + Cov(Æ1 ; Æy ) + Cov(Æ1 ; D 1y ) + Cov(Æx ; Æ1 ) + Cov(Æx ; Æy ) + Cov(Æx ; D 1y ) + Cov(D 1x ; Æ1) + Cov(D 1x ; Æy ) + Cov(D 1x ; D 1y ) All white covariance terms are zero. • By construction, a and D are uncorrelated • By construction, a from non-IBD alleles are uncorrelated • By construction, D values are uncorrelated unless both alleles are IBD Ω Cov(Æx ; Æy ) = 0 V ar (A)=2 if x 6 = y; i.e., not IBD if x = y; i.e., IBD ar (A) =one V ar (Æ1 IBD + Æ2have ) = 2V Hence, relativesVsharing allele a ar (Æ1 ) genetic covariance of Var(A)/2 so t hat V ar (Æ1 ) = Cov(Æ1 ; Æ1 ) = Var (A )=2 The resulting parent-offspring genetic covariance becomes Cov(Gp,Go) = Var(A)/2 Half-sibs Each sib gets exactly one allele from common father, different alleles from the different mothers 2 1 o 1 o 2 Hence, the genetic The half-sibs covariance share of half-sibs no onealleles alleleisIBD just (1/2)Var(A)/2 •= Var(A)/4 occurs with probability 1/2 Full-sibs Father Mother Each sib gets exact one allele from each parent Full Sibs not IBD [ Prob = 1/2 ] Paternal allele [ Prob Prob(exactly oneIBD allele IBD)==1/2 1/2] not IBD [ Prob = 1/2 [ Prob = 1/2 ] ] = Maternal 1- Prob(0 allele IBD) -IBD Prob(2 IBD) Prob(zero alleles IBD) = 1/2*1/2 = 1/4 -> Prob(both Resulting Genetic Covariance between full-sibs IBD alleles IBD alleles 0 1 2 Probability Probability Contribution Contribution 1/4 0 1 1/21/2 Var(A)/2 Var(A)/2 2 1/4 0 1/4 1/4 0 Var(A) + Var(D) Var(A) + Var(D) Cov(Full-sibs) = Var(A)/2 + Var(D)/4 Genetic Covariances for General Relatives Let r = (1/2)Prob(1 allele IBD) + Prob(2 alleles IBD) Let u = Prob(both alleles IBD) General genetic covariance between relatives Cov(G) = rVar(A) + uVar(D) When epistasis is present, additional terms appear r2Var(AA) + ruVar(AD) + u2Var(DD) + r3Var(AAA) + Shared environmental values Cov(P1, P2) = Cov(G1+E1, G2+E2) = Cov(G1,G2), + Cov(E1,E2) In human, relatives (esp. family members) often share environments as well as sharing genes Shared material effects potentially important as well Sample Covariances Cov(monozygotic twins) = VA + VD + Cov(E) Cov(dizygotic twins) = VA/2 + VD/4 + Cov(E) Cov(parent, offspring) = VA/2 Hence, can estimate genetic variance components From phenotypic covariances using known sets of relatives More generally, use all comparisons between relatives in a complex pedigree (REML estimate of variances) Relative risks for binary traits Let z1 and z2 denote the trait state (0,1) in two relatives. Recurrence risk, KR (for relatives of type R) = Prob(z2 =1 | z1 = 1) James’ identity: KR = K + Cov(z1,z2)/K where K = Prob(z=1), i.e., the population prevalence Relative risk, lR = KR/K Risch’s identity: lR = 1 + Cov(z1,z2)/K2 Searching for QTLs: Marker-Trait Associations I. Within a pedigree Key: With linkage = excess of parential gametes MQ/mq father -- M associated with QTL allele Q (which increases trait value over q). Comparing mean trait values in offspring for paternal-M vs. paternal-m will show (for sufficiently large sample) a significant difference. Since the phase may differ across parents (e.g., mother might be Mq/mQ), critical to contrast markers alleles from each parent separately Searching for QTLs: Marker-Trait Associations II. Population-level linkage disequilibrium Key: With LD, covariance between alleles For very tightly-linked markers (less than 1 cM), might expect some population-level disequilibrium Hence, can contrast (say) M vs. m grouped over all individuals to look for a difference in trait value btw the two groups. If marker locus is sufficiently close to a QTL, LD might be present and an marker-trait association detected. Complication: Population structure can generate a covariance btw unlinked markers Key concepts • P=G+E=A+D+I+E • Var(G) = Var(A) + Var(D) + Var(I) • Phenotypic covariances can be used to estimate components of Var(G) • h2 = Var(A)/Var(P) is the heritability of a trait, measure of how parents & offspring resemble each other • Can use linkage (within a pedigree) or linkage disequilibrium (within a population) to search for QTLs via marker-trait associations