BASIC of GENETICS WHAT YOU NEED TO KNOW Ahmed Rebai Ahmed.rebai@cbs.rnrt.tn DNA.. THE CODE OF LIFE DNA is a molecule made of four bricks Living cells/organisms have DNA within it DNA contains the ‘text’ of life DNA FROM DNA TO PROTEIN DNA Parts of DNA are CODING (give proteins) this is only 3% in human genome but 95% of yeast Parts of DNA are NON-CODING: Introns Regulatory region of genes Other (junk DNA!) GENE Gene: a section of DNA that codes for a protein and protein contributes to a trait A chromosome is a ‘chunk’ of DNA and genes are parts of chromosomes GENES … ALLELES Because we have a pair of each chromosome, we have two copies of each gene These two forms can be identical in sequence or different: they are called ALLELE Alleles can yield different phenotypes ALLELE Allele: the different ‘options’ for a gene Example: attached or unattached earlobes are the alleles for the gene for earlobe shape DOMINANT/RECESSIVE Dominant: an allele that blocks or hides a recessive allele Recessive: an allele that is blocked by or hidden by a dominant allele GENOTYPE Genotype: A person’s set of alleles (gene options) Genotypes can be noted by Two letters denoting alleles: AA, AB, BB or for single variations for example AA, AG, GG A digit 1, 2, 3 or 0,1,2 (choosing a reference allele) 2 1 0 HOMOZYGOUS/HETEROZYGOUS Homozygous: When a person’s two alleles for a gene are the same Heterozygous: When a person’s two alleles for a gene are different You get one allele from your mom and one from your dad. If you get the same alleles from your mom and dad, you are homozygous for that gene. If your mom gave you a different allele than your dad, you are heterozygous for that gene PHENOTYPE Phenotype: A person’s physical features because of their genotype What you look like (your phenotype) is based on what your genotype is (your genes) SEGERGATION: LESSONS FROM PEAS Mendel (1822-1884) in the monastry of St. Thomas in the town of Brno (Brünn), in the Czech Republic. By a series of experiments in 1856-1863 on garden peas discovred the laws of inheritance SEXUAL REPRODUCTION MENDELIAN GENETICS: THE LAWS SEGERGATION SEGREGATION RULES 1. Genes come in pairs, which means that a cell or individual has two copies (alleles) of each gene. 2. For each pair of genes, the alleles may be identical (homozygous WW or homozygous ww), or they may be different (heterozygous Ww). 3. Each reproductive cell (gamete) produced by an individual contains only one allele of each gene (that is, either W or w). 4. In the formation of gametes, any particular gamete is equally likely to include either allele (hence, from a heterozygous Ww genotype, half the gametes contain W and the other half contain w). 5. The union of male and female reproductive cells is a random process that reunites the alleles in pairs. MENDEL’S FIRST LAW The Principle of Segregation: In the formation of gametes, the paired hereditary determinants separate (segregate) in such a way that each gamete is equally likely to contain either member of the pair. RECOMBINATION Mendel studied co-segregation of two genes by crossing: Wrinkled and Green x Round and Yellow MENDENL’S SECOND LAW The Principle of Independent Assortment: Segregation of the members of any pair of alleles is independent of the segregation of other pairs in the formation of reproductive cells. This is of course valid for unlinked genes RECOMBINATION When two genes are linked (close on the same chromosome) they do not segregate independently; frequencies of genotypes in progeny depend on the distance between genes MULTIPLE GENES FOR A PHENOTYPE: POLYGENIC TRAITS CONTINIOUS SCALE FOR A PHENOTYPE LET US EXERCICE What are the genotypes produced by the following matings and their frequencies: AA x AA AA x Aa AA x aa Aa x Aa Aa x aa aa x aa What are the frequencies of two-gene genotypes from this mating: AABb x AaBB? POPULATION GENETICS Basic concepts and theories PROBABILITY IN POPULATION GENETICS Consider the offsprings of the mating Aa x Aa The addition rule: Pr(an offspring have at least one A allele)=Pr(A-)= Pr(AA or Aa)= Pr(AA)+Pr(Aa)=1/4+1/2=3/4 For any two independent events A and B Pr(A or B)=Pr(A)+Pr(B) The multiplication rule: Pr(two offsprings having at least one A allele each)= Pr(A- and A-)=Pr(A-)xPr(A-)= 3/4x3/4=9/16 Far any two independent events A and B Pr(A and B)=Pr(A)xPr(B) EXERCICE Two indivdiuals with genotypes Aa and Aa married and had three children; what is the probability that one of their children has the genotype aa? Pr(aa and (AA or Aa) and (AA or Aa))= Pr(aa)xPr(A-)xPr(A-)=1/4x3/4x3/4=9/64 But Since the aa child have three possible birth orders we should multiply by 3. so 27/64. Compute for the case of two children? (response: 6/16; for 4 children this is also 27/64) ORGANIZATION OF GENETIC VARIATION A population is a group of organisms of the same species living within a sufficiently restricted geographical area that any mmeber can potentially mate with any other member (of the opposite sex) Population subdivision can be due to geographic constraints as well as to social behaviour Local populations: by country, town, : a group of individuals that can interbreed also said subpopulations or Mendelian populations GENETIC VARIATION Phenotypic diversity in natural populations is impressive and is due to genetic variation: multiple alleles for many genes affecting the phenotype Population genetics is concerned by describing how alleles are organized into genotypes and to determine wether alleles of the same or different genes are associated at random ALLELE FREQUENCIES IN POPULATIONS Allele frequency is the proportion in the population of all alleles of the gene that are of the specified type Since the population are of large size allele frequencies are estimated from a population sample Consider a gene with genotypes: AA, Aa et aa and a sample of N individuals We count the number of individuals that have AA, Aa et aa genotypes (denoted NAA, NAa et Naa, respectively) and we estimate the ferquency of allele A by the number of alleles A among all alleles segregating in the population, that is: pA= (2NAA+NAa)/2N and then pa=1-pA 32 EXAMPLE In a sample of 1000 individuals 298 were of genotype MM and 489 MN and 213 NN so the ferquency of allele M is pM=(2*298+489)/(2*1000)=0.54 We can compute a 95% confidence interval for the frequency based on the binomial law and normal approximation: This approximation is only valid for non-small (>0.1) and non-high (<0.9) frequencies In example we get [0.52 ; 0.56] FOR RARE ALLELES For rare alleles (less than 1%) there is chance that a sample do not contain any allele carrier so the frequency estimation will be 0 An alternative is to use Emprical Bayes estimation: For uniform prior this gives p=(k+2)/(n+4) where k is the observed number of alleles in the sample and n the total number of alleles RANDOM MATING Means that any two individuals (of opposite sex) have the same probability to mate This means that genotypes meet each other with the same probability as if they were formed by random collision of genotypes Random mating can apply to some genes like those controlling blood groups or neutral polymorphisms but not for others like those controlling skin color or height NON OVERLAPPING GENERATION Formally this means that the cycle of birth, maturation and death includes the death of all individuals present in each generation before the next generation mature This is only an approximation (simplistic in humans) but works well as far as geotype frequencies are considered THE HARDY-WEINBERG PRINCIPLE If we assume that The organism is diploid Reproduction is sexual Generations non-overlapping Allele frequencies identical in males and females The population is of large size Mating is random Migration and mutation is negligible Natural seltcion does not affect alleles 38 THEN.. Genotype frequencies can be deduced from allele frequencies (p is frequency of allele A, q=1-p of allele a): AA: p² Aa: 2pq aa: q² These frequencies (allelic and genotypic) remains the same over generations : we say that the population is in Hardy-Weinberg Equilibrium (HWE) WHY? IMPLICATION OF HWE Despite very restrictive and incorrect assumption HWE offers a reference model in which there are no evolutionary forces at work other than those imposed by the process of reproduction itself (like a mechanical model of falling object without any force in action other than gravity) The HW model separates life cycle to two phases: games->zygote and zygote->adult Even if the assumptions of non-overlapping generations is not true HWE will be attained gradually Applies also to multiallelic genes IMPLICATION OF HWE APPLICATION OF HWE We can calculate the number of carriers of a rare mutation in the population Ex: cystic fibrosis in european population patient is known to be 1 over 1700 (q=0.024) so the number of heterozygotes is (due to HWE) about 5% So when there is a very rare allele most of genotypes containing this allele are heterozygous: Show that for a rare allele of frequency is 1/1000 there are 2000 times more heterzoygotes than recessive homozygotes? HWE DEVIATION Deviation from HWE can be due to inbreeding, population stratification, selection, gender-dependent allele frequencies, non-random (assortative) mating Principle do not apply directly to X-linked genes or Y-linked genes 44 TESTS OF HWE Compare observed to expected genotype counts using Pearson chi-square test of goodness of fit: with 3 genotypes and 1 parameter estimated (p) we have a test with 1 df Inappropriate for rare variants (low genotype counts): use Fisher Exact Test (FET) Other Exact tests are available in the R language (e.g. Genetics package,…) 45 PEARSON CHI-SQUARE THROUGH D Let DA= PAA- p² Testing HWE is testing DA=0 ² A ND ² ( p(1 p))² p-value = Pr(²1df> ²obs) If p-value<0,05 (or 0,0001) then Deviation 46 from HWE Compute TESTS OF HWE: LET’S DO IT! Example: In a sample of 1000 individuals 298 were of genotype MM and 489 MN and 213 NN so the ferquency of allele M is Genotypes: MM MN NN Observed counts : 298 489 213 Expected counts : 294.3 496.4 209.3 pM=0.54, PMM=0.294 so D=0.298-0.294=0.004 ²=N D²/(p(1-p))²=1000*(0.004/(0.54*0.46))² 47 ²=0.25<3.84; p-value=0.61 HAPLOTYPES FROM GENOTYPES If we study many genes they can be linked and one can use haplotypes A haplotype (haploid genotype) is a set for alleles carried by one chromosome for several genes Consider two genes (A,a) and (B,b) with allele frequencies (pA, pa) and (pB, pb) If gametic frequencies are product of allele frequencies: AB: pAxpB, Ab:pAxpb, aB: paxpB, ab:paxpb We say that the genes are in random association 48 or in Linkage equilibrium LINKAGE If DISEQUIULIBRIUM the observed frequency of gametes (e.g. PAB) differ from that expected under linkage equilibrium (pAxpB) we say that the gene is in Linkage Disequilibrium (LD) To measure and test LD we need to know the haplotype frequencies LINKAGE DISEQUILIBRIUM a b A B SNP1 SNP2 Allele Frequencies 30% 70% 40% 60% 60% 42% 28% 30% 12% 18% 10% Linkage Disequilibrium (LD) No LD 51 LD MEASURES: D The difference between observed and expected haplotype frequency D PAB pA pB Is also equal to D PAB Pab PAb PaB D is bounded between Dmax and Dmin D’: STANDARDIZED D Practically choose alleles A and B such that D>0 and pA>pB, A standardized measure of LD is thus: D D D' Dmax (1 p A ) pB D’=1 denotes complete LD THE R² MEASURE : MORE PRACTICAL This is correlation from the 2x2 contingency table of haplotype counts D² r² PA Pa PB Pb Or PB Pa r ² ( D ' )² PA Pb 54 TESTING LD We can show that Nr² is a chi-square test of LD (1df) Exercice: two blood group systems: M/N and S/s gave following haplotypes (1000 individuals): MS: 474 Ms: 611 NS: 142 Ns: 733 Allele frequencies are M: 0.54, S: 0.31 Compute D and D’ and r² Test LD Solution: D=0.07, D’=0.50 r²=0.47, X²=470, p<10-100 CAUSES OF LD LD is ‘created by linkage’ If r is the recombination rate between two genes then we can show that LD at generation t is given by t Dt=(1-r) D0 If r is small (genes very close on chromosome) the decay is very slow and can stay for over hundreds of generation RECOMBINATION AND LD (1-r)/2 /2 DECAY OF LD OVER GENERATIONS ADMIXTURE OF POPULATIONS LD can be created by the merge of populations having different gametic frequencies Let two populations and two genes in linkage equiulibrium in both, where alleles A and B have frequencies 0.05 in the first population and 0.95 in the second population A new population is formed by equal mixture of the two populations, show that LD is high in that population (D=0.2 and D’=0.81) ? ADMIXTURE NATURAL (DARWINIAN) SELECTION Individuals differ in their ability to survive and reproduce owing in part to their genotype Th selective advantage/disadvantage is measured by fitness Selection results in a change of allele frequencies over generations and deviation from HWE EFFECT OF SELECTION RANDOM GENETIC DRIFT For each generation there is a chance in the drawing of gametes that will unit to form the next generation This chance can result in a random change in allele frequency and may ultimately lead to the fixation or elimination of some alleles SIMPLY SAYING MATHEMATICAL MODELS OF DRIFT Wright-Fisher model (1930): probability of obtaining k copies of an allele that had frequency p in the last generation is: expected time before a neutral allele becomes fixed through genetic drift is given by: POPULATION BOTTLENECK FOUNDER EFFECT POPULATION SUBSTRUCTURE When a population is organized in several subpopulations having different genetic composition (allele frequencies) Substructure generally results in the reduction of heterozygotes frequency relative to that expected with random mating (Wahlund principle) Several measures to assess population substructure : F-statistics F-STATISTICS Defined by Wright (1921) (1-FIT)=(1-FIS)(1-FST) ANOTHER FORMULATION The mots useful to test substructure is FST an index that measures the level of genetic divergence among subpopulations FST=(HT-HS)/HT HS: average heterozygosity among individuals within subpopulations HT: average heterozygosity among individuals within the total populations According to variance of allele frequencies HOW TO USE IT? FST=1 means total divergence by fixation of alternative alleles in subpopulations <0.05: little differentiation 0.0<FST<0.15 moderate 0.15<FST<0.25 high >0.25 very high Test chi-square with 1 df: X²= (k-1) N FST Examples: between european and sub-sahrian african: 0.15 Japanese-african: 0.19 europeans: 0.11 EXAMPLE Two population where allele frequency is 0,5 and 0,3 ADMIXTURE Genetic admixture occurs when individuals from two or more previously separated populations begin interbreeding. Admixture results in the introduction of new genetic lineages into a population. Most human populations are a product of mixture of genetically distinct groups that intermixed within the last 4,000 years. ADMIXTURE DETECTION By testing HWE Standard statistical methods applied to data on genotype, alleles/haplotype frequencies: Principal component Analysis (PCA), Clustering: K-means, hierarchical,.. Advanced methods: Maximum likelihood (psmix R package) Bayesian methods Wavelet analysis (adwave R package) STRUCTURE PRINCIPAL COMPONENT ANALYSIS CLUSTERING STRUCTURE inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed. http://pritchardlab.stanford.edu/structure.html ADMIXTURE https://www.genetics.ucla.edu/software/admixture/ R PACKAGES Genetics: Classes and methods for handling genetic data. Includes classes to represent genotypes and haplotypes at single markers up to multiple markers on multiple chromosomes. Function include allele frequencies, flagging homo/heterozygotes, flagging carriers of certain alleles, estimating and testing for Hardy-Weinberg disequilibrium, estimating and testing for linkage disequilibrium, ... Adegenet: Classes and functions for genetic data analysis within the multivariate framework Hierfstat: estimation of hierarchical F-statistics from haploid or diploid genetic data with any numbers of levels in the hierarchy, following the algorithm Functions are also given to test via randomisation the significance of each F and variance components RECOMMENDED READINGS