Course1-Basic_Genetics - H3ABioNet training course material

advertisement
BASIC of GENETICS
WHAT YOU NEED TO KNOW
Ahmed Rebai
Ahmed.rebai@cbs.rnrt.tn
DNA.. THE CODE OF LIFE
DNA is a molecule made of four bricks
 Living cells/organisms have DNA within it
 DNA contains the ‘text’ of life

DNA
FROM DNA TO PROTEIN
DNA
 Parts
of DNA are CODING (give
proteins) this is only 3% in human
genome but 95% of yeast
 Parts of DNA are NON-CODING:
Introns
 Regulatory region of genes
 Other (junk DNA!)

GENE
Gene: a section of DNA
that codes for a protein
and protein contributes
to a trait
 A chromosome is a
‘chunk’ of DNA and
genes are parts of
chromosomes

GENES
… ALLELES
Because we have a pair of each chromosome, we
have two copies of each gene
 These two forms can be identical in sequence or
different: they are called ALLELE
 Alleles can yield different phenotypes

ALLELE
Allele: the different ‘options’ for a gene
 Example: attached or unattached earlobes are
the alleles for the gene for earlobe shape

DOMINANT/RECESSIVE
Dominant: an allele that blocks or hides a
recessive allele
 Recessive: an allele that is blocked by or hidden
by a dominant allele

GENOTYPE
Genotype: A person’s set of alleles (gene options)
 Genotypes can be noted by

Two letters denoting alleles: AA, AB, BB or for single
variations for example AA, AG, GG
 A digit 1, 2, 3 or 0,1,2 (choosing a reference allele)

2
1
0
HOMOZYGOUS/HETEROZYGOUS
Homozygous: When a person’s two alleles for a
gene are the same
 Heterozygous: When a person’s two alleles for a
gene are different
 You get one allele from your mom and one from
your dad.
 If you get the same alleles from your mom and
dad, you are homozygous for that gene.
 If your mom gave you a different allele than your
dad, you are heterozygous for that gene

PHENOTYPE
Phenotype: A person’s physical features because
of their genotype
 What you look like (your phenotype) is based on
what your genotype is (your genes)

SEGERGATION: LESSONS FROM PEAS

Mendel (1822-1884) in the monastry of St.
Thomas in the town of Brno (Brünn), in the
Czech Republic. By a series of experiments in
1856-1863 on garden peas discovred the laws of
inheritance
SEXUAL REPRODUCTION
MENDELIAN GENETICS: THE LAWS
SEGERGATION
SEGREGATION RULES
1. Genes come in pairs, which means that a cell or
individual has two copies (alleles) of each gene.
 2. For each pair of genes, the alleles may be
identical (homozygous WW or homozygous ww), or
they may be different (heterozygous Ww).
 3. Each reproductive cell (gamete) produced by
an individual contains only one allele of each gene
(that is, either W or w).
 4. In the formation of gametes, any particular
gamete is equally likely to include either allele
(hence, from a heterozygous Ww genotype, half the
gametes contain W and the other half contain w).
 5. The union of male and female reproductive
cells is a random process that reunites the alleles in
pairs.

MENDEL’S FIRST LAW
 The
Principle of Segregation: In
the formation of gametes, the
paired hereditary determinants
separate (segregate) in such a way
that each gamete is equally likely to
contain either member of the pair.
RECOMBINATION
 Mendel
studied co-segregation of two
genes by crossing:
Wrinkled and Green x Round and Yellow
MENDENL’S SECOND LAW


The Principle of Independent Assortment:
Segregation of the members of any pair of alleles
is independent of the segregation of other pairs in
the formation of reproductive cells.
This is of course valid for unlinked genes
RECOMBINATION

When two genes are linked (close on the same
chromosome) they do not segregate
independently; frequencies of genotypes in
progeny depend on the distance between genes
MULTIPLE GENES FOR A PHENOTYPE:
POLYGENIC TRAITS
CONTINIOUS SCALE FOR A PHENOTYPE
LET US EXERCICE
What are the genotypes produced by the
following matings and their frequencies:
 AA x AA
 AA x Aa
 AA x aa
 Aa x Aa
 Aa x aa
 aa x aa
 What are the frequencies of two-gene genotypes
from this mating: AABb x AaBB?

POPULATION GENETICS
Basic concepts and theories
PROBABILITY IN POPULATION GENETICS
Consider the offsprings of the mating Aa x Aa
 The addition rule:

Pr(an offspring have at least one A allele)=Pr(A-)=
Pr(AA or Aa)= Pr(AA)+Pr(Aa)=1/4+1/2=3/4
 For any two independent events A and B
Pr(A or B)=Pr(A)+Pr(B)


The multiplication rule:
Pr(two offsprings having at least one A allele each)=
Pr(A- and A-)=Pr(A-)xPr(A-)= 3/4x3/4=9/16
 Far any two independent events A and B
Pr(A and B)=Pr(A)xPr(B)

EXERCICE
Two indivdiuals with genotypes Aa and Aa
married and had three children; what is the
probability that one of their children has the
genotype aa?
 Pr(aa and (AA or Aa) and (AA or Aa))=
Pr(aa)xPr(A-)xPr(A-)=1/4x3/4x3/4=9/64
 But
 Since the aa child have three possible birth
orders we should multiply by 3. so 27/64.
 Compute for the case of two children?

(response: 6/16; for 4 children this is also 27/64)
ORGANIZATION OF GENETIC VARIATION
A population is a group of organisms of the same
species living within a sufficiently restricted
geographical area that any mmeber can
potentially mate with any other member (of the
opposite sex)
 Population subdivision can be due to geographic
constraints as well as to social behaviour
 Local populations: by country, town, : a group of
individuals that can interbreed also said
subpopulations or Mendelian populations

GENETIC VARIATION
 Phenotypic
diversity in natural
populations is impressive and is due to
genetic variation: multiple alleles for
many genes affecting the phenotype
 Population genetics is concerned by
describing how alleles are organized
into genotypes and to determine wether
alleles of the same or different genes are
associated at random
ALLELE FREQUENCIES IN POPULATIONS
Allele frequency is the proportion in the population of all
alleles of the gene that are of the specified type
 Since the population are of large size allele frequencies
are estimated from a population sample



Consider a gene with genotypes: AA, Aa et aa and a sample
of N individuals
We count the number of individuals that have AA, Aa et aa
genotypes (denoted NAA, NAa et Naa, respectively) and we
estimate the ferquency of allele A by the number of alleles A
among all alleles segregating in the population, that is:
pA= (2NAA+NAa)/2N
and then pa=1-pA
32
EXAMPLE

In a sample of 1000 individuals 298 were of genotype
MM and 489 MN and 213 NN so the ferquency of allele
M is
pM=(2*298+489)/(2*1000)=0.54

We can compute a 95% confidence interval for the
frequency based on the binomial law and normal
approximation:

This approximation is only valid for non-small (>0.1)
and non-high (<0.9) frequencies

In example we get [0.52 ; 0.56]
FOR RARE ALLELES
For rare alleles (less than 1%) there is chance
that a sample do not contain any allele carrier so
the frequency estimation will be 0
 An alternative is to use Emprical Bayes
estimation:


For uniform prior this gives p=(k+2)/(n+4) where k is
the observed number of alleles in the sample and n
the total number of alleles
RANDOM MATING
 Means
that any two individuals (of
opposite sex) have the same probability to
mate
 This means that genotypes meet each
other with the same probability as if they
were formed by random collision of
genotypes
 Random mating can apply to some genes
like those controlling blood groups or
neutral polymorphisms but not for others
like those controlling skin color or height
NON OVERLAPPING GENERATION
 Formally
this means that the cycle of
birth, maturation and death includes
the death of all individuals present
in each generation before the next
generation mature
 This is only an approximation
(simplistic in humans) but works
well as far as geotype frequencies
are considered
THE HARDY-WEINBERG PRINCIPLE

If we assume that








The organism is diploid
Reproduction is sexual
Generations non-overlapping
Allele frequencies identical in males and
females
The population is of large size
Mating is random
Migration and mutation is negligible
Natural seltcion does not affect alleles 38
THEN..
Genotype frequencies can be deduced
from allele frequencies (p is frequency
of allele A, q=1-p of allele a):
AA: p²
Aa: 2pq aa: q²
These frequencies (allelic and genotypic)
remains the same over generations : we say
that the population is in Hardy-Weinberg
Equilibrium (HWE)
WHY?
IMPLICATION OF HWE
Despite very restrictive and incorrect assumption
HWE offers a reference model in which there are
no evolutionary forces at work other than those
imposed by the process of reproduction itself (like
a mechanical model of falling object without any
force in action other than gravity)
 The HW model separates life cycle to two phases:
games->zygote and zygote->adult
 Even if the assumptions of non-overlapping
generations is not true HWE will be attained
gradually
 Applies also to multiallelic genes

IMPLICATION OF HWE
APPLICATION OF
HWE
We can calculate the number of carriers of a rare
mutation in the population
 Ex: cystic fibrosis in european population patient
is known to be 1 over 1700 (q=0.024) so the
number of heterozygotes is (due to HWE) about
5%
 So when there is a very rare allele most of
genotypes containing this allele are heterozygous:
 Show that for a rare allele of frequency is 1/1000
there are 2000 times more heterzoygotes than
recessive homozygotes?

HWE DEVIATION
 Deviation





from HWE can be due to
inbreeding,
population stratification,
selection,
gender-dependent allele frequencies,
non-random (assortative) mating
 Principle
do not apply directly to X-linked
genes or Y-linked genes
44
TESTS OF HWE
 Compare observed to expected genotype
counts using Pearson chi-square test of
goodness of fit: with 3 genotypes and 1
parameter estimated (p) we have a test with
1 df
 Inappropriate for rare variants (low
genotype counts): use Fisher Exact Test
(FET)
 Other Exact tests are available in the R
language (e.g. Genetics package,…)
45
PEARSON CHI-SQUARE THROUGH D
Let
DA= PAA- p²
Testing

HWE is testing DA=0
²
A
ND
² 
( p(1  p))²
p-value = Pr(²1df> ²obs)
If p-value<0,05 (or 0,0001) then Deviation
46
from HWE
Compute
TESTS OF HWE: LET’S DO IT!
Example: In a sample of 1000 individuals 298 were of
genotype MM and 489 MN and 213 NN so the ferquency of
allele M is
Genotypes:
MM
MN
NN
 Observed counts :
298
489
213
 Expected counts :
294.3 496.4 209.3


pM=0.54, PMM=0.294 so D=0.298-0.294=0.004
 ²=N
D²/(p(1-p))²=1000*(0.004/(0.54*0.46))²
47
 ²=0.25<3.84; p-value=0.61
HAPLOTYPES FROM GENOTYPES
 If
we study many genes they can be linked and
one can use haplotypes
 A haplotype (haploid genotype) is a set for
alleles carried by one chromosome for several
genes
 Consider two genes (A,a) and (B,b) with allele
frequencies (pA, pa) and (pB, pb)
 If gametic frequencies are product of allele
frequencies:
 AB: pAxpB, Ab:pAxpb, aB: paxpB, ab:paxpb
 We say that the genes are in random association 48
or in Linkage equilibrium
LINKAGE
 If
DISEQUIULIBRIUM
the observed frequency of gametes
(e.g. PAB) differ from that expected
under linkage equilibrium (pAxpB) we
say that the gene is in Linkage
Disequilibrium (LD)
 To measure and test LD we need to
know the haplotype frequencies
LINKAGE DISEQUILIBRIUM
a
b
A
B
SNP1
SNP2
Allele Frequencies
30%
70%
40%
60%
60%
42%
28%
30%
12%
18%
10%
Linkage Disequilibrium (LD)
No LD
51
LD MEASURES: D

The difference between observed and expected
haplotype frequency
D  PAB  pA pB

Is also equal to
D  PAB Pab  PAb PaB

D is bounded between Dmax and Dmin
D’: STANDARDIZED D
 Practically
choose alleles A and B such
that D>0 and pA>pB,
 A standardized measure of LD is thus:
D
D
D' 

Dmax (1  p A ) pB
 D’=1
denotes complete LD
THE R² MEASURE : MORE PRACTICAL
 This
is correlation from the 2x2
contingency table of haplotype
counts
D²
r² 

PA Pa PB Pb
Or
PB Pa
r ²  ( D ' )²
PA Pb
54
TESTING LD

We can show that
 Nr²
is a chi-square test of LD (1df)
 Exercice: two blood group systems: M/N
and S/s gave following haplotypes (1000
individuals):
 MS: 474 Ms: 611 NS: 142 Ns: 733
 Allele frequencies are M: 0.54, S: 0.31
 Compute D and D’ and r²
 Test LD

Solution: D=0.07, D’=0.50 r²=0.47, X²=470, p<10-100
CAUSES OF LD
LD is ‘created by linkage’
 If r is the recombination rate between two genes
then we can show that LD at generation t is
given by

t
Dt=(1-r) D0
If r is small (genes very close on chromosome) the
decay is very slow and can stay for over hundreds
of generation
RECOMBINATION AND LD
(1-r)/2
/2
DECAY OF LD OVER GENERATIONS
ADMIXTURE OF POPULATIONS
LD can be created by the merge of populations
having different gametic frequencies
 Let two populations and two genes in linkage
equiulibrium in both, where alleles A and B have
frequencies 0.05 in the first population and 0.95
in the second population
 A new population is formed by equal mixture of
the two populations, show that LD is high in that
population (D=0.2 and D’=0.81) ?

ADMIXTURE
NATURAL (DARWINIAN) SELECTION
 Individuals
differ in their ability to
survive and reproduce owing in part to
their genotype
 Th selective advantage/disadvantage is
measured by fitness
 Selection results in a change of allele
frequencies over generations and
deviation from HWE
EFFECT OF SELECTION
RANDOM GENETIC DRIFT
 For
each generation there is a chance in
the drawing of gametes that will unit to
form the next generation
 This chance can result in a random
change in allele frequency and may
ultimately lead to the fixation or
elimination of some alleles
SIMPLY SAYING
MATHEMATICAL MODELS OF DRIFT


Wright-Fisher model (1930): probability of
obtaining k copies of an allele that had
frequency p in the last generation is:
expected time before a neutral allele becomes
fixed through genetic drift is given by:
POPULATION BOTTLENECK
FOUNDER EFFECT
POPULATION SUBSTRUCTURE
 When
a population is organized in several
subpopulations having different genetic
composition (allele frequencies)
 Substructure generally results in the
reduction of heterozygotes frequency
relative to that expected with random
mating (Wahlund principle)
 Several measures to assess population
substructure : F-statistics
F-STATISTICS

Defined by Wright (1921)
(1-FIT)=(1-FIS)(1-FST)
ANOTHER FORMULATION

The mots useful to test substructure is FST an
index that measures the level of genetic
divergence among subpopulations
FST=(HT-HS)/HT
HS: average heterozygosity among individuals within
subpopulations
 HT: average heterozygosity among individuals within
the total populations


According to variance of allele frequencies
HOW TO USE IT?
FST=1 means total divergence by fixation of
alternative alleles in subpopulations
 <0.05: little differentiation
 0.0<FST<0.15 moderate
 0.15<FST<0.25 high
 >0.25 very high
 Test chi-square with 1 df: X²= (k-1) N FST
 Examples:



between european and sub-sahrian african: 0.15
Japanese-african: 0.19 europeans: 0.11
EXAMPLE

Two population where allele frequency is 0,5 and
0,3
ADMIXTURE
Genetic admixture occurs when individuals from
two or more previously separated populations
begin interbreeding.
 Admixture results in the introduction of new
genetic lineages into a population.
 Most human populations are a product of
mixture of genetically distinct groups that
intermixed within the last 4,000 years.

ADMIXTURE DETECTION
By testing HWE
 Standard statistical methods applied to data on
genotype, alleles/haplotype frequencies:

Principal component Analysis (PCA),
 Clustering: K-means, hierarchical,..


Advanced methods:
Maximum likelihood (psmix R package)
 Bayesian methods
 Wavelet analysis (adwave R package)


STRUCTURE
PRINCIPAL COMPONENT ANALYSIS
CLUSTERING
STRUCTURE
inferring the presence of distinct
populations, assigning individuals to
populations, studying hybrid zones,
identifying migrants and admixed
individuals, and estimating population
allele frequencies in situations where
many individuals are migrants or admixed.
http://pritchardlab.stanford.edu/structure.html
ADMIXTURE
https://www.genetics.ucla.edu/software/admixture/
R PACKAGES

Genetics: Classes and methods for handling genetic
data. Includes classes to represent genotypes and
haplotypes at single markers up to multiple markers on
multiple chromosomes. Function include allele frequencies,
flagging homo/heterozygotes, flagging carriers of certain
alleles, estimating and testing for Hardy-Weinberg
disequilibrium, estimating and testing for linkage
disequilibrium, ...

Adegenet: Classes and functions for genetic data
analysis within the multivariate framework

Hierfstat: estimation of hierarchical F-statistics from
haploid or diploid genetic data with any numbers of levels
in the hierarchy, following the algorithm Functions are also
given to test via randomisation the significance of each F
and variance components
RECOMMENDED READINGS
Download