SoV

advertisement
CSE280a: Algorithmic topics in
bioinformatics
Vineet Bafna
CSE280
Vineet Bafna
The scope/syllabus
•
We will cover topics from the following areas:
–
–
–
–
•
Population genetics
Computational Mass Spectrometry
Biological networks (emphasis on comparative analysis)
ncRNA
The focus will be on the use of algorithms for
analyzing data in these areas
–
Some background in algorithms (mathematical maturity) is
helpful.
–
Relevant biology will be discussed on a ‘need to know’ basis
CSE280
Vineet Bafna
Logisitics
•
•
•
•
This class has little required homework.
All reading is optional, but recommended.
1 Final Exam (20%), and 1 research project (70%)
Students help edit class notes in latex (10%)
–
–
•
Project Goal:
–
•
•
At most one topic per student
Groups of <= 2 per topic
Address research problems with minimum preparation
Lectures will be given by instructors and by students.
Most communication is electronic.
–
CSE280
Check http://www.cse.ucsd.edu/classes/wi08/cse280a
Vineet Bafna
From an individual to a population
•
•
Individual genomes vary by about 1 in 1000bp.
These small variations account for significant
phenotype differences.
–
–
•
•
•
Disease susceptibility.
Response to drugs
How can we understand genetic variation in a
population, and its consequences?
It took a long time (10-15 yrs) to produce the draft
sequence of the human genome.
Soon (within 10-15 years), entire populations can
have their DNA sequenced. Why do we care?
CSE280
Vineet Bafna
Population Genetics
•
•
•
•
Individuals in a species
(population) are phenotypically
different.
Often these differences are
inherited (genetic).
Understanding the genetic
basis of these differences is a
key challenge of biology!
The analysis of these
differences involves many
interesting algorithmic
questions.
We will use these questions to
illustrate algorithmic principles,
and use algorithms to interpret
genetic data.
CSE280
Vineet Bafna
Population genetics
•
We are all similar, yet we are different. How substantial
are the differences?
–
–
–
–
–
–
What are the sources of variation?
As mutations arise, they are either neutral and subject to
evolutionary drift, or they are (dis-)advantageous and under
selective pressure. Can we tell?
If you had DNA from many sub-populations, Asian, European,
African, can you separate them?
How can we detect recombination?
Why are some people more likely to get a disease then others?
How is disease gene mapping done?
Phasing of chromosomes
CSE280
Vineet Bafna
Computational mass spectrometry
•
•
Mass Spectrometry is a key technology for
measuring active proteins, their interactions,
and their post-translational modifications
Computation plays a key role in interpreting
mass spectrometry data.
CSE280
Vineet Bafna
Back to the genome
•
•
•
Recall that the genome contains protein-coding genes (1% of
the genome)
Another 5% might encode RNA, and regulatory sites
How do we find regions of functional interest?
CSE280
Vineet Bafna
Population genetics basics
CSE280
Vineet Bafna
Scope of genetics lectures
•
•
Basic terminology
Key principles
–
–
–
–
–
–
–
–
–
CSE280
Sources of variation
HW equilibrium
Linkage
Coalescent theory
Recombination/Ancestral Recombination Graph
Haplotypes/Haplotype phasing
Population sub-structure
Structural polymorphisms
Medical genetics basis: Association mapping/pedigree analysis
Vineet Bafna
Terminology: allele
•
Allele: A specific variant at a location
– The notion of alleles predates the concept of
gene, and DNA.
– Initially, alleles referred to variants that described a
measurable trait (round/wrinkled seed)
– Now, an allele might be a nucleotide on a
chromosome, with no measurable phenotype.
– As we discuss source of variation, we will have
different kinds of alleles.
CSE280
Vineet Bafna
Terminology
•
Locus: The location of the allele
– A nucleotide position.
– A genetic marker
– A gene
– A chromosomal segment
CSE280
Vineet Bafna
Terminology
•
•
•
•
Genotype: genetic makeup of (part of) an individual
Phenotype: A measurable trait in an organism, often
the consequence of a genetic variation
Humans are diploid, they have 2 copies of each
chromosome.
– They may have heterozygosity/homozygosity at a
location
– Other organisms (plants) have higher forms of
ploidy.
– Additionally, some sites might have 2 allelic forms,
or even many allelic forms.
Haplotype: genetic makeup of (part of) a single
chromosome
CSE280
Vineet Bafna
What causes variation in a population?
•
•
•
•
Mutations (may lead to SNPs)
Recombinations
Other crossover events (gene conversion)
Structural Polymorphisms
CSE280
Vineet Bafna
Single Nucleotide Polymorphisms
•
•
•
Small mutations that are
sustained in a population
are called SNPs
SNPs are the most common
source of variation studied
The data is a matrix (rows
are individuals, columns are
loci). Only the variant
positions are kept.
CSE280
Vineet Bafna
A->G
Single Nucleotide Polymorphisms
Infinite Sites Assumption:
Each site mutates at most
once
CSE280
Vineet Bafna
00000101011
10001101001
01000101010
01000000011
00011110000
00101100110
Short Tandem Repeats
GCTAGATCATCATCATCATTGCTAG
GCTAGATCATCATCATTGCTAGTTA
GCTAGATCATCATCATCATCATTGC
GCTAGATCATCATCATTGCTAGTTA
GCTAGATCATCATCATTGCTAGTTA
GCTAGATCATCATCATCATCATTGC
CSE280
Vineet Bafna
4
3
5
3
3
5
STR can be used as a DNA fingerprint
•
•
•
Consider a collection of
regions with variable length
repeats.
Variable length repeats will
lead to variable length DNA
Vector of lengths is a fingerprint
4
3
5
3
3
5
2
3
1
2
1
3
loci
CSE280
Vineet Bafna
Structural polymorphisms
•
•
•
Large scale structural changes
(deletions/insertions/inversions) may occur in a
population.
Copy Number variation
Certain diseases (cancers) are marked by an
abundance of these events
CSE280
Vineet Bafna
Personalized genome sequencing
•
•
•
•
•
•
•
•
These variants (of which 1,288,319 were novel)
included
3,213,401 single nucleotide polymorphisms (SNPs),
53,823 block substitutions (2–206 bp),
292,102 heterozygous insertion/deletion events
(indels)(1–571 bp),
559,473 homozygous indels (1–82,711 bp),
90 inversions, as well as numerous segmental
duplications and copy number variation regions.
Non-SNP DNA variation accounts for 22% of all
events identified in the donor, however they involve
74% of all variant bases. This suggests an
important role for non-SNP genetic alterations in
defining the diploid genome structure.
Moreover, 44% of genes were heterozygous for one
or more variants.
CSE280
Vineet Bafna
PLoS Biology, 2007
Recombination
00000000
11111111
00011111
•
CSE280
Vineet Bafna
Not all DNA
recombines!
Human DNA
•
•
•
Not all DNA recombines.
mtDNA is inherited from the
mother, and
y-chromosome from the father
CSE280
Vineet Bafna
http://upload.wikimedia.org/wikipedia/commons/b/b2/Karyotype.png
Gene Conversion
•
Gene Conversion
versus single
crossover
–
CSE280
Hard to distinguish in a
population
Vineet Bafna
Topic 1: Basic Principles
•
In a ‘stable’ population, the distribution of
alleles obeys certain laws
–
•
HW Equilibrium
–
•
Not really, and the deviations are interesting
(due to mixing in a population)
Linkage (dis)-equilibrium
–
CSE280
Due to recombination
Vineet Bafna
Hardy Weinberg equilibrium
•
•
•
•
Consider a locus with 2 alleles, A, a
p (respectively, q) is the frequency of A (resp.
a) in the population
3 Genotypes: AA, Aa, aa
Q: What is the frequency of each genotype
If various assumptions are satisfied, (such as random
mating, no natural selection), Then
• PAA=p2
• PAa=2pq
• Paa=q2
CSE280
Vineet Bafna
Hardy Weinberg: why?
•
Assumptions:
–
–
–
–
–
•
Diploid
Sexual reproduction
Random mating
Bi-allelic sites
Large population size, …
Why? Each individual randomly picks his two
chromosomes. Therefore, Prob. (Aa) = pq+qp
= 2pq, and so on.
CSE280
Vineet Bafna
Hardy Weinberg: Generalizations
•
Multiple alleles with frequencies
–
By HW,
1,2, , H
Pr[homozygous genotype i] =  i2
 Pr[heterozygous genotype i, j] = 2 
i j
•
Multiple loci?

CSE280
Vineet Bafna
Hardy Weinberg: Implications
•
•
•
•
The allele frequency does not change from generation to
generation. Why?
It is observed that 1 in 10,000 caucasians have the disease
phenylketonuria. The disease mutation(s) are all recessive.
What fraction of the population carries the mutation?
Males are 100 times more likely to have the “red’ type of color
blindness than females. Why?
Conclusion: While the HW assumptions are rarely satisfied, the
principle is still important as a baseline assumption, and
significant deviations are interesting.
CSE280
Vineet Bafna
Recombination
00000000
11111111
00011111
CSE280
Vineet Bafna
What if there were no recombinations?
•
•
•
Life would be simpler
Each individual sequence would have a
single parent (even for higher ploidy)
The genealogical relationship is expressed as
a tree. This principle is used to track ancestry
of an individual
CSE280
Vineet Bafna
The Infinite Sites Assumption
00000000
3
00100000
5
8
00100001
•
•
•
CSE280
00101000
The different sites are linked. A 1 in position 8 implies 0 in
position 5, and vice versa.
Some phenotypes could be linked to the polymorphisms
Some of the linkage is Vineet
“destroyed”
by recombination
Bafna
Infinite sites assumption and Perfect Phylogeny
•
•
Each site is mutated at
most once in the history.
All descendants must carry
the mutated value, and all
others must carry the
ancestral value
i
1 in position i
0 in position i
CSE280
Vineet Bafna
Perfect Phylogeny
•
•
Assume an evolutionary model in which no
recombination takes place, only mutation.
The evolutionary history is explained by a
tree in which every mutation is on an edge of
the tree. All the species in one sub-tree
contain a 0, and all species in the other
contain a 1. Such a tree is called a perfect
phylogeny.
CSE280
Vineet Bafna
Handling recombination
•
•
A tree is not sufficient as a sequence may
have 2 parents
Recombination leads to loss of correlation
between columns
CSE280
Vineet Bafna
Quiz 1
•
•
•
Allele, locus, genotype, haplotype
Hardy Weinberg equilibrium?
Today: Linkage (dis)-equilibrium
CSE280
Vineet Bafna
Quiz 2
•
Recall that a SNP data-set is a ‘binary’ matrix.
–
–
•
Rows are individual (chromosomes)
Columns are alleles at a specific locus
Suppose you have 2 SNP datasets of a
contiguous genomic region.
–
–
–
CSE280
One from an African population, and one from a
European Population.
Can you tell which is which?
How long does the genomic region have to be?
Vineet Bafna
Recombination, and populations
•
•
•
•
•
Think of a population of N individual chromosomes.
The population remains stable from generation to
generation.
Without recombination, each individual has exactly
one parent chromosome from the previous
generation.
With recombinations, each individual is derived from
one or two parents.
We will formalize this notion later in the context of
coalescent theory.
CSE280
Vineet Bafna
Linkage (Dis)-equilibrium (LD)
•
•
•
Consider sites A &B
Case 1: No recombination
Each new individual
chromosome chooses a
parent from the existing
‘haplotype’
CSE280
Vineet Bafna
A
0
0
0
0
1
1
1
1
B
1
1
0
0
0
0
0
0
1
0
Linkage (Dis)-equilibrium (LD)
•
•
•
Consider sites A &B
Case 2: diploidy and
recombination
Each new individual
chooses a parent from the
existing alleles
CSE280
Vineet Bafna
A
0
0
0
0
1
1
1
1
B
1
1
0
0
0
0
0
0
1
1
Linkage (Dis)-equilibrium (LD)
•
Consider sites A &B
•
Case 1: No recombination
Each new individual chooses a parent
from the existing ‘haplotype’
– Pr[A,B=0,1] = 0.25
• Linkage disequilibrium
Case 2: Extensive recombination
Each new individual simply chooses
and allele from either site
– Pr[A,B=(0,1)]=0.125
• Linkage equilibrium
•
•
•
CSE280
Vineet Bafna
A
0
0
0
0
1
1
1
1
B
1
1
0
0
0
0
0
0
LD
•
In the absence of recombination,
–
–
•
Correlation between columns
The joint probability Pr[A=a,B=b] is different from
P(a)P(b)
With extensive recombination
–
CSE280
Pr(a,b)=P(a)P(b)
Vineet Bafna
Measures of LD
•
•
Consider two bi-allelic sites with alleles
marked with 0 and 1
Define
–
–
•
•
P00 = Pr[Allele 0 in locus 1, and 0 in locus 2]
P0* = Pr[Allele 0 in locus 1]
Linkage equilibrium if P00 = P0* P*0
D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = …
CSE280
Vineet Bafna
LD over time
•
With random mating, and fixed recombination
rate r between the sites, Linkage
Disequilibrium will disappear
–
–
–
–
CSE280
Let D(t) = LD at time t
P(t)00 = (1-r) P(t-1)00 + r P(t-1)0* P(t-1)*0
D(t) = P(t)00 - P(t)0* P(t)*0 = P(t)00 - P(t-1)0* P(t-1)*0 (Why?)
D(t) =(1-r) D(t-1) =(1-r)t D(0)
Vineet Bafna
Other measures of LD
•
D’ is obtained by dividing D by the largest possible value
–
–
•
•
•
Dmax = max {P1*P*1, P0*P*1, P1*P*0, P0*P*0 }
Ex: D’ = abs(P11- P1* P*1)/ Dmax
 = D/(P1* P0* P*1 P*0)1/2
Let N be the number of individuals
Show that 2N is the 2 statistic between the two sites
0
Site 1
0
1
P00N
P0*N
1
CSE280
Site 2
Vineet Bafna
LD over distance
•
Assumption
–
–
•
•
Recombination rate increases linearly with
distance
LD decays exponentially with distance.
The assumption is reasonable, but
recombination rates vary from region to
region, adding to complexity
This simple fact is the basis of disease
association mapping.
CSE280
Vineet Bafna
LD and disease mapping
•
•
•
Consider a mutation that is causal for a disease.
The goal of disease gene mapping is to discover
which gene (locus) carries the mutation.
Consider every polymorphism, and check:
–
–
•
There might be too many polymorphisms
Multiple mutations (even at a single locus) that lead to the
same disease
Instead, consider a dense sample of polymorphisms
that span the genome
CSE280
Vineet Bafna
LD can be used to map disease genes
LD
0
1
1
0
0
1
D
N
N
D
D
N
•
•
LD decays with distance from the disease allele.
By plotting LD, one can short list the region
containing the disease gene.
CSE280
Vineet Bafna
CSE280
Vineet Bafna
QuickTime™ and a
TIFF (LZW) decompressor
QuickTime™
are needed to see
this picture.and a
TIFF (LZW) decompressor
are needed to see this picture.
•
269 individuals
–
–
–
–
•
90 Yorubans
90 Europeans (CEPH)
44 Japanese
45 Chinese
~1M SNPs
CSE280
Vineet Bafna
LD and disease gene mapping problems
•
•
•
Marker density?
Complex diseases
Population sub-structure
CSE280
Vineet Bafna
Topic 2: Simulating population data
•
•
We described various population genetic concepts
(HW, LD), and their applicability
The values of these parameters depend critically upon
the population assumptions.
–
–
–
–
–
•
What if we do not have infinite populations
No random mating (Ex: geographic isolation)
Sudden growth
Bottlenecks
Ad-mixture
It would be nice to have a simulation of such a
population to test various ideas. How would you do
this simulation?
CSE280
Vineet Bafna
Download