M. Dlakic lecture 11/14/08

advertisement
What is Comparative Genomics?
Insights gained through comparison of
genomes from different species
How did it all start?
• We needed some genomes to start comparing
• Many Bacteria sequenced first
• Model organisms
• Yeast
• Worm
• Fruit fly
• Thale cress
• Finally, Human
• Comparative genomics did not just happen
• Enough data had to be accumulated
• Development of new computational methods to meet the challenges of
processing large amounts of data
• “Informatics” techniques from applied math, computer science and
statistics were adapted for biological sequences
Comparing sequenced genomes
• Comparison of genomic sequences from
different species can help identify the
following:
• Gene structure
• Gene function
• Interaction between gene products
• Non-coding RNAs
• Regulatory sequences
Evolution and sequence conservation
• Genome comparisons are based on simple premise:
conservation = functional importance
• If there are no constraints on DNA sequence, random
mutations will occur
• Over large evolutionary times (millions of years), these
random mutations make two related sequences different
• Sequences from different genomes will be conserved if:
• They code for proteins
• They are important for regulation (protein binding)
No-hypothesis-driven approach
• Hypothesis-driven approaches
• Develop goals based on available hypothesis
• Design initial experiments (and backups if those fail)
• When it yields results, go to NIH, NSF, DOE, ONR for funding
• No hypothesis-driven approaches
• Start with a general knowledge of the biological system
• Collect large amount of data (usually high-throughput methods) and try
extracting and/or amplifying signal from noisy data
• Sometimes it works for reasons that are obvious
• Sometimes it works for reasons that are NOT obvious
• Sometimes it doesn’t work because the data is too noisy
• Funding agencies are not likely to fund this kind of research
Finding DNA regulatory motifs (protein binding sites)
• Experimental approaches
• Promoter Trapping
• DNA Footprinting
• In-vitro binding site selection (SELEX)
• Computational approaches
• Searching databases of known sites
• Finding over-represented motifs in a group of sequences
(Gibbs sampling, Expectation Maximization)
• In promoters of homologous genes
• In promoters of functionally linked genes
• In promoters of interacting proteins
• Ab initio methods
• Positional conservation of (pseudo)palindromic DNA motifs
Finding motifs in promoters of homologous genes
• Perform all-versus-all proteomes BLAST search
• Pool together promoters of related genes
• Find conserved motifs (Gibbs sampling, Expectation
Maximization)
• Only DNA motifs in related genes can be identified
Finding DNA motifs by positional
conservation of palindromes
• The approach targets sites for dimeric proteins and is particularly
suited for helix-turn-helix proteins of Bacteria and Archea
• HTH proteins bind as dimers usually with variable sequence spacing
• Binding sites are palindromic with poorly conserved middle
GGATTnnnAATCC GGATTnnAATCC GGATTnnAAGCC
• Starting from a complete set of promoter sequences, we find
imperfect palindromes of variable length
• Remove sequence bias (A/T or G/C content > 80%)
• Search all-versus-all and identify similar motifs
YES
Many potential binding sites are found ...
Sulfate
metabolism
Transposons
GTP-binding
ATPase
RNA Pol K
Ribosomal
proteins
Short
hypothetical
proteins
• The role of found motifs is difficult to predict
Finding DNA motifs - the summary
• In promoters of homologous genes
• Easy to perform and interpret results
• Works only for proteins with sequence homology
• In promoters of interacting proteins
• General approach, works even in the absence of sequence homology
• Needs better coverage of interactions; High-throughput studies of
species other than yeast will enable comparative analysis
• Ab initio methods
• General approach, requires no prior knowledge
• Complementary approaches (experimental or computational) are
needed to link the found sites to their DNA-binding proteins
Evolution and sequence conservation
• Genome comparisons are based on simple premise:
conservation = functional importance
• If there are no constraints on DNA sequence, random
mutations will occur
• Over large evolutionary times (millions of years), these
random mutations make two related sequences different
• Sequences from different genomes will be conserved if:
• They code for proteins
• They are important for regulation (protein binding)
•
Comparative genomics is needed to identify conservation
Comparative genomics helps
genome annotations
• In prokaryotes, finding genes is relatively
easy based on open reading frames (ORFs)
• In eukaryotes, we have to look for ORFs,
exons, introns, splice sites, polyA sites
• Bad news: Predicted exons sometimes do not exist
• More bad news: Pseudogenes
• Bad news keep coming: Alternative splicing
• Good news: In different species, the genes
normally have similar exon-intron structure
Case 1:
Cellular concentration of metabolite is too low to occupy the
riboswitch binding site.
Transcription and …
1
2
3 RNA 4
polymerase
RNA
polymerase
Courtesy of R. Breaker, Yale U.
Case 1:
Cellular concentration of metabolite is too low to occupy the
riboswitch binding site.
Transcription and intramolecular RNA folding continue.
1
1
2
2
3
3
4
RNA
UUUUU
AUG
polymerase
Courtesy of R. Breaker, Yale U.
Case 1:
Cellular concentration of metabolite is too low to occupy the
riboswitch binding site.
Translation is and
Transcription
initiated.
intramolecular RNA folding continue.
1
2
3
4
UUUUU
AUG
Typically the new mRNA codes for a
biosynthetic or transport protein that raises
the intracellular level of the metabolite.
Gene regulation (next case) is accomplished by variations in
the interactions of the regions highlighted in orange.
Ribos
Courtesy of R. Breaker, Yale U.
Case 2:
Cellular concentration of metabolite (X) is high.
RNA polymerase produces the long untranslated leader region.
Intramolecular folding can lead to an alternate conformation.
X
X
X
X
Nascent RNA
X
X
RNA
polymerase
DNA template
The alternate riboswitch conformation is stable when
metabolite is bound.
Courtesy of R. Breaker, Yale U.
Case 2:
Cellular concentration of metabolite (X) is high.
Transcription
RNA
polymerase
continues.
produces the long untranslated leader region.
Intramolecular folding can lead to an alternate conformation.
X
X
X
X
X
1
2
3 RNA 4
polymerase
X
UUUUU
The alternate riboswitch conformation is stable when
metabolite is bound.
Courtesy of R. Breaker, Yale U.
Case 2:
Cellular concentration of metabolite (X) is high.
Transcription continues.
Now, RNA folding leads to formation of an intrinsic terminator.
X
X
X
X
X
X
1
2
3
1
2 3
4
X
4
UUUUU
RNA
polymerase
Courtesy of R. Breaker, Yale U.
Case 2:
Cellular concentration of metabolite (X) is high.
Transcription continues.
Now, RNA folding leads to formation of an intrinsic terminator.
X
X
X
X
X
1
2 3
X
4
UUUUU
RNA
polymerase
The transcript is never completed and the metabolite
biosynthetic or transport protein is not produced.
Courtesy of R. Breaker, Yale U.
What does this ncRNA bind?
Can we predict functions without
strict measure of significance
(no sequence or structural similarity)?
This is done by machine-trained (objective)
jury-like system using inference
Comparative genomics predicts
protein interactions (Rosetta Stone)
• In yeast, topoisomerase II has
two domains that correspond
to gyrases A and B
• Sequence comparisons show
that these two domains are
individual proteins in E. coli
• The implication is that these
two proteins interact, and
that their fusion was favored
during the evolution
Predicting protein function
by genome context
What does gene colinearity mean?
Krr1/Rrp20
Rio1/Rio2
Spo11
Tif11
Not much, unless supported
by phylogeny and function
The case of Fibrillarin/Nop56 colinearity
Fibrillarin and Nop56 DO interact
Functional clues
for hypothetical
proteins based on
genomic context
analysis
High-throughput approaches
• Had to be developed quickly to
match the speed of genome
sequencing
• As a general rule, most experimental
approaches can be adapted for highthroughput
– Protein interactions (two hybrid, TAP)
– Protein localizations
– Gene regulations (microarray)
– Structure determination (more recent,
still gaining speed)
What is a high-throughput experiment?
• Usually done at the level of whole
organism (whole genome) under
different conditions
• HT experiments are aided by:
– Equipment miniaturization
– Robotics
– Other automated procedures
• In almost all instances, heavy data
analysis and processing is required
General properties of HT experiments
• Collect large amounts of data under many
different conditions
– Err on the side of collecting too much data,
disk storage is cheap
• Process raw data (computers)
• Analyze data (computers)
• Integrate data from various sources
(computers)
• Identify patterns and cluster the results
based on similarity (computers)
Integrating heterogonous data
to predict protein interactions
Analysis of different data
types is usually based on
Bayesian inference
Example protein interactions:
● Proteins more likely to interact if
they are co-expressed
● Proteins more likely to interact
if they are co-localized in cell
● Proteins more likely to interact
if they are co-localized in genome
● Proteins more likely to interact
if they are parts of the same
cellular process
Predicting large protein
complexes from individual parts
Beware of erroneous annotations
Download