Whole-proteome biophysics, mutations and evolution

advertisement
Whole-genome biophysics,
mutations, evolution, chromatin, …
Konstantin Zeldovich
x62354, LRB 1004
In the previous lecture:
Protein structures and sequences are largely determined by the physical chemistry
Ab initio paradigm: sequence + physics = structure (+function, hopefully)
Today:
Are physical and chemical constraints discernible at the
whole proteome / whole genome level ?
-Constraints on amino acid usage in prokaryotes
-Thermostability
-Metabolic cost of protein synthesis
-Mutational robustness of proteins
-Evolution of protein stability
-The genetic code is nonrandom
-Large-scale structure of chromatin, 3C-like methods (J. Dekker lab).
Temperature ranges of modern life
Psychro-, meso-, thermo-, hyperthermophilic bacteria/archaea
-10°C (Antarctic ice, permafrost in Siberia and Canada)
Colwellia spp, Psychrobacter spp
+110°C (deep sea hydrothermal vents, hot springs)
Pyrococcus spp, Methanococcus spp
>250 sequenced genomes
Simplest eukaryotes: up to ~60°C (nematode from hot springs)`
Cold-blooded animals:
Notothenia spp. Antarctic fish: -1.8°C habitat,
dies of overheating at +6°C = 40°F
Desert iguana: up to +60°C
Very few complete genomes!
Is habitat temperature reflected in the genomes?
Existing knowledge
• What is presumably related to thermostability?
– G+C in DNA increases with temperature (wrong)
• DNA stabilization by pairing
– Fraction of charges (DEKR) in proteins increases
• Hydrophobic interactions weaken with temperature
– Fraction of polar residues decreases
• ?
Limitations of the previous work:
based on a few (dozen) individual proteins, or
a limited number (~20) of completely sequenced genomes
Here: high-thoroughput analysis, 204 genomes
Zeldovich, Berezovsky, Shakhnovich, PLOS CB 2007
IVYWREL, or LIVEWYR
86 genomes
Topt=937FIVYWREL-335 , R=0.93, rmsd Topt=8.9°C
Zeldovich, Berezovsky, Shakhnovich, PLOS CB 2007
Genomic DNA: any temperature, any GC content
Base pairing is not the bottleneck of thermal adaptation.
204 genomes
DNA adaptation via codon bias
Fraction of A+G
Autocorr. function of A,G
Fractions of A, G nucleotides are changing with temperature
Thermal adaptation of proteins and DNA are independent processes.
Metabolic cost of protein synthesis
Starting from the same basic precursors,
some amino acids are easy to synthesize,
some are hard, and require more energy
Hypothesis:
Energy (ATP) is the limiting factor in a.a.
synthesis and thus survival.
Thus, highly expressed proteins must be made
of “cheap” amino acids
A.a. cost can be deduced from pathway maps
Protein expression can be either measured, or
inferred from codon usage (codon adaptation
index)
Akashi and Gojobori, PNAS 99:3695 (2002)
Highly expressed proteins are “cheaper”
Akashi and Gojobori, PNAS 99:3695 (2002)
MCU rationale:
Synonymous codons are used with different frequencies (codon bias)
For some reason (translation efficiency?), codon bias is correlated with expression
MCU can be calibrated using a few genes with known expression levels
Kanaya et al, Gene 238:143 (1999)
Nowadays, direct measurements of expression are available (PROJECT!)
Possible effects of mutations
DNA
-Exon, nonsynonymous -> see “protein”
-Exon, synonymous -> normally neutral
-Introns, regulatory sequences ->???
-Altered protein expression, localization, alternative splicing, …
-RNA coding regions -> changes in RNA structure/function
-Chromatin structure?
Protein (non-synonymous)
-Change of stability
-Possible misfolding or aggregation (-> neurodegenerative diseases)
-Altered interaction(s) with other protein(s) or small molecule(s)
-Altered function
Change of thermodynamic stability is among the easiest to comprehend.
Mutational robustness of proteins
ProTherm database
http://gibk26.bse.kyutech.ac.jp/jouhou/protherm/protherm.html
~2000 mutations, thermal & chemical unfolding
Average = 1 kcal/mol (destabilizing), variance = 3 (kcal/mol)2
Kumar et al, NAR 2006
Zeldovich et al, PNAS 2007
G prediction servers and tools
•
•
•
•
•
FoldX
PoPMuSiC
MUPro
CUPSAT
Eris
(they are all trained on highly overlapping datasets, including ProTherm)
More servers listed at
http://www.gen2phen.org/wiki/protein-level-predictions-4-stability-changes-prediction
Can we translate this to the organism level?
-Mutations a protein change its stability G, occur at cell replication
-magnitudes of G can be measured or modeled
-Proteins must be stable for the function to exist and evolve
-Essential proteins must be stable (G<0) in a viable organism
- essential proteins per genome (~300 in bacteria)
-For simplicity:
-No epistasis, all proteins equally essential
-Locally flat, two-level fitness landscape (life or death)
-Asexual replication
Mutations shuffle stability back and forth
(Protein) evolution is a diffusion process
in the -dimensional space of stabilities of the cell’s essential proteins
Zeldovich, Chen, Shakhnovich PNAS 2007
… back in 1930
2D example
R.A. Fisher 1930
??
Diffusion in the space
of “characters”
Single fitness peak at origin
Fitness w=w(r)
n-dimensional hyperspheres
?? of constant fitness
r
Compensatory mutations and
epistasis
Soft selection
Axes poorly quantified!!!
Low
fitness
High
Hartl, Taubes 1998
Poon, Otto 2000
“Characters” are protein stabilities, =2
lethal phenotypes
G1=0
unstable proteins, G>0
G2=0
replication
mutation
viable phenotypes
impossible genotypes, too stable proteins
Replication of the viable organisms must compensate
for death due to the flux across G=0 adsorbing boundary
(… skipping the math – analytic solution exists – please ask if interested)
Prediction: universal distribution of G of all proteins
P( E )  e

hE
h2  D
sin E
Line: theory; histogram: ProTherm database, ~200 proteins
Zeldovich, Chen, Shakhnovich PNAS 2007
Genetic code links genomes and proteomes
Information-theoretical viewpoint: is this 64->20 mapping in any way optimal?
Hypothesis: the genetic code minimizes the effect of DNA mutations on protein structure
1,000,000 realizations of the code (64->20)
mean-square change of a.a. polarity upon point mutation
Freeland & Hurst, J. Mol. Evol. 47:238 (1998)
Large-scale structure of chromatin
On a small scale, chromatin is tightly packed (nucleosomes, 10- and 30-nm fibers)
Large-scale structure?
Chromosome Conformation Capture (3C, 5C, HiC, …)
digest
ligate
uncrosslink
Formaldehyde
crosslink
fragments can be counted by qPCR or deep sequencing
Result: “contact map” of the chromosome: which part is spatially close to which
3D structure can then be inferred, a la NMR structures of proteins (distance constraints)
Dekker et al, Science 2002
Lieberman-Aiden et al, Science 2009
DNA looping and long-range
transcriptional control
Murine beta-globin locus, 130kb
Dekker, TiBS 28:277 (2003)
Sequence determinants of the contacts?? (PROJECT!)
Whole chromosome as a polymer?
s bp
Probability of contact?
p~s

Theory:
  3 / 2, random polymer coil
  1, compact fractal without knots,
" crumpled globule"
Lieberman-Aiden et al, Science 2009
Download