Whole-genome biophysics, mutations, evolution, chromatin, … Konstantin Zeldovich x62354, LRB 1004 In the previous lecture: Protein structures and sequences are largely determined by the physical chemistry Ab initio paradigm: sequence + physics = structure (+function, hopefully) Today: Are physical and chemical constraints discernible at the whole proteome / whole genome level ? -Constraints on amino acid usage in prokaryotes -Thermostability -Metabolic cost of protein synthesis -Mutational robustness of proteins -Evolution of protein stability -The genetic code is nonrandom -Large-scale structure of chromatin, 3C-like methods (J. Dekker lab). Temperature ranges of modern life Psychro-, meso-, thermo-, hyperthermophilic bacteria/archaea -10°C (Antarctic ice, permafrost in Siberia and Canada) Colwellia spp, Psychrobacter spp +110°C (deep sea hydrothermal vents, hot springs) Pyrococcus spp, Methanococcus spp >250 sequenced genomes Simplest eukaryotes: up to ~60°C (nematode from hot springs)` Cold-blooded animals: Notothenia spp. Antarctic fish: -1.8°C habitat, dies of overheating at +6°C = 40°F Desert iguana: up to +60°C Very few complete genomes! Is habitat temperature reflected in the genomes? Existing knowledge • What is presumably related to thermostability? – G+C in DNA increases with temperature (wrong) • DNA stabilization by pairing – Fraction of charges (DEKR) in proteins increases • Hydrophobic interactions weaken with temperature – Fraction of polar residues decreases • ? Limitations of the previous work: based on a few (dozen) individual proteins, or a limited number (~20) of completely sequenced genomes Here: high-thoroughput analysis, 204 genomes Zeldovich, Berezovsky, Shakhnovich, PLOS CB 2007 IVYWREL, or LIVEWYR 86 genomes Topt=937FIVYWREL-335 , R=0.93, rmsd Topt=8.9°C Zeldovich, Berezovsky, Shakhnovich, PLOS CB 2007 Genomic DNA: any temperature, any GC content Base pairing is not the bottleneck of thermal adaptation. 204 genomes DNA adaptation via codon bias Fraction of A+G Autocorr. function of A,G Fractions of A, G nucleotides are changing with temperature Thermal adaptation of proteins and DNA are independent processes. Metabolic cost of protein synthesis Starting from the same basic precursors, some amino acids are easy to synthesize, some are hard, and require more energy Hypothesis: Energy (ATP) is the limiting factor in a.a. synthesis and thus survival. Thus, highly expressed proteins must be made of “cheap” amino acids A.a. cost can be deduced from pathway maps Protein expression can be either measured, or inferred from codon usage (codon adaptation index) Akashi and Gojobori, PNAS 99:3695 (2002) Highly expressed proteins are “cheaper” Akashi and Gojobori, PNAS 99:3695 (2002) MCU rationale: Synonymous codons are used with different frequencies (codon bias) For some reason (translation efficiency?), codon bias is correlated with expression MCU can be calibrated using a few genes with known expression levels Kanaya et al, Gene 238:143 (1999) Nowadays, direct measurements of expression are available (PROJECT!) Possible effects of mutations DNA -Exon, nonsynonymous -> see “protein” -Exon, synonymous -> normally neutral -Introns, regulatory sequences ->??? -Altered protein expression, localization, alternative splicing, … -RNA coding regions -> changes in RNA structure/function -Chromatin structure? Protein (non-synonymous) -Change of stability -Possible misfolding or aggregation (-> neurodegenerative diseases) -Altered interaction(s) with other protein(s) or small molecule(s) -Altered function Change of thermodynamic stability is among the easiest to comprehend. Mutational robustness of proteins ProTherm database http://gibk26.bse.kyutech.ac.jp/jouhou/protherm/protherm.html ~2000 mutations, thermal & chemical unfolding Average = 1 kcal/mol (destabilizing), variance = 3 (kcal/mol)2 Kumar et al, NAR 2006 Zeldovich et al, PNAS 2007 G prediction servers and tools • • • • • FoldX PoPMuSiC MUPro CUPSAT Eris (they are all trained on highly overlapping datasets, including ProTherm) More servers listed at http://www.gen2phen.org/wiki/protein-level-predictions-4-stability-changes-prediction Can we translate this to the organism level? -Mutations a protein change its stability G, occur at cell replication -magnitudes of G can be measured or modeled -Proteins must be stable for the function to exist and evolve -Essential proteins must be stable (G<0) in a viable organism - essential proteins per genome (~300 in bacteria) -For simplicity: -No epistasis, all proteins equally essential -Locally flat, two-level fitness landscape (life or death) -Asexual replication Mutations shuffle stability back and forth (Protein) evolution is a diffusion process in the -dimensional space of stabilities of the cell’s essential proteins Zeldovich, Chen, Shakhnovich PNAS 2007 … back in 1930 2D example R.A. Fisher 1930 ?? Diffusion in the space of “characters” Single fitness peak at origin Fitness w=w(r) n-dimensional hyperspheres ?? of constant fitness r Compensatory mutations and epistasis Soft selection Axes poorly quantified!!! Low fitness High Hartl, Taubes 1998 Poon, Otto 2000 “Characters” are protein stabilities, =2 lethal phenotypes G1=0 unstable proteins, G>0 G2=0 replication mutation viable phenotypes impossible genotypes, too stable proteins Replication of the viable organisms must compensate for death due to the flux across G=0 adsorbing boundary (… skipping the math – analytic solution exists – please ask if interested) Prediction: universal distribution of G of all proteins P( E ) e hE h2 D sin E Line: theory; histogram: ProTherm database, ~200 proteins Zeldovich, Chen, Shakhnovich PNAS 2007 Genetic code links genomes and proteomes Information-theoretical viewpoint: is this 64->20 mapping in any way optimal? Hypothesis: the genetic code minimizes the effect of DNA mutations on protein structure 1,000,000 realizations of the code (64->20) mean-square change of a.a. polarity upon point mutation Freeland & Hurst, J. Mol. Evol. 47:238 (1998) Large-scale structure of chromatin On a small scale, chromatin is tightly packed (nucleosomes, 10- and 30-nm fibers) Large-scale structure? Chromosome Conformation Capture (3C, 5C, HiC, …) digest ligate uncrosslink Formaldehyde crosslink fragments can be counted by qPCR or deep sequencing Result: “contact map” of the chromosome: which part is spatially close to which 3D structure can then be inferred, a la NMR structures of proteins (distance constraints) Dekker et al, Science 2002 Lieberman-Aiden et al, Science 2009 DNA looping and long-range transcriptional control Murine beta-globin locus, 130kb Dekker, TiBS 28:277 (2003) Sequence determinants of the contacts?? (PROJECT!) Whole chromosome as a polymer? s bp Probability of contact? p~s Theory: 3 / 2, random polymer coil 1, compact fractal without knots, " crumpled globule" Lieberman-Aiden et al, Science 2009