Codon usage bias Ref: Chapter 9 Xuhua Xia xxia@uottawa.ca http:// dambe.bio.uottawa.ca Objectives • Understand how codon usage bias affect translation efficiency and gene expression • Biomedical relevance – Protein drugs in pharmaceutical industry – Transgenic experiments in agriculture • Factors affecting codon usage bias • Indices measuring codon usage bias • Develop bioinformatic skills to study the genomic codon usage. Xuhua Xia Slide 2 Codon Usage Bias • • Observation: Strongly biased codon usage in a variety of species ranging from viruses, mitochondria, plastids, prokaryotes and eukaryotes. Hypotheses: – Differential mutation hypothesis, e.g., Transcriptional hypothesis of codon usage (Xia 1996 Genetics 144:1309-1320 ) – Different selection hypothesis, e.g., (Xia 1998 Genetics 149: 37-44) • Predictions: – From mutation hypothesis: Concordance between codon usage and mutation pressure – From Selection hypothesis: • Concordance between differential availability of tRNA and differential codon usage. • The concordance is stronger in highly expressed genes than lowly expressed genes (CAI is positively correlated with gene expression). Gene 1 Polycistronic mRNA Ribosome Protein Gene 2 Gene 3 RNA polymerase GCC~tRNA~Gly UCC~tRNA~Gly UCC~tRNA~Gly Xuhua Xia UCC~tRNA~Gly Slide 3 Table 9-2, yeast AA(1) Arg Arg Asn Asn Asp Asp Cys Cys Gln Gln Glu Glu His His Leu Leu Lys Lys Phe Phe Ser Ser Tyr Tyr Xuhua Xia Codon(2) AGA AGG AAC AAU GAC GAU UGC UGU CAA CAG GAA GAG CAC CAU UUA UUG AAA AAG UUC UUU AGC AGU UAC UAU T(3) 11 1 10 0 16 0 4 0 9 1 14 2 7 0 7 10 7 14 10 0 2 0 8 0 Xia 2007. Bioinformatics and the cell. w(4) 1 0.091 1 0 1 0 1 0 1 0.111 1 0.143 1 0 0.7 1 0.5 1 1 0 1 0 1 0 F(5) 314 1 208 11 202 112 3 39 153 1 305 5 102 25 42 359 65 483 168 19 6 4 141 10 Slide 4 Conflict: Initiation and Elongation • Met codon usage from the 12 CDSs: AUA 214 AUG 37 • Possible tRNAMet/CAU, tRNAMet/UAU • Vertebrate mitochondrial genome has only one tRNAMet. Which one to have? – tRNAMet/CAU: Good for initiation, but not efficient for AUA codons even with the C modified to 5-formylcytidine – tRNAMet/UAU: Good for AUA codons, but not good for initiation. – anticodon: CAU favoring the AUG codon • Nature has chosen CAU: All mitochondrial genomes with a single tRNAMet has a CAU anticodon. • Problem with AUA codons in translation? Xuhua Xia Xia et al. 2007. PLoS One Slide 5 Hypothesis and Predictions Met Leu Glu Lys Gln Arg Trp AUA UUA GAA AAA CAA AGA UGA Favoured by mutation Favoured by mutation, but not by tRNAmediated selection because the first (wobble) position in tRNA anticodon is C. Also favoured by tRNA-mediated selection: the first (wobble) position of tRNA is U. Predictions: 1. Proportion of A-ending codons (or RSCU) should be smaller in the Met codon family than in other R-ending codon families: PNNA = NNNA/NNNG Xuhua Xia 2. Availability of tRNAMet/UAU should increase PAUA. Slide 6 Selection against AUA codons Met Leu Glu Lys Gln Arg Trp Species AUA UUA GAA AAA CAA AGA UGA A. gossypii 1.473 1.993 1.826 1.852 1.917 2 2 C. glabrata 1.043 1.995 2.000 1.938 1.889 2 2 K. thermotolerans 0.556 1.973 1.910 1.948 1.945 2 1.967 S. cerevisiae 1.140 1.969 1.800 1.883 1.794 1.947 1.908 S. castelli 1.299 1.994 1.891 1.981 1.969 S. servazzii 1.321 1.931 1.702 1.824 1.841 1.959 Y. lipolytica 1.440 1.968 1.536 1.859 1.963 1.922 1.882 2 1.918 2 Carullo, M. and Xia, X. 2008 J Mol Evol 66:484–493. Xuhua Xia Slide 7 (a) 80 PAUA 70 60 50 40 30 30 40 50 60 70 80 PUUA 0.95 (b) 0.85 PAUA 0.75 0.65 0.55 0.45 0.35 0.25 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 PUUA Fig. 5. Relationship between PAUA and PUUA, highlighting the observation that PAUA is greater when both a tRNAMet/CAU and a tRNAMet/UAU are present than when only tRNAMet/CAU is present in the mtDNA, for bivalve species (a) and chordate species (b). The filled squares are for mtDNA containing both tRNA Met/CAU and tRNAMet/UAU genes, and the open triangles are for mtDNA without a tRNAMet/UAU gene. Xia, X. 2012. In: RS Singh et al.. Evolution in the fast lane: Rapidly evolving genes and genetic systems. Oxford University Press. Calculation of RSCU RSCUij CodFreqj NumCodoni CodFreq i j 1 Num Codoni Codon GCU GCC GCA GCG GAA GAG GGU GGC GGA GGG UUA UUG CUU CUC CUA CUG Xuhua Xia RSCU Ala 52 0.84 52 91 103 2 4 AA N RSCU Codon Ala 52 0.84 CCU Ala 91 1.47 CCC Ala 103 1.66 CCA Ala 2 0.03 CCG Glu 78 1.64 CAA Glu 17 0.36 CAG Gly 29 0.53 CGU Gly 62 1.13 CGC Gly 97 1.77 CGA Gly 31 0.57 CGG Leu 110 1.11 AUA Leu 16 0.16 AUG Leu 62 0.62 UCU Leu 95 0.95 UCC Leu 285 2.86 UCA Leu 29 0.29 UCG AA N RSCU Codon Pro 42 0.87 UAA Pro 63 1.31 UAG Pro 85 1.76 AGA Pro 3 0.06 AGG Gln 79 1.82 AAA Gln 8 0.18 AAG Arg 7 0.44 ACU Arg 11 0.7 ACC Arg 42 2.67 ACA Arg 3 0.19 ACG Met 218 1.66 UGA Met 44 0.34 UGG Ser 51 1.11 GUU Ser 65 1.42 GUC Ser 99 2.16 GUA Ser 5 0.11 GUG RSCU is codon-specific RSCU and proportion: Different scaling. AA N RSCU * 8 3.2 * 1 0.4 * 1 0.4 * 0 0 Lys 90 1.78 Lys 11 0.22 Thr 44 0.57 Thr 96 1.25 Thr 153 1.99 Thr 15 0.19 Trp 92 1.77 Trp 12 0.23 Val 40 0.84 Val 48 1.01 Val 87 1.83 Val 15 0.32 Slide 9 Calculation of CAI wij RefCodFreqij RefCodFreqi.max Codon UGA UAG UAA GCA GCU GCG GCC UGC UGU GAU GAC GAG GAA UUU UUC … Xuhua Xia AA * * * A A A A C C D D E E F F … N2,3,4 [ CodFreqi ln( wi )] i1 N 2,3,4 CodFreqi i 1 CAI e ObsFreq 0 0 0 1 15 0 8 3 3 9 11 11 14 3 9 … RefCodFreq 6 4 16 195 322 81 242 123 112 69 40 289 335 118 213 N2,3,4: Number of 2-, 3-, 4-fold codon families e w 0.375 0.250 1.000 0.606 1.000 0.252 0.752 1.000 0.911 1.000 0.580 0.863 1.000 0.554 1.000 … 1*ln(0.606) 15*ln(1) 8*ln(0.752) ... 1158... Compound 6- or 8-fold codon families should be broken into two codon families CAI is gene-specific. 0 CAI 1 CAI computed with different reference sets are not comparable. Problem with computing w as Fi/Fi.max: Suppose an amino acid is rarely used in highly expressed genes, then there is little selection on it, and the codon usage might be close to even, with wi 1. Now if we have a lowly expressed gene that happen to be made of entire of this amino acid, then the CAI for this lowly expressed gene would be 1, which is misleading. There has been no good alternative. Further research is needed. Slide 10 Weak mRNA predictive power 80 Protein abundance 70 y = 5.6507x + 4.1367 R2 = 0.1936 60 50 ENO1 40 30 20 10 FRS2 0 0.5 1.5 2.5 3.5 4.5 mRNA abundance Xuhua Xia Slide 11 Effect of Codon Usage Bias 80 Protein abundance 70 y = 70.398x - 11.739 60 R 2 = 0.5668 50 40 ENO1 30 20 FRS2 10 0 0.05 0.25 0.45 0.65 0.85 Codon usage bias Xuhua Xia Slide 12 Problems with CAI • Formulation – Reference set – w=0 • Implementation wij RefCodFreqij RefCodFreqi.max N2,3,4 [ CodFreqi ln( wi )] i1 N 2,3,4 CodFreqi i 1 – AUG 1*ln(0.606) 15*ln(1) 8*ln(0.752) ... 1158... CAI e e – UGG – Multiple codon families for one amino acid • Dependence on AT% • Solutions (Xia, X. 2007. Evolutionary Bioinformatics) Xuhua Xia Slide 13 RSCU (HIV-1 vs Human) 2.5 V 2 RSCU (HIV-1) R S A I 1.5 L E K L (a) G P T A-ending C-ending G-ending R Q 1 U-ending 0.5 Fig. 1. Relative synonymous codon usage (RSCU) of HIV1 compared to RSCU of highly expressed human genes. Data points for codons ending with A, C, G or U are annotated with different combinations of colors and symbols. A-ending codons exhibit strong discordance in their usage between HIV-1 and human and are annotated with their coded amino acids. 0 0 0.5 1 1.5 2 2.5 RSCU (Human) Xuhua Xia van Weringh et al. 2011. MBE. Slide 14 Research • Observation on HIV-1: – Strong surplus of A-ending codon – High mutation rate • Hypothesis: Strong A-biased mutation disrupting codon adaptation. – Prediction: • Strong A-biased mutation (confirmed) • If mutation rate is lower, then there will be better codon adaptation (The related HTLV-1 parasitizes the same cell as HIV-1, but have lower mutation rate: HTLV-1 genes should exhibit better codon adaptation) Xuhua Xia Slide 15 RSCU (HTLV-1 vs Human) 3 RSCU (HTLV-1) 2.5 2 A-ending C-ending 1.5 G-ending U-ending 1 0.5 0 0 0.5 1 1.5 2 2.5 RSCU (Human) Relative synonymous codon usage (RSCU) of HTLV-1 compared to RSCU of highly expressed human genes. Data points for codons ending with A, C, G or U are annotated with different combinations of colors and symbols. A-ending codons exhibit strong discordance in their usage between HIV-1 and human and are annotated with their coded amino acids. Xuhua Xia Slide 16 Any problem with the mutation hypothesis? Table 2. Frequency of A residues, length and codon adaptation index (CAI) for the three HIV-1 early (tat, rev and nef) and five late (gag-pol, vif, vpu, vpr, and env) coding sequences (CDS). Gene CDS (bp) CAI tat 261 0.66875 rev 351 0.66211 nef 621 0.67523 gag 1503 0.62784 pol 3012 0.58139 vif 579 0.61941 vpr 291 0.64272 vpu 249 0.49068 env 2571 0.61924