Sequence analysis of nucleic acids and proteins: part 2 Prediction of structure and function Based on Chapter 3 of Post-genome bioinformatics by Minoru Kanehisa Oxford University Press, 2000 Search and learning problems in sequence analysis Sim ila rit y search Structure/fun cti on prediction Proble ms in Biological Science Pair wise sequence align ment Database search for simil ar sequence s Multi ple sequence align me nt Phylogene tic tree recons truction Protein 3D structure ali gnment ab initio prediction Knowledge based prediction Molecular classification RNA secondary structure prediction RNA 3D structure prediction Prot ein 3D structure prediction Motif extrac tion Functional site prediction Cellul ar localization prediction Codin g region prediction Transmembrane doma in prediction Prot ein secondary structure prediction Prot ein 3D structure prediction Superfa mil y classification Ortholog/para log gro uping of genes 3D fold classification Math/Stat/CompSci method Optimi zation algo rit hms Dyna mi c programmi ng (DP) Sim ulated ann eali ng (SA) Gene tic algorithms (GA) Markov Cha in Monte Carlo (MCMC: Metropoli s and Gibbs samplers) Hopfi eld neu ral network Pattern recognition and learnin g algorithms Discriminant analysis Neural netwo rks Support vector machin es Hidden Markov models (HMM) Forma l gra mmar CAR T Clustering algorithms Hierarchical, k-means, etc PCA, MDS, etc Self-orga nizing maps, etc Thermodynamic principle The amino acid sequence contains all the information necessary to fold a protein molecule into its native 3D state under physiological conditions: fold, denature, spontaneously refold, called Anfinsen’s thermodynamic principle Thus it should be possible to predict 3D structure computationally by minimizing a suitable conformational energy function, but difficult to define, difficult to minimize (globally), called ab initio In practice, structures determined by X-ray crystallography and nuclear magnetic resonance (NMR) are used to give empirical structure-function relationships. RNA secondary structure can be predicted ab initio using an energy function and DP to minimize it, in a process similar to alignment A schematic illustration of RNA secondary structure elements. Stem Hairpin loop Pseudo knot Bulge loop Internal loop Branch loop A C C A G.C C.G G.C G.U A.U U.A U.A G ACAC CU A G C Yeast alanyl transfer RNA Prediction of protein secondary structure: many methods The definition of a dihedral angle and the three backbone dihedral angles, f, y, w, in a protein. Because w is around 180O, the backbone configuration can be specified by f and y, for each peptide unit. f C’ Ca C’ H N R Ca N C’ H O H O N C’ f R y Ca H Peptide unit H R Ca w N C’ H O Prediction of protein secondary structure The options are a-helix, -strand and coil. Many 2º structure prediction methods exist, with ones by Chou-Fasman and another due to Garnier,Osguthorpe and Robson being widely used. These are position&structurespecific scoring matrices based on modest or large numbers of proteins. On the next page we display the GOR PSSM for a-helices. These days one can choose from methods based on almost every major machine learning approach: ANN, HMM, etc. a Helix State Cter G A V L I S T D E N Q K H R F Y W C M P X -8 -16 18 1 17 -21 -23 -13 16 19 2 7 25 14 1 0 -8 8 -77 2 0 0 -7 -18 20 -1 19 -19 -16 -21 20 24 3 9 24 0 -5 7 -9 18 -71 -12 -6 0 -6 -18 23 -5 22 -15 -18 -16 18 31 -2 6 22 -7 -19 17 -10 11 -74 -9 -7 0 -5 -29 25 0 28 -5 -13 -16 14 35 -6 0 18 -6 -25 23 -18 9 -74 -1 -6 0 -4 -41 32 -2 23 0 -20 -14 23 39 -6 7 14 -14 -16 23 -13 2 -67 0 -15 0 -3 -51 40 -9 29 2 -25 -11 22 36 -9 0 16 -6 -16 18 -13 26 -60 21 -22 0 -2 -67 45 -10 37 10 -27 -7 19 36 -16 -3 16 -2 -7 29 -31 37 -71 33 -35 0 -1 -85 45 -5 37 9 -31 -14 26 45 -22 10 25 1 -4 26 -26 29 -61 25 -47 0 0 -105 62 4 51 17 -51 -28 -1 52 -44 23 28 2 -1 32 -15 30 -47 34 -68 0 1 -64 58 -5 48 12 -41 -30 -5 40 -29 35 37 21 -1 40 -24 17 -46 41 -179 0 2 -42 51 -3 54 8 -47 -33 -26 14 -24 29 44 24 3 34 -18 -1 -56 39 -95 0 Nter 3 -37 45 -8 59 12 -43 -30 -35 -17 -13 23 54 25 6 28 -23 12 -58 44 -72 0 4 -30 48 -11 41 6 -35 -20 -21 -13 0 16 49 27 0 12 -28 13 -67 29 -53 0 5 -33 43 -1 36 6 -34 -17 -6 -14 -2 10 44 25 0 3 -19 11 -70 15 -37 0 6 -26 37 0 34 16 -38 -18 -3 -10 -4 0 39 19 -6 15 -16 31 -71 4 -28 0 7 -21 30 -7 28 18 -34 -12 -1 -7 -5 0 44 25 8 6 -18 13 -80 -2 -22 0 8 -17 32 -7 15 9 -36 -8 1 -2 3 1 47 31 0 4 -23 2 -81 -11 -11 0 Two architectures of the hierarchical neural network: (a) the perceptron and (b) the back-propagation neural network. Input layer Output layer Input Layer Hidden Layer Output Layer Prediction of transmembrane domains Membrane proteins are very common, perhaps 25% of all. Membranes are hydrophobic and so a transmembrane domain typically has hydrophobic residues, about 20 to span the membrane. There are a number of rules for detecting them: KyteDoolittle hydropathy scores work fairly well, and the Klein-Kanehisa-DeLisi discriminant function does even better. Three-dimensional structures of two membrane proteins Photosynthetic reaction centre (PDB:1PRC) Outer membrane protein: porin (PDB: 1OMF) Hidden Markov Models ( HMMs) S = States {s0,s1,…..,sn} V = Output alphabet {v0,v1,…..,vm} A = { aij} = transition probability from si sj B = {bi(j)} = probability outputting vj in state si • What is the probability of a sequence of observations? • What are the maximum likelihood estimates of parameters in an HMM? • What is the most likely sequence of states that produced a given sequence of observations? A hidden Markov model for sequence analysis d1 d2 d3 d4 I0 I1 I2 I3 I4 m0 m1 m2 m3 m4 Start m5 End m=match state (output), I=insert state (output), d=delete state (no output) Prediction of protein 3D structures Knowledge based prediction of protein 3D or 3º structure can be classified into two categories: comparative modelling and fold recognition. The first can work well when there is significant sequence similarity to a protein with known 3D structure. By contrast, fold recognition is used when no significant sequence similarity exists, and makes use of the knowledge and analysis of all protein structures. One such method due to Eisenberg and colleagues, involves 3D-1D alignment. Another such is threading. The 3D-1D method for prediction of protein 3D structures involves the construction of a library of 3D profiles for the known protein structures. Side chain Inside or outside a E P2 P1 Environmental class Amino acids B1a A -0.66 R -1.67 . . . . . . . . . . Y 0.18 W 1.00 B1 B1 . . . . -0.79 -0.91 . . . . -1.16 -2.16 . . . . . . . . . . . . . . 0.07 0.17 . . . . 1.17 1.05 . . . . 3D-1D score B3 B2 B1 Polar or apolar Main chain Residue number 1 A 12 R -32 . . . . . . . . . . Y -94 W -214 2 3 .......... -66 46 . . . . . . . . . . -80 -34 . . . . . . . . . . . . . . . . . . . . 112 -210 . . . . . . . . . . 102 -135 . . . . . . . . . . 3D profile N Gene Structure I DNA - - - - agacgagataaatcgattacagtca - - - - Transcription RNA - - - - agacgagauaaaucgauuacaguca - - - Splicing Translation Protein - - - - - DEI - - - Protein Folding Problem Exon Intron Exon Intron Exon Protein Gene Structure II Exon 1 5’ Exon 2 Intron 1 Exon 3 Intron 2 Intron 3 Exon 4 3’ DNA TRANSCRIPTION pre-mRNA SPLICING mRNA TRANSLATION AUG - X1…Xn - STOP protein sequence protein 3D structure Gene Structure III Exon 1 Intron 1 Exon 2 Exon 3 Intron 2 Intron 3 Exon 4 5’ Promoter TATA DNA 3’ Splice site GGTGAG Translation Initiation ATG Splice site CAG Pyrimidine tract Branchpoint CTGAC polyA signal Stop codon TAG/TGA/TAA Additional Difficulties • Alternative splicing SPLICING pre-mRNA ALTERNATIVE SPLICING mRNA TRANSLATION Protein I TRANSLATION Protein II • Pseudo genes DNA Approaches to Gene Recognition • Homology BLASTN, TBLASTX, Procrustes • Statistical de novo GRAIL, FGENEH, Genscan, Genie, Glimmer • Hybrid GenomeScan, Genie F(*,*,*,…) Example: Glimmer Gene Finding in Microbial DNA • • • • No introns 90% coding Shorter genomes (less than 10 million bp) Lots of data Gene Structure in Prokaryotes ORF Translation Initiation ATG Stop codon TAG/TGA/TAA Simplest Hidden Markov Gene Model Coding A C G T 1 0.9 0.03 0.04 0.03 0.1 0.9 ATG TAA 1 0.1 Intergene 0.9 A C G T 0.25 0.25 0.25 0.25 The Viterbi Algorithm A A C A G T G A C T C T Example: Genscan Gene Finding in Human DNA • • • • Introns 5% coding Large genome (3 billion bp) Alternative splicing The Genscan HMM Examples of functional sites. Mole cule DNA Processing Replic ation Transcription Func tiona l sit es Replic ation origin Promotor Enhance r Operator and other prokaryot ic regu lators Interacting molecules Origin recogn iti on co mplex RNA polymerase Transcription factor Repressor, etc RNA Post-transc riptiona l processing Translation Spli ce sit e Spli ceosome Translation initi ation site Ribosome Post-translationa l processing Cleavage s it e Phospho rylation and other modific ation sit es ATP binding sites Signa l sequenc e, lo cali zation signals DNA binding sites Ligand binding sit es Catalytic sit es Protease Protein kinas e, etc. Protein Protein sorting Protein func tion Signa l recogn iti on p article DNA Ligands Many dif ferent molecules Protein sorting prediction The final step in informational expression of proteins involves their sorting to the appropriate location within or outside the cell. The information for correct localization is usually located within the protein itself. Sequence Alignment Problem • Task: find common patterns shared by multiple Protein sequences • Importance: understanding function and structures; revealing evolutionary relationship, data organizing … • Types: Pairwise vs. Multiple; Global vs. Local. • Approaches: criteria-based (extension of pairwise methods) versus model-based (EM, Gibbs, HMM) Outline of Liu-Lawrence approach • Local alignment --- Examples, the Gibbs sampling algorithm • A simple multinomial model for block-motifs and the Bayesian missing-data formulation. Possible but not covered here: • Motif sampler: repeated motifs. • The hidden Markov model (its decoupling) • The propagation model and beyond Example: search for regulatory binding sites • Gene Transcription and Regulation – Transcription initiated by RNA polymerase binding at the so-called promoter region (TATA-box; or -10, -35) – Regulated by some (regulatory) proteins on DNA “near” the promoter region. – These binding sites on DNA are often “similar” in composition. Enhancers and repressors 5’ RNA polymerase Starting codon AUG Promoter region Translation start 3’ The particular dataset • 18 DNA segments, each of length 105 bps. • There are at least one CRP binding sites, known experimentally, in each sequence. • The binding sites are about 16-19 base pairs long, with considerable variability in their contents. • Interested in seeing if we can find these sites computationally. The Data Set Truth? Example: H-T-H proteins HTH: sequence-specific DNA binding, gene regulation. Motifs occur as local isolated structures. The whole 3-D structures are known and very different. 30 sequences with known HTH positions chosen. The set represents a typically diverse cross section of HTH seq. Width of the motif pattern is assumed to be in the range from 17 to 22. The criterion “information per parameter” is used to determine the optimal width, 21. Heuristic convergence developed (multiple restarts with IPP monitored) Finding Local Alignment of Multiple Sequences a1 Local Motif a2 width = w ak Alignment variable: A={a1, a2, …, ak} Objective: find the “best” common patterns. length nk Motif Alignment Model a1 Motif a2 width = w ak length nk The missing data: Alignment variable: A={a1, a2, …, ak} • Every non-site positions follows a common multinomial with p0=(p0,1 ,…, p0,20) • Every position i in the motif element follows probability distribution pi=(pi,1 ,…, pi,20) The Tricky Part: The alignment variable A={a1, a2, …, ak} is not observable • General Missing Data problem: – – – – Unobserved data in each datum Object of the DP optimization (path) Potentially observable Examples • Alignment • RNA structure • Protein secondary structure Statistical Models • How do we describe patterns? – frequencies of amino acid types. – multinomial distribution --- more generally a “model” A typical aligned motif Multinomial Distribution Motif Positions Seq Seq Seq Seq Seq 1 2 3 4 5 1 2 3 4 5 6 I V V I L G G G G S K P I E D P G E D D A D Q H P E G P E E A total of k sequences Model Mi for i-th column: (ki,1, ki,2, …, ki,20) ~ Multinom (k, pi ) where pi=(pi,1 ,…, pi,20) Estimation for the “pattern” • The maximum likelihood: • Bayesian estimate: ij p kij k ki ,1 + + ki , 20 k – Prior: pi ~ Dirichlet (ai,1, ..., ai,20), “pseudo-counts” – Posterior: [pi | obs ]~ Dirichlet (ai,1,+ki,1,…, ai,20 +ki,20) – Posterior Mean: ˆ ij p – Posterior Distribution: k ij + a ij k + l a il Dealing with the missing data • Let Q=(p0 , p1 , … , pw ), “parameter”, A={a1, a2, …, aK} • Iterative sampling: P(Q | A, Data); P(A | Q, Data) Draw from [Q | A, Data], then draw from [A | Q, Data] • Predictive Updating: pretend that K-1 sequences have been aligned. We stochastically predict for the K-th sequence!! a1 a3 a2 ak ? The Algorithm • Initialized by choosing random starting ( 0) ( 0) ( 0) a , a ,......, a positions 1 2 K • Iterate the following steps many times: – Randomly or systematically choose a sequence, say, sequence k, to exclude. – Carry out the predictive-updating step to update ak • Stop when not much change observed, or some criterion met. The PU-Step a1 a2 a3 ak ? 1. Compute predictive frequencies of each position i in motif cij= count of amino acid type j at position i. c0j = count of amino acid type j in all non-site positions. qij= (cij+bj)/(K-1+B), B=b1+ + bK “pseudo-counts” 2. Sample from the predictive distriubtion of ak . w qi , Rk ( l + i ) i 1 q0, Rk ( l + i ) P (a k l + 1) Phase-shift and Fragmentation • Sometimes get stuck in a local shift optimum : True motif locations ak ? • How to “escape” from this local optimum? – Simultaneous move: A A+d, A+d{a1+d, … , aK+d} – Use a Metropolis step: accept the move with prob=p, ( A + d | R) p min{1, } ( A| R) Compare entropies between new columns and left-out ones. Acknowledgements for slides used PDB: protein figures Lior Pachter: gene finding Jun Liu: Gibbs sampler