Forward algorithm We want to find the probability of an observed sequence given an HMM (please spell it out). An exhaustive way to find this probability would be to find each possible sequence of the hidden states (who is going to explain what the hidden states are) , and sum these probabilities. Forward algorithm uses recursion to calculate the probability more efficiently. Define our HMM model with N states as (π, A, B), where π is the vector of the initial state probabilities, A is the transition matrix, and B is a matrix of emission probabilities given a state (spell it out here even if it is planned to be introduced by Ed) . Let our sequence of observations be X=(x1, x2,…, xT). Then probability of reaching an intermediate state at step t (partial probabilities), can be calculated exploiting recursion by knowing these values for the previous step t-1: N f t ( j ) b j ( xt ) f t 1 (i ) aij , and for initial state at t=1: i 1 f1 ( j ) b j ( x1 ) ( j ) . b_j is apparently a component of mathrix B. Why does it look like a vector? The sum of all partial probabilities at the very last step gives the probability of the observation, given the HMM: N P( X | HMM ) f T (i ) i 1 Pfam. Pfam [1] is a database of protein domain families (spell it out) . Pfam contains multiple sequence alignments for each family, as well as profile hidden Markov models (profile HMMs) for finding these domains in new sequences. Pfam contains functional annotation, literature references and database links for each family. There are two multiple alignments for each Pfam family, the seed alignment that contains a relatively small number of representative members of the family and the full alignment that contains all members in the database that can be detected. The seed alignment contains a relatively small number of representative members of the family (why to repeat? Say something new and important. Explain what sa essentially means. Give the examples ) and is meant to change infrequently, either to improve the alignment or to extend a family with new members. If there is structural information available for any of the members of a family, it is used to improve domain boundaries, seed alignment itself and annotation. The profile HMM is built from the seed alignment using the HMMER. This profile HMM is then used to search protein sequence database for matches against this profile. All the matches found above the family specific threshold are aligned using the profile HMM to make the full alignment. Pfam is web-based resource available at http://www.sanger.ac.uk/Software/Pfam/ with several mirrors around the world. In addition, Pfam software and database itself can be downloaded and used locally if needed. As of release 17.0, made available in March 2005, Pfam contains 7868 Pfam families. Pfam families match 75% of protein sequences in Swiss-Prot and TrEMBL databases [2]. For those protein sequences that do not belong to any Pfam family, automatically generated Pfam-B families are created. Although of lower quality Pfam-B families can be useful when no Pfam-A families are found, and the combination of Pfam and Pfam-B covers 82% of protein sequences. Pfam web site allows a few ways to query its data. First, if it is a known sequence from UniProt [3] database (UniProt merged data from SwissProt, TrEMBL, and PIR [4]), then its domain structure is pre-computed and results are presented to a user as shown in the example search below for FGFR1_HUMAN: (I appreciate these pictures but a link to the site would be sufficient. The pictures are taking space, and I would like to see more essentials about the HMM. This is the goal of the project, not the web tools ) Protein families database of alignments and HMMs SwissPfam entry for fgfr1_human Description from UniProt for FGFR1_HUMAN : basic fibroblast growth factor receptor 1 precursor (ec 2.7.1.112)(fgfr-1) (bfgf-r) (fms-like tyrosine kinase-2) (c-fgr) [822 residues] ig 48-103 I-set 159-248 ig 270-343 Pkinase_Tyr 478-754 Key signal peptide: pfamA: > Source Domain Pfam Pfam-B_4731 Pfam-B_4731 ig Pfam Pfam Context: > smart: > transmembrane: > low complexity: > coiled coil: > pfamB: > Start End 1 33 1 30 48 103 Overlapping Domains: Change the domain order using the ^ and v buttons. View the changes by clicking the 'Change order' button. Hide key Pfam Pfam Pfam Pfam Pfam Pfam Smart Smart Smart Smart signalp seg seg tmhmm seg I-set ig Pfam-B_958 Pfam-B_907 Pkinase_Tyr Pfam-B_1516 IGc2 IGc2 IGc2 TyrKc signal peptide 159 248 270 343 377 428 432 477 478 754 763 807 46 108 169 237 268 348 478 754 1 23 low complexity 49 60 low complexity 124 138 transmembrane 375 397 low complexity 439 453 Sequence markup high priority signal peptide pfamA smart transmembrane low complexity pfamB Increase priority Decrease priority low priority Change order Start End Disulphide-bridge 55 101 Disulphide-bridge 178 230 Disulphide-bridge 277 341 Active-site 623 Comments or questions on the site? Send a mail to pfam@sanger.ac.uk Results contain graphical representation of domain structure of a protein. High quality Pfam-A domains are shown in the large, one-colour boxes and automatically generated Pfam-B domains are shown in the smaller, three coloured boxes. Although Pfam-B families are not guaranteed by Pfam to be correct, they do give an idea of the other sequences in UniProt, which perhaps share some features with this protein. For new protein sequences, search results will contain if there are any of the domains in Pfam and if so, how many domains, where they lie in the sequence, annotation about the domain and alignment of a user sequence to the domains. (an example would be appreciated. KVAP is a “fancy” one, with Nobel price on its chest ) Pfam also provide tools for analysis of domain architecture in proteins. In particular, using the Sweden web server at http://www.cgr.ki.se/Pfam/, it is possible to look for proteins that share the same overall domain organisation. These are not necessary the most sequence-similar proteins, and there is no obviously correct way to assign a score. So, results are ranked by the number of domains in common, from identical domain architectures to smaller numbers of common domains. It is also possible to perform a general query consisting of a set of Pfam domains, with or without ordering or gap constraints, similar to regular expressions. The user can search for proteins with a certain domain combination motif, e.g. all proteins with one or more immunoglobulin-like domains and a tyrosine protein kinase domain. Another way to use Pfam is via so-called ‘taxonomy search’ tool (UK web site), which allows the user to find Pfam entries specific to a group of organisms using a taxonomy query language. One use of this tool is to aid identification of putative drug targets. For example, as part of a screen for possible drug targets unique to the malaria parasite, one might want to identify all Pfam domains present in Plasmodium falciparum but not in the vertebrate host. The taxonomic query ‘Plasmodium falciparum AND NOT Vertebrata’ returns 77 Pfam domains, 10 of which have already been postulated as drug targets against P.falciparum. Gene prediction. (again, pay more attention to the essentials of HMM ) One of the first steps after obtaining full DNA sequence of an organism is to compile a catalog of genes in this genome. To underline the difficulty in identifying genes, gene boundaries of yeast genome, which has about 6,000 genes, are still subject to major corrections more than 6 years after the genome was sequenced. Human genome was completed a few years ago, and it is estimated to contain between 20,000 and 25,000 genes. An automatic gene prediction and annotation pipeline, Ensembl [5], annotates about 22,000 human genes mostly using homology-based evidences like cDNA, EST or protein sequence matches against various databases. One of the parts of Ensembl is HMM-based gene prediction program, GENSCAN [6]. Historically, GENSCAN was one of the first systems to perform well on typical genomic sequences containing multiple genes in both orientations. GENSCAN uses a fifth order generalized hidden Markov model (GHMM) to predict genes in a given target sequence, using only that sequence as input. Prior to the advent of dual-genome gene predictors, GENSCAN remained one of the most accurate and widely used systems. Different states and transitions of GHMM are shown below. Figure 1.Different states and transitions in GENSCAN HMM. Each circle or square is a functional unit of a gene on its forward strand, for example Einit is 5’ coding sequence and Eterm is 3’ coding sequence. The model for the reverse strand is in mirror symmetry to the model shown with respect to horizontal axis. E, exon; I, intron; pro, promoter, figure taken from [7]. The initial sequencing of the mouse genome made it possible for the first time to incorporate whole-genome comparison into human gene prediction. This led to the creation of a new generation of gene predictors, such as SLAM [8], SGP2 [9], and TWINSCAN [10], which were able to improve on the performance of GENSCAN by using patterns of conservation between the human and mouse genomes to help discriminate between coding and noncoding regions. These programs are the bestperforming gene predictors for mammalian genomes currently available. TWINSCAN, SLAM and SGP2 programs use HMMs to predict genes. In particular, TWINSCAN modifies GENSCAN model by incorporating sequence similarity into GENSCAN scoring scheme. SGP2 takes a similar approach but instead of GENSCAN, it uses GENEID, another HMM-based gene prediction program, to produce scores. Unlike both TWINSCAN and SGP2, which use sequence alignment as input, SLAM tool combine sequence alignment pair HMM with gene prediction GHMM into so-called generalized pair HMM, thus obtaining both sequence alignment and gene predictions. Recent experimental evaluation [11] by RT-PCR of gene predictors SGP2, TWINSCAN and Ensembl, applied to the chicken genome showed that approximately 50% of predictions that were in TWINSCAN and SGP2 but not in Ensembl could be experimentally verified. These experiments demonstrate that comparative prediction methods are effective at complementing homology-based methods and confirm that a combination of methods can improve the prediction accuracy. With recent genome availability for other species, like rat and chicken, there is an effort to use multiple alignment of several genomes to improve gene prediction. One of the recent developments [12], TWINSCAN 3.0 (also called N-SCAN) changes its GHMM so that it emits multiple alignment instead of single DNA sequence, and also phylogenetic tree is incorporated into the model as Bayesian network. N-SCAN also extends its new state diagram to allow explicit modeling of 5’ UTR, which is important for understanding post-transcriptional regulation of new and existing genes. Their computation results show improvement as compared to dual-genome systems, but there was no experimental evaluation of their predictions yet. References 1 Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer ELL, Studholme DJ, Yeats C, Eddy SR (2004) The Pfam Protein Families Database. Nucleic Acids Research Database Issue 32:D138-D141 2 Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I. et al. (2003) The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365–370. 3 Apweiler R., Bairoch A., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Natale D.A., O'Donovan C., Redaschi N., Yeh L.S. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 32:D115-119(2004). 4 Wu,C.H., Yeh,L.-S.L., Huang,H., Arminski,L., Castro-Alvear,J., Chen,Y., Hu,Z., Kourtesis,P., Ledley,R.S., Suzek,B.E. et al. (2003) The Protein Information Resource. Nucleic Acids Res., 31, 345–347. 5 T. Hubbard, D. Andrews, M. Caccamo, G. Cameron, Y. Chen, M. Clamp, L. Clarke, G. Coates, T. Cox, F. Cunningham, V. Curwen, T. Cutts, T. Down, R. Durbin, X. M. Fernandez-Suarez, J. Gilbert, M. Hammond, J. Herrero, H. Hotz, K. Howe, V. Iyer, K. Jekosch, A. Kahari, A. Kasprzyk, D. Keefe, S. Keenan, F. Kokocinsci, D. London, I. Longden, G. McVicker, C. Melsopp, P. Meidl, S. Potter, G. Proctor, M. Rae, D. Rios, M. Schuster, S. Searle, J. Severin, G. Slater, D. Smedley, J. Smith, W. Spooner, A. Stabenau, J. Stalker, R. Storey, S. Trevanion, A. Ureta-Vidal, J. Vogel, S. White, C. Woodwark and E. Birney. Ensembl 2005 Nucleic Acids Res. 2005 Jan 1;33 Database issue:D447-D453. 6 Burge, C.B. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94. 7 Zhang MQ. Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet. 2002;3:698– 709. 8 Pachter, L., Alexandersson, M., and Cawley, S. 2002. Applications of generalized pair hidden Markov models to alignment and gene finding problems. J. Comp. Biol. 9: 389-400. 9 Parra G, Agarwal P, Abril JF, Wiehe T, Fickett JW, Guigo R. Comparative gene prediction in human and mouse.Genome Res. 2003 Jan;13(1):108-17. 10 Korf, I., Flicek, P., Duan, D., and Brent, M.R. 2001. Integrating genomic homology into gene structure prediction. Bioinformatics 17 Suppl 1: 140-148. 11 Eyras E, Reymond A, Castelo R, Bye JM, Camara F, Flicek P, Huckle EJ, Parra G, Shteynberg DD, Wyss C, Rogers J, Antonarakis SE, Birney E, Guigo R, Brent MR. Gene finding in the chicken genome. BMC Bioinformatics. 2005 May 30;6(1):131. 12 Gross, S. and Brent, M. 2005. Using multiple alignments to improve gene prediction. The Ninth International Conference on Research in Computational Molecular Biology (RECOMB). (in press).