hmm_project_SashaEd

advertisement
Forward algorithm
We want to find the probability of an observed sequence given an HMM (please spell it
out). An exhaustive way to find this probability would be to find each possible sequence
of the hidden states (who is going to explain what the hidden states are) , and sum these
probabilities. Forward algorithm uses recursion to calculate the probability more
efficiently.
Define our HMM model with N states as (π, A, B), where π is the vector of the initial state
probabilities, A is the transition matrix, and B is a matrix of emission probabilities given
a state (spell it out here even if it is planned to be introduced by Ed) . Let our sequence of
observations be X=(x1, x2,…, xT). Then probability of reaching an intermediate state at
step t (partial probabilities), can be calculated exploiting recursion by knowing these
values for the previous step t-1:
N
f t ( j )  b j ( xt ) f t 1 (i ) aij , and for initial state at t=1:
i 1
f1 ( j )  b j ( x1 )   ( j ) .
b_j is apparently a component of mathrix B. Why does it look like a vector?
The sum of all partial probabilities at the very last step gives the probability of the
observation, given the HMM:
N
P( X | HMM )   f T (i )
i 1
Pfam.
Pfam [1] is a database of protein domain families (spell it out) . Pfam contains multiple
sequence alignments for each family, as well as profile hidden Markov models (profile
HMMs) for finding these domains in new sequences. Pfam contains functional
annotation, literature references and database links for each family. There are two
multiple alignments for each Pfam family, the seed alignment that contains a relatively
small number of representative members of the family and the full alignment that
contains all members in the database that can be detected. The seed alignment contains a
relatively small number of representative members of the family (why to repeat? Say
something new and important. Explain what sa essentially means. Give the
examples ) and is meant to change infrequently, either to improve the alignment or to
extend a family with new members. If there is structural information available for any of
the members of a family, it is used to improve domain boundaries, seed alignment itself
and annotation. The profile HMM is built from the seed alignment using the HMMER.
This profile HMM is then used to search protein sequence database for matches against
this profile. All the matches found above the family specific threshold are aligned using
the profile HMM to make the full alignment.
Pfam is web-based resource available at http://www.sanger.ac.uk/Software/Pfam/ with
several mirrors around the world. In addition, Pfam software and database itself can be
downloaded and used locally if needed. As of release 17.0, made available in March
2005, Pfam contains 7868 Pfam families. Pfam families match 75% of protein sequences
in Swiss-Prot and TrEMBL databases [2]. For those protein sequences that do not belong
to any Pfam family, automatically generated Pfam-B families are created. Although of
lower quality Pfam-B families can be useful when no Pfam-A families are found, and the
combination of Pfam and Pfam-B covers 82% of protein sequences.
Pfam web site allows a few ways to query its data. First, if it is a known sequence from
UniProt [3] database (UniProt merged data from SwissProt, TrEMBL, and PIR [4]), then
its domain structure is pre-computed and results are presented to a user as shown in the
example search below for FGFR1_HUMAN:
(I appreciate these pictures but a link to the site would be sufficient. The pictures
are taking space, and I would like to see more essentials about the HMM. This is the
goal of the project, not the web tools )
Protein families database of alignments and
HMMs
SwissPfam entry for fgfr1_human
Description from UniProt for
FGFR1_HUMAN :
basic fibroblast growth factor receptor 1 precursor (ec 2.7.1.112)(fgfr-1) (bfgf-r) (fms-like tyrosine kinase-2) (c-fgr)
[822 residues]
ig 48-103
I-set 159-248
ig 270-343
Pkinase_Tyr 478-754
Key
signal peptide:
pfamA:
>
Source
Domain
Pfam
Pfam-B_4731
Pfam-B_4731
ig
Pfam
Pfam
Context:
>
smart:
>
transmembrane:
>
low complexity:
>
coiled coil:
>
pfamB:
>
Start End
1
33
1
30
48
103
Overlapping Domains: Change the domain order using
the ^ and v buttons. View the changes by clicking the
'Change order' button.
Hide key
Pfam
Pfam
Pfam
Pfam
Pfam
Pfam
Smart
Smart
Smart
Smart
signalp
seg
seg
tmhmm
seg
I-set
ig
Pfam-B_958
Pfam-B_907
Pkinase_Tyr
Pfam-B_1516
IGc2
IGc2
IGc2
TyrKc
signal peptide
159
248
270
343
377
428
432
477
478
754
763
807
46
108
169
237
268
348
478
754
1
23
low complexity 49
60
low complexity 124
138
transmembrane 375
397
low complexity 439
453
Sequence markup
high priority
signal peptide
pfamA
smart
transmembrane
low complexity
pfamB
Increase priority
Decrease priority
low priority
Change order
Start End
Disulphide-bridge
55
101
Disulphide-bridge
178
230
Disulphide-bridge
277
341
Active-site
623
Comments or questions on the site? Send a mail to pfam@sanger.ac.uk
Results contain graphical representation of domain structure of a protein. High quality
Pfam-A domains are shown in the large, one-colour boxes and automatically generated
Pfam-B domains are shown in the smaller, three coloured boxes. Although Pfam-B
families are not guaranteed by Pfam to be correct, they do give an idea of the other
sequences in UniProt, which perhaps share some features with this protein.
For new protein sequences, search results will contain if there are any of the domains in
Pfam and if so, how many domains, where they lie in the sequence, annotation about the
domain and alignment of a user sequence to the domains.
(an example would be appreciated. KVAP is a “fancy” one, with Nobel price on its
chest )
Pfam also provide tools for analysis of domain architecture in proteins. In particular,
using the Sweden web server at http://www.cgr.ki.se/Pfam/, it is possible to look for
proteins that share the same overall domain organisation. These are not necessary the
most sequence-similar proteins, and there is no obviously correct way to assign a score.
So, results are ranked by the number of domains in common, from identical domain
architectures to smaller numbers of common domains. It is also possible to perform a
general query consisting of a set of Pfam domains, with or without ordering or gap
constraints, similar to regular expressions. The user can search for proteins with a certain
domain combination motif, e.g. all proteins with one or more immunoglobulin-like
domains and a tyrosine protein kinase domain.
Another way to use Pfam is via so-called ‘taxonomy search’ tool (UK web site), which
allows the user to find Pfam entries specific to a group of organisms using a taxonomy
query language. One use of this tool is to aid identification of putative drug targets. For
example, as part of a screen for possible drug targets unique to the malaria parasite, one
might want to identify all Pfam domains present in Plasmodium falciparum but not in the
vertebrate host. The taxonomic query ‘Plasmodium falciparum AND NOT Vertebrata’
returns 77 Pfam domains, 10 of which have already been postulated as drug targets
against P.falciparum.
Gene prediction.
(again, pay more attention to the essentials of HMM )
One of the first steps after obtaining full DNA sequence of an organism is to compile a
catalog of genes in this genome. To underline the difficulty in identifying genes, gene
boundaries of yeast genome, which has about 6,000 genes, are still subject to major
corrections more than 6 years after the genome was sequenced. Human genome was
completed a few years ago, and it is estimated to contain between 20,000 and 25,000
genes. An automatic gene prediction and annotation pipeline, Ensembl [5], annotates
about 22,000 human genes mostly using homology-based evidences like cDNA, EST or
protein sequence matches against various databases.
One of the parts of Ensembl is HMM-based gene prediction program, GENSCAN [6].
Historically, GENSCAN was one of the first systems to perform well on typical genomic
sequences containing multiple genes in both orientations. GENSCAN uses a fifth order
generalized hidden Markov model (GHMM) to predict genes in a given target sequence,
using only that sequence as input. Prior to the advent of dual-genome gene predictors,
GENSCAN remained one of the most accurate and widely used systems. Different states
and transitions of GHMM are shown below.
Figure 1.Different states and transitions in GENSCAN HMM. Each circle or square is a functional unit of
a gene on its forward strand, for example Einit is 5’ coding sequence and Eterm is 3’ coding sequence. The
model for the reverse strand is in mirror symmetry to the model shown with respect to horizontal axis. E,
exon; I, intron; pro, promoter, figure taken from [7].
The initial sequencing of the mouse genome made it possible for the first time to
incorporate whole-genome comparison into human gene prediction. This led to the
creation of a new generation of gene predictors, such as SLAM [8], SGP2 [9], and
TWINSCAN [10], which were able to improve on the performance of GENSCAN by
using patterns of conservation between the human and mouse genomes to help
discriminate between coding and noncoding regions. These programs are the bestperforming gene predictors for mammalian genomes currently available.
TWINSCAN, SLAM and SGP2 programs use HMMs to predict genes. In particular,
TWINSCAN modifies GENSCAN model by incorporating sequence similarity into
GENSCAN scoring scheme. SGP2 takes a similar approach but instead of GENSCAN, it
uses GENEID, another HMM-based gene prediction program, to produce scores. Unlike
both TWINSCAN and SGP2, which use sequence alignment as input, SLAM tool
combine sequence alignment pair HMM with gene prediction GHMM into so-called
generalized pair HMM, thus obtaining both sequence alignment and gene predictions.
Recent experimental evaluation [11] by RT-PCR of gene predictors SGP2, TWINSCAN
and Ensembl, applied to the chicken genome showed that approximately 50% of
predictions that were in TWINSCAN and SGP2 but not in Ensembl could be
experimentally verified. These experiments demonstrate that comparative prediction
methods are effective at complementing homology-based methods and confirm that a
combination of methods can improve the prediction accuracy.
With recent genome availability for other species, like rat and chicken, there is an effort
to use multiple alignment of several genomes to improve gene prediction. One of the
recent developments [12], TWINSCAN 3.0 (also called N-SCAN) changes its GHMM so
that it emits multiple alignment instead of single DNA sequence, and also phylogenetic
tree is incorporated into the model as Bayesian network. N-SCAN also extends its new
state diagram to allow explicit modeling of 5’ UTR, which is important for understanding
post-transcriptional regulation of new and existing genes. Their computation results show
improvement as compared to dual-genome systems, but there was no experimental
evaluation of their predictions yet.
References
1
Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S,
Sonnhammer ELL, Studholme DJ, Yeats C, Eddy SR (2004) The Pfam Protein Families Database. Nucleic
Acids Research Database Issue 32:D138-D141
2 Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M., Estreicher, A., Gasteiger, E., Martin, M.J.,
Michoud, K., O’Donovan, C., Phan, I. et al. (2003) The Swiss-Prot protein knowledgebase and its
supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365–370.
3 Apweiler R., Bairoch A., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez
R., Magrane M., Martin M.J., Natale D.A., O'Donovan C., Redaschi N., Yeh L.S. UniProt: the Universal
Protein knowledgebase. Nucleic Acids Res. 32:D115-119(2004).
4 Wu,C.H., Yeh,L.-S.L., Huang,H., Arminski,L., Castro-Alvear,J., Chen,Y., Hu,Z., Kourtesis,P.,
Ledley,R.S., Suzek,B.E. et al. (2003) The Protein Information Resource. Nucleic Acids Res., 31, 345–347.
5 T. Hubbard, D. Andrews, M. Caccamo, G. Cameron, Y. Chen, M. Clamp, L. Clarke, G. Coates, T. Cox,
F. Cunningham, V. Curwen, T. Cutts, T. Down, R. Durbin, X. M. Fernandez-Suarez, J. Gilbert, M.
Hammond, J. Herrero, H. Hotz, K. Howe, V. Iyer, K. Jekosch, A. Kahari, A. Kasprzyk, D. Keefe, S.
Keenan, F. Kokocinsci, D. London, I. Longden, G. McVicker, C. Melsopp, P. Meidl, S. Potter, G. Proctor,
M. Rae, D. Rios, M. Schuster, S. Searle, J. Severin, G. Slater, D. Smedley, J. Smith, W. Spooner, A.
Stabenau, J. Stalker, R. Storey, S. Trevanion, A. Ureta-Vidal, J. Vogel, S. White, C. Woodwark and E.
Birney. Ensembl 2005 Nucleic Acids Res. 2005 Jan 1;33 Database issue:D447-D453.
6 Burge, C.B. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol.
Biol. 268: 78-94.
7 Zhang MQ. Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet. 2002;3:698–
709.
8 Pachter, L., Alexandersson, M., and Cawley, S. 2002. Applications of generalized pair hidden Markov
models to alignment and gene finding problems. J. Comp. Biol. 9: 389-400.
9 Parra G, Agarwal P, Abril JF, Wiehe T, Fickett JW, Guigo R. Comparative gene prediction in human and
mouse.Genome Res. 2003 Jan;13(1):108-17.
10 Korf, I., Flicek, P., Duan, D., and Brent, M.R. 2001. Integrating genomic homology into gene structure
prediction. Bioinformatics 17 Suppl 1: 140-148.
11 Eyras E, Reymond A, Castelo R, Bye JM, Camara F, Flicek P, Huckle EJ, Parra G, Shteynberg DD,
Wyss C, Rogers J, Antonarakis SE, Birney E, Guigo R, Brent MR. Gene finding in the chicken genome.
BMC Bioinformatics. 2005 May 30;6(1):131.
12 Gross, S. and Brent, M. 2005. Using multiple alignments to improve gene prediction. The Ninth
International Conference on Research in Computational Molecular Biology (RECOMB). (in press).
Download