Center for Biological Sequence Analysis Prokaryotic gene finding Marie Skovgaard Ph.D. student marie@cbs.dtu.dk Center for Biological Sequence Analysis Prokarya Center for Biological Sequence Analysis Center for Biological Sequence Analysis Can you spot the gene? >AE006641 GTATACTCTTCTTCCCTATACATTGTCGCAGCAAGCTTAGTTTCTTTAGCCTCTCTGCTTTCATTATTAC TTATAATCTTAATAGCAAGGAGACATATGATAGAGTATTTCTATATGATTCCTTCGTTCGTTTATATGAA CTTTATTGTCGCACTAAACTTCACTGCAATATTTTTAGAGTTAATAAGAGCACCTAGAGTGTGGGTAAAA ACTGAAAGAAGTGCCAAGGTTACGGGGGAGGTCATGGGATGATAACTGAATTTTTACTTAAAAAGAAATT AGAAGAACATTTAAGCCATGTAAAGGAAGAGAATACGATATATGTAACAGATTTAGTAAGATGCCCCAGA AGAGTAAGATATGAGAGTGAATACAAGGAGCTTGCAATCTCTCAGGTTTACGCGCCTTCAGCTATTTTAG GGGACATATTGCATCTCGGTCTTGAAAGCGTATTAAAAGGGAACTTTAATGCAGAAACTGAAGTTGAAAC TCTGAGAGAAATTAACGTCGGAGGTAAAGTTTATAAAATTAAAGGAAGAGCCGATGCAATAATTAGAAAT GACAACGGGAAGAGTATTGTAATTGAGATAAAAACTTCTAGAAGTGATAAAGGATTACCTCTAATTCATC ATAAAATGCAGCTACAGATATATTTATGGTTATTTAGTGCAGAAAAAGGTATACTAGTTTACATAACTCC AGATAGGATAGCTGAGTATGAAATAAACGAACCTTTAGATGAAGCAACAATAGTAAGACTTGCAGAGGAT ACAATAATGTTACAAAACTCACCTAGATTCAACTGGGAATGTAAATATTGCATATTTTCCGTCATTTGCC CAGCTAAACTAACCTAAAATTAAAATCTCTCATCGATATAATTAAATTGTGCACACTAGACCAGTAGTTG CCACAATAGCTGGGAGTGACAGTGGAGGAGGTGCTGGATTACAGGCTGATCTAAAGACGTTTAGCGCATT AGGAGTTTTTGGTACAACAATAATAACCGGTTTAACAGCACAGAATACAAGAACAGTTACAAAAGTATTA GAGATACCATTAGATTTCATTGAAGCTCAGTTTGATGCGGTTTGCCTAGATTTACATCCAACTCACGCCA AAACTGGAATGTTAGCTTCTGGTAAAGTGGTAGAACTTGTACTGAGAAAAATTAGAGAGTATAACATAAA ACTAGTTTTAGATCCAGTGATGGTTGCGAAATCTGGATCATTATTGGTAACAGAGGATATCTCGGAGCAA ATAAAAAAGGCGATGAAGGAGGCCATAATATCTACTCCAAACAGATATGAAGCTGAGATAATAAATAAGA CAAAGATTAATAGTCAAGATGATGTTATAAAAGCGGCAAGGGAAATTTATTCTAAGTATGGGAATGTTGT AGTTAAAGGATTTAATGGAGTAGATTACGCCATAATTGACGGAGAAGAAATAGAGTTAAAAGGTGATTAC ATCAGTACTAAAAATACACATGGTAGTGGAGACGTATTTTCTGCCTCCATAACTGCATATCTTGCCTTGG GATACAAACTTAAAGATGCATTAATAAGAGCTAAAAAATTCGCTACAATGACAGTCAAATACGGTTTGGA CTTAGGAGGAGGATATGGACCAGTAGATCCCTTTGCCCCTATAGAGTCCATAGTGAAGAGAGAAGAAGGA AGAAATCAGCTAGAAAACTTACTTTGGTACTTAGAGTCTAATCTTAACGTTATACTTAAACTAATTAACG / (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/ Center for Biological Sequence Analysis Identifying open reading frames / (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/ Center for Biological Sequence Analysis A. pernix (43% AT) Center for Biological Sequence Analysis Why care about over annotated genes? Genome comparison: • Fraction of known proteins • Average gene length • Amino acid composition The quality of our databases To gain biological knowledge Center for Biological Sequence Analysis Regular expression Regular expression: /[AT][CG][AC][ACGT]*A[TG][CG]/ ACA---ATG TCAACTATC ACAC--AGC AGA---ATG ACCG--ATC The regular expression is able to find all posible sequences, but do not distinguish between the consensus sequence and the highly unlikely sequence: ACAC—ATC or TGCT--AGG Weigth matrixes can be used to score the sequence but do not deal with insertions and deletions. Center for Biological Sequence Analysis Markov model ACA---ATG TCAACTATC ACAC--AGC AGA---ATG ACCG--ATC 0.4 A C G T 0.2 0.4 0.2 0.2 0.6 0.6 A C G T 0.8 1.0 0.2 A C G T 0.8 0.2 1.0 A C G T 0.8 0.2 0.4 A C G T 1.0 1.0 A C G T 1.0 0.2 0.8 A C G T 0.8 0.2 Center for Biological Sequence Analysis Profile HMM Profile HMM have a predefined architecture and the parameters are estimated from multiple sequence alignments. Profile HMM are not usefull for gene finding, since all genes in an organism can not be aligned in a meaningfull way. Begin End Center for Biological Sequence Analysis Markov Model for gene finding Define a simple architecture: / (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/ ATG GTG TTG S1 A T G C A T G C A T G C TAG TAA TGA S2 S3 S4 S5 Center for Biological Sequence Analysis Markov models Knowledge of the structure of genes is used to define the architecture of the model. Sequences (x) from known genes are used to estimate the parameters of the model – training of the model. The training is done by counting the number of times a nucleotide occur in a given state and dividing this number with the number of sequences used in training giving the frequencies. ATG GTG TTG S1 Sequence States Center for Biological Sequence Analysis Training S1 S2 S3 S4 S5 x1 x2 x3 x4 x5 A T G C A T G C A T G C TAG TAA TGA S2 S3 S4 S5 x6 x7 x8 x9 x10 …..….xn Center for Biological Sequence Analysis Model after training 0.98 ATG: 0.77 TTG: 0.11 GTG: 0.12 CTG: 0.00 S1 A: 0.22 T: 0.24 G: 0.27 C: 0.27 A: 0.25 T: 0.23 G: 0.27 C: 0.25 A: 0.26 T: 0.24 G: 0.25 C: 0.25 TAG: 0.6 TAA: 0.3 TGA: 0.1 S2 S3 S4 S5 The trained model can be used to search for genes in DNA sequences. ATG A S1 0.77 0.00 S2 0.00 (0.22*0.77) S3 0.00 0.00 S4 0.00 0.00 0.00 0.00 0.00 0.00 Sequence States Center for Biological Sequence Analysis Searching with the HMM S5 T T 0.00 0.00 0.00 0.00 T C G C G C G A T ……….T A G (0.23*0.22*0.77) 0.00 (0.24*0.23*0.22*0.77) =P(x|M) Center for Biological Sequence Analysis Log-Odds score The propability of a sequence gets infinitly small as the sequence x becomes longer. This is solved by defining a background (NULL) model. For example a random distribution: A=T=C=G=0.25 From this the Log-Odds score can be calculated: -log(P(x|M)/P(x|NULL)) A high Log-Odds score corresponds to a sequence that looks more like the gene model than the background model. Center for Biological Sequence Analysis Is the model to simple? ATG GTG TTG S1 A T G C A T G C A T G C TAG TAA TGA S2 S3 S4 S5 Center for Biological Sequence Analysis Codon usage Synonymous codons incode the same amino acid. At random synonymous codons would be expected to be used with equal frequencies. In real life synonomous codons have different frequencies. Different species have consistent and characteristic codon biases. Lateral transferred genes and genes from plasmids and phages will have atypical codon usage. Variations in codon usage within an organism can be modelled in different coding models in the HMM. Center for Biological Sequence Analysis 1st Position 2nd Position U C A 3rd Position G U 30,407 22,581 18,943 18,629 Phe Phe Leu Leu 11,523 11,766 9,793 12,195 Ser Ser Ser Ser 22,048 16,669 2,706 326 Tyr Tyr Stop Stop 7,062 8,846 1,260 20,756 Cys Cys Stop Trp U C A G C 15,018 15,104 5,316 71,710 Leu Leu Leu Leu 9,569 7,491 11,496 31,614 Pro Pro Pro Pro 17,631 13,272 20,912 39,285 His His Gln Gln 28,458 29,968 4,860 7,404 Arg Arg Arg Arg U C A G A 41,375 Ile 34,261 Ile 5,967 Ile 37,994 Met 12,223 31,889 9,683 19,682 Thr Thr Thr Thr 24,189 29,529 45,812 14,076 Asn Asn Lys Lys 11,982 21,907 2,899 1,694 Ser Ser Arg Arg U C A G G 24,910 20,800 14,850 35,979 20,808 34,770 27,468 45,862 Ala Ala Ala Ala 43,817 25,996 53,780 24,312 Asp Asp Glu Glu 33,731 40,396 10,902 15,118 Gly Gly Gly Gly U C A G Val Val Val Val Fields : [number] [amino acid] Center for Biological Sequence Analysis Is the model to simple? ATG GTG TTG S1 AAA AAT AAG AAC GAA GAT GAG GAC ATA ATT ATG ATC GTA GTT GTG GTC AGA AGT AGG AGC GGA GGT GGG GGC ACA ACT ACG ACC GCA GCT GCG GCC TAA TAT TAG TAC CAA CAT CAG CAC S2 TTA TTT TTG TTC CTA CTT CTG CTC TGA TGT TGG TGC CGA CGT CGG CGC TCA TCT TCG TCC CCA CCT CCG CCC TAG TAA TGA S3 Center for Biological Sequence Analysis HMM for gene finding ATG GTG TTG S1 AAA AAT AAG AAC GAA GAT GAG GAC ATA ATT ATG ATC GTA GTT GTG GTC AGA AGT AGG AGC GGA GGT GGG GGC ACA ACT ACG ACC GCA GCT GCG GCC TAA TAT TAG TAC CAA CAT CAG CAC S2 TTA TTT TTG TTC CTA CTT CTG CTC TGA TGT TGG TGC CGA CGT CGG CGC TCA TCT TCG TCC CCA CCT CCG CCC AAA AAT AAG AAC GAA GAT GAG GAC ATA ATT ATG ATC GTA GTT GTG GTC AGA AGT AGG AGC GGA GGT GGG GGC ACA ACT ACG ACC GCA GCT GCG GCC TAA TAT TAG TAC CAA CAT CAG CAC S3 TTA TTT TTG TTC CTA CTT CTG CTC TGA TGT TGG TGC CGA CGT CGG CGC TCA TCT TCG TCC CCA CCT CCG CCC TAG TAA TGA S4 Center for Biological Sequence Analysis Multiple coding models ATG GTG TTG S AAA AAT AAG AAC GAA GAT GAG GAC ATA ATT ATG ATC GTA GTT GTG GTC AGA AGT AGG AGC GGA GGT GGG GGC ACA ACT ACG ACC GCA GCT GCG GCC TAA TAT TAG TAC CAA CAT CAG CAC TTA TTT TTG TTC CTA CTT CTG CTC TGA TGT TGG TGC CGA CGT CGG CGC TCA TCT TCG TCC CCA CCT CCG CCC AAA AAT AAG AAC GAA GAT GAG GAC ATA ATT ATG ATC GTA GTT GTG GTC AGA AGT AGG AGC GGA GGT GGG GGC ACA ACT ACG ACC GCA GCT GCG GCC TAA TAT TAG TAC CAA CAT CAG CAC TTA TTT TTG TTC CTA CTT CTG CTC TGA TGT TGG TGC CGA CGT CGG CGC TCA TCT TCG TCC CCA CCT CCG CCC AAA AAT AAG AAC GAA GAT GAG GAC ATA ATT ATG ATC GTA GTT GTG GTC AGA AGT AGG AGC GGA GGT GGG GGC ACA ACT ACG ACC GCA GCT GCG GCC TAA TAT TAG TAC CAA CAT CAG CAC TTA TTT TTG TTC CTA CTT CTG CTC TGA TGT TGG TGC CGA CGT CGG CGC TCA TCT TCG TCC CCA CCT CCG CCC AAA AAT AAG AAC GAA GAT GAG GAC ATA ATT ATG ATC GTA GTT GTG GTC AGA AGT AGG AGC GGA GGT GGG GGC ACA ACT ACG ACC GCA GCT GCG GCC TAA TAT TAG TAC CAA CAT CAG CAC TTA TTT TTG TTC CTA CTT CTG CTC TGA TGT TGG TGC CGA CGT CGG CGC TCA TCT TCG TCC CCA CCT CCG CCC AAA AAT AAG AAC GAA GAT GAG GAC ATA ATT ATG ATC GTA GTT GTG GTC AGA AGT AGG AGC GGA GGT GGG GGC ACA ACT ACG ACC GCA GCT GCG GCC TAA TAT TAG TAC CAA CAT CAG CAC TTA TTT TTG TTC CTA CTT CTG CTC TGA TGT TGG TGC CGA CGT CGG CGC TCA TCT TCG TCC CCA CCT CCG CCC AAA AAT AAG AAC GAA GAT GAG GAC ATA ATT ATG ATC GTA GTT GTG GTC AGA AGT AGG AGC GGA GGT GGG GGC ACA ACT ACG ACC GCA GCT GCG GCC TAA TAT TAG TAC CAA CAT CAG CAC TTA TTT TTG TTC CTA CTT CTG CTC TGA TGT TGG TGC CGA CGT CGG CGC TCA TCT TCG TCC CCA CCT CCG CCC AAA AAT AAG AAC GAA GAT GAG GAC ATA ATT ATG ATC GTA GTT GTG GTC AGA AGT AGG AGC GGA GGT GGG GGC ACA ACT ACG ACC GCA GCT GCG GCC TAA TAT TAG TAC CAA CAT CAG CAC TTA TTT TTG TTC CTA CTT CTG CTC TGA TGT TGG TGC CGA CGT CGG CGC TCA TCT TCG TCC CCA CCT CCG CCC AAA AAT AAG AAC GAA GAT GAG GAC ATA ATT ATG ATC GTA GTT GTG GTC AGA AGT AGG AGC GGA GGT GGG GGC ACA ACT ACG ACC GCA GCT GCG GCC TAA TAT TAG TAC CAA CAT CAG CAC TTA TTT TTG TTC CTA CTT CTG CTC TGA TGT TGG TGC CGA CGT CGG CGC TCA TCT TCG TCC CCA CCT CCG CCC AAA AAT AAG AAC GAA GAT GAG GAC ATA ATT ATG ATC GTA GTT GTG GTC AGA AGT AGG AGC GGA GGT GGG GGC ACA ACT ACG ACC GCA GCT GCG GCC TAA TAT TAG TAC CAA CAT CAG CAC TTA TTT TTG TTC CTA CTT CTG CTC TGA TGT TGG TGC CGA CGT CGG CGC TCA TCT TCG TCC CCA CCT CCG CCC AAA AAT AAG AAC GAA GAT GAG GAC ATA ATT ATG ATC GTA GTT GTG GTC AGA AGT AGG AGC GGA GGT GGG GGC ACA ACT ACG ACC GCA GCT GCG GCC TAA TAT TAG TAC CAA CAT CAG CAC TTA TTT TTG TTC CTA CTT CTG CTC TGA TGT TGG TGC CGA CGT CGG CGC TCA TCT TCG TCC CCA CCT CCG CCC AAA AAT AAG AAC GAA GAT GAG GAC ATA ATT ATG ATC GTA GTT GTG GTC AGA AGT AGG AGC GGA GGT GGG GGC ACA ACT ACG ACC GCA GCT GCG GCC TAA TAT TAG TAC CAA CAT CAG CAC TTA TTT TTG TTC CTA CTT CTG CTC TGA TGT TGG TGC CGA CGT CGG CGC TCA TCT TCG TCC CCA CCT CCG CCC AAA AAT AAG AAC GAA GAT GAG GAC ATA ATT ATG ATC GTA GTT GTG GTC AGA AGT AGG AGC GGA GGT GGG GGC ACA ACT ACG ACC GCA GCT GCG GCC TAA TAT TAG TAC CAA CAT CAG CAC TTA TTT TTG TTC CTA CTT CTG CTC TGA TGT TGG TGC CGA CGT CGG CGC TCA TCT TCG TCC CCA CCT CCG CCC TAG TAA TGA E Center for Biological Sequence Analysis Order of the model A zero order Markov model (state) has a propability of letter in the state – the propabilities are independent of the previous sequence. The NULL model is a zero order Markov model (A=T=G=C=0.25). The propability of a letter in a first order Markov model depends on the previous letter (dinucleotide distributions). Second order depends on the two previous letters (corresponding to a codon). Center for Biological Sequence Analysis Order of the coding model Inter-codon denpendencies are correlations between amino acids typically found in proteins. They reflect typical features of proteins and can be used to improve the performance of the gene finder. The use of higher order coding models in gene finding is a way to capture these inter-codon denpendencies. Higher order models requires more training data and more computational time when searching. Center for Biological Sequence Analysis The Shine-Dalgarno sequence The ribosome binds to the messenger RNA through baseparing to the 30S ribosomal subunit. The binding site is the Shine-Dalgarno sequence (SD). The SD is a purine-rich sequence (consensus sequence: AGGAG) at the 5' end of most prokaryotic mRNAs. The SD is found 5-10 basepairs upstream from the start codon. Center for Biological Sequence Analysis EasyGene Center for Biological Sequence Analysis Center for Biological Sequence Analysis R. prowazekii Center for Biological Sequence Analysis GeneMark.hmm http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi Lukashin A. and Borodovsky M., “GeneMark.hmm: new solutions for gene finding”, NAR, 1998, Vol. 26, No. 4, pp. 1107-1115. EasyGene http://cbs.dtu.dk/services/EasyGene Schou Larsen T. and Krogh A., “EasyGene – A prokaryotic gene finder that ranks ORFs by statistical significance”. BMC Bioinformatics 2003, 4:21