S1 - Center for Biological Sequence Analysis

advertisement
Center for Biological Sequence Analysis
Prokaryotic gene finding
Marie Skovgaard
Ph.D. student
marie@cbs.dtu.dk
Center for Biological Sequence Analysis
Prokarya
Center for Biological Sequence Analysis
Center for Biological Sequence Analysis
Can you spot the gene?
>AE006641
GTATACTCTTCTTCCCTATACATTGTCGCAGCAAGCTTAGTTTCTTTAGCCTCTCTGCTTTCATTATTAC
TTATAATCTTAATAGCAAGGAGACATATGATAGAGTATTTCTATATGATTCCTTCGTTCGTTTATATGAA
CTTTATTGTCGCACTAAACTTCACTGCAATATTTTTAGAGTTAATAAGAGCACCTAGAGTGTGGGTAAAA
ACTGAAAGAAGTGCCAAGGTTACGGGGGAGGTCATGGGATGATAACTGAATTTTTACTTAAAAAGAAATT
AGAAGAACATTTAAGCCATGTAAAGGAAGAGAATACGATATATGTAACAGATTTAGTAAGATGCCCCAGA
AGAGTAAGATATGAGAGTGAATACAAGGAGCTTGCAATCTCTCAGGTTTACGCGCCTTCAGCTATTTTAG
GGGACATATTGCATCTCGGTCTTGAAAGCGTATTAAAAGGGAACTTTAATGCAGAAACTGAAGTTGAAAC
TCTGAGAGAAATTAACGTCGGAGGTAAAGTTTATAAAATTAAAGGAAGAGCCGATGCAATAATTAGAAAT
GACAACGGGAAGAGTATTGTAATTGAGATAAAAACTTCTAGAAGTGATAAAGGATTACCTCTAATTCATC
ATAAAATGCAGCTACAGATATATTTATGGTTATTTAGTGCAGAAAAAGGTATACTAGTTTACATAACTCC
AGATAGGATAGCTGAGTATGAAATAAACGAACCTTTAGATGAAGCAACAATAGTAAGACTTGCAGAGGAT
ACAATAATGTTACAAAACTCACCTAGATTCAACTGGGAATGTAAATATTGCATATTTTCCGTCATTTGCC
CAGCTAAACTAACCTAAAATTAAAATCTCTCATCGATATAATTAAATTGTGCACACTAGACCAGTAGTTG
CCACAATAGCTGGGAGTGACAGTGGAGGAGGTGCTGGATTACAGGCTGATCTAAAGACGTTTAGCGCATT
AGGAGTTTTTGGTACAACAATAATAACCGGTTTAACAGCACAGAATACAAGAACAGTTACAAAAGTATTA
GAGATACCATTAGATTTCATTGAAGCTCAGTTTGATGCGGTTTGCCTAGATTTACATCCAACTCACGCCA
AAACTGGAATGTTAGCTTCTGGTAAAGTGGTAGAACTTGTACTGAGAAAAATTAGAGAGTATAACATAAA
ACTAGTTTTAGATCCAGTGATGGTTGCGAAATCTGGATCATTATTGGTAACAGAGGATATCTCGGAGCAA
ATAAAAAAGGCGATGAAGGAGGCCATAATATCTACTCCAAACAGATATGAAGCTGAGATAATAAATAAGA
CAAAGATTAATAGTCAAGATGATGTTATAAAAGCGGCAAGGGAAATTTATTCTAAGTATGGGAATGTTGT
AGTTAAAGGATTTAATGGAGTAGATTACGCCATAATTGACGGAGAAGAAATAGAGTTAAAAGGTGATTAC
ATCAGTACTAAAAATACACATGGTAGTGGAGACGTATTTTCTGCCTCCATAACTGCATATCTTGCCTTGG
GATACAAACTTAAAGATGCATTAATAAGAGCTAAAAAATTCGCTACAATGACAGTCAAATACGGTTTGGA
CTTAGGAGGAGGATATGGACCAGTAGATCCCTTTGCCCCTATAGAGTCCATAGTGAAGAGAGAAGAAGGA
AGAAATCAGCTAGAAAACTTACTTTGGTACTTAGAGTCTAATCTTAACGTTATACTTAAACTAATTAACG
/ (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/
Center for Biological Sequence Analysis
Identifying open reading frames
/ (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/
Center for Biological Sequence Analysis
A. pernix
(43% AT)
Center for Biological Sequence Analysis
Why care about over annotated
genes?
Genome comparison:
• Fraction of known
proteins
• Average gene length
• Amino acid
composition
The quality of our
databases
To gain biological
knowledge
Center for Biological Sequence Analysis
Regular expression
Regular expression:
/[AT][CG][AC][ACGT]*A[TG][CG]/
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATG
ACCG--ATC
The regular expression is able to
find all posible sequences, but do
not distinguish between the
consensus sequence and the highly
unlikely sequence:
ACAC—ATC or TGCT--AGG
Weigth matrixes can be used to
score the sequence but do not deal
with insertions and deletions.
Center for Biological Sequence Analysis
Markov model
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATG
ACCG--ATC
0.4
A
C
G
T
0.2
0.4
0.2
0.2
0.6
0.6
A
C
G
T
0.8
1.0
0.2
A
C
G
T
0.8
0.2
1.0
A
C
G
T
0.8
0.2
0.4
A
C
G
T
1.0
1.0
A
C
G
T
1.0
0.2
0.8
A
C
G
T
0.8
0.2
Center for Biological Sequence Analysis
Profile HMM
Profile HMM have a
predefined architecture
and the parameters are
estimated from multiple
sequence alignments.
Profile HMM are not
usefull for gene finding,
since all genes in an
organism can not be
aligned in a meaningfull
way.
Begin
End
Center for Biological Sequence Analysis
Markov Model for gene finding
Define a simple architecture:
/ (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/
ATG
GTG
TTG
S1
A
T
G
C
A
T
G
C
A
T
G
C
TAG
TAA
TGA
S2
S3
S4
S5
Center for Biological Sequence Analysis
Markov models
Knowledge of the structure of genes is used to
define the architecture of the model.
Sequences (x) from known genes are used to
estimate the parameters of the model – training
of the model.
The training is done by counting the number of
times a nucleotide occur in a given state and
dividing this number with the number of
sequences used in training giving the
frequencies.
ATG
GTG
TTG
S1
Sequence
States
Center for Biological Sequence Analysis
Training
S1
S2
S3
S4
S5
x1
x2
x3
x4
x5
A
T
G
C
A
T
G
C
A
T
G
C
TAG
TAA
TGA
S2
S3
S4
S5
x6
x7
x8
x9 x10 …..….xn
Center for Biological Sequence Analysis
Model after training
0.98
ATG: 0.77
TTG: 0.11
GTG: 0.12
CTG: 0.00
S1
A: 0.22
T: 0.24
G: 0.27
C: 0.27
A: 0.25
T: 0.23
G: 0.27
C: 0.25
A: 0.26
T: 0.24
G: 0.25
C: 0.25
TAG: 0.6
TAA: 0.3
TGA: 0.1
S2
S3
S4
S5
The trained model can be used to search for
genes in DNA sequences.
ATG
A
S1
0.77
0.00
S2
0.00 (0.22*0.77)
S3
0.00
0.00
S4
0.00
0.00
0.00
0.00
0.00
0.00
Sequence
States
Center for Biological Sequence Analysis
Searching with the HMM
S5
T
T
0.00
0.00
0.00
0.00
T C G C G C G A T ……….T A G
(0.23*0.22*0.77) 0.00
(0.24*0.23*0.22*0.77)
=P(x|M)
Center for Biological Sequence Analysis
Log-Odds score
The propability of a sequence gets infinitly small
as the sequence x becomes longer.
This is solved by defining a background (NULL)
model. For example a random distribution:
A=T=C=G=0.25
From this the Log-Odds score can be calculated:
-log(P(x|M)/P(x|NULL))
A high Log-Odds score corresponds to a
sequence that looks more like the gene model
than the background model.
Center for Biological Sequence Analysis
Is the model to simple?
ATG
GTG
TTG
S1
A
T
G
C
A
T
G
C
A
T
G
C
TAG
TAA
TGA
S2
S3
S4
S5
Center for Biological Sequence Analysis
Codon usage
Synonymous codons incode the same amino
acid. At random synonymous codons would be
expected to be used with equal frequencies. In
real life synonomous codons have different
frequencies.
Different species have consistent and
characteristic codon biases. Lateral transferred
genes and genes from plasmids and phages will
have atypical codon usage.
Variations in codon usage within an organism
can be modelled in different coding models in
the HMM.
Center for Biological Sequence Analysis
1st
Position
2nd Position
U
C
A
3rd
Position
G
U
30,407
22,581
18,943
18,629
Phe
Phe
Leu
Leu
11,523
11,766
9,793
12,195
Ser
Ser
Ser
Ser
22,048
16,669
2,706
326
Tyr
Tyr
Stop
Stop
7,062
8,846
1,260
20,756
Cys
Cys
Stop
Trp
U
C
A
G
C
15,018
15,104
5,316
71,710
Leu
Leu
Leu
Leu
9,569
7,491
11,496
31,614
Pro
Pro
Pro
Pro
17,631
13,272
20,912
39,285
His
His
Gln
Gln
28,458
29,968
4,860
7,404
Arg
Arg
Arg
Arg
U
C
A
G
A
41,375 Ile
34,261 Ile
5,967 Ile
37,994 Met
12,223
31,889
9,683
19,682
Thr
Thr
Thr
Thr
24,189
29,529
45,812
14,076
Asn
Asn
Lys
Lys
11,982
21,907
2,899
1,694
Ser
Ser
Arg
Arg
U
C
A
G
G
24,910
20,800
14,850
35,979
20,808
34,770
27,468
45,862
Ala
Ala
Ala
Ala
43,817
25,996
53,780
24,312
Asp
Asp
Glu
Glu
33,731
40,396
10,902
15,118
Gly
Gly
Gly
Gly
U
C
A
G
Val
Val
Val
Val
Fields : [number] [amino acid]
Center for Biological Sequence Analysis
Is the model to simple?
ATG
GTG
TTG
S1
AAA
AAT
AAG
AAC
GAA
GAT
GAG
GAC
ATA
ATT
ATG
ATC
GTA
GTT
GTG
GTC
AGA
AGT
AGG
AGC
GGA
GGT
GGG
GGC
ACA
ACT
ACG
ACC
GCA
GCT
GCG
GCC
TAA
TAT
TAG
TAC
CAA
CAT
CAG
CAC
S2
TTA
TTT
TTG
TTC
CTA
CTT
CTG
CTC
TGA
TGT
TGG
TGC
CGA
CGT
CGG
CGC
TCA
TCT
TCG
TCC
CCA
CCT
CCG
CCC
TAG
TAA
TGA
S3
Center for Biological Sequence Analysis
HMM for gene finding
ATG
GTG
TTG
S1
AAA
AAT
AAG
AAC
GAA
GAT
GAG
GAC
ATA
ATT
ATG
ATC
GTA
GTT
GTG
GTC
AGA
AGT
AGG
AGC
GGA
GGT
GGG
GGC
ACA
ACT
ACG
ACC
GCA
GCT
GCG
GCC
TAA
TAT
TAG
TAC
CAA
CAT
CAG
CAC
S2
TTA
TTT
TTG
TTC
CTA
CTT
CTG
CTC
TGA
TGT
TGG
TGC
CGA
CGT
CGG
CGC
TCA
TCT
TCG
TCC
CCA
CCT
CCG
CCC
AAA
AAT
AAG
AAC
GAA
GAT
GAG
GAC
ATA
ATT
ATG
ATC
GTA
GTT
GTG
GTC
AGA
AGT
AGG
AGC
GGA
GGT
GGG
GGC
ACA
ACT
ACG
ACC
GCA
GCT
GCG
GCC
TAA
TAT
TAG
TAC
CAA
CAT
CAG
CAC
S3
TTA
TTT
TTG
TTC
CTA
CTT
CTG
CTC
TGA
TGT
TGG
TGC
CGA
CGT
CGG
CGC
TCA
TCT
TCG
TCC
CCA
CCT
CCG
CCC
TAG
TAA
TGA
S4
Center for Biological Sequence Analysis
Multiple coding models
ATG
GTG
TTG
S
AAA
AAT
AAG
AAC
GAA
GAT
GAG
GAC
ATA
ATT
ATG
ATC
GTA
GTT
GTG
GTC
AGA
AGT
AGG
AGC
GGA
GGT
GGG
GGC
ACA
ACT
ACG
ACC
GCA
GCT
GCG
GCC
TAA
TAT
TAG
TAC
CAA
CAT
CAG
CAC
TTA
TTT
TTG
TTC
CTA
CTT
CTG
CTC
TGA
TGT
TGG
TGC
CGA
CGT
CGG
CGC
TCA
TCT
TCG
TCC
CCA
CCT
CCG
CCC
AAA
AAT
AAG
AAC
GAA
GAT
GAG
GAC
ATA
ATT
ATG
ATC
GTA
GTT
GTG
GTC
AGA
AGT
AGG
AGC
GGA
GGT
GGG
GGC
ACA
ACT
ACG
ACC
GCA
GCT
GCG
GCC
TAA
TAT
TAG
TAC
CAA
CAT
CAG
CAC
TTA
TTT
TTG
TTC
CTA
CTT
CTG
CTC
TGA
TGT
TGG
TGC
CGA
CGT
CGG
CGC
TCA
TCT
TCG
TCC
CCA
CCT
CCG
CCC
AAA
AAT
AAG
AAC
GAA
GAT
GAG
GAC
ATA
ATT
ATG
ATC
GTA
GTT
GTG
GTC
AGA
AGT
AGG
AGC
GGA
GGT
GGG
GGC
ACA
ACT
ACG
ACC
GCA
GCT
GCG
GCC
TAA
TAT
TAG
TAC
CAA
CAT
CAG
CAC
TTA
TTT
TTG
TTC
CTA
CTT
CTG
CTC
TGA
TGT
TGG
TGC
CGA
CGT
CGG
CGC
TCA
TCT
TCG
TCC
CCA
CCT
CCG
CCC
AAA
AAT
AAG
AAC
GAA
GAT
GAG
GAC
ATA
ATT
ATG
ATC
GTA
GTT
GTG
GTC
AGA
AGT
AGG
AGC
GGA
GGT
GGG
GGC
ACA
ACT
ACG
ACC
GCA
GCT
GCG
GCC
TAA
TAT
TAG
TAC
CAA
CAT
CAG
CAC
TTA
TTT
TTG
TTC
CTA
CTT
CTG
CTC
TGA
TGT
TGG
TGC
CGA
CGT
CGG
CGC
TCA
TCT
TCG
TCC
CCA
CCT
CCG
CCC
AAA
AAT
AAG
AAC
GAA
GAT
GAG
GAC
ATA
ATT
ATG
ATC
GTA
GTT
GTG
GTC
AGA
AGT
AGG
AGC
GGA
GGT
GGG
GGC
ACA
ACT
ACG
ACC
GCA
GCT
GCG
GCC
TAA
TAT
TAG
TAC
CAA
CAT
CAG
CAC
TTA
TTT
TTG
TTC
CTA
CTT
CTG
CTC
TGA
TGT
TGG
TGC
CGA
CGT
CGG
CGC
TCA
TCT
TCG
TCC
CCA
CCT
CCG
CCC
AAA
AAT
AAG
AAC
GAA
GAT
GAG
GAC
ATA
ATT
ATG
ATC
GTA
GTT
GTG
GTC
AGA
AGT
AGG
AGC
GGA
GGT
GGG
GGC
ACA
ACT
ACG
ACC
GCA
GCT
GCG
GCC
TAA
TAT
TAG
TAC
CAA
CAT
CAG
CAC
TTA
TTT
TTG
TTC
CTA
CTT
CTG
CTC
TGA
TGT
TGG
TGC
CGA
CGT
CGG
CGC
TCA
TCT
TCG
TCC
CCA
CCT
CCG
CCC
AAA
AAT
AAG
AAC
GAA
GAT
GAG
GAC
ATA
ATT
ATG
ATC
GTA
GTT
GTG
GTC
AGA
AGT
AGG
AGC
GGA
GGT
GGG
GGC
ACA
ACT
ACG
ACC
GCA
GCT
GCG
GCC
TAA
TAT
TAG
TAC
CAA
CAT
CAG
CAC
TTA
TTT
TTG
TTC
CTA
CTT
CTG
CTC
TGA
TGT
TGG
TGC
CGA
CGT
CGG
CGC
TCA
TCT
TCG
TCC
CCA
CCT
CCG
CCC
AAA
AAT
AAG
AAC
GAA
GAT
GAG
GAC
ATA
ATT
ATG
ATC
GTA
GTT
GTG
GTC
AGA
AGT
AGG
AGC
GGA
GGT
GGG
GGC
ACA
ACT
ACG
ACC
GCA
GCT
GCG
GCC
TAA
TAT
TAG
TAC
CAA
CAT
CAG
CAC
TTA
TTT
TTG
TTC
CTA
CTT
CTG
CTC
TGA
TGT
TGG
TGC
CGA
CGT
CGG
CGC
TCA
TCT
TCG
TCC
CCA
CCT
CCG
CCC
AAA
AAT
AAG
AAC
GAA
GAT
GAG
GAC
ATA
ATT
ATG
ATC
GTA
GTT
GTG
GTC
AGA
AGT
AGG
AGC
GGA
GGT
GGG
GGC
ACA
ACT
ACG
ACC
GCA
GCT
GCG
GCC
TAA
TAT
TAG
TAC
CAA
CAT
CAG
CAC
TTA
TTT
TTG
TTC
CTA
CTT
CTG
CTC
TGA
TGT
TGG
TGC
CGA
CGT
CGG
CGC
TCA
TCT
TCG
TCC
CCA
CCT
CCG
CCC
AAA
AAT
AAG
AAC
GAA
GAT
GAG
GAC
ATA
ATT
ATG
ATC
GTA
GTT
GTG
GTC
AGA
AGT
AGG
AGC
GGA
GGT
GGG
GGC
ACA
ACT
ACG
ACC
GCA
GCT
GCG
GCC
TAA
TAT
TAG
TAC
CAA
CAT
CAG
CAC
TTA
TTT
TTG
TTC
CTA
CTT
CTG
CTC
TGA
TGT
TGG
TGC
CGA
CGT
CGG
CGC
TCA
TCT
TCG
TCC
CCA
CCT
CCG
CCC
AAA
AAT
AAG
AAC
GAA
GAT
GAG
GAC
ATA
ATT
ATG
ATC
GTA
GTT
GTG
GTC
AGA
AGT
AGG
AGC
GGA
GGT
GGG
GGC
ACA
ACT
ACG
ACC
GCA
GCT
GCG
GCC
TAA
TAT
TAG
TAC
CAA
CAT
CAG
CAC
TTA
TTT
TTG
TTC
CTA
CTT
CTG
CTC
TGA
TGT
TGG
TGC
CGA
CGT
CGG
CGC
TCA
TCT
TCG
TCC
CCA
CCT
CCG
CCC
AAA
AAT
AAG
AAC
GAA
GAT
GAG
GAC
ATA
ATT
ATG
ATC
GTA
GTT
GTG
GTC
AGA
AGT
AGG
AGC
GGA
GGT
GGG
GGC
ACA
ACT
ACG
ACC
GCA
GCT
GCG
GCC
TAA
TAT
TAG
TAC
CAA
CAT
CAG
CAC
TTA
TTT
TTG
TTC
CTA
CTT
CTG
CTC
TGA
TGT
TGG
TGC
CGA
CGT
CGG
CGC
TCA
TCT
TCG
TCC
CCA
CCT
CCG
CCC
TAG
TAA
TGA
E
Center for Biological Sequence Analysis
Order of the model
A zero order Markov model (state) has a
propability of letter in the state – the
propabilities are independent of the previous
sequence. The NULL model is a zero order
Markov model (A=T=G=C=0.25).
The propability of a letter in a first order Markov
model depends on the previous letter (dinucleotide distributions).
Second order depends on the two previous
letters (corresponding to a codon).
Center for Biological Sequence Analysis
Order of the coding model
Inter-codon denpendencies are correlations
between amino acids typically found in proteins.
They reflect typical features of proteins and can
be used to improve the performance of the gene
finder.
The use of higher order coding models in gene
finding is a way to capture these inter-codon
denpendencies.
Higher order models requires more training data
and more computational time when searching.
Center for Biological Sequence Analysis
The Shine-Dalgarno sequence
The ribosome binds to the
messenger RNA through
baseparing to the 30S
ribosomal subunit.
The binding site is the
Shine-Dalgarno sequence
(SD).
The SD is a purine-rich
sequence (consensus
sequence: AGGAG) at the
5' end of most prokaryotic
mRNAs.
The SD is found 5-10
basepairs upstream from
the start codon.
Center for Biological Sequence Analysis
EasyGene
Center for Biological Sequence Analysis
Center for Biological Sequence Analysis
R. prowazekii
Center for Biological Sequence Analysis
GeneMark.hmm
http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi
Lukashin A. and Borodovsky M., “GeneMark.hmm: new
solutions for gene finding”, NAR, 1998, Vol. 26, No. 4, pp.
1107-1115.
EasyGene
http://cbs.dtu.dk/services/EasyGene
Schou Larsen T. and Krogh A., “EasyGene – A prokaryotic
gene finder that ranks ORFs by statistical significance”.
BMC Bioinformatics 2003, 4:21
Download