Diapositiva 1

advertisement
Estructura de gene procariota
Promoter
CDS
UTR
Terminator
UTR
Genomic DNA
transcription
mRNA
translation
protein
Operons:
Prokaryotic
Gene Organisation
DNA
Promoter
Promoter
Leader
Repressor
or
Activator
5’
+1
Spacer
mRNA
3’
RNA
Polymerase
s
Ribosome
Transcription: 2 consensus sequences and the startpoint
- 10: TAATA
T80A95t45A60a50T96
- 35: TTGACA
T82T84G78A65C54a45
Translation: rbs (ribosomal binding site)
Shine Delgarno AGGAGG
Tailer
promotores and reguladores
en Procariotas
•
Promoter determines:
1. Which strand will serve as a template.
2. Transcription starting point.
3. Strength of polymerase binding.
•
RNA polymerase subunit for promoter recognition
is called sigma-factor
–
–
•
•
Different variations (7 for E. coli)
Consensus binding sequences (Table 6.2 in textbook)
Operons for co-transcription
Regulators affect the binding of RNA polymerase
to DNA (positive and negative)
Ejemplo de promotor procariota
• Pribnow box located at –10 (6-7bp)
• Promoter sequence located at -35
(6bp)
Secuencias Consenso
• Promoters sequences can vary
tremendously.
• RNA polymerase recognizes hundreds of
different promoters
Terminadores
• The terminator region pauses the
polymerase and causes disassociation.
Producción de un ARN maduro en eucariotas
•
The final mRNA may represent less than 5% of the transcribed
DNA sequence
Modelo simplificado de un gen humano
PROMOTOR
Secuencia que no se traduce
Intrón 1
Intrón 2
Secuencia que no se traduce
Intrón 3
5`
Región reguladora
3`
EXON 1
EXON 2
EXON 3
EXON 4
EXON n
Región reguladora
Unidad de transcripción
Después del procesamiento postranscripcional del ARN
transcrito primario, la secuencia de ARNm corresponde a las
secuencias de los exones y las no codificantes (intrones y UTRs).
Los genes eucariotas contienen
normalmente intrones
Tipos de genes en eucariotas
Protein encoding genes
• Transcription
• RNA Polymerase II dependent promoters
• Type II splicing
• Polyadenylation (exception histone mRNAs)
• Translation
RNA coding genes
• Transcription
• RNA Polymerase I and III dependent promoters
• Type I and III splicing
• No polyadenylation
• No translation
Estructura de un gen eucariota
• TATA box located at –25
– TATA(A/T)A(A/T)
– Recognized by TATA-binding protein
• Initiator sequence at +1
– YYCARR; Y is C/T, R is G/A
– +1 is usually the A
• Transcription factors bind to promoters
– Position specific scoring matrix (PSSM)
• Possible distant regions acting as enhancers or silencers
(even more than 50 kb).
– More complex mechanism than prokaryotes
La transcriptción puede ser modificada por
factores que actuan en trans:
activadores (enhancers) y silenciadores
El splicing alternativo puede
producir diferentes proteinas
con diferentes funciones
Contains domains
that adhere to cell
surfaces
Lacks domains
that adhere to cell
surfaces
Eukaryotic
Promoter
GC
CAAT
Proximal
Promoter
proximal
Gene Organisation
TSS
TATAPromoter
Inr
Core
core
Transcription:
core promoter: loosely conserved initiator region (Inr) around
TSS
~ - 25: TATA-box
proximal promoter: ~ - 75: CAT (CCAAT)
~ - 170: GC-box
enhancer/silencer: upstream or downstream to promoter
Translation:
• 5‘ Kozak sequence:
GCCACCATG
• 3‘ polyadenylation site: AATAAA
Eukaryote gene structure
vs. prokaryote gene structure
• No operons
• Capping at 5’ end and polyadenylation
at 3’ end
– Transport of mRNA out of nucleus
– Effects stability and efficiency of translation
• Introns
• Alternative splicing
Resumen
• Prokaryotic genes
promoter
gene
start
gene
gene
terminator
stop
• Eukaryotic genes
intron
intron
promoter
exon
start
exon
donor acceptor
exon
stop
Gene prediction:
Prokaryotes vs. Eukaryotes
Prokaryotes
• Conserved promoter region (-10, -35; fixed spacing)
• Contiguous open reading frames (ORF)
• Polycistronic mRNAs
• Short intergenic sequences
Good method: detecting large ORFs
• Complications:
• Sequencing errors
• very small genes will be missed
• Overlapping genes on both strands
Promoter and Gene prediction:
Prokaryotes vs. Eukaryotes
•Promoter elements
•core promoter
•initiator region (Inr)
•TATA box
•Downstream promoter element (DPE)
•proximal promoter: transcription factor (“TF”) binding sites
•CAAT box,
•GC box
•SP-1 sites
•GAGA boxes
•Enhancers/silencers sites (less useful)
•Coding sequence
•signal sensors (start and stop signals (Kozak sequence, stop codons), Polyadenylation signals, Splicing signals (3‘, 5‘ splice sites, splice junction, branchpoint)
•content sensors (base composition, codon usage, hexamer usage)
El reto
• The speed with which new data are collected increases
and exceedes the rate with which they could be analysed.
• Whole-genome sequences for more than 800 organisms
(bacteria, archaea, and eukaryota as well as many
viruses and organells) are either complete or being
determined.
• Across all sequenced species, nearly half of the potential
genes can not be assigned a specific role.
Los programas para la predición de genes
deberían ser capáces de identificar
automáticamente y anotar todos los genes
Three Basic Strategies for
Promoter and Gene Prediction
• Búsqueda por homología
• Análisis de señales en las secuencias
• Análisis estadísticos
¿Porqué homología?
Evolutionary relationships
Paralogues: homologous
proteins that perform different
but related functions within
one organism.
ancestor
Orthologues: homologous
proteins that perform the
same function in different
species.
species 1
species 2
species 3
Homology Searching
• Investigate sequence databases such as
EMBL or Swissprot with programs such
as BLAST or TFASTA.
• Orthologs / homologs / paralogs may have
been described. Sequence identity may
be low; several approaches should be tried.
• As more sequence data is collected, this initial
step becomes more important.
Low coverage, high accuracy
Three Basic Strategies for
Gene Prediction
• Homology searching
• Analysis of sequence signals
• Statistical analysis
¿Que señales se pueden emplear en
bioinformática para la predicción de
genes?
¿Que diferencia a los genes de otras
secuencias genómicas ?
Genomic sequences tend towards randomness;
Genes are non-random.
Base composition
Translated DNA sequences are restricted in the choice for
nucleotides in the first, second (and to a lesser extend) third
position of the codons.
Occurrence of a certain base in first, second and third position
of the potential codons will not be random.
123123123123123123123123123123123123123123123123123123
ATGATAGCTATACGGATCCGTAGCTAGATCAGTAGCGTGACTGCTGTCGTCATT
A(1,4,7...)=10 of 18 (Random sequence Exp=25%)
A(2,5,8...)=1 of 18 (Random sequence Exp=25%)
Confidence levels can be calculated because large sets of coding
and non-coding sequences have been analyzed.
Base composition bias
Frequency of the four different nucleotides at
the different codon positions in human coding
regions.
Our model gene
1011 tata
1066
1345
2427
3058
Growth Factor Mouse
Weakly expressed, tissue specific
GC-rich (57%; cds 66%)
’TATA’ promoter (1011-1017)
2 exons
Not an easily predictable gene !
Bottner M, Laaff M, Schechinger B, Rappold G, Unsicker K, Suter-Crazzolara C. Gene. 1999 (237):105-11.
Testcode
coding
non-coding
‘Period three constraint’ [J. Fickett, Nucl. Acids Res. 10(17); 5303-5318 (1982)].
The top and bottom regions predict coding and non-coding regions to a 95%
confidence level. Start and stop codons (dashes and diamonds) are indicated.
Base composition bias
Advantages:
• Input: the crude DNA sequence
• No information on reading frames is necessary.
• No information on organism specific codon usage
is needed.
Disadvantages:
• Short exons (<200bp) are ignored.
• Frameshift errors reduce the prediction success.
Codon usage bias
The frequency of usage of
each codon (per thousand)
in human coding regions.
The relative frequency of
each codon among
synonymous codons.
The human codon usage table
(http://www.kazusa.or.jp/codon/)
Codon usage bias
Frequency of usage
Relative Frequency
Leucine : Alanine : Tryptophan
Protein encoding DNA
= 6.9 : 6.5 : 1
Random DNA
= 6.0 : 4.0 : 1
(Species specific, example rat)
Most amino acids are encoded by more than one codon.
Leucine
TTG
TTA
CTG
CTA
CTT
human
12.5
7.2
40.2
6.9 12.7
rat
12.4
5.0
40.8
7.0 11.2
xenopus
14.4
9.1
26.1
8.4 15.9
yeast
27.1 26.4
10.4 13.4 12.2
Frequency dependent on species, level of gene expression.
CTC
19.4
20.4
12.6
5.4
Codon usage bias
Advantages:
• Input: the crude DNA sequence AND a codon frequency table
• No information on reading frame needed
Disadvantages:
• Weakly expressed genes have little bias
• Frameshift errors reduce the prediction success
Analysis of Sequence Signals
Content Sensors (Large sequence motifs):
• base composition
• codon usage
• hexamer usage
Signal Sensors (Short sequence motifs):
• Start/stop codons
• Splicing signals (3‘, 5‘ signals, branchpoint, splice junctions)
• Polyadenylation signals
• Transcription regulation signals (TF binding sites, promoters)
String matching
Input: A text string t of length n.
A patterns string p of length m.
Output: All instances of the
pattern in the text.
Patterns
• Use consensus sequence (pattern) for splice site,
Kozak sequence or transcription factor binding site.
• Disadvantage: many false positives.
TATA
...ATGATAGATATACAGATTATATAGATCGAT...
TATA
TATA-box
Startcodon
GCCACCAUGG
Kozak sequence
Polyadenylation
signals
YGUGUUYY (N)20-30 AAUAAA
Stop codons
UGA, UAA, UAG
Termination
sequences
(not well defined
in eukaryotes)
Splice Sites
5
3 5
B
B
5
5’ splice site
CAG/GTAAGTAG
3
B
3
3’ splice site
(T)10NCAG/G(C)
B
branchpoint
CT(G/A)A(C/T)
B
J J
+
3
3
J
splicejunction
MAG/G
9
Profile or Position Weigth Matrix
• Replace the pattern by a profile
• Employ training sets to build profile and to optimize
the algorithm.
Alignment
1234567...
ACATTAA...
TCAGAAT...
ACAGAAC...
AGATTAC...
ACCGAAC...
Profile
A
C
G
T
consensus
1234567...
4040351...
0410003...
0103000...
1002201...
ACAGAAC...
Three Basic Strategies for
Gene Prediction
• Homology searching
• Analysis of sequence signals
• Statistical analysis
Gribskov Profiles
What is a Gribskov Profile?
A Gribskov profile is a weight matrix of the
probabilities of appearance of amino acids in a
certain position in a multiple alignment.
Score for finding each aa at a certain position
POS
1
2
3
4
5
6
A
-2
-2
-2
-2
18
-42
C
115
895
-65
-223
-104
-64
D
-82
-302
-142
-62
-121
-221
E
F
-121
-401
-241
-81
-101
-161
-103
-203
-283
-223
-163
-23
G
56
-304
416
196
-163
-223
H ... L
-101 ... -103
-302 ... -103
-221 ... -343
38 ... -302
-181 ... -43
-181 ... 176
... S
T ... Y
... -21 -61 ... -101
... -101 -102 ... -202
... -21 -181 ... -282
... 139 -81 ... -162
... -159 218 ... -182
... -121 -42 ... -62
Gap
30
100
100
100
100
30
Gribskov Profiles
Differences between Gribskov Profiles and common
sequence comparison methods
A group of related sequences can be used to
build the profile
The profile includes position-specific penalties
for insertion and deletion
Gribskov Profiles
What is needed to create a Gribskov Profile?
A group of functionally related proteins
Globins
Immunoglobulins
Aligned by
seq1.pep
seq2.pep
seq3.pep
seq4.pep
seq5.pep
1
~CCGTL
GCGSL~
~CGHSV
~CGGTL
CCGSS~
Similarity
Three dimensional structure
A mutational distance matrix
Blosum62
PAM250 Dayhoff
Gribskov Profiles
seq1.pep
seq2.pep
seq3.pep
seq4.pep
seq5.pep
Sequence position-specific scoring
matrix M(p,a)
21 Columns
20 of them specify
1 specifies
Score of each aa at a certain position
Aligned positions
A
1
2
3
4
.
.
.
N
C
D
E
................
W
Y
Penalty for deletion or insertion
in that position
Number of positions in
the alignment
Gap
1
~CCGTL
GCGSL~
~CGHSV
~CGGTL
CCGSS~
Creating a Gribskov Profile
The profile is filled using the
Multiple alignment
Mutational distance matrix
20
M(p,a)=  b=1
W(p,b) * Y(a,b)
W(p,b) = n(b,p)/ NR
Weight of appearance of
aa b at position p
n(b,p) is the number of times that
aa b appears in position p
NR number of rows in the alignment
Y(a,b)
Value in the mutational
distance matrix
M(p,C)= W(p,W) * Y(C,W)
Mutational Distance Matrix
Blosum62 matrix
A
B
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
X
Y
Z
W
A
4
-2
0
-2
-1
-2
0
-2
-1
-1
-1
-1
-2
-1
-1
-1
1
0
0
-3
-1
-2
-1
B
CC
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
X
Y
Z
6
-3
6
2
-3
-1
-1
-3
-1
-4
-3
1
-1
0
-2
0
-1
-3
-4
-1
-3
2
9
-3
-4
-2
-3
-3
-1
-3
-1
-1
-3
-3
-3
-3
-1
-1
-1
-2
-1
-2
-4
6
2
-3
-1
-1
-3
-1
-4
-3
1
-1
0
-2
0
-1
-3
-4
-1
-3
2
5
-3
-2
0
-3
1
-3
-2
0
-1
2
0
0
-1
-2
-3
-1
-2
5
6
-3
-1
0
-3
0
0
-3
-4
-3
-3
-2
-2
-1
1
-1
3
-3
6
-2
-4
-2
-4
-3
0
-2
-2
-2
0
-2
-3
-2
-1
-3
-2
8
-3
-1
-3
-2
1
-2
0
0
-1
-2
-3
-2
-1
2
0
4
-3
2
1
-3
-3
-3
-3
-2
-1
3
-3
-1
-1
-3
5
-2
-1
0
-1
1
2
0
-1
-2
-3
-1
-2
1
4
2
-3
-3
-2
-2
-2
-1
1
-2
-1
-1
-3
5
-2
-2
0
-1
-1
-1
1
-1
-1
-1
-2
6
-2
0
0
1
0
-3
-4
-1
-2
0
7
-1
-2
-1
-1
-2
-4
-1
-3
-1
5
1
0
-1
-2
-2
-1
-1
2
5
-1
-1
-3
-3
-1
-2
0
4
1
-2
-3
-1
-2
0
5
0
-2
-1
-2
-1
4
-3
-1
-1
-2
11
-1
2
-3
-1
-1
-1
7
-2
5
-2
Creating a Gribskov Profile
The profile is filled using the
Multiple alignment
Mutational distance matrix
20
M(p,a)=  b=1
W(p,b) * Y(a,b)
W(p,b) = n(b,p)/ NR
Weight of appearance of
aa b at position p
n(b,p) is the number of times that
aa b appears in position p
NR number of rows in the alignment
Y(a,b)
Value in the mutational
distance matrix
M(p,C)= W(p,W) * Y(C,W)
Y(C,W) = -2
Creating a Gribskov Profile
M(p,a)=
20
 b=1
Alignment
seq1.pep
seq2.pep
seq3.pep
seq4.pep
seq5.pep
W(p,b) * Y(a,b)
M(1,A)=  b=1 W(1,b) * Y(A,b)
1
~CCGTL
GCGSL~
~CGHSV
~CGGTL
CCGSS~
M(1,A)= ( W(1,A) * Y(A,A) ) + (W(1,C) * Y(A,C) ) +......+ ( W(1, Y) *Y(A,Y) )
M(1,A)= ( 0.025/6 * 4) + ( 1/6 * 0 ) +......+ ( 0.025/6 * -1)
aa not present in a position
get a very small weight 0,025/NR
M(1,C)=  b=1 W(1,b) * Y(C,b)
Consensus
sequence
symbol with largest value in each position
(CCGGTL)
Score for finding each aa at a certain position
POS
1
2
3
4
5
6
A
-2
-2
-2
-2
-2
18
-42
C
115
895
-65
-223
-104
-64
D
-82
-302
-142
-62
-121
-221
E
F
-121
-401
-241
-81
-101
-161
-103
-203
-283
-223
-163
-23
G
56
-304
416
196
-163
-223
H ... L
-101 ... -103
-302 ... -103
-221 ... -343
38 ... -302
-181 ... -43
-181 ... 176
... S
T ... Y
... -21 -61 ... -101
... -101 -102 ... -202
... -21 -181 ... -282
... 139 -81 ... -162
... -159 218 ... -182
... -121 -42 ... -62
Gap
30
100
100
100
100
30
Scoring with a Gribskov Profile
Alignment
seq1.pep
seq2.pep
seq3.pep
seq4.pep
seq5.pep
Consensus sequence
symbol with largest value in each position
(CCGGTL)
1
~CCGTL
GCGSL~
~CGHSV
~CGGTL
CCGSS~
P(CCGGTL)= Pp1(C)* Pp2(C)* Pp3(G)* Pp4(G)* Pp5(T) * Pp6(L)
P(CCGGTL)= log Pp1(C)+ log Pp2(C)+ log Pp3(G)+ log Pp4(G)+ log
Pp5(T)+ log Pp6(L)
Probability of any sequence is calculated in the same way
Score for finding each aa at a certain position
POS
1
2
3
4
5
6
A
-2
-2
-2
-2
-2
18
-42
C
115
895
-65
-223
-104
-64
D
-82
-302
-142
-62
-121
-221
E
F
-121
-401
-241
-81
-101
-161
-103
-203
-283
-223
-163
-23
G
56
-304
416
196
-163
-223
H ... L
-101 ... -103
-302 ... -103
-221 ... -343
38 ... -302
-181 ... -43
-181 ... 176
... S
T ... Y
... -21 -61 ... -101
... -101 -102 ... -202
... -21 -181 ... -282
... 139 -81 ... -162
... -159 218 ... -182
... -121 -42 ... -62
Gap
30
100
100
100
100
30
Introduction
Gribskov Profile
Definition
Creating a Gribskov Profile
Scoring a sequence with a Profile
Hidden Markov Models
Definition
State order of an HMM
Basic Architecture
Scoring a sequence with an HMM
Building a Hidden Markov Model
Estimation of the model
Problems building an HMM
Biological application of HMMs
HMM programs in HUSAR
Advantages of using Markov Models
P=0.6
Markov Models are probabilistic,
models, with a solid statistical
foundation
In contrast to patterns and profiles,
HMMs allow consistent treatment of
insertions and deletions
A
P=0.1
P=0.2
C
P=0.09
P=0.01
T
G
C
-
In contrast to patterns and profiles, Markov Models take into
account the information about neighboring residues.
Hidden Markov Model
Domain 1 (active binding site)
Domain 2 (never found, inactive)
Domain 3 (never found, inactive)
Domain 4 (active)
123456…
ATGTCGTCGTCG
ATGTGGTCGTCG
ATGTCATCGTCG
ATGTGATCGTCG
Markov Model is based on active domains only !!
If a G is found at position 3,
P(T)4=1.0
If a T is found at position 4,
P(C)5=0.5, P(G)5=0.5
If a C is found at position 5,
P(A)6=0.0, P(G)6=1.0
If a G is found at position 5,
P(A)6=1.0, P(G)6=0.0
If a G is found at position 6,
P(T)7=1.0
If an A is found at position 6,
P(T)7=1.0
Order state of HMMs
Markov Models take into account additional information
about neighboring residues.
First order Markov Model
Captures the first order correlation
between neighboring nucleotides
HMM models can use preceding,
succeeding or surrounding residues
Fifth order Markov Model
There is no real limit in the number
of preceding residues that can be
used for an HMM (computing time!)
Biological applications of HMMs
Gene finding
Protein secondary structure
prediction
Protein homology recognition
Phylogenetic analysis
Radiation hybrid mapping
Profile HMM libraries
Genetic linkage mapping
(Birney & Durbin, 1997;
Henderson, 1997; Krogh, 1997;
Lukashin & Borodovsky, 1998)
(Goldman et al., 1996)
(Karplus et al., 1999)
(Felsenstein & Churchill, 1996)
(Sloniw et al., 1997)
(PROSITE; Pfam database)
(Krushyak et al., 1996)
Hidden Markov Model
transitions
states
t 1,1
t
t 1,2
A
2,2
t
B
P1(a)
P2(a)
P1(b)
P2(b)
a
b
a
2,end
End
HMM
Observed symbol sequence
P(aba|HMM) =P1(a) t 1,1 P1(b) t 1,2 P2(a) t
2,end
Hidden Markov Model
transitions
states
0.9
0.9
1.0
Start
E
PA=(0.25)
PC=(0.25)
PG=(0.25)
PT=(0.25)
0.1
1.0
5
I
PA=(0.05)
PC=(0)
PG=(0.95)
PT=(0)
PA=(0.4)
PC=(0.1)
PG=(0.1)
PT=(0.4)
0.1
End
Hidden Markov Model
Markov Models assume that sequences are
generated independently of the model
Applied to time series or to linear sequences
Basic Architecture of a profile HMM
0.3
d1
Start
d2
d3
End
0.06
i1
i0
Probabilities
m1
C
from information
contained in
Alignment
A
C
D
E
F
G
H
I
.
.
.
Y
0.01
0.015
i3
i2
m2
C
m3
Y
A
C
D
E
F
G
H
I
.
.
.
Y
A
C
D
E
F
G
H
I
.
.
.
Y
0.5
Match states
Model the distribution
of symbols in the
corresponding
column of an alignment
0.01
Methods in Gene Prediction
Ab initio analysis of genomic sequences:
Genscan (Burge and Karlin 1997)
HMMer (Haussler et al. 1993, Krogh et al. 1994)
FGenesH (Solovyev and Salamov 1994)
Comparison of protein and genomic sequences:
Procrustes (Gelfand et al. 1996)
Genewise (Birney and Durbin)
Cross-species genomic sequence comparisons:
CEM (Bafna and Huson 2000)
TWINSCAN (Korf et al. 2001)
Doublescan Meyer and Durbin 2002)
SLAM (Alexandersson et al. 2003)
Gene prediction programs (many with homology searching capabilities)
GeneMachine
http://genome.nhgri.nih.gov/genemachine
Genscan
http://genome.dkfz-heidelberg.de
GenomeScan
http://genes.mit.edu/genomescan
Fgenesh, Fgenes-M, TSSW, TSSG, Polyah, SPL and RNAS http://genomic.sanger.ac.uk/gf/gf.shtml
Fgenesh, Fgenes-M, SPL and RNASPL http://www.softberry.com/berry.phtml
HMMgene
http://www.cbs.dtu.dk/services/HMMgene
Genie
http://www.fruitfly.org/seq_tools/genie.html
GeneMark
http://www.ebi.ac.uk/genemark
GeneID
http://www1.imim.es/software/geneid/geneid.html#top
GeneParser
http://beagle.colorado.edu/~eesnyder/GeneParser.html
MZEF and POMBE
http://argon.cshl.org/genefinder/
AAT, MZEF with homology
http://genome.cs.mtu.edu/aat.html
MZEF with SpliceProximalCheck
http://industry.ebi.ac.uk/~thanaraj/MZEF-SPC.html
Genesplicer, Glimmer and GlimmerM
http://www.tigr.org/~salzberg
WebGene
http://www.itba.mi.cnr.it/webgene
GenLang
http://www.cbil.upenn.edu/genlang/genlang_home.html
Xpound
ftp://igs-server.cnrs-mrs.fr/pub/Banbury/xpound
Gene-prediction programs: alignment based
Procrustes
http://www-hto.usc.edu/software/procrustes/index.hl
GeneWise2
http://www.sanger.ac.uk/Software/Wise2
SplicePredictor
http://bioinformatics.iastate.edu/cgi-bin/sp.cgi
PredictGenes
http://cbrg.inf.ethz.ch/Server/subsection3_1_8.html
Gene-prediction programs: comparative genomics
Doublescan
http://www.sanger.ac.uk/Software/analysis/doublescan
SLAM
http://bio.math.berkeley.edu/slam
Twinscan
http:/ genes.cs.wustl.edu
Finding ORFs and splice sites
DioGenes
http://www.cbc.umn.edu/diogenes/index.html
OrfFinder
http://www.ncbi.nlm.nih.gov/gorf/gorf.html
YeastGene
http://tubic.tju.edu.cn/cgi-bin/Yeastgene.cgi
CDS: search coding regions
http://bioweb.pasteur.fr/seqanal/interfaces/cds-simple.html
Neural network splice site prediction
http://www.fruitfly.org/seq_tools/splice.html
NetGene2 http://www.cbs.dtu.dk/services/NetGene2
RNA gene prediction
tRNAScan http://www.genetics.wustl.edu/eddy/tRNAscan-SE/
FGENES, FGENEH, FGENESH(+)
• Victor Solovyev and coleagues
• FGENE applications are based on HMMs
• They form a complete, partially automated, modular package
• Dynamic modelling with various features of coding sequences
• Precise determination of exon borders with homology search
1011 tata
1066 1345
2427
3058
GENIE (UCLA)
• Combination of statistical methods (HMM) and neural networks
• A candidate sequence is "threaded" through the HMM using a min-cost
path search algorithm and the system reports this "optimal" path as the
predicted gene structure.
1011 tata
1066 1345
2427
3058
GrailEXP
• Widely used for genbank annotations
• GrailEXP predicts exons, genes, promoters, polyAs,
CpG islands, EST similarities, and repetitive elements
1011 tata
1066 1345
2427
3058
Genscan (Chris Burge and Samuel
Karlin)
• Genescan employs a dynamic programming strategy.
• General three-periodic (inhomogeneous) fifth order
Markov Model.
• Transcription-, translation- and splicing signals.
• Length distributions and compositional features of introns,
exons and intergenic regions.
• Exceptional: It was developed to recognize partial and
multiple genes on both strands.
• Independent of databases.
1011 tata
1066 1345
2427
3058
TWINSCAN (I. Korf et al.,
2001)
• TWINSCAN models both gene structure and evolutionary
conservation
• Scores of features (e.g. splice sites, coding regions)
are modified using the patterns of divergence between
the target genome and a closely related genome.
Prediction of a subsequence of the mouse
genome
alignments to human genomic sequences
repeat sequences reported
TWINSCAN
GENSCAN
actual gene structure
1011 tata
1066 1345
2427
3058
cDNA
protein
GLIMMER (Salzberg and colleagues,
JHU)
• finding genes in microbial DNA.
• combination of Markov models from first through
eighth order, weighting each model according to
its predictive power.
• Widely used for genbank annotations.
How(not) to use bioinformatics tools
• No single bioinformatics tool is 100 % accurate (colleagues and
developers may tell you the opposite).
• Common pitfall: for which organism was the application developed?
• Repetitive elements (such as the mouse L1 element) can be
wrongly recognized as genes.
• Bioinformatics rule: try several approaches, try to understand
why they may give apparently contradicting results.
Evaluation of Gene Prediction
Tools
The ideal testset is a segment of DNA for
which all genes have been described experimentally.
Gene
prediction
tool
Specificity = true predicted
/ all predicted
Measure for false positives: 9 / 11 = 81.8%
Sensitivity = true predicted / true genes
Measure for false negatives: 9 / 10 = 90%
Accuracy versus G+C content
Accuracy versus G+C content
1,00
0,90
0,80
0,70
Accuracy
0,60
0 - 40%
40 - 50%
50 - 60%
60 - 100%
0,50
0,40
0,30
0,20
0,10
0,00
FGENES
GeneMark
Genie
Genscan
Morgan
http://www.cse.ucsc.edu/~rogic/evaluation/tablesgen.html
MZEF
Exon accuracy
Exon accuracy
1,00
0,90
0,80
Exon accuracy
0,70
0,60
Sensitivity
(false negatives)
0,50
Specificity
(false positives)
0,40
Partially correct
predicted
0,30
0,20
0,10
0,00
FGENES
GeneMark
Genie
Genscan
HMMgene
Morgan
http://www.cse.ucsc.edu/~rogic/evaluation/tablesgen.html
MZEF
Accuracy versus exon length
Accuracy versus exon length
1,00
0,90
0,80
0,70
0 - 24
25 - 49
50 - 74
75 - 99
100 - 199
200 - 299
300 +
Accuracy
0,60
0,50
0,40
0,30
0,20
0,10
0,00
FGENES
GeneMark
Genie
Genscan
HMMgene
Morgan
http://www.cse.ucsc.edu/~rogic/evaluation/tablesgen.html
MZEF
Accuracy versus exon type
Accuracy
versus exon type
1,00
0,90
0,80
0,70
Accuracy
0,60
Initial
Internal
Terminal
Single
0,50
0,40
0,30
0,20
0,10
0,00
FGENES
GeneMark
Genie
Genscan
HMMgene
http://www.cse.ucsc.edu/~rogic/evaluation/tablesgen.html
Morgan
Open Problems and Future
Directions
•Near the 90% of the nucleotides can be identified correctly, but
exact boundaries of the exons and their assemblies into
complete coding sequences are much more difficult to predict.
Less than the half of the genes are predicted exactly correct.
•Multiple protein products correspond to a single gene through
alternative splicing, alternative transcription or alternative
translation has not been dealt with effectively.
•Promoter recognition
Download