#27 - Gene Prediction II 10/24/07 BCB 444/544 Gene Prediction II

advertisement
#27 - Gene Prediction II
10/24/07
Required Reading
BCB 444/544
(before lecture)
Mon Oct 22 - Lecture 26
Lecture 27
Gene Prediction
• Chp 8 - pp 97 - 112
Wed Oct 24 - Lecture 27
Gene Prediction II
(will not be covered on Exam 2)
Promoter & Regulatory Element Prediction
• Chp 9 - pp 113 - 126
Thurs Oct 25 - Review Session & Project Planning
#27_Oct24
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
Fri Oct 26 - EXAM 2
10/24/07
1
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
Assignments & Announcements
10/24/07
2
BCB 544 "Team" Projects
Mon Oct 22 - Study Guide for Exam 2 was posted, finally…
• 544 Extra HW#2 is next step in Team Projects
Mon Oct 22 - HW#4 Due
(no "correct" answer to post)
•
•
•
•
Thu Oct 25 - no Lab => Optional Review Session for Exam
544 Project Planning/Consult with DD & MT
Write ~ 1 page outline
Schedule meeting with Michael & Drena to discuss topic
Read a few papers
Write a more detailed plan
• You may work alone if you prefer
Fri Oct 26 - Exam 2 - Will cover:
•
•
•
•
• Last week of classes will be devoted to Projects
• Written reports due: Mon Dec 3 (no class that day)
• Oral presentations (15-20') will be: Wed-Fri Dec 5,6,7
Lectures 13-26 (thru Mon Sept 17)
Labs 5-8
HW# 3 & 4
All assigned reading:
Chps 6 (beginning with HMMs), 7-8, 12-16
Eddy: What is an HMM
Ginalski: Practical Lessons…
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
• 1 or 2 teams will present during each class period
 See Guidelines for Projects posted online
3
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
BCB 544 Only:
New Homework Assignment
4
Seminars this Week
BCB List of URLs for Seminars related to Bioinformatics:
544 Extra#2 (posted online Thurs?)
http://www.bcb.iastate.edu/seminars/index.html
No - sorry! sent by email on Sat…
Due:
10/24/07
• Oct 25 Thur - BBMB Seminar 4:10 in 1414 MBB
PART 1 - ASAP
PART 2 - Fri Nov 2 by 5 PM
•
Part 1 - Brief outline of Project, email to Drena & Michael
Dave Segal
UC Davis
Zinc Finger Protein Design
• Oct 19 Fri - BCB Faculty Seminar 2:10 in 102 ScI
after response/approval, then:
• Guang Song ComS, ISU Probing functional mechanisms
by structure-based modeling and simulations
Part 2 - More detailed outline of project
Read a few papers and summarize status of problem
Schedule meeting with Drena & Michael to discuss ideas
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
BCB 444/544 Fall 07 Dobbs
10/24/07
5
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
6
1
#27 - Gene Prediction II
10/24/07
What is a Gene?
Chp 8 - Gene Prediction
What is a gene? segment of DNA, some of which is
SECTION III GENE AND PROMOTER PREDICTION
"structural," i.e., transcribed to give a functional RNA
product, & some of which is "regulatory"
Xiong: Chp 8 Gene Prediction
• Categories of Gene Prediction Programs
• Genes can encode:
• Gene Prediction in Prokaryotes
• mRNA (for protein)
• Gene Prediction in Eukaryotes
• other types of RNA (tRNA, rRNA, miRNA, etc.)
• Genes differ in eukaryotes vs prokaryotes (& archaea),
both structure & regulation
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
7
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
Synthesis & Processing of Eukaryotic mRNA
intron
1' transcript (RNA)
exon 2
3’
exon 3 5’
intron
•
•
•
•
•
3’
Splicing (remove introns)
3’
5’
5’ 7MeG
What are cDNAs & ESTs?
• Isolate RNA (always from a specific
Transcription
5’
Mature mRNA
8
cDNA libraries are important for determining gene
structure & studying regulation of gene expression
DN
Gene in DNA
5’ exon 1
3’
10/24/07
organism, region, and time point)
Convert RNA to complementary DNA
(with reverse transcriptase)
Clone into cDNA vector
Sequence the cDNA inserts
Short cDNAs are called ESTs or
Expressed Sequence Tags
ESTs are strong evidence for genes
Capping & polyadenylation
10/24/07
9
UniGene: Unique genes via ESTs
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
Eukaryotes
• Large genomes 107 – 1010 bp
• Often less than 2% coding
• Complicated gene structure
(splicing, long exons)
• UniGene clusters contain many ESTs
• UniGene data come from many cDNA libraries.
• Prediction success 50-95%
When you look up a gene in UniGene, you can
obtain information re: level & tissue
distribution of expression
10
Promotor
11
• Small genomes 0.5 - 10·106 bp
• About 90% of genome is
coding
• Simple gene structure
TAA
5’ UTR
3’ UTR
Exons
10/24/07
Prokaryotes
• Prediction success ~99%
Splice sites
ATG
BCB 444/544 Fall 07 Dobbs
10/24/07
Gene Prediction in Prokaryotes vs Eukaryotes
• Find UniGene at NCBI:
www.ncbi.nlm.nih.gov/UniGene
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
vector
• Full-length cDNAs can be difficult to obtain
AAAAA 3’
m
Export to cytoplasm
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
insert
Start codon
Stop codon
ATG
TAA
Promotor Open reading frame (ORF)
Introns
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
12
2
#27 - Gene Prediction II
10/24/07
Gene Prediction - The Problem
Prediction is Easier in Microbial Genomes
Why?
Smaller genomes
Simpler gene structures
Many more sequenced genomes!
(for comparative approaches)
Problem:
Given a new genomic DNA sequence, identify coding regions
and their predicted RNA and protein sequences
Many microbial genomes have been fully sequenced &
whole-genome "gene structure" and "gene function"
annotations are available
e.g., GeneMark.hmm, Glimmer
TIGR Comprehensive Microbial Resource (CMR)
NCBI Microbial Genomes
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT
ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT
13
Computational Gene Prediction: Approaches
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
14
Computational Gene Prediction: Algorithms
• Ab initio methods
1. Neural Networks (NNs)
• Search by signal: find DNA sequences involved in gene
(more on these later…)
e.g., GRAIL
expression.
2. Linear discriminant analysis (LDA) (see text)
• Search by content: Test statistical properties distinguishing
coding from non-coding DNA
e.g., FGENES, MZEF
• Similarity-based methods
3. Markov Models (MMs) & Hidden Markov Models (HMMs)
• Database search: exploit similarity to proteins, ESTs, cDNAs
e.g., GeneSeqer - uses MMs
• Comparative genomics: exploit aligned genomes
GENSCAN - uses 5th order HMMs - (see text)
• Do other organisms have similar sequence?
HMMgene - uses conditional maximum likelihood (see text)
• Hybrid methods - best
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
15
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
16
Signals Search
Gene Prediction Strategies
Approach: Build models (PSSMs, profiles, HMMs, …) and search
What sequence signals can be used?
against DNA. Detected instances provide evidence for genes
• Transcription: TF binding sites, promoter, initiation site, terminator,
GC islands, etc.
• Processing signals: Splice donor/acceptors, polyA signal
• Translation: Start (AUG = Met) & stop (UGA,UUA, UAG)
ORFs, codon usage
What other types of information can be used?
• Homology (sequence comparison, BLAST)
• cDNAs & ESTs (experimental data, pairwise alignment)
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
BCB 444/544 Fall 07 Dobbs
10/24/07
17
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
18
3
#27 - Gene Prediction II
10/24/07
Content Search
DNA Signals Used in Gene Prediction
1.
Exploit the regular gene structure
ATG—Exon1—Intron1—Exon2—…—ExonN—STOP
2.
Recognize “coding bias”
CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-…
3.
Recognize splice sites
Intron—cAGt—Exon—gGTgag—Intron
4.
Model the duration of regions
Introns tend to be much longer than exons, in mammals
Exons are biased to have a given minimum length
5.
Use cross-species comparison
Gene structure is conserved in mammals
Exons are more similar (~85%) than introns
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
Observation: Encoding a protein affects statistical properties
of DNA sequence:
• Nucleotide composition
• Hexamer frequency
• GC content (CpG islands, exon/intron)
• Uneven usage of synonymous codons (codon bias)
Method: Evaluate these differences (coding statistics) to
differentiate between coding and non-coding regions
19
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
20
Predicting Genes based on
Codon Usage Differences
Human Codon Usage
Algorithm:
Process sliding window
•
•
Use codon frequencies to
compute probability of
coding versus non-coding
Plot log-likelihood ratio:
&
P (S | coding ) #
log$$
!!
% P ( S | non ' coding ) "
Exons
Coding Profile of ß-globin gene
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
21
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
Similarity-Based Methods:
Database Search
10/24/07
22
Similarity-Based Methods:
Comparative Genomics
In different genomes: Translate DNA into all 6 reading
frames and search against proteins (TBLASTX,BLASTX, etc.)
Idea: Functional regions are more conserved than non-functional
ones; high similarity in alignment indicates gene
human
mouse
ATTGCGTAGGGCGCT
TAACGCATCCCGCGA
GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA
|
||||| ||||| |||
||||| |||||||||||||
| |
C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-
Within same genome: Search with EST/cDNA database
Advantages:
(EST2genome, BLAT, etc.).
•
Problems:
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
BCB 444/544 Fall 07 Dobbs
10/24/07
May find uncharacterized or RNA genes
Problems:
• Will not find “new” or RNA genes (non-coding genes).
• Limits of similarity are hard to define
• Small exons might be overlooked
•
•
23
Finding suitable evolutionary distance
Finding limits of high similarity (functional regions)
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
24
4
#27 - Gene Prediction II
10/24/07
Human-Mouse Homology
Hum an
Gene Prediction Flowchart
Mouse
Comparison of 1196 orthologous genes
• Sequence identity between genes in human vs mouse
Exons:
84.6%
Protein:
85.4%
Introns: 35%
5’ UTRs: 67%
3’ UTRs: 69%
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
25
Predicting Genes - Basic steps:
26
1. 1st, mask to "remove" repetitive elements (ALUs, etc.)
2. Perform database search on translated DNA
• BLAST it!
• Perform database similarity search
(with EST & cDNA databases, if available)
• Translate in all 6 reading frames
(i.e., "6-frame translation")
• Compare with protein sequence databases
(BlastX,TFasta)
3. Use several programs to predict genes & find ORFs
(GENSCAN, GeneSeqer, GeneMark.hmm, GRAIL)
4. Search for functional motifs in translated ORFs & in
neighboring DNA sequences (InterPro, Transfac)
Use Gene Prediction software to locate genes
Compare results obtained using different programs
Analyze regulatory sequences, too
Refine gene prediction
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
Predicting Genes - a few Details:
• Obtain genomic sequence
•
•
•
•
Fig 5.15
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
Baxevanis & Ouellette 2005
5. Repeat
10/24/07
27
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
Thanks to Volker Brendel, ISU
for the following Figs & Slides
10/24/07
28
GeneSeqer
Genomic Sequence
Slightly modified from:
Fast Search
BSSI Genome Informatics Module
http://www.bioinformatics.iastate.edu/BBSI/course_desc_20
05.html#moduleB
Spliced Alignment
EST or protein database
(Suffix Array/Suffix Tree)
V Brendel vbrendel@iastate.edu
Output
Assembly
Brendel et al (2004) Bioinformatics 20: 1157
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
BCB 444/544 Fall 07 Dobbs
10/24/07
29
Brendel 2005
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
30
5
#27 - Gene Prediction II
10/24/07
GeneSeqer - Brendel et al.- ISU
Signals: Pre-mRNA Splicing
http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi
Spliced Alignment Algorithm
Start codon
Stop codon
Genomic DNA
Brendel et al (2004) Bioinformatics 20: 1157
http://bioinformatics.oxfordjournals.org/cgi/con
tent/abstract/20/7/1157
Transcription
pre-mRNA
Cap-
-Poly(A)
Splicing
• Perform pairwise alignment with large gaps in one
sequence (due to introns)
mRNA
-Poly(A)
Cap-
Translation
• Align genomic DNA with cDNA, ESTs, protein sequences
Protein
• Score semi-conserved sequences at splice junctions
EXON
• Using Bayesian probability model & 1st order MM
INTRON
GT
• Score coding constraints in translated exons
• Using Bayesian model
GT
Donor
Brendel 2005
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
Splice sites
Acceptor
10/24/07
31
Brendel 2005
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
Brendel - Spliced Alignment I:
Start codon
32
10/24/07
34
Compare with protein probes
Stop codon
Start codon
Genomic DNA
Stop codon
Genomic DNA
Start codon
Stop codon
Protein
-Poly(A)
Cap5’-UTR
Brendel 2005
10/24/07
Brendel - Spliced Alignment II:
Compare with cDNA or EST probes
mRNA
Acceptor
site
AG
Splice sites
AG
Donor site
Intron
3’-UTR
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
33
Brendel 2005
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
Information Content vs Position
Splice Site Detection
Do DNA sequences surrounding splice "consensus"
sequences contribute to splicing signal? YES
0.8
• Information Content Ii :
Ii = 2 +
"
0.5
f iB log 2 ( f iB )
0.4
0.3
0.3
0.2
0.1
0.0
I i ! I + 196
. "I
i: ith position in sequence
Ī: avg information content over all positions >20 nt from splice site
σĪ: avg sample standard deviation of Ī
BCB 444/544 Fall 07 Dobbs
0.5
0.1
-50
10/24/07
35
-40
-30
-20
-10
Human
T2_AG
0.6
0.4
0.2
• Extent of Splice Signal Window:
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
0.7
Human
T2_GT
0.6
B !U ,C , A ,G
Brendel 2005
0.8
0.7
0.0
0
10
20
30
40
50 -50
-40
-30
-20
-10
0
10
20
30
40
50
Which sequences are exons & which are introns?
How can you tell?
Brendel 2005
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
36
6
#27 - Gene Prediction II
10/24/07
Donor (GT) & Acceptor (AG) Sites
Used for Model Training
Species
Brendel 2005
Markov Model for Spliced Alignment
Number of True Splice Sites / Phase
1
2
3
Type
Home sapiens
GT
AG
6586
6555
5277
5194
3037
2979
Mus musculus
GT
AG
1212
1194
1185
1139
521
504
Rattus norvegicus
GT
AG
450
442
408
386
147
140
Gallus gallus
GT
AG
288
284
238
228
107
103
Drosophila
GT
AG
989
1001
670
671
524
536
C. elegans
GT
AG
37029
36864
20500
20325
20789
20626
S. pombe
GT
AG
170
179
118
122
119
118
Aspergillus
GT
AG
221
217
176
172
157
163
Arabidopsis thaliana
GT
AG
23019
22929
9297
9247
8653
8611
Zea mays
GT
AG
316
311
107
104
88
83
PΔG
(1-PΔG )(1-PD(n+1))
en
(1-PΔG )PD(n+1)
(1-PΔG )PD(n+1)
in
10/24/07
37
Brendel 2005
True
TP
FP
PP=TP+FP
False
FN
TN
PN=FN+TN
Coverage
Recall
!=
False
Positives
Predicted
=
1$"
1$" + #
True
TP
FP
PP=TP+FP
False
FN
TN
PN=FN+TN
• Sensitivity: S n = TP / AP = =
1 !Coverage
"
IMPORTANT: Sensitivity alone does
not tell us much about performance
because a 100% sensitivity can be
trivially achieved by labeling all test
cases positive!
In English? Sensitivity is the
fraction of all positive instances
having a true positive prediction.
r=
AN
AP
Do not memorize this!
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
AN
1!#
"=
• Specificity: S p = TP / PP = 1=! Recall
PP
1IMPORTANT:
! # + r"
in medical jargon,
Specificity is sometimes defined
In English? Specificity is the
differently (what we define here as
fraction of all predicted positives
"Specificity" is sometimes referred
that are, in fact, true positives.
to as "Positive predictive value")
39
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
σ
1.00
(Receiver Operating Characteristic (?!!)
Sn
0.60
0.80
for a binary classifier system as its discrimination threshold is varied.
The ROC can also be represented equivalently by plotting fraction
-10 -8
of true positives (TPR = true positive rate) vs
fraction of false positives (FPR = false positive rate)
6
8
10 12 14 16 18 20
-10 -8
-6 -4
σ
Sn
0.60
A. thaliana
GT site
•
MCC = 1 for a perfect prediction
Do not memorize this!
10/24/07
41
•
•
2
4
6
8
10 12 14 16 18 20
σ
0.80
Sn
0.60
A. thaliana
AG site
0.40
0.20
0.00
-6 -4 -2 0
0.00
-2 0
1.00
0.40
-10 -8
BCB 444/544 Fall 07 Dobbs
4
0.80
Matthews correlation coefficient (MCC)
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
0.20
2
1.00
• Correlation Coefficient
0 for a completely random assignment
-1 for a "perfectly incorrect" prediction
0.00
-6 -4 -2 0
Human
AG site
0.40
0.20
(1 - specificity)
Sn
0.60
0.40
In signal detection theory, a receiver operating characteristic (ROC), or
vs
40
σ
1.00
Human
GT site
0.80
http://en.wikipedia.org/wiki/Roc_curve
ROC curve is a plot of sensitivity
10/24/07
GenSeqer Performance?
Best Measures for Comparison?
• ROC curves
38
AP=TP+FN AN=FP+TN
FP
AN
AN AN
1 1!!##
AN
= TP
S/ pPP
==TP
! = 1"! =" ="1=! #
• Specificity: S p S=pTP
/ PP
1=!/1PP
!##r+
PPPP PP1 !1#!1+
"+ r""
• Normalized specificity: !
10/24/07
Actual
True False
True
Positives
AP=TP+FN AN=FP+TN
FN
AP
= TP
/ AP
• Sensitivity: S n S=nTP
/ AP
= 1=!1"! "
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
Evaluation of Predictions - in English
Predicted
Positives
Actual
True False
!=
in+1
1-PA(n)
Evaluation of Predictions
• Misclassification rates:
en+1
PA(n)PΔG
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
Predicted
PΔG
0.20
2
4
6
8
10 12 14 16 18 20
-10 -8
-6 -4
0.00
-2 0
2
4
6
8
10 12 14 16 18 20
Plots such as these (& ROCs) are much better than using a "single
number" to compare different methods
Such plots illustrate trade-off: Sn vs Sp
Note: the above are not ROC curves (plots of Sn vs 1-Sp)
Brendel 2005
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
42
7
#27 - Gene Prediction II
10/24/07
GeneSeqer Results on Different Genomes
Species
Model
2C
Homo sapiens
2C
Drosophila
7C
C. elegans
7C
A. thaliana
Brendel 2005
Site
Test Site Set
True
False
GT
921
44411
AG
920
65103
GT
329
11501
AG
329
14920
GT
400
7460
AG
400
10132
GT
613
9027
AG
614
10196
Bayes
Factor
Sn
σ
Sp
(%)
(%)
(%)
0
3
6
0
3
6
98.5
91.7
66.3
96.3
90.3
76.1
90.5
96.3
98.5
88.4
92.9
96.1
16.4
34.8
57.6
9.7
15.7
25.6
• Comparison with ab initio gene prediction:
vs GENSCAN an HMM-based ab initio method
0
3
6
0
3
6
95.4
90.0
83.9
95.7
92.1
85.1
94.8
97.6
99.1
94.8
97.0
98.5
34.1
53.6
75.0
28.7
41.4
59.4
0
3
6
0
3
6
97.8
94.2
84.8
98.8
96.2
90.2
92.7
97.1
99.1
97.2
98.8
99.5
40.4
64.3
85.4
58.2
76.9
88.5
• "Winner" depends on:
• Availability of ESTs
• Level of similarity to protein homologs
0
3
6
0
3
6
99.5
95.6
87.1
99.2
96.4
87.1
93.2
97.6
99.3
92.3
96.4
98.6
48.1
73.2
91.0
41.9
62.0
81.2
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
GeneSeqer
vs
Performance of GeneSeqer vs Others?
10/24/07
43
Brendel 2005
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
GENSCAN
GeneSeqer vs
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
GeneSeqer
NAP
GENSCAN
0
10 20 30 40 50 60 70 80 90 100
Target protein alignment score
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
GeneSeqer
NAP
GENSCAN
0 10 20 30 40 50 60 70 80 90 100
Target protein alignment score
10/24/07
45
GENSCAN - Burge, MIT
Brendel 2005
GeneSeqer: Input
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
BCB 444/544 Fall 07 Dobbs
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
46
10/24/07
48
GeneSeqer: Output
http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi
Brendel 2005
GENSCAN
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
GENSCAN - Burge, MIT
Brendel 2005
44
(Intron prediction)
Intron (Sn + Sp) / 2
Exon (Sn + Sp) / 2
(Exon prediction)
10/24/07
10/24/07
47
Brendel 2005
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
8
#27 - Gene Prediction II
10/24/07
GeneSeqer: Gene Evidence Summary
Gene Prediction - Problems & Status?
Common errors?
• False positive intergenic regions:
• 2 annotated genes actually correspond to a single gene
• False negative intergenic region:
• One annotated gene structure actually contains 2 genes
• False negative gene prediction:
• Missing gene (no annotation)
• Other:
• Partially incorrect gene annotation
• Missing annotation of alternative transcripts
Current status?
• For ab initio prediction in eukaryotes: HMMs have better overall
performance for detecting untron/exon boundaries
• Limitation? Training data: predictions are organism specific
• Combined ab initio/homology based predictions: Improved accurracy
• Limitation? Availability of identifiable sequence homologs in databases
Brendel 2005
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
49
http://www.bioinformatics.iastate.edu/bioinformatics2go/
GENSCAN: http://genes.mit.edu/GENSCAN.html
GeneMark.hmm: http://exon.gatech.edu/GeneMark/
others: GRAIL, FGENES, MZEF, HMMgene
Similarity-based
•
•
50
Ab initio
•
•
•
•
10/24/07
Other Gene Prediction Resources: at ISU
Recommended Gene Prediction Software
•
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
BLAST, GenomeScan, EST2Genome, Twinscan
Combined:
•
GeneSeqer, ROSETTA
 Consensus: because results depend on organisms & specific
task, Always use more than one program!
• Two servers hat report consensus predictions
• GeneComber
• DIGIT
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
51
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
10/24/07
52
Other Gene Prediction Resources:
GaTech, MIT, Stanford, etc.
Lists of Gene Prediction Software
http://www.bioinformaticsonline.org/links/ch_09_t_1.html
http://cmgm.stanford.edu/classes/genefind/
Current Protocols in Bioinformatics (BCB/ISU owns a copy - currently in my lab!)
Chapter 4 Finding Genes
4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations
4.2 Using MZEF To Find Internal Coding Exons
4.3 Using GENEID to Identify Genes
4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes
4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm
4.6 Eukaryotic Gene Prediction Using GeneMark.hmm
4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome
4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences
4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation
4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences
BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II
BCB 444/544 Fall 07 Dobbs
10/24/07
53
9
Download