BCB 444/544 Gene Prediction Lecture 26 #26_Oct22

advertisement
BCB 444/544
Lecture 26
Gene Prediction
#26_Oct22
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
1
Required Reading
(before lecture)
Mon Oct 22 - Lecture 26
Gene Prediction
• Chp 8 - pp 97 - 112
Wed Oct 24 - Lecture 27
(will not be covered on Exam 2)
Regulatory Element Prediction
• Chp 9 - pp 113 - 126
Thurs Oct 25 - Review Session & Project Planning
Fri Oct 26 - EXAM 2
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
2
Assignments & Announcements
Sun Oct 21 - Study Guide for Exam 2 was posted
Mon Oct 22 - HW#4 Due
(no "correct" answer to post)
Thu Oct 25 - Lab = Optional Review Session for Exam
544 Project Planning/Consult with DD & MT
Fri Oct 26 - Exam 2 - Will cover:
•
•
•
•
Lectures 13-26 (thru Mon Sept 17)
Labs 5-8
HW# 3 & 4
All assigned reading:
Chps 6 (beginning with HMMs), 7-8, 12-16
Eddy: What is an HMM
Ginalski: Practical Lessons…
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
3
BCB 544 "Team" Projects
• 544 Extra HW#2 is next step in Team Projects
•
•
•
•
Write ~ 1 page outline
Schedule meeting with Michael & Drena to discuss topic
Read a few papers
Write a more detailed plan
• You may work alone if you prefer
• Last week of classes will be devoted to Projects
• Written reports due: Mon Dec 3 (no class that day)
• Oral presentations (15-20') will be: Wed-Fri Dec 5,6,7
• 1 or 2 teams will present during each class period
 See Guidelines for Projects posted online
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
4
BCB 544 Only:
New Homework Assignment
544 Extra#2 (posted online Thurs?)
No - sorry! sent by email on Sat…
Due:
PART 1 - ASAP
PART 2 - Fri Nov 2 by 5 PM
Part 1 - Brief outline of Project, email to Drena & Michael
after response/approval, then:
Part 2 - More detailed outline of project
Read a few papers and summarize status of problem
Schedule meeting with Drena & Michael to discuss ideas
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
5
Seminars this Week
BCB List of URLs for Seminars related to Bioinformatics:
http://www.bcb.iastate.edu/seminars/index.html
• Oct 25 Thur - BBMB Seminar 4:10 in 1414 MBB
•
Dave Segal
UC Davis
Zinc Finger Protein Design
• Oct 19 Fri - BCB Faculty Seminar 2:10 in 102 ScI
• Guang Song ComS, ISU Probing functional mechanisms
by structure-based modeling and simulations
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
6
Chp 16 - RNA Structure Prediction
SECTION V
STRUCTURAL BIOINFORMATICS
Xiong: Chp 16 RNA Structure Prediction (Terribilini)
•
•
•
•
•
•
RNA Function
Types of RNA Structures
RNA Secondary Structure Prediction Methods
Ab Initio Approach
Comparative Approach
Performance Evaluation
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
7
This is a new slide
Covalent & non-covalent bonds in RNA
Primary:
Covalent bonds
Secondary/Tertiary
Non-covalent bonds
• H-bonds
(base-pairing)
• Base stacking
Fig 6.2
Baxevanis & Ouellette 2005BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
8
RNA Pseudoknots & Tetraloops
This is a new slide
• Often have important regulatory or catalytic functions
Pseudoknot
http://www.lbl.gov/Science-Articles/ResearchReview/Annual-Reports/1995/images/rna.gif
Tetraloop
http://academic.brooklyn.cuny.edu/chem/z
huang/QD/mckay_hr.gif
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
9
Base Pairing in RNA
This slide has been changed
G-C, A-U, G-U ("wobble") & many variants
See: IMB Image Library of Biological Molecules
http://www.fli-leibniz.de/ImgLibDoc/nana/IMAGE_NANA.html#basepairs
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
10
This slide has been changed
RNA Secondary Structure Prediction
Methods
Two (three, recently) main types of methods:
1. Ab initio - based on calculating most energetically
favorable secondary structure(s)
Energy minimization (thermodynamics)
2. Comparative approach - based on comparisons of
multiple evolutionarily-related RNA sequences
Sequence comparison (co-variation)
3. Combined computational & experimental
Use experimental constraints when available
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
11
This is a new slide
RNA Secondary structure prediction - 3
3) Combined experimental & computational
• Experiments:
DMS
Map single-stranded vs doublestranded regions in folded RNA
• How?
G
200
Enzymes: S1 nuclease, T1 RNase
Chemicals: kethoxal, DMS, OH
220
• Software:
Mfold
Sfold
RNAStructure
RNAFold
RNAlifold
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
240
Kethoxal modification
(mild)
(strong)
DMS modification
(mild)
(strong)
10/22/07
12
This slide has been changed
Ab Initio Prediction: Clarifications
• Free energy is calculated based on parameters
determined in the wet lab
• Correction: Use known energy associated with
each type of nearest-neighbor pair (base-stacking)
(not base-pair)
• Base-pair formation is not independent: multiple
base-pairs adjacent to each other are more
favorable than individual base-pairs - cooperative because of base-stacking interactions
• Bulges and loops adjacent to base-pairs have a free
energy penalty
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
13
Energy minimization:
What are the rules?
A
A
U
U
Basepair
A=U
A=U
This is a new slide
What gives here?
G = -1.2 kcal/mole
A
U
U
A
Basepair
A=U
U=A
G = -1.6 kcal/mole
C Staben 2005
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
14
Energy minimization calculations:
Base-stacking is critical
AA
UU
AU or UA
AU
UA
AG, AC, CA, GA
UC, UG, GU, CU
CC
GG
This is a new slide
-1.2
CG
GC
-3.0
-1.6
GC
CG
-4.3
-2.1
GU
UG
-0.3
-4.8
XG, GX
YU, UY
0
- Tinocco et al.
C Staben 2005
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
15
This slide has been changed
Ab Initio Energy Calculation
• Search for all possible base-pairing
patterns
• Calculate total energy of each
structure based on all stabilizing and
destabilizing forces
Total free energy for a specific
RNA conformation = Sum of
incremental energy terms for:
• helical stacking
(sequence dependent)
• loop initiation
• unpaired stacking
(favorable "increments" are < 0)
Fig 6.3
Baxevanis & Ouellette 2005BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
16
Dynamic Programming
This slide has been changed
• Finding optimal secondary structure is difficult lots of possibilities
• Compare RNA sequence with itself
• Apply scoring scheme based on energy parameters
for base stacking, cooperativity, and penalties for
destabilizing forces (loops, bulges)
• Find path that represents most energetically
favorable secondary structure
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
17
3 - Popular Programs that use Combined
Computational Experimental Approaches
•
•
•
•
•
Mfold
Sfold
RNAStructure
RNAFold
RNAlifold
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
18
Comparison of Predictions for Single RNA
using Different Methods
SL Y
SL Y
SL Z
SL X
SL Z
SL X
Sfold -51.14 kcal/mol
Mfold -54.84 kcal/mol
SL Y
SL Z
SL Y
SL X
SL Z
SL X
RNAstructure -71.3 kcal/mol
JH Lee 2007
RNAfold -80.16 kcal/mol
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
19
Comparison of Mfold Predictions:
-/+ Constraints
Mfold
-126.05 kcal/mol
JH Lee 2007
Mfold plus constraints
-54.84 kcal/mol
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
20
Performance Evaluation
This slide has been changed
• Ab initio methods? correlation coefficient = 20-60%
• Comparative approaches? correlation coefficient = 2080%
• Programs that require user to supply MSA are more
accurate
• Comparative programs are consistently more accurate than
ab initio
• Base-pairs predicted by comparative sequence analysis for large
& small subunit rRNAs are 97% accurate when compared with
high resolution crystal structures!
- Gutell, Pace
• BEST APPROACH? Methods that combine computational
prediction (ab initio & comparative) with experimental
constraints (from chemical/enzymatic modification
studies)
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
21
Chp 8 - Gene Prediction
SECTION III GENE AND PROMOTER PREDICTION
Xiong: Chp 8 Gene Prediction
• Categories of Gene Prediction Programs
• Gene Prediction in Prokaryotes
• Gene Prediction in Eukaryotes
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
22
What is a Gene?
What is a gene? segment of DNA, some of which is
"structural," i.e., transcribed to give a functional RNA
product, & some of which is "regulatory"
• Genes can encode:
• mRNA (for protein)
• other types of RNA (tRNA, rRNA, miRNA, etc.)
• Genes differ in eukaryotes vs prokaryotes (& archaea),
both structure & regulation
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
23
Gene Finding
Problem: Given a new genomic DNA sequence, identify
coding regions and their predicted RNA and protein
sequences
ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT
ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT
Steps:
1.
2.
3.
Search against protein / EST database
Apply gene prediction programs (many programs available)
Analyze regulatory regions
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
24
Gene Prediction in Prokaryotes vs Eukaryotes
Eukaryotes
• Large genomes 107 – 1010 bp
• Often less than 2% coding
• Complicated gene structure
(splicing, long exons)
• Prediction success 50-95%
Prokaryotes
• Small genomes 0.5 - 10·106 bp
• About 90% of genome is coding
• Simple gene structure
• Prediction success ~99%
Splice sites
ATG
TAA
5’ UTR
3’ UTR
Promotor
Exons
Start codon
Stop codon
ATG
TAA
Promotor Open reading frame (ORF)
Introns
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
25
DNA "Signals" Used by Gene Finding
Algorithms
1.
Exploit the regular gene structure
ATG—Exon1—Intron1—Exon2—…—ExonN—STOP
2.
Recognize “coding bias”
CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-…
3.
Recognize splice sites
Intron—cAGt—Exon—gGTgag—Intron
4.
Model the duration of regions
Introns tend to be much longer than exons, in mammals
Exons are biased to have a given minimum length
5.
Use cross-species comparison
Gene structure is conserved in mammals
Exons are more similar (~85%) than introns
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
26
Computational Gene Finding Approaches
• Ab initio methods
• Search by signal: find DNA sequences involved in gene
expression.
• Search by content: Test statistical properties distinguishing
coding from non-coding DNA
• Similarity based methods
• Database search: exploit similarity to proteins, ESTs, and
cDNAs
• Comparative genomics: exploit aligned genomes
• Do other organisms have similar sequence?
• Hybrid methods - best
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
27
Examples of Gene Prediction Software

Ab initio


Similarity-based


BLAST, Procrustes…
Hybrids


Genscan, GeneMark.hmm, Genie, GeneID…
GeneSeqer, GenomeScan, GenieEST, Twinscan, SGP, ROSETTA,
CEM, TBLASTX, SLAM.
BEST? Ab initio - Genescan (according to some assessments)
Hybrid - GeneSeqer
But depends on organism & specific task
Lists of Gene Prediction Software
http://www.bioinformaticsonline.org/links/ch_09_t_1.html
http://cmgm.stanford.edu/classes/genefind/
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
28
Synthesis & Processing of Eukaryotic mRNA
DN
Gene in DNA
5’ exon 1
3’
intron
1' transcript (RNA)
exon 2
3’
exon 3 5’
intron
Transcription
5’
3’
Splicing (remove introns)
3’
5’
Mature mRNA
5’ 7MeG
Capping & polyadenylation
AAAAA 3’
m
Export to cytoplasm
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
29
What are cDNAs & ESTs?
cDNA libraries are important for determining gene
structure & studying regulation of gene expression
• Isolate RNA (always from a specific
organism, region, and time point)
insert
• Convert RNA to complementary DNA
• (with reverse transcriptase)
• Clone into cDNA vector
• Sequence the cDNA inserts
vector
• Short cDNAs are called ESTs or
Expressed Sequence Tags
ESTs are strong evidence for genes
• Full-length cDNAs can be difficult to obtain
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
30
UniGene: Unique genes via ESTs
• Find UniGene at NCBI:
www.ncbi.nlm.nih.gov/UniGene
• UniGene clusters contain many ESTs
• UniGene data come from many cDNA libraries.
When you look up a gene in UniGene, you can
obtain information re: level & tissue
distribution of expression
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
31
Gene Prediction
• Overview of steps & strategies
• What sequence signals can be used?
• What other types of information can be used?
• Algorithms
• HMMs, Bayesian models, neural nets
• Gene prediction software
• 3 major types
• many, many programs!
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
32
Overview of Gene Prediction Strategies
What sequence signals can be used?
• Transcription: TF binding sites, promoter, initiation site, terminator,
GC islands, etc.
• Processing signals: Splice donor/acceptors, polyA signal
• Translation: Start (AUG = Met) & stop (UGA,UUA, UAG)
ORFs, codon usage
What other types of information can be used?
• Homology (sequence comparison, BLAST)
• cDNAs & ESTs (experimental data, pairwise alignment)
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
33
Gene prediction: Eukaryotes vs prokaryotes
Gene prediction is easier in microbial genomes
Why?
Smaller genomes
Simpler gene structures
Many more sequenced genomes!
(for comparative approaches)
Many microbial genomes have been fully sequenced &
whole-genome "gene structure" and "gene function"
annotations are available
e.g., GeneMark.hmm
TIGR Comprehensive Microbial Resource (CMR)
NCBI Microbial Genomes
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
34
Predicting Genes - Basic steps:
• Obtain genomic sequence
• BLAST it!
• Perform database similarity search
(with EST & cDNA databases, if available)
• Translate in all 6 reading frames
(i.e., "6-frame translation")
• Compare with protein sequence databases
• Use Gene Prediction software to locate genes
• Analyze regulatory sequences
• Refine gene prediction
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
35
Predicting Genes - Details:
1. 1st, mask to "remove" repetitive elements (ALUs, etc.)
2. Perform database search on translated DNA
(BlastX,TFasta)
3. Use several programs to predict genes
(GENSCAN, GeneMark.hmm, GeneSeqer)
4. Search for functional motifs in translated ORFs
(Blocks, Motifs, etc.) & in neighboring DNA sequences
5. Repeat
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
36
GeneSeqer - Brendel et al.- ISU
http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi
Spliced Alignment Algorithm
Brendel et al (2004) Bioinformatics 20: 1157
• Perform pairwise alignment with large gaps in one sequence
(due to introns)
• Align genomic DNA with cDNA, ESTs, protein sequences
• Score semi-conserved sequences at splice junctions
• Using Bayesian model or MM
• Score coding constraints in translated exons
• Using a Bayesian model or MM
Intron
GT
Donor
Brendel 2005
AG
Splice sites
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Acceptor
10/22/07
37
Brendel - Spliced Alignment II:
Compare with protein probes
Start codon
Stop codon
Genomic DNA
Protein
Brendel 2005
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
38
Splice Site Detection
Do DNA sequences surrounding splice "consensus"
sequences contribute to splicing signal? YES
• Information Content Ii :
Ii  2 
f
iB
BU ,C , A,G
log 2 ( f iB )
• Extent of Splice Signal Window:
I i  I  196
. I
i: ith position in sequence
Ī: avg information content over all positions >20 nt from splice site
Ī: avg sample standard deviation of Ī
Brendel 2005
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
39
Information content vs position
0.8
0.8
0.7
0.7
Human
T2_GT
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
-50
-40
-30
-20
-10
Human
T2_AG
0.6
0.0
0
10
20
30
40
50 -50
-40
-30
-20
-10
0
10
20
30
40
50
Which sequences are exons & which are introns?
How can you tell?
Brendel et al (2004) Bioinformatics 20: 1157
Brendel 2005
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
40
Markov Model for Spliced Alignment
PG
PG
(1-PG)(1-PD(n+1))
en
en+1
(1-PG)PD(n+1)
PA(n)PG
(1-PG)PD(n+1)
in
in+1
1-PA(n)
Brendel 2005
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
41
Download