#26 - Gene Prediction 10/22/07 BCB 444/544 Gene Prediction

advertisement
#26 - Gene Prediction
10/22/07
Required Reading
BCB 444/544
(before lecture)
Mon Oct 22 - Lecture 26
Lecture 26
Gene Prediction
• Chp 8 - pp 97 - 112
Wed Oct 24 - Lecture 27
Gene Prediction
(will not be covered on Exam 2)
Regulatory Element Prediction
• Chp 9 - pp 113 - 126
Thurs Oct 25 - Review Session & Project Planning
#26_Oct22
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Fri Oct 26 - EXAM 2
10/22/07
1
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Assignments & Announcements
10/22/07
2
BCB 544 "Team" Projects
Sun Oct 21 - Study Guide for Exam 2 was posted
• 544 Extra HW#2 is next step in Team Projects
Mon Oct 22 - HW#4 Due
(no "correct" answer to post)
•
•
•
•
Thu Oct 25 - Lab = Optional Review Session for Exam
544 Project Planning/Consult with DD & MT
Write ~ 1 page outline
Schedule meeting with Michael & Drena to discuss topic
Read a few papers
Write a more detailed plan
• You may work alone if you prefer
Fri Oct 26 - Exam 2 - Will cover:
•
•
•
•
• Last week of classes will be devoted to Projects
• Written reports due: Mon Dec 3 (no class that day)
• Oral presentations (15-20') will be: Wed-Fri Dec 5,6,7
Lectures 13-26 (thru Mon Sept 17)
Labs 5-8
HW# 3 & 4
All assigned reading:
Chps 6 (beginning with HMMs), 7-8, 12-16
Eddy: What is an HMM
Ginalski: Practical Lessons…
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
• 1 or 2 teams will present during each class period
 See Guidelines for Projects posted online
3
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
BCB 544 Only:
New Homework Assignment
4
Seminars this Week
BCB List of URLs for Seminars related to Bioinformatics:
544 Extra#2 (posted online Thurs?)
http://www.bcb.iastate.edu/seminars/index.html
No - sorry! sent by email on Sat…
Due:
10/22/07
• Oct 25 Thur - BBMB Seminar 4:10 in 1414 MBB
PART 1 - ASAP
PART 2 - Fri Nov 2 by 5 PM
•
Part 1 - Brief outline of Project, email to Drena & Michael
Dave Segal
UC Davis
Zinc Finger Protein Design
• Oct 19 Fri - BCB Faculty Seminar 2:10 in 102 ScI
after response/approval, then:
• Guang Song ComS, ISU Probing functional mechanisms
by structure-based modeling and simulations
Part 2 - More detailed outline of project
Read a few papers and summarize status of problem
Schedule meeting with Drena & Michael to discuss ideas
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
BCB 444/544 Fall 07 Dobbs
10/22/07
5
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
6
1
#26 - Gene Prediction
10/22/07
This is a new slide
Chp 16 - RNA Structure Prediction
SECTION V
Covalent & non-covalent bonds in RNA
STRUCTURAL BIOINFORMATICS
Primary:
Xiong: Chp 16 RNA Structure Prediction (Terribilini)
•
•
•
•
•
•
Covalent bonds
Secondary/Tertiary
RNA Function
Types of RNA Structures
RNA Secondary Structure Prediction Methods
Ab Initio Approach
Comparative Approach
Performance Evaluation
Non-covalent bonds
• H-bonds
(base-pairing)
• Base stacking
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
RNA Pseudoknots & Tetraloops
7
This is a new slide
http://www.lbl.gov/Science-Articles/ResearchReview/Annual-Reports/1995/images/rna.gif
Base Pairing in RNA
10/22/07
8
This slide has been changed
G-C, A-U, G-U ("wobble") & many variants
• Often have important regulatory or catalytic functions
Pseudoknot
Fig 6.2
Baxevanis & Ouellette 2005BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
See: IMB Image Library of Biological Molecules
Tetraloop
http://www.fli-leibniz.de/ImgLibDoc/nana/IMAGE_NANA.html#basepairs
http://academic.brooklyn.cuny.edu/chem/z
huang/QD/mckay_hr.gif
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
9
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
RNA Secondary structure prediction - 3
3) Combined experimental & computational
Two (three, recently) main types of methods:
• Experiments:
1. Ab initio - based on calculating most energetically
favorable secondary structure(s)
• How?
Energy minimization (thermodynamics)
Sequence comparison (co-variation)
G
200
Enzymes: S1 nuclease, T1 RNase
Chemicals: kethoxal, DMS, OH•
220
• Software:
Mfold
Sfold
RNAStructure
RNAFold
RNAlifold
3. Combined computational & experimental
Use experimental constraints when available
BCB 444/544 Fall 07 Dobbs
DMS
Map single-stranded vs doublestranded regions in folded RNA
2. Comparative approach - based on comparisons of
multiple evolutionarily-related RNA sequences
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
10
This is a new slide
This slide has been changed
RNA Secondary Structure Prediction
Methods
10/22/07
11
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
240
Kethoxal modification
(mild)
(strong)
DMS modification
(mild)
(strong)
10/22/07
12
2
#26 - Gene Prediction
10/22/07
This slide has been changed
• Free energy is calculated based on parameters
determined in the wet lab
• Correction: Use known energy associated with
each type of nearest-neighbor pair (base-stacking)
(not base-pair)
• Base-pair formation is not independent: multiple
base-pairs adjacent to each other are more
favorable than individual base-pairs - cooperative
- because of base-stacking interactions
• Bulges and loops adjacent to base-pairs have a free
energy penalty
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
Energy minimization calculations:
Base-stacking is critical
-1.2
CG
GC
-3.0
AU or UA
UA
AU
-1.6
GC
CG
-4.3
AG, AC, CA, GA
UC, UG, GU, CU
-2.1
GU
UG
-0.3
-4.8
XG, GX
YU, UY
CC
GG
A
A
U Basepair A=U
U
A=U
A
U
U
A
Basepair
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
13
C Staben 2005
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Dynamic Programming
This slide has been changed
Ab Initio Energy Calculation
(sequence dependent)
• loop initiation
• unpaired stacking
(favorable "increments" are < 0)
15
BCB 444/544 Fall 07 Dobbs
Fig 6.3
Baxevanis & Ouellette 2005BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
This slide has been changed
10/22/07
14
Total free energy for a specific
RNA conformation = Sum of
incremental energy terms for:
• helical stacking
• Finding optimal secondary structure is difficult lots of possibilities
• Compare RNA sequence with itself
• Apply scoring scheme based on energy parameters
for base stacking, cooperativity, and penalties for
destabilizing forces (loops, bulges)
• Find path that represents most energetically
favorable secondary structure
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
• Search for all possible base-pairing
patterns
• Calculate total energy of each
structure based on all stabilizing and
destabilizing forces
0
10/22/07
A=U
U=A
ΔG = -1.6 kcal/mole
- Tinocco et al.
C Staben 2005
What gives here?
ΔG = -1.2 kcal/mole
This is a new slide
AA
UU
This is a new slide
Energy minimization:
What are the rules?
Ab Initio Prediction: Clarifications
16
3 - Popular Programs that use Combined
Computational Experimental Approaches
•
•
•
•
•
17
10/22/07
Mfold
Sfold
RNAStructure
RNAFold
RNAlifold
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
18
3
#26 - Gene Prediction
10/22/07
Comparison of Predictions for Single RNA
using Different Methods
Comparison of Mfold Predictions:
-/+ Constraints
SL Y
SL Y
SL Z
SL X
SL Z
SL X
Sfold -51.14 kcal/mol
Mfold -54.84 kcal/mol
SL Y
SL Z
SL Y
SL X
SL Z
SL X
RNAstructure -71.3 kcal/mol
JH Lee 2007
Mfold
-126.05 kcal/mol
Mfold plus constraints
-54.84 kcal/mol
RNAfold -80.16 kcal/mol
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Performance Evaluation
10/22/07
19
JH Lee 2007
This slide has been changed
• Ab initio methods? correlation coefficient = 20-60%
• Comparative approaches? correlation coefficient = 2080%
• Programs that require user to supply MSA are more
accurate
• Comparative programs are consistently more accurate than
ab initio
• Base-pairs predicted by comparative sequence analysis for large
& small subunit rRNAs are 97% accurate when compared with
high resolution crystal structures!
- Gutell, Pace
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
20
10/22/07
22
Chp 8 - Gene Prediction
SECTION III GENE AND PROMOTER PREDICTION
Xiong: Chp 8 Gene Prediction
• Categories of Gene Prediction Programs
• Gene Prediction in Prokaryotes
• Gene Prediction in Eukaryotes
• BEST APPROACH? Methods that combine computational
prediction (ab initio & comparative) with experimental
constraints (from chemical/enzymatic modification
studies)
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
21
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
What is a Gene?
Gene Finding
Problem: Given a new genomic DNA sequence, identify
coding regions and their predicted RNA and protein
sequences
What is a gene? segment of DNA, some of which is
"structural," i.e., transcribed to give a functional RNA
product, & some of which is "regulatory"
ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT
ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT
• Genes can encode:
• mRNA (for protein)
Steps:
• other types of RNA (tRNA, rRNA, miRNA, etc.)
1.
2.
3.
• Genes differ in eukaryotes vs prokaryotes (& archaea),
both structure & regulation
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
BCB 444/544 Fall 07 Dobbs
10/22/07
23
Search against protein / EST database
Apply gene prediction programs (many programs available)
Analyze regulatory regions
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
24
4
#26 - Gene Prediction
10/22/07
DNA "Signals" Used by Gene Finding
Algorithms
Gene Prediction in Prokaryotes vs Eukaryotes
Eukaryotes
Prokaryotes
• Large genomes 107 – 1010 bp
• Often less than 2% coding
• Small genomes 0.5 - 10·106 bp
• About 90% of genome is coding
• Simple gene structure
• Complicated gene structure
(splicing, long exons)
• Prediction success 50-95%
ATG
TAA
5’ UTR
3’ UTR
Exons
Exploit the regular gene structure
ATG—Exon1—Intron1—Exon2—…—ExonN—STOP
2.
Recognize “coding bias”
CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-…
3.
Recognize splice sites
Intron—cAGt—Exon—gGTgag—Intron
4.
Model the duration of regions
Introns tend to be much longer than exons, in mammals
Exons are biased to have a given minimum length
5.
Use cross-species comparison
Gene structure is conserved in mammals
Exons are more similar (~85%) than introns
• Prediction success ~99%
Splice sites
Promotor
1.
Start codon
Stop codon
ATG
TAA
Promotor Open reading frame (ORF)
Introns
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
25
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
26
Examples of Gene Prediction Software
Computational Gene Finding Approaches

• Ab initio methods
Ab initio

• Search by signal: find DNA sequences involved in gene
expression.
• Search by content: Test statistical properties distinguishing
coding from non-coding DNA


• Similarity based methods

• Database search: exploit similarity to proteins, ESTs, and
cDNAs
• Comparative genomics: exploit aligned genomes
• Do other organisms have similar sequence?

BLAST, Procrustes…
Hybrids

• Hybrid methods - best
Genscan, GeneMark.hmm, Genie, GeneID…
Similarity-based
GeneSeqer, GenomeScan, GenieEST, Twinscan, SGP, ROSETTA,
CEM, TBLASTX, SLAM.
BEST? Ab initio - Genescan (according to some assessments)
Hybrid - GeneSeqer
But depends on organism & specific task
Lists of Gene Prediction Software
http://www.bioinformaticsonline.org/links/ch_09_t_1.html
http://cmgm.stanford.edu/classes/genefind/
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
27
Synthesis & Processing of Eukaryotic mRNA
intron
1' transcript (RNA)
exon 2
3’
exon 3 5’
intron
What are cDNAs & ESTs?
insert
organism, region, and time point)
• Convert RNA to complementary DNA
• (with reverse transcriptase)
• Clone into cDNA vector
• Sequence the cDNA inserts
vector
• Short cDNAs are called ESTs or
Expressed Sequence Tags
ESTs are strong evidence for genes
• Full-length cDNAs can be difficult to obtain
3’
Splicing (remove introns)
3’
5’
5’ 7MeG
28
• Isolate RNA (always from a specific
Transcription
5’
Mature mRNA
10/22/07
cDNA libraries are important for determining gene
structure & studying regulation of gene expression
DN
Gene in DNA
5’ exon 1
3’
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
Capping & polyadenylation
AAAAA 3’
m
Export to cytoplasm
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
BCB 444/544 Fall 07 Dobbs
10/22/07
29
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
30
5
#26 - Gene Prediction
10/22/07
UniGene: Unique genes via ESTs
Gene Prediction
• Overview of steps & strategies
• Find UniGene at NCBI:
www.ncbi.nlm.nih.gov/UniGene
• What sequence signals can be used?
• What other types of information can be used?
• UniGene clusters contain many ESTs
• Algorithms
• UniGene data come from many cDNA libraries.
• HMMs, Bayesian models, neural nets
• Gene prediction software
When you look up a gene in UniGene, you can
obtain information re: level & tissue
distribution of expression
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
• 3 major types
• many, many programs!
31
Overview of Gene Prediction Strategies
What other types of information can be used?
Why?
Smaller genomes
Simpler gene structures
Many more sequenced genomes!
(for comparative approaches)
Many microbial genomes have been fully sequenced &
whole-genome "gene structure" and "gene function"
annotations are available
e.g., GeneMark.hmm
TIGR Comprehensive Microbial Resource (CMR)
NCBI Microbial Genomes
• Homology (sequence comparison, BLAST)
• cDNAs & ESTs (experimental data, pairwise alignment)
10/22/07
33
Predicting Genes - Basic steps:
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
34
Predicting Genes - Details:
• Obtain genomic sequence
1. 1st, mask to "remove" repetitive elements (ALUs, etc.)
2. Perform database search on translated DNA
(BlastX,TFasta)
3. Use several programs to predict genes
(GENSCAN, GeneMark.hmm, GeneSeqer)
4. Search for functional motifs in translated ORFs
(Blocks, Motifs, etc.) & in neighboring DNA sequences
5. Repeat
• BLAST it!
• Perform database similarity search
(with EST & cDNA databases, if available)
• Translate in all 6 reading frames
(i.e., "6-frame translation")
• Compare with protein sequence databases
• Use Gene Prediction software to locate genes
• Analyze regulatory sequences
• Refine gene prediction
BCB 444/544 Fall 07 Dobbs
32
Gene prediction is easier in microbial genomes
• Transcription: TF binding sites, promoter, initiation site, terminator,
GC islands, etc.
• Processing signals: Splice donor/acceptors, polyA signal
• Translation: Start (AUG = Met) & stop (UGA,UUA, UAG)
ORFs, codon usage
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
Gene prediction: Eukaryotes vs prokaryotes
What sequence signals can be used?
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
35
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
36
6
#26 - Gene Prediction
10/22/07
GeneSeqer - Brendel et al.- ISU
Brendel - Spliced Alignment II:
http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi
Compare with protein probes
Spliced Alignment Algorithm
Brendel et al (2004) Bioinformatics 20: 1157
Start codon
• Perform pairwise alignment with large gaps in one sequence
Stop codon
Genomic DNA
(due to introns)
• Align genomic DNA with cDNA, ESTs, protein sequences
Protein
• Score semi-conserved sequences at splice junctions
• Using Bayesian model or MM
• Score coding constraints in translated exons
• Using a Bayesian model or MM
Intron
GT
Donor
Brendel 2005
AG
Acceptor
Splice sites
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
37
Brendel 2005
Splice Site Detection
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
38
Information content vs position
Do DNA sequences surrounding splice "consensus"
sequences contribute to splicing signal? YES
0.8
• Information Content Ii :
Ii = 2 +
"
0.7
Human
T2_GT
0.6
0.5
f iB log 2 ( f iB )
B !U ,C , A ,G
0.5
0.4
0.3
0.3
0.2
0.1
0.1
0.0
-50
I i ! I + 196
. "I
i: ith position in sequence
Ī: avg information content over all positions >20 nt from splice site
σĪ: avg sample standard deviation of Ī
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
39
-40
-30
-20
-10
Human
T2_AG
0.6
0.4
0.2
• Extent of Splice Signal Window:
Brendel 2005
0.8
0.7
0.0
0
10
20
30
40
50 -50
-40
-30
-20
-10
0
10
20
30
40
50
Which sequences are exons & which are introns?
How can you tell?
Brendel et al (2004) Bioinformatics 20: 1157
Brendel 2005
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
10/22/07
40
Markov Model for Spliced Alignment
PΔG
PΔG
(1-PΔG )(1-PD(n+1))
en
en+1
(1-PΔG )PD(n+1)
PA(n)PΔG
(1-PΔG )PD(n+1)
in
in+1
1-PA(n)
Brendel 2005
BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction
BCB 444/544 Fall 07 Dobbs
10/22/07
41
7
Download