#28 - Promoter Prediction 10/29/07 BCB 444/544

advertisement
#28 - Promoter Prediction
10/29/07
Required Reading
BCB 444/544
(before lecture)
Mon Oct 29 - Lecture 28
Lecture 28
Promoter & Regulatory Element Prediction
• Chp 9 - pp 113 - 126
Wed Oct 30 - Lecture 29
Gene Prediction - finish it
Phylogenetics Basics
• Chp 10 - pp 127 - 141
Promoter Prediction
Thurs Oct 31 - Lab 9
Gene & Regulatory Element Prediction
Fri Oct 30 - Lecture 29
#28_Oct29
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
Phylogenetic Tree Construction Methods & Programs
• Chp 11 - pp 142 - 169
10/29/07
1
Assignments & Announcements
10/29/07
2
10/29/07
4
BCB 544 "Team" Projects
Mon Oct 29 - HW#5 - will be posted today
Last week of classes will be devoted to Projects
HW#5 = Hands-on exercises with phylogenetics
and tree-building software
Due: Mon Nov 5
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
• Written reports due:
• Mon Dec 3 (no class that day)
(not Fri Nov 1 as previously posted)
• Oral presentations (20-30') will be:
• Wed-Fri Dec 5,6,7
• 1 or 2 teams will present during each class period
 See Guidelines for Projects posted online
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
3
BCB 544 Only:
New Homework Assignment
Seminars this Week
BCB List of URLs for Seminars related to Bioinformatics:
544 Extra#2
Due:
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
http://www.bcb.iastate.edu/seminars/index.html
√ PART 1 - ASAP
PART 2 - meeting prior to 5 PM Fri Nov 2
• Nov 1 Thurs - BBMB Seminar 4:10 in 1414 MBB
• Todd Yeates UCLA TBA -something cool about
structure and evolution?
Part 1 - Brief outline of Project, email to Drena & Michael
after response/approval, then:
• Nov 2 Fri - BCB Faculty Seminar 2:10 in 102 ScI
Part 2 - More detailed outline of project
• Bob Jernigan BBMB, ISU
• Control of Protein Motions by Structure
Read a few papers and summarize status of problem
Schedule meeting with Drena & Michael to discuss ideas
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
BCB 444/544 Fall 07 Dobbs
10/29/07
5
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
6
1
#28 - Promoter Prediction
10/29/07
Chp 8 - Gene Prediction
Computational Gene Prediction: Approaches
• Ab initio methods
SECTION III GENE AND PROMOTER PREDICTION
• Search by signal: find DNA sequences involved in gene
Xiong: Chp 8 Gene Prediction
expression
• Search by content: Test statistical properties distinguishing
• Categories of Gene Prediction Programs
coding from non-coding DNA
• Gene Prediction in Prokaryotes
• Similarity-based methods
• Gene Prediction in Eukaryotes
• Database search: exploit similarity to proteins, ESTs, cDNAs
• Comparative genomics: exploit aligned genomes
• Do other organisms have similar sequence?
• Hybrid methods - best
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
7
This is a new slide
Computational Gene Prediction: Algorithms
1. Neural Networks (NNs)
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
Signals Search
10/29/07
8
This is a new slide
Approach: Build models (PSSMs, profiles, HMMs, …) and search
(more on these later…)
against DNA. Detected instances provide evidence for genes
e.g., GRAIL
2. Linear discriminant analysis (LDA) (see text)
e.g., FGENES, MZEF
3. Markov Models (MMs) & Hidden Markov Models (HMMs)
e.g., GeneSeqer - uses MMs
GENSCAN - uses 5th order HMMs - (see text)
HMMgene - uses conditional maximum likelihood (see text)
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
Content Search
10/29/07
9
This is a new slide
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
Human Codon Usage
10/29/07
10
This is a new slide
Observation: Encoding a protein affects statistical properties
of DNA sequence:
• Nucleotide.amino acid distribution
• GC content (CpG islands, exon/intron)
• Uneven usage of synonymous codons (codon bias)
• Hexamer frequency - most discriminative of these
for identifying coding potential
Method: Evaluate these differences (coding statistics) to
differentiate between coding and non-coding regions
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
BCB 444/544 Fall 07 Dobbs
10/29/07
11
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
12
2
#28 - Promoter Prediction
10/29/07
Predicting Genes based on
Codon Usage Differences
This is a new slide
•
This is a new slide
In different genomes: Translate DNA into all 6 reading
frames and search against proteins (TBLASTX,BLASTX, etc.)
Algorithm:
Process sliding window
•
Similarity-Based Methods:
Database Search
ATTGCGTAGGGCGCT
TAACGCATCCCGCGA
Use codon frequencies to
compute probability of
coding versus non-coding
Plot log-likelihood ratio:
&
P (S | coding ) #
log$$
!!
% P ( S | non ' coding ) "
Within same genome: Search with EST/cDNA database
Exons
(EST2genome, BLAT, etc.).
Problems:
Coding Profile of ß-globin gene
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
Similarity-Based Methods:
Comparative Genomics
10/29/07
• Will not find “new” or RNA genes (non-coding genes).
• Limits of similarity are hard to define
• Small exons might be overlooked
13
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
This is a new slide
10/29/07
Human-Mouse Homology
Hum an
14
This is a new slide
Mouse
Idea: Functional regions are more conserved than non-functional
ones; high similarity in alignment indicates gene
human
mouse
GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA
|
||||| ||||| |||
||||| |||||||||||||
| |
C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-
Comparison of 1196 orthologous genes
• Sequence identity between genes in human vs mouse
Exons:
84.6%
Protein:
85.4%
Introns: 35%
5’ UTRs: 67%
3’ UTRs: 69%
Advantages:
•
May find uncharacterized or RNA genes
Problems:
•
•
Finding suitable evolutionary distance
Finding limits of high similarity (functional regions)
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
15
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
16
GeneSeqer - Brendel et al.- ISU
Thanks to Volker Brendel, ISU
for the following Figs & Slides
http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi
Spliced Alignment Algorithm
Brendel et al (2004) Bioinformatics 20: 1157
http://bioinformatics.oxfordjournals.org/cgi/con
tent/abstract/20/7/1157
Slightly modified from:
BSSI Genome Informatics Module
• Perform pairwise alignment with large gaps in one
sequence (due to introns)
http://www.bioinformatics.iastate.edu/BBSI/course_desc_20
05.html#moduleB
• Align genomic DNA with cDNA, ESTs, protein sequences
• Score semi-conserved sequences at splice junctions
V Brendel vbrendel@iastate.edu
• Using Bayesian probability model & 1st order MM
• Score coding constraints in translated exons
Intron
• Using Bayesian model
Brendel et al (2004) Bioinformatics 20: 1157
GT
Donor
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
BCB 444/544 Fall 07 Dobbs
10/29/07
17
Brendel 2005
AG
Splice sites
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
Acceptor
10/29/07
18
3
#28 - Promoter Prediction
10/29/07
Information Content vs Position
Splice Site Detection
Do DNA sequences surrounding splice "consensus"
sequences contribute to splicing signal? YES
0.8
• Information Content Ii :
Ii = 2 +
"
0.7
Human
T2_GT
0.6
0.5
f iB log 2 ( f iB )
B !U ,C , A ,G
0.5
0.4
0.3
0.3
0.2
0.1
0.1
0.0
-50
I i ! I + 196
. "I
i: ith position in sequence
Ī: avg information content over all positions >20 nt from splice site
σĪ: avg sample standard deviation of Ī
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
19
-40
-30
-10
0.0
0
10
20
30
40
50 -50
-40
-30
-20
-10
0
10
20
30
40
50
Which sequences are exons & which are introns?
How can you tell?
Brendel 2005
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
20
This is a new slide
Markov Model for Spliced Alignment
PΔG
-20
Human
T2_AG
0.6
0.4
0.2
• Extent of Splice Signal Window:
Brendel 2005
0.8
0.7
Evaluation of Splice Site Prediction
PΔG
(1-PΔG )(1-PD(n+1))
en
en+1
(1-PΔG )PD(n+1)
TP
FP
TN
FN
PA(n)PΔG
Right!
(1-PΔG )PD(n+1)
in
in+1
=
=
=
=
positive instance correctly predicted as positive
negative instance incorrectly predicted as positive
negative instance correctly predicted as negative
positive instance incorrectly predicted as negative
1-PA(n)
Brendel 2005
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
21
Evaluation of Predictions
True
TP
FP
PP=TP+FP
False
FN
TN
PN=FN+TN
FN
AP
= TP
/ AP
• Sensitivity: S n S=nTP
/ AP
= 1=!1"! "
Coverage
Recall
!=
!=
False
Positives
Predicted
=
1$"
1$" + #
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
BCB 444/544 Fall 07 Dobbs
22
True
TP
FP
PP=TP+FP
False
FN
TN
PN=FN+TN
AP=TP+FN AN=FP+TN
• Sensitivity: S n = TP / AP = =
1 !Coverage
"
FP
AN
In English? Sensitivity is the
fraction of all positive instances
having a true positive prediction.
AN AN
1 1!!##
AN
= TP
S/ pPP
==TP
! = 1"! =" ="1=! #
• Specificity: S p S=pTP
/ PP
1=!/1PP
!##r+
PPPP PP1 !1#!1+
"+ r""
• Normalized specificity: !
10/29/07
Actual
True False
True
Positives
AP=TP+FN AN=FP+TN
• Misclassification rates:
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
Evaluation of Predictions - in English
Predicted
Positives
Actual
True False
Predicted
Fig 5.11
Baxevanis & Ouellette
2005
r=
AN
AP
Do not memorize this!
10/29/07
23
IMPORTANT: Sensitivity alone does
not tell us much about performance
because a 100% sensitivity can be
achieved trivially by labeling all test
cases positive!
AN
1!#
"=
• Specificity: S p = TP / PP = 1=! Recall
PP
1IMPORTANT:
! # + r"
in medical jargon,
Specificity is sometimes defined
In English? Specificity is the
differently (what we define here as
fraction of all predicted positives
"Specificity" is sometimes referred
that are, in fact, true positives.
to as "Positive predictive value")
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
24
4
#28 - Promoter Prediction
10/29/07
This slide has been changed
GeneSeqer: Input
Best Measures for Comparison?
• ROC curves
http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi
(Receiver Operating Characteristic (?!!)
http://en.wikipedia.org/wiki/Roc_curve
In signal detection theory, a receiver operating characteristic (ROC), or
ROC curve is a plot of sensitivity
vs
(1 - specificity)
for a binary classifier system as its discrimination threshold is varied.
The ROC can also be represented equivalently by plotting fraction
of true positives (TPR = true positive rate) vs
fraction of false positives (FPR = false positive rate)
• Correlation Coefficient
Matthews correlation coefficient (MCC)
MCC = 1 for a perfect prediction
0 for a completely random assignment
-1 for a "perfectly incorrect" prediction
Do not memorize this!
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
25
Brendel 2005
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
27
Brendel 2005
Common errors?
•
• False positive intergenic regions:
• 2 annotated genes actually correspond to a single gene
• False negative intergenic region:
• One annotated gene structure actually contains 2 genes
• False negative gene prediction:
• Missing gene (no annotation)
• Other:
• Partially incorrect gene annotation
• Missing annotation of alternative transcripts
•
10/29/07
28
•
GENSCAN: http://genes.mit.edu/GENSCAN.html
GeneMark.hmm: http://exon.gatech.edu/GeneMark/
others: GRAIL, FGENES, MZEF, HMMgene
Similarity-based
•
BLAST, GenomeScan, EST2Genome, Twinscan
Combined:
•
•
GeneSeqer, http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi
ROSETTA
 Consensus: because results depend on organisms & specific
• For ab initio prediction in eukaryotes: HMMs have better overall
performance for detecting intron/exon boundaries
task, Always use more than one program!
• Two servers hat report consensus predictions
• GeneComber
• DIGIT
• Limitation? Training data: predictions are organism specific
• Combined ab initio/homology based predictions: Improved accurracy
• Limitation? Availability of identifiable sequence homologs in databases
10/29/07
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
Ab initio
•
•
•
Current status?
BCB 444/544 Fall 07 Dobbs
26
Recommended Gene Prediction Software
Gene Prediction - Problems & Status?
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
GeneSeqer: Gene Evidence Summary
GeneSeqer: Output
Brendel 2005
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
29
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
30
5
#28 - Promoter Prediction
10/29/07
Other Gene Prediction Resources:
GaTech, MIT, Stanford, etc.
Other Gene Prediction Resources: at ISU
http://www.bioinformatics.iastate.edu/bioinformatics2go/
Lists of Gene Prediction Software
http://www.bioinformaticsonline.org/links/ch_09_t_1.html
http://cmgm.stanford.edu/classes/genefind/
Current Protocols in Bioinformatics (BCB/ISU owns a copy - currently in my lab!)
Chapter 4 Finding Genes
4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations
4.2 Using MZEF To Find Internal Coding Exons
4.3 Using GENEID to Identify Genes
4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes
4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm
4.6 Eukaryotic Gene Prediction Using GeneMark.hmm
4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome
4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences
4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation
4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
31
10/29/07
32
Eukaryotes vs Prokaryotes:
Genomes
Chp 9 - Promoter & Regulatory Element
Prediction
Eukaryotic genomes
SECTION III GENE AND PROMOTER PREDICTION
• Are packaged in chromatin & sequestered in a nucleus
• Are larger and have multiple linear chromosomes
• Contain mostly non-protein coding DNA (98-99%)
Xiong: Chp 9 Promoter & Regulatory Element Prediction
• Promoter & Regulatory Elements in Prokaryotes
Prokarytic genomes
• Promoter & Regulatory Elements in Eukaryotes
• DNA is associated with a nucleoid, but no nucleus
• Much larger, usually single, circular chromosome
• Contain mostly protein encoding DNA
• Prediction Algorithms
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
33
Eukaryotes vs Prokryotes:
Gene Structure
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
34
Eukaryotes vs Prokaryotes:
Genes
Eukaryotic genes
• Are larger and more complex than in prokaryotes
• Contain introns that are “spliced” out to generate mature mRNAs*
• Often undergo alternative splicing, giving rise to multiple RNAs*
• Are transcribed by 3 different RNA polymerases
(instead of 1, as in prokaryotes)
* In biology, statements such as this include an implicit “usually” or “often”
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
BCB 444/544 Fall 07 Dobbs
10/29/07
35
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
36
6
#28 - Promoter Prediction
10/29/07
Eukaryotes vs Prokaryotes:
Regulatory Elements
Eukaryotes vs Prokaryotes:
Levels of Gene Regulation
Primary level of control?
• Prokaryotes:
• Promoters & operators (for operons) - cis-acting DNA signals
• Activators & repressors - trans-acting proteins
(we won't discuss these…)
• Prokaryotes: Transcription initiation
• Eukaryotes: Transcription is also very important, but
• Expression is regulated at multiple levels
many of which are post-transcriptional:
•
•
•
•
•
• Eukaryotes:
• Promoters & enhancers (for single genes) - cis-acting
•Transcription factors - trans-acting
RNA processing, transport, stability
Translation initiation
Protein processing, transport, stability
Post-translational modification (PTM)
Subcellular localization
•
Recent important discoveries: small regulatory RNAs (miRNA, siRNA)
are abundant and play very important roles in controlling gene
expression in eukaryotes, often at post-transcriptional levels
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
•What the RNA polymerase actually binds
37
Prokaryotic Promoters
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
• Transcription factors must bind first and serve as landmarks
recognized by RNA polymerase complexes
• Eukaryotic promoter sequences are less highly conserved, but many
• Prokaryotic promoter sequences are highly conserved:
• -10 region
• -35 region
promoters (for RNA polymerase II) contain :
• -30 region "TATA" box
• -100 region "CCAAT" box
10/29/07
39
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
40
Eukaryotic genes are transcribed by
3 different RNA polymerases
Eukaryotic Promoters vs Enhancers
(Location of promoter regions, TFBSs & TFs differ, too)
Both promoters & enhancers are binding sites for transcription
factors (TFs)
Promoters
• essential for initiation of transcription
• located “relatively” close to start site (usually <200 bp upstream,
but can be located within gene, rather than upstream!)
•
38
• Eukaryotic RNA polymerase complexes do not bind directly to
promoter sequences
• Prokaryotic RNA polymerase complex binds directly to promoter,
by virtue of its sigma subunit - no requirement for “transcription
factors” binding first
•
10/29/07
Eukaryotic Promoters
• RNA polymerase complex recognizes promoter sequences located
very close to and on 5’ side (“upstream”) of tansription initiation site
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
Important difference?
rRNA
Enhancers
• needed for regulated transcription (differential expression in
specific cell types, developmental stages, in response to environment,
etc.)
• can be very far from start site (sometimes > 100 kb)
mRNA
tRNA, 5S RNA
Brown Fig 9.18
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
BCB 444/544 Fall 07 Dobbs
10/29/07
41
BIOS Scientific Publishers Ltd,
BCB
1999
444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
42
7
#28 - Promoter Prediction
10/29/07
Promoter of lac operon in E. coli
Prokaryotic Genes & Operons
(Transcribed by prokaryotic RNA polymerase)
•
Genes with related functions are often clustered within operons
(e.g., lac operon)
•
Operons = genes with related functions that are transcribed and
regulated as a single unit; one promoter controls expression of
several proteins
•
mRNAs produced from operons are “polycistronic” - a single mRNA
encodes several proteins; i.e., there are multiple ORFs, each with
its own AUG (START) & STOP codons, linked within one mRNA
molecule
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
43
Brown Fig 9.17
BCB
444/544 F07 ISU Dobbs #28- Promoter Prediction
BIOS Scientific Publishers Ltd,
1999
10/29/07
44
Eukaryotic genes have large & complex
regulatory regions
Eukaryotic genes
• Genes with related functions are occasionally, but not usually
clustered; instead, they share common regulatory regions
(promoters, enhancers, etc.)
• Chromatin structure must also be “active” for transcription to
occur
•Cis-acting regulatory elements include:
Promoters, enhancers, silencers
•Trans-acting regulatory factors include:
Transcription factors (TFs), chromatin
remodeling complexes, small RNAs
Brown Fig 9.17
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
45
Eukaryotic Promoters: DNA sequences required
for initiation, usually <200 bp from start site
BCB
444/544 F07 ISU Dobbs #28- Promoter Prediction
BIOS Scientific Publishers Ltd,
1999
10/29/07
46
Eukaryotic promoters & enhancer regions
often contain many different TFBS motifs
Eukaryotic RNA polymerases bind by recognizing a complex of
TFs bound at promotor
First, TFs must bind
short motifs (TFBSs)
within promoters;
then RNA polymerase
can bind and initiate
transcription of RNA
~250 bp
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
BCB 444/544 Fall 07 Dobbs
Pre-mRNA
10/29/07
47
Fig 9.13
Mount 2004
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
48
8
#28 - Promoter Prediction
10/29/07
Simplified View of Promoters in Eukaryotes
Eukaryotic Activators
vs
Repressors
Regions far from the promoter can act as "enhancers" or "repressors"
of transcription by serving as binding sites for activator or repressor
proteins (TFs)
repressor
Gene
100 - 50,000 bp
Activator proteins (TFs)
bind to enhancers &
interact with RNAP to
stimulate transcription
Fig 5.12
Baxevanis &
Ouellette 2005
RNAP
promoter
enhancer
enhancer proteins
interact with RNAP
transcription
repressor prevents
binding of activator
Repressors block the
action of activators
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
49
Eukaryotic Transcription Factors (TFs)
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
• Common in eukaryotic proteins
• ~ 1% of mammalian genes encode
zinc-finger proteins (ZFPs)
Here motif = amino acid
sequence in protein
• In C. elegans, there are > 500 !
• Can be used as highly specific DNA
binding modules
• Potentially valuable tools for
directed genome modification
(esp. in plants) & human gene
therapy - one clinical trial
will begin soon!
• TFs recognize and bind specific short DNA sequence motifs
called “transcription factor binding sites” (TFBSs)
• Databases for TFs &TFBSs include:
• TRANSFAC,
• JASPAR
Here motif = nucleotide
sequence in DNA
http://www.generegulation.com/cgibin/pub/databases/transfac
Brown Fig 9.12
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
50
Zinc Finger Proteins - Transcription Factors
• Transcription factors = proteins that interact with
the RNA polymerase complex to activate or repress
transcription
• TFs often contain both:
• a trans-activating domain
• a DNA binding domain or motif
10/29/07
10/29/07
51
Promoter Prediction Algorithms & Software
• Did you go to Dave Segal's seminar?
• Your TAs Pete & Jeff work on
designing better ZFPs!
BCB
444/544 F07 ISU Dobbs #28- Promoter Prediction
BIOS Scientific Publishers Ltd,
1999
10/29/07
52
Eukaryotes vs Prokaryotes:
Promoter Prediction
Promoter prediction is much easier in prokaryotes
Xiong -
Why?
Highly conserved
Simpler gene structures
More sequenced genomes!
(for comparative approaches)
Methods? Previously: mostly HMM-based
Now: similarity-based comparative methods
because so many genomes available
Xiong textbook:
1) "Manual method"= rules of Wang et al (see text)
2) BPROM - uses linear discriminant function
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
BCB 444/544 Fall 07 Dobbs
10/29/07
53
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
54
9
#28 - Promoter Prediction
10/29/07
Eukaryotes vs Prokaryotes:
Promoter Prediction
Predicting Promoters in Eukaryotes
Promoter prediction is much easier in prokaryotes
Why?
Closely related to gene prediction!
• Obtain genomic sequence

Highly conserved
Simpler gene structures
More sequenced genomes!
(for comparative approaches)
• Use sequence-similarity based comparison
(BLAST, MSA) to find related genes
But: "regulatory" regions are much less wellconserved than coding regions
• Locate ORFs
• Identify Transcription Start Site (TSS)
(if possible!)
• Use Promoter Prediction Programs
• Analyze motifs, etc. in DNA sequence (TRANSFAC, JASPAR)
Methods? Previously: mostly HMM-based
Now: similarity-based comparative methods
because so many genomes available
Xiong textbook:
1) "Manual method"= rules of Wang et al (see text)
2) BPROM - uses linear discriminant function
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
55
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
Predicting promoters: Steps & Strategies
10/29/07
56
Automated Promoter Prediction Strategies
Identify TSS --if possible?
1) Pattern-driven algorithms (ab initio)
• One of biggest problems is determining exact TSS!
Not very many full-length cDNAs!
• Good starting point? (human & vertebrate genes)
Use FirstEF
found within UCSC Genome Browser
or submit to FirstEF web server
2) Sequence-driven algorithms (homology based)
3) Combined "evidence-based"
BEST RESULTS? Combined, sequential
Fig 5.10
Baxevanis &
Ouellette 2005
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
57
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
1) Pattern-driven Algorithms
10/29/07
58
Ways to Reduce FPs in ab initio Prediction
•
Take sequence context/biology into account
Eukaryotes: clusters of TFBSs are common
•
Success depends on availability of collections of annotated
transcription factor binding sites (TFBSs)
Tend to produce very large numbers of false positives (FPs)
•
Why?
•
Probability of "real" binding site higher if annotated transcription
start site (TSS) is nearby
But: What about enhancers? (no TSS nearby!)
& only a small fraction of TSSs have been experimentally
determinined
•
Do the wet lab experiments!
•
•
•
•
•
•
Prokaryotes: knowledge of σ (sigma) factors helps
Binding sites for specific TFs are often variable
Binding sites are short (typically 6-10 bp)
Interactions between TFs (& other proteins) influence both
affinity & specificity of TF binding
One binding site often recognized by multiple TFs
But: Promoter-bashing can be tedious…
Biology is complex: gene activation is often specific to
organism/cell/stage/environmental condition; promoter and
enhancer elements must mediate this
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
BCB 444/544 Fall 07 Dobbs
10/29/07
59
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
60
10
#28 - Promoter Prediction
10/29/07
2) Sequence-driven Algorithms
Phylogenetic Footprinting
•
Assumption: Common functionality can be deduced from
sequence conservation (Homology)
•
Alignments of co-regulated genes should highlight elements
involved in regulation
Based on increasing availability of whole genome DNA sequences
from many different species
Selection of organisms for comparison is important
•
•
•
Careful: How determine co-regulation?
1. Orthologous genes from difference species
2. Genes experimentally shown to be co-regulated
(using microarrays??)
Comparative promoter prediction:
1. Phylogenetic footprinting
2. Expression Profiling
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
•
•
•
•
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
•
2.
•
62
Need sets of co-regulated genes
•
Co-expression implies co-regulation
Co-regulated genes share common regulatory elements
Drawbacks:
1.
10/29/07
Problems with Sequence-driven Algorithms
Assumptions: (sometimes valid, sometimes NOT)
1.
2.
Consite, rVISTA, PromH(W), Bayes aligner, Footprinter
61
Based on increasing availability of whole genome mRNA expression
data, esp., microarray data
High-throughput simultaneous monitoring of expression levels of
thousands of genes
•
use MSA algorithms (e.g., CLUSTAL)
more sensitive methods
•
Gibbs sampling
•
Expectation Maximization (EM) methods
Examples of programs:
•
Expression Profiling
•
not too close, not too far: good = human vs mouse
To reduce FPs, must extract non-coding sequences and then
align them; prediction depends on good alignment
•
Signals are short & weak!
Requires Gibbs sampling or EM: e.g., MEME, AlignACE, Melina
Prediction depends on determining which genes are co-expressed usually by clustering - which an be error prone
Examples of programs:
• INCLUSive - combined microarray analysis & motif detection
• PhyloCon - combined phylo footprinting & expression profiling)
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
For comparative (phylogenetic) methods
• Must choose appropriate species
• Different genomes evolve at different rates
• Classical alignment methods have trouble with
translocations or inversions than change order of
functional elements
• If background conservation of entire region is high,
comparison is useless
• Not enough data (but Prokaryotes >>> Eukaryotes)
Complexity: many regulatory elements are not conserved
across species!
63
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
64
Global Alignment of Human & Mouse Obese Gene
Promoters (200 bp upstream from TSS)
TRANSFAC Matrix Entry: for TATA box
Fields:
• Accession & ID
• Brief description
• TFs associated with
this entry
• Weight matrix
• Number of sites
used to build
• Other info
Fig 5.13
Baxevanis & Ouellette
2005
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
BCB 444/544 Fall 07 Dobbs
10/29/07
65
Fig 5.14
Baxevanis & Ouellette
2005
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
66
11
#28 - Promoter Prediction
10/29/07
Annotated Lists of Promoter Databases &
Promoter Prediction Software
•
Check out Optional Review &
Try Associated Tutorial:
URLs from Mount textbook:
Table 9.12 http://www.bioinformaticsonline.org/links/ch_09_t_2.html
•
Table in Wasserman & Sandelin Nat Rev Genet article
http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.htm
•
Wasserman WW & Sandelin A (2004) Applied bioinformatics for
identification of regulatory elements. Nat Rev Genet 5:276-287
http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html
URLs from Baxevanis & Ouellette textbook:
http://www.wiley.com/legacy/products/subject/life/bioinformatics/ch05.htm#links
Check this out: http://www.phylofoot.org/NRG_testcases/
More lists:
•
http://www.softberry.com/berry.phtml?topic=index&group=programs&subgroup=promoter
•
http://bioinformatics.ubc.ca/resources/links_directory/?subcategory_id=104
•
http://www3.oup.co.uk/nar/database/subcat/1/4/
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
BCB 444/544 Fall 07 Dobbs
10/29/07
Bottom line: this is a very "hot" area - new
software for computational prediction of gene
regulatory elements published every day!
67
BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction
10/29/07
68
12
Download