Gene Prediction

advertisement
Gene Finding
4/8/2015
1
Copyright notice
• Many of the images in this power point
presentation are from Bioinformatics
and Functional Genomics by Jonathan
Pevsner (ISBN 0-471-21004-8).
Copyright © 2003 by John Wiley &
Sons, Inc.
• Many slides of this power point
presentation Are from slides of Dr.
Jonathon Pevsner and other people.
The Copyright belong to the original
authors. Thanks!
4/8/2015
2
Gene Finding
Why do it?
• Find and annotate all the genes within the large volume
of DNA sequence data
– Human DNA length = 3.4*109 bp
– Number of genes = 30,000 - 100,000
– Gene percentage ~= 1%
• Gain understanding of problems in basic biology
– e.g. gene regulation-what are the mechanisms involved in
transcription, splicing, etc?
• Different emphasis in these goals has some effect on the
design of computational approaches for gene finding.
4/8/2015
3
Gene Finding
• Cells recognize genes from DNA
sequence
– find genes via their bioprocesses
• Not so easy for us..
4/8/2015
4
Where is Gene?
CTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCT
CTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGA
AGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAG
GAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGT
TTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGT
GGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAG
AATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAA
CTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACT
TGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATA
AGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGG
ACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCAT
ATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAAC
AAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTAT
TGTTATGAGACTGGATATAT...
4/8/2015
5
Types of Genes
• Protein coding
– most genes
• RNA genes
– rRNA
– tRNA
– snRNA (small nuclear RNA)
– snoRNA (small nucleolar RNA)
4/8/2015
6
3 Major Categories of Information used in
Gene Finding Programs
• Signals/features
– a sequence pattern with functional significance e.g. splice donor
& acceptor sites, start and stop codons, promoter features such
as TATA boxes, TF binding sites, CpG islands
• Content/composition
– statistical properties of coding vs. non-coding regions.
• e.g. codon-bias; length of ORFs in prokaryotes;GC content
• Similarity
– compare DNA sequence to known sequences in database
– Not only known proteins but also ESTs, cDNAs
4/8/2015
7
Gene Structure
4/8/2015
8
Prokaryotic Genes Structure
5’
3’
Open Reading Frame
Promoter region (maybe)
Ribosome binding site (maybe)
Termination sequence (maybe)
Start codon / Stop Codon
4/8/2015
9
In Prokaryotic Genomes
• We usually start by looking for an ORF
– A start codon, followed by (usually) at least 60 amino acid
codons before a stop codon occurs
– Or by searching for similarity to a known ORF
• Look for basal signals
– Transcription (the promoter consensus and the termination
consensus)
– Translation (ribosome binding site: the Shine-Dalgarno
sequence)
• Look for differences in sequence content between
coding and non-coding DNA
– GC content and codon bias
4/8/2015
10
Gene Finding in Bacterial Genomes
• Advantages
– Simple gene structure
• Small genomes (0.5 to 10 million bp)
• No introns
– Dense Genomes
• High coding density (>90%)
• Short intergenic regions
– Conserved signals
– Abundant comparative information
• Complete Genomes available for many
– Uninterrupted ORFs
• Disadvantages
– Some genes overlap (nested)
– Some genes are quite short (<60 bp)
4/8/2015
11
Open Reading Frame (ORF)
• Any stretch of DNA that potentially
encodes a protein
• The identification of an ORF is the first
indication that a segment of DNA may be
part of a functional gene
4/8/2015
12
Open Reading Frames
A C G T A A C T G A C T A G G T G A A T
CGT
GTA
AAC
ACT
TGA
GAC
CTA
TAG
GGT
GTG
GAA
AAT
Each grouping of the nucleotides into
consecutive triplets constitutes a reading
frame.
A sequence of triplets that contains no stop
codon is an Open Reading Frame (ORF)
4/8/2015
13
ORFs as gene candidates
• An open reading frame that begins with a start codon
(usually ATG, GTG or TTG, but this is speciesdependent)
• Most prokaryotic genes code for proteins that are 60 or
more amino acids in length
• The probability that a random sequence of nucleotides of
length n has no stop codons (UAA, UAG, UGA) is
(61/64)n
– When n is 50, there is a probability of 92% that the random
sequence contains a stop codon
– When n is 100, this probability exceeds 99%
4/8/2015
14
Codon Bias
• Genetic code degenerate
– Equivalent triplet codons code for the same amino acid
• Codon usage varies
– organism to organism
– gene to gene
• Biological basis
– Avoidance of codons similar to stop
– Preference for codons that correspond to abundant
tRNAs within the organism
4/8/2015
15
Codon Bias
Gene Differences
GlyGGG
GlyGGA
GlyGGT
GlyGGC
4/8/2015
GAL4
0.21
0.17
0.38
0.24
ADH1
0
0
0.93
0.07
16
Codon Bias
Organism differences
• Arginine : CGT,CGC,CGA,CGG,AGA,AGG
• Yeast Genome: arg specified by AGA 48% of
time (other five equivalent codons ~10% each)
• Fruitfly Genome: arg specified by CGC 33% of
time (other five ~13% each)
• Complete set of codon usage biases can be
found at:
http://www.kazusa.or.jp/codon/
4/8/2015
17
GC content
• GC relative to AT is a distinguishing factor of
bacterial genomes
• Varies dramatically across species
– Serves as a means to identify bacterial species
• For various biological reasons
– Mutational bias of particular DNA polymerases
– DNA repair mechanisms
– horizontal gene transfer (transformation, transduction,
conjugation)
4/8/2015
18
GC Content
• GC content may be different in recently
acquired genes than elsewhere
• This can lead to variations in the
frequency of codon usage within coding
regions
– There may be significant differences in codon
bias within different genes of a single
bacterium’s genome
4/8/2015
19
Ribosome Binding Sites
• RBS is also known as a Shine-Dalgarno
sequence (species-dependent) that should
bind well with the 3’ end of 16S rRNA (part
of the ribosome)
• Usually found within 4-18 nucleotides of
the start codon of a true gene
4/8/2015
20
Shine-Dalgarno Sequence
• Shine-Dalgarno sequence is a nucleotide
sequence (consensus = AGGAGG) that is
present in the 5'-untranslated region of
prokaryotic mRNAs.
• This sequence serves as a binding site for
ribosomes and is thought to influence the
reading frame.
• If a subsequence aligning well with the ShineDalgarno sequence is found within 4-18
nucleotides of an ORF’s start codon, that
improves the ORF’s candidacy.
4/8/2015
21
Bacterial Promoter
-35
T82T84G78A65C54A45…
(16-18 bp)…
T80A95T45A60A50T96…(A,G)
-10
+1
Not so simple: remember, these are
consensus sequences
4/8/2015
22
Eukaryotic Gene Structure
4/8/2015
23
Genes and Signals
4/8/2015
24
The Complicating factors
in Eukaryotes
• Interrupted genes (split genes)
• introns and exons
• Large genomes
• Most DNA is non-coding
• introns, regulatory regions, “junk” DNA (unknown
function)
• About 3% coding
• Complex regulation of gene expression
• Regulatory sequences may be far away from
start codon
4/8/2015
25
Some numbers to consider:
• Vertebrate genes average about 30Kb long
– varies a lot
• Coding region is only about 1-2 Kb
• Exon sizes and numbers vary a lot
– Average is 6 exons, each about 150 bp long
• An average 5’ UTR is about 750 bp
• An average 3’UTR is about 450 bp
– (both can be much longer)
• There are huge deviations from all of these numbers
e.g. dystrophin is 2.4 Mb long ; factor VIII gene has 26
exons, introns are up to 32 Kb (one intron produces 2
transcripts unrelated to the gene!)
– There are genes without introns: called single-exon or
intronless genes
4/8/2015
26
Given a long eukaryotic DNA
sequence:
• How would you determine if it had a gene?
• How would you determine which
substrings of the sequence contained
protein-coding regions?
4/8/2015
27
So, what’s the problem with
looking for ORFs?
“split” genes make it difficult to define ORFs
• Where are the stars and stops?
• What problems do introns introduce?
• What would you predict for the size of
ORFs?
4/8/2015
28
Most Programs Concentrate on
Finding Exons
• Exon: the region of DNA within a gene
that codes for a polypeptide chain or
domain
• Intron: non-coding sequences found in the
structural genes
4/8/2015
29
Splice Sites used to Define Exons
• Splice donor (exon-intron boundary)
and splice acceptor (intron-exon
boundary)
• Common sequence motifs
– C(orA)AG/GTA(orG)AGT "donor" splice
site
– T(orC)nNC(orT)AG/G "acceptor" splice site
4/8/2015
30
Gene finding programs look for
different types of exon
• single exon genes: begin with start codon & end
with stop codon
• initial exons: begin with start codon & end with
donor site
• internal exons: begin with acceptor & end with
donor
• terminal exons: begin with acceptor & end with
stop codon
4/8/2015
31
How are correct splice sites
identified?
• There are many occurrences of GT or AG
within introns that are not splice sites
• Statistical profiles of splice sites are used
http://www.lclark.edu/~lycan/Bio490/pptpresentations/mutation/sld016.htm
4/8/2015
32
Other Biologically Important Signals
Used in Gene Finding Programs
• Transcriptional Signals
– Transcription Start: characterized by cap signal
• A single purine (A/G)
– TATA box (promoter) at –25 relative to start
– Polyadenylation signal: AATAAA (3’ end)
• Major Caveat: not all genes have these signals
• Makes it difficult to define the beginning and end
of a gene
4/8/2015
33
Upstream Promoter Sites
• Transcription Factor (TF) sites
– Transcription factors are sequence-specific DNAbinding proteins
– Bind to consensus DNA sequences
– e.g. CAAT transcription factor and CAAT box
• Many of these
– Vary in sequence, location, interaction with other sites
– Further complicates the problem of delineating a
“gene”
4/8/2015
34
Translation Signals
• Kozak sequence
– The signal for initiation of translation in
vertebrates
– Consensus is GCCACCatgG
• And of course..
– Translation stop codons
4/8/2015
35
GC Content in Eukaryotes
• Overall GC content does not vary between
species as it does in prokaryotes
• GC content is still important in gene finding
algorithms
– CpG Islands
4/8/2015
36
CpG Islands
• CpG stands for cytosine and guanine
separated by a phosphate, which links the
two nucleosides together in DNA.
– CG dinucleotides are often written CpG to
avoid confusion with the base pair C-G
4/8/2015
37
CpG Islands
• In the eukaryotic genome, CpG occur at lower
frequency than would be expected in purely random
sequences (1/16).
– Occurrence related to methylation
– Methylation of C in CG, turning it into 5-methylcytosin.
Following spontaneous deamination, the 5methylcytosine converts into thymine.
– Methylation of C makes CpG prone to mutation (e.g.
to TpG or CpA). CpG sites thus tend to be eliminated
from the genomes of eukaryotes
4/8/2015
38
CpG Islands
• However, in the start regions of many genes
which have a high concentration of CpG sites:
CpG islands,
– Found at the promoters of eukaryotic genes.
– These CpG sites are unmethylated, and therefore any
spontaneous deaminations of cytosine to uracil are
recognized by the repair machinery and the CpG site
is restored.
– High occurrence of CpGs in many cases marks the
existence of downstream genes and is frequently
used in genome annotation as indicator of gene
density.
4/8/2015
39
Gene Finding by
Computational Methods
• Dependent on good experimental data to
build reliable predictive models
• Various aspects of gene structure/function
provide information used in gene finding
programs
4/8/2015
40
Computational Gene finding
approaches
1) Rule-based (e.g, start & stop codons)
2) Content-based (e.g., codon bias,
promoter sites)
3) Similarity-based (e.g., orthologs)
4) Pattern-based (e.g., machine-learning:
neural network, HMM)
4/8/2015
41
Simple rule-based gene finding in
prokaryotes, based on ORFs
• Look for putative start codon (ATG)
• Staying in same frame, scan in groups of
three until a stop codon is found
• If # of codons >=50, assume it’s a gene
• If # of codons <50, go back to last start
codon, increment by 1 & start again
• At end of chromosome, repeat process for
reverse complement
4/8/2015
42
Example ORF
4/8/2015
43
Problems with rule-based
approaches
• Advantages
– Simple and fairly sensitive (>50%)
• Disadvantages
– Prokaryotic genes are not always so simple to find
– ATG is not the only possible start site (e.g. CTG,
TTG – class I alternates)
– Small genes tend to be overlooked and long ones
over-predicted
• Solution? Use additional information to
increase confidence in predictions
4/8/2015
44
Content based approaches
• Key prokaryotic gene features
– RNA polymerase promoter site (-10, -30 site
or TATA box)
– Shine-Dalgarno sequence (+10, Ribosome
Binding Site) to initiate protein translation
– Codon biases
– High GC content
– Stem-loop (rho-independent) terminators
4/8/2015
45
Content based approaches
• Key eukaryotic gene features
– CpG islands
• More abundant near gene start site
• High GC content in 5’ ends of genes
– Codon Bias
• Some codons are strongly preferred in coding regions, others
are not
– Hexamers
• Dicodon frequencies informative – physical constraints prefer
certain adjacent amino acids over others
– Positional Bias
• 3rd base tends to be G/C rich in coding regions
4/8/2015
46
Content-based recognition
• Advantages:
– Increases accuracy over rule-based
• Disadvantages:
– Features are degenerate
– Features are not always present
4/8/2015
47
Homology-Based Approaches in
Eukaryotic Genomes
• More complicated than prokaryotes due to split genes
• Genome sequence -> first identify all candidate exons
• Use a spliced alignment algorithm to explore all possible
exon assemblies & compare to known
– e.g. Procrustes
• Limitations:
– must have similar sequence in the database with
known exon structure
– Sensitive to frame shift errors
4/8/2015
48
Gene Finding using
Comparative Genomics
• Purifying selection – Conserved regions
between two genomes are useful or else
they would have diverged.
• If genomes are too close in the
phylogenetic tree, there may be too much
noise.
• If genomes are too far, then regions can
be missed.
4/8/2015
49
UCSC Browser
4/8/2015
50
Gene Prediction using
sequence similarities
• Genomescan incorporates similarity-based method by
adding a blastX component to its prediction algorithm,
using the translated sequence to search protein db.
• http://genes.mit.edu/genomescan/
• “TWINSCAN is a gene prediction system that models
both gene structure and evolutionary conservation. The
scores of features like splice sites and coding regions
are modified using the patterns of divergence between
the target genome and a closely related genome.”
• http://genes.cs.wustl.edu/
4/8/2015
51
Neural Networks - Grail
• Sensors are trained using a set of known
genes in the organism.
• GrailExp incorporates similarity-based
method by adding a blastn component to
its prediction algorithm. Runs reliably on
unmasked sequences.
• Sensors are :
– Frame Bias Matrix - This uses the codon bias to
determine the correct frame .
– Fickett - Named after Fickett who originally
used properties such as 3-periodicity and
overall base composition to predict genes.
4/8/2015
52
Neural Networks - Grail
– Coding 6-tuple word preference -frequency of
6-tuple words in the coding region.
– Coding 6-tuple in-frame preference - 6-tuple
composition is evaluated for the 3 frames and
the one with the best score is used.
– Repetitive 6-tuple word preference - 6-tuple
statistics in repetitive elements. This is an
identification where coding regions are not
expected.
4/8/2015
53
Neural Network
Training Set
ACGAAG
AGGAAG
AGCAAG
ACGAAA
AGCAAC
EEEENN
Dersired Output
4/8/2015
Definitions
Sliding Window
ACGAAG
A = [001]
C = [010]
G = [100]
E = [01]
N = [00]
[010100001]
Input Vector
[01]
Output Vector
54
Neural Network Training
[010100001]
ACGAAG
Input
Vector
4/8/2015
.2 .4 .1
.1 .0 .4
.7 .1 .1
.0 .1 .1
.0 .0 .0 [.6 .4 .6]
.2 .4 .1
.0 .3 .5
.1 .1 .0
.5 .3 .1
Weight
Matrix1
1
1 - e-x
.1 .8
.0 .2
.3 .3
[.24 .74]
compare
[0 1]
Hidden Weight Output
Layer Matrix2 Vector
55
Back Propagation
1
1 - e-x
[010100001]
Input
Vector
4/8/2015
.2 .4 .1
.1 .0 .4
.02
.83
.7 .1 .1
.1 .8
.0 .1 .1
.0 .0 .0 [.6 .4 .6] .0 .2.23 [.24 .74]
.3 .3
.2 .4 .1
compare
.22
.33
.0 .3 .5
.1 .1 .0
[0 1]
.5 .3 .1
Weight
Matrix1
Hidden Weight Output
Layer Matrix2 Vector
56
Calculate New Output
[010100001]
Input
Vector
4/8/2015
.1 .1 .1
.2 .0 .4
.7 .1 .1
.0 .1 .1
.0 .0 .0 [.7 .4 .7]
.2 .2 .1
.0 .3 .5
.1 .3 .0
.5 .3 .3
Weight
Matrix1
1
1 - e-x
.02 .83
.00 .23
.22 .33
[.16 .91]
Converged!
[0 1]
Hidden Weight Output
Layer Matrix2 Vector
57
Train on Second Input Vector
[100001001]
ACGAAG
Input
Vector
4/8/2015
.1 .1 .1
.2 .0 .4
.7 .1 .1
.0 .1 .1
.0 .0 .0 [.8 .6 .5]
.2 .2 .1
.0 .3 .5
.1 .3 .0
.5 .3 .3
Weight
Matrix1
1
1 - e-x
.02 .83
.00 .23
.22 .33
[.12 .95]
Compare
[0 1]
Hidden Weight Output
Layer Matrix2 Vector
58
Back Propagation
1
1 - e-x
[010100001]
Input
Vector
4/8/2015
.1 .1 .1
.2 .0 .4
.01
.84
.7 .1 .1
.02 .83
.0 .1 .1
.0 .0 .0 [.8 .6 .5] .00 .23.24[.12 .95]
.22 .33
.2 .2 .1
compare
.21
.34
.0 .3 .5
.1 .3 .0
[0 1]
.5 .3 .3
Weight
Matrix1
Hidden Weight Output
Layer Matrix2 Vector
59
After Many Iterations….
.13 .08 .12
.24 .01 .45
.76 .01 .31
.06 .32 .14
.03 .11 .23
.21 .21 .51
.10 .33 .85
.12 .34 .09
.51 .31 .33
.03 .93
.01 .24
.12 .23
Two “Generalized” Weight Matrices
4/8/2015
60
Neural Networks
Matrix1
Matrix2
ACGAGG
EEEENN
New pattern
Input
4/8/2015
Prediction
Layer 1
Hidden
Layer
Output
61
Hidden Markov Models
• In general, sequences are not monolithic, but
can be made up of discrete segments
• Hidden Markov Models (HMMs) allow us to
model complex sequences, in which the
character emission probabilities depend upon
the state
• Think of an HMM as a probabilistic or
stochastic sequence generator, and what is
hidden is the current state of the model
4/8/2015
62
MM
A Markov process is a process, which moves from state to state
depending (only) on the previous n states.
0.25
0.25
0.5
Sunny
Cloudy
Weather today
Sunny cloudy Rainy
0.25 0.25  Sunny
 0 .5


A   0.375 0.125 0.375 Cloudy Weather
yesterday
 0.125 0.625 0.375

 Rainy
4/8/2015
Rainy
 0 .6 
 
   0 .3 
 0 .1 
 
Sunny
Cloudy
Rainy
63
Example:
P (Sunny , Sunny, Cloudy, Rainy | Model) =
Π(sunny)* P (Sunny | Sunny) * P (Cloudy | Sunny) *P (Rainy | Cloudy)
=
0.6 * 0.5 * 0.25 * 0.375 = 0.0281
0.25
0.25
0.5
Sunny
Cloudy
Weather today
Sunny cloudy Rainy
0.25 0.25  Sunny
 0 .5


A   0.375 0.125 0.375 Cloudy Weather
yesterday
 0.125 0.625 0.375

 Rainy
4/8/2015
Rainy
 0 .6 
 
   0 .3 
 0 .1 
 
Sunny
Cloudy
Rainy
64
HMM
emission probabilities


 0.25 Yellow

 Red
B1   0.25
 0.25 Green
 0.25 Blue




 0.35 Yellow

 Red
B 2   0.10
 0.35 Green
 0.10 Blue




 0.10Yellow

 Red
B 3   0.65
0
Green
 0.25 Blue


#1
#2
#3
i+1 turn
#1
#2
#3
 0 .1 0 .7 0 .2 


A   0 .4 0 .2 0 .4 
 0 .2 0 .3 0 .5 


4/8/2015
#1
#2
#3
State transition probabilities
ith turn
 0 .6 
 
   0 .3 
 0 .1 
 
#1
#2
#3
65
Elements of an HMM
•
An HMM is characterized by the following:
1. N, the number of states in the model
2. M, the number of distinct observation symbols per state
3. The state transition probability distribution A={aij}, where
aij=P[qt+1=j|qt=i], 1≤i,j≤N
4. The observation symbol probability distribution in state j,
B={bj(vk)} , where bj(vk)=P[ot=vk|qt=j], 1≤j≤N, 1≤k≤M
5. The initial state distribution  ={i}, where  i=P[q1=i],
1≤i≤N
•
For convenience, we usually use a compact notation
=(A,B,) to indicate the complete parameter set of
an HMM
– Requires specification of two model parameters (N and M)
66
Two Major Assumptions for
HMM
• First-order Markov assumption
– The state transition depends only on the origin and
destination
P Q    P q1 ,...,qt ,...,qT    P q1   P qt qt 1 ,  
T
t 2
– The state transition probability is time invariant
aij=P(qt+1=j|qt=i), 1≤i, j≤N
• Output-independent assumption
– The observation is dependent on the state that generates
it, not dependent on its neighbor observations
P O Q,    P o1 ,...,ot ,...,oT q1 ,...,qt ,...,qT ,     P ot qt ,     bqt ot 
T
T
t 1
t 1
67
 
 0.25
 
B1   0.25
 0.25
 0.25
 
Yellow
Red
Green
Blue
#1
 
 0.35
 
B 2   0.10
 0.35
 0.10
 
Yellow
Red
Green
#2
Blue
 
 0.10
 
B3   0.65
0 
 0.25
 
Yellow
#3
i+1 turn
Red
#1
Green
Blue
#2
#3
 0.1 0.7 0.2  #1


A   0.4 0.2 0.4  #2 ith turn
 0 .2 0 .3 0 .5 

 #3
 0 .1 
 
   0 .3 
 0 .6 
 
#1
#2
#3
The three Basic problems of HMMs
Problem 1:
Given observation sequence O=O1O2…OT and model M=(Π, A, B)
compute P(O | M).
for example: P (
4/8/2015
| M)
68
Problem 1:
Given observation sequence O=O1O2…OT and model M=(Π, A, B)
compute P(O | M)
We define a sequence of states Q=q1q2…qT.
P(Q| M)   q1aq1  q 2 aq 2  q3...ak  1  k
P(O| Q, M)  Tt 1 P(Ot | qt, M )
P(O| M)  allQ P(O | Q, M )P(Q | M )
Example: P(
4/8/2015
| M).
#1
#1
#1
#1
#2
#2
#2
#2
#3
#3
#3
#3
69
Problem 1:
Given observation sequence O=O1O2…OT and model M=(Π, A, B)
compute P(O | M).
We define a sequence of states Q=q1q2…qT.
P(Q| M)   q1aq1  q 2 aq 2  q3...ak  1  k
P(O| Q, M)  Tt 1 P(Ot | qt, M )
P(O| M)  allQ P(O | Q, M )P(Q | M )
Example: P(
#1
| M).
#1
#1
#1
O(NT*T) !!!
N- number of states
4/8/2015
#2
#2
#2
#2
#3
#3
#3
#3
T- number of observations
70
Problem 1:
Given observation sequence
O=O1O2…OT and model M=(Π, A, B)
compute P(O | M).
Solution:
Much better…
Forward algorithm
Example: P(
O(N2T) !!!
| M).
N- number of states
T- number of observations
#1
#1
#1
#1
For N=5 an T =100
#2
#3
4/8/2015
#2
#3
#2
#3
#2
Naive solution…1072
Forward algorithm… 3000
#3
71
 
 0.25
 
B1   0.25
 0.25
 0.25
 
Yellow
Red
Green
Blue
 
 0.35
 
B 2   0.10
 0.35
 0.10
 
#1
Yellow
Red
Green
Blue
 
 0.10
 
B3   0.65
0 
 0.25
 
#2
Yellow
#3
i+1 turn
Red
Green
Blue
#1
#2
#3
 0.1 0.7 0.2  #1


A   0.4 0.2 0.4  #2 ith turn
 0 .2 0 .3 0 .5 

 #3
 0 .1 
 
   0 .3 
 0 .6 
 
#1
#2
#3
The three Basic problems of HMMs
Problem 2:
Given observation sequence O=O1O2…OT and model M=(Π, A, B)
how do we choose a corresponding state sequence q=q1q2…qT
,which best “explains” the observation.
For example:
What are most probable q1q2q3q4 given the observation
#?
4/8/2015
#?
#?
#?
72
 
 0.25
 
B1   0.25
 0.25
 0.25
 
Yellow
Red
Green
Blue
#1
 
 0.35
 
B 2   0.10
 0.35
 0.10
 
Yellow
Red
Green
#2
Blue
 
 0.10
 
B3   0.65
0 
 0.25
 
Yellow
#3
i+1 turn
Red
Green
Blue
#1
#2
#3
 0.1 0.7 0.2  #1


A   0.4 0.2 0.4  #2 ith turn
 0 .2 0 .3 0 .5 

 #3
 0 .6 
 
   0 .3 
 0 .1 
 
#1
#2
#3
The three Basic problems of HMMs
Problem 3:
How do we adjust the model parameters Π, A, B to maximize
P(O |{Π, A, B})?
4/8/2015
73
Solution to the three problems:
•
Given an observation sequence O=(o1,o2,…,oT), and an HMM =(A,B,)
– Problem 1:
How to efficiently compute P(O|) ?
 Evaluation problem
• Solution: Forward algorithm O(N2L)
– Problem 2:
How to choose an optimal state sequence Q=(q1,q2,……, qT) which best explains
the observations?
Q*  arg max P(Q, O |  )
 Decoding Problem
Q
• Solution: Viterbi algorithm O(N2L)
– Problem 3:
How to adjust the model parameters =(A,B,) to maximize P(O|)?
 Learning/Training Problem
• Solution: Baum-Welch reestimation formulas
4/8/2015
74
Solution to Problem 1 - The Forward Procedure
• Base on the HMM assumptions, the calculation of
Pqt qt 1,   and Pot qt ,   involves only qt-1, qt , and
ot , so it is possible to compute the likelihood
PO   with recursion on t
• Forward variable :
αt i   Po1, o2 ,...,ot , qt  i λ
– The probability of the joint event that o1,o2,…,ot are observed and
the state at time t is i, given the model λ
αt 1  j   Po1 , o2 ,...,ot , ot 1 , qt 1  j λ 
N

  αt (i )aij b j (ot 1 )
i 1

4/8/2015
75
Solution to Problem 1 - The Forward Procedure
(cont.)
P( A, B,  )
P( A, B,  ) P( B,  )
 t 1  j   Po1 , o2 ,...,ot , ot 1 , qt 1  j |   P( A, B |  )  P( )  P(B,  )  P( )  P( A | B,  )P(B |  )
 Po1 , o2 ,...,ot , ot 1 | qt 1  j ,  P (qt 1  j |  ) Output-independent assumption
 Po1 , o2 ,...,ot | qt 1  j ,  P (ot 1 | qt 1  j ,  ) P (qt 1  j |  )
 Po1 , o2 ,...,ot , qt 1  j |  P (ot 1 | qt 1  j ,  ) P( A | B,  )P(B |  )  P( A, B |  )
 Po1 , o2 ,...,ot , qt 1  j |  b j (ot 1 ) Po q  j,    b (o )
t 1
t 1
N

  P o1 , o2 ,...,ot , qt  i, qt 1  j λ b j (ot 1 )
i 1

4/8/2015
j
P A 
t 1
 P( A, B)
all B
P( A, B |  )  P( A |  ) P( B | A,  )
N

  P o1 , o2 ,...,ot , qt  i λ P (qt 1  j | o1 , o2 ,...,ot , qt  i, λ)b j (ot 1 )
i 1

First-order Markov assumption
N

  P o1 , o2 ,...,ot , qt  i λ P (qt 1  j | qt  i, λ)b j (ot 1 )
i 1

N

   t (i )aij b j (ot 1 )
i 1

76
Solution to Problem 1 - The Forward Procedure
(cont.)
• 3(2)=P(o1,o2,o3,q3=2|)
=[2(1)*a12+ 2(2)*a22 +2(3)*a32]b2(o3)
State
S3
S3
S3
a32
2(3)
b2(o3)
S3
S3
S2
S2
a22 S2
2(2)
a12
S1
S1
S2
S2
S1
S1
S1
2(1)
4/8/2015
1
2
3
T-1
T
o1
o2
o3
oT-1
oT
Si
means bj(ot) has been computed
aij
means aij has been computed
Time
77
Solution to Problem 1 - The Forward Procedure (cont.)
t i   Po1o2...ot , qt  i λ 
• Algorithm
1. Initializtion
a
α1 i   πi bi o1 , 1  i  N
N

2. Induction αt 1  j    αt i aij b j ot 1 , 1  t  T-1,1  j  N
i 1

3.Te rminat
ion PO λ    αT i 
N
i 1
Complexity:
O(N2T)
MUL: N(N+1 )(T-1 )+N  N 2T
ADD: (N-1 )N(T-1 )  N 2T
• Based on the lattice (trellis) structure
– Computed in a time-synchronous fashion from left-to-right, where each
cell for time t is completely computed before proceeding to time t+1
• All state sequences, regardless how long previously,
merge to N nodes (states) at each time instance t
4/8/2015
78
Solution to Problem 1 - The Forward Procedure (cont.)
• A three-state Hidden Markov Model for the Dow Jones
Industrial average
α2(1)= (0.35*0.6+0.02*0.5+0.09*0.4)*0.7
a11=0.6
α1(1)=0.5*0.7
π1=0.5 b1(up)=0.7
a21=0.5
b1(up)=0.7
a31=0.4
α1(2)= 0.2*0.1
π2=0.2
b2(up)= 0.1
b2(up)= 0.1
α1(3)= 0.3*0.3
π3=0.3
b3(up)=0.3
b3(up)=0.3
(Huang et al., 2001)
4/8/2015
79
Solution to Problem 2 - The Viterbi Algorithm
• The Viterbi algorithm can be regarded as the dynamic
programming algorithm applied to the HMM or as a
modified forward algorithm
– Instead of summing up probabilities from different paths coming
to the same destination state, the Viterbi algorithm picks and
remembers the best path
• Find a single optimal state sequence Q=(q1,q2,……, qT)
– The Viterbi algorithm also can be illustrated in a trellis framework
similar to the one for the forward algorithm
4/8/2015
80
Solution to Problem 2 - The
Viterbi Algorithm (cont.)
State
4/8/2015
S3
S3
S3
S3
S3
S2
S2
S2
S2
S2
S1
S1
S1
S1
S1
1
2
3
T-1
o1
o2
o3
oT-1
T
Time
oT
81
Solution to Problem 2 - The
Viterbi Algorithm (cont.)
1. Initialization
1 i   πi bi o1 , 1  i  N
2. Induction
 t 1  j   max[ t i aij ]b j ot 1 , 1  t  T-1,1  j  N
1 (i )  0, 1  i  N
1i  N
t 1 ( j )  arg max[ t i aij ], 1  t  T-1,1  j  N
1i  N
3. Termination
P * O λ   max T i 
1i  N
qT*  arg max T i 
1i  N
4. Backtracking
q*t   t 1 (qt*1 ), t  T  1.T  2,...,1
Q*  (q1* , q2* ,...,qT* ) is the best state sequence
Complexity: O(N2T)
4/8/2015
82
Solution to Problem 2 - The Viterbi Algorithm (cont.)
• A three-state Hidden Markov Model for the Dow Jones
Industrial average
δ1(1)=0.5*0.7
π1=0.5 b1(up)=0.7
δ1(2)= 0.2*0.1
π2=0.2
b2(up)= 0.1
δ2(1)
=max (0.35*0.6, 0.02*0.5, 0.09*0.4)*0.7
a11=0.6
a21=0.5
b1(up)=0.7
a31=0.4
δ2(1)= 0.35*0.6*0.7=0.147
Ψ2(1)=1
b2(up)= 0.1
δ1(3)= 0.3*0.3
π3=0.3
4/8/2015
b3(up)=0.3
b3(up)=0.3
(Huang et al., 2001)
83
Solution to Problem 3 –
The Baum-Welch Algorithm
• How to adjust (re-estimate) the model parameters
=(A,B,) to maximize P(O|)?
– The most difficult one among the three problems, because there
is no known analytical method that maximizes the joint
probability of the training data in a closed form
• The data is incomplete because of the hidden state
sequence
– The problem can be solved by the iterative Baum-Welch
algorithm, also known as the forward-backward algorithm
• The EM (Expectation Maximization) algorithm is perfectly
suitable for this problem
4/8/2015
84
Baum-Welch Local Maximization
• 1st step: You determine
– The number of hidden states, N
– The emission (observation alphabet)
• 2nd step: randomly assign values to…
A - the transition probabilities
B - the observation (emission) probabilities
 - the starting state probabilities
• 3rd step: Let the machine re-estimate
A, B, 
4/8/2015
85
Solution to Problem 3 –
The Backward Procedure
• Backward variable
t i   Pot 1, ot 2 ,...,oT qt  i, λ 
:
– The probability of the partial observation sequence ot+1,ot+2,…,oT,
given state i at time t and the model 
– 2(3)=P(o3,o4,…, oT|q2=3,)
=a31* b1(o3)*3(1)+a32* b2(o3)*3(2)+a33* b3(o3)*3(3)
State
S3
S3
S3
S3
S3
S3
S2
S2
S2
S2
S2
S2
S1
S3
S1
a31
S1
S1
S1
b1(o3) 3(1)
4/8/2015
1
2
3
T-1
T
o1
o2
o3
oT-1
oT
Time
86
Solution to Problem 3 –
The Backward Procedure (cont.)
t i   Pot 1, ot 2 ,...,oT qt  i, λ
• Algorithm
1. Initialization βT i   1, 1  i  N
N
2. Induction t i    aij b j ot 1  t 1  j , 1  t  T-1,1  j  N
j 1
Complexity MUL : 2 N 2(T-1 )  N 2T ; ADD : (N-1 )N(T-1 )  N 2T
P O λ    P o1 , o2 , o3 ,...,oT , q1  i λ    P o1 , o2 , o3 ,...,oT q1  i, λ P q1  i λ 
N
N
i 1
i 1
  P o2 , o3 ,...,oT q1  i, λ P o1 q1  i, λ P q1  i λ 
N
i 1
N
  1 (i )bi (o1 ) i
i 1
4/8/2015
 
cf. P O λ   αT i 
N
i 1
87
Solution to Problem 3 –
The Forward-Backward Algorithm
• Relation between the forward and backward variables
 t i   P o1o2 ...ot , qt  i λ 
N
 t i   [   t 1  j a ji ]bi (ot )
j 1
 t i   Pot 1ot  2 ...oT qt  i, λ 
 t i  
N
 aij b j (ot 1 ) t 1  j 
j 1
t i t (i)  PO, qt  i λ
PO λ  iN1t i t (i)
4/8/2015
(Huang et al., 2001)
88
Solution to Problem 3 –
The Forward-Backward Algorithm (cont.)
 t i  t (i)
 P(o1 , o2 ,...,ot , qt  i |  )  P(ot 1 , ot  2 ,...,oT | qt  i,  )
 P(o1 , o2 ,...,ot | qt  i,  )  P(qt  i |  )  P(ot 1 , ot  2 ,...,oT | qt  i,  )
 P(o1 , o2 ,...,oT | qt  i,  )  P(qt  i |  )
 P(o1 , o2 ,...,oT , qt  i |  )
 PO, qt  i λ 
P O λ    P O, qt  i λ     t (i )  t (i )
4/8/2015
N
N
i 1
i 1
89
Solution to Problem 3 – The Intuitive View
t i t (i)  PO, qt  i λ
• Define two new variables:
PO λ  iN1t i t (i)
t(i)= P(qt = i | O, )
– Probability of being in state i at time t, given O and 
P(O, qt  i |  )  t i t i 
 i t i 
 t i  

 Nt
PO λ 
PO λ 
  t i t i 
i 1
t( i, j )=P(qt = i, qt+1 = j | O, )
– Probability of being in state i at time t and state j at time t+1, given O
and 
t i, j  
Pqt  i, qt 1  j, O λ 
PO λ 

 t i aijb j ot 1 t 1  j 
   t mamn bn ot 1 t 1 n
N
N
m1n 1
 t i  
4/8/2015
N
 t i, j 
j 1
90
Solution to Problem 3 – The Intuitive View (cont.)
• P(q3 = 3, O | )=3(3)*3(3)
3(3)
State
4/8/2015
3(3)
Ss13
Ss13
Ss13
S3
S3
S3
Ss 2
Ss 2
S2
S2
S2
S2
Ss31
Ss31
S1
S1
S1
S1
1
2
3
T-1
T Time
o1
o2
o3
oT-1
oT
4
91
Solution to Problem 3 – The Intuitive View (cont.)
• P(q3 = 3, q4 = 1, O | )=3(3)*a31*b1(o4)*4(1)
3(3)
State
Ss13
Ss13
Ss13
S3
S3
S3
a31
Ss 2
Ss 2
S2
S2
S2
S2
Ss31
Ss31
S1
S1
S1
S1
T-1
T
oT-1
oT
b1(o4) 4(1)
4/8/2015
1
2
3
o1
o2
o3
4
Time
92
Solution to Problem 3 – The
Intuitive View (cont.)
• t( i, j )=P(qt = i, qt+1 = j | O, )
T 1
  i, j 
t 1
t
 expect ednumber of t ransit ions fromst at ei t o st at e j in O
• t(i)= P(qt = i | O, )
T 1
  t i 
t 1
 expectednumber of transitions fromstatei in O
4/8/2015
93
Solution to Problem 3 – The Intuitive View (cont.)
• Re-estimation formulae for  , A, and B are
 i  expected freqency (number of times) in state i at time (t  1)   1i 
 ξ t i,j 
T-1
expected number of transitions from state i to state j
aij 

expected number of transitions from state i
t 1
T-1
  t i 
t 1
 t  j
T
t 1
expected number of times in state j and observing symbol vk
s.t. ot  vk
b j vk  
 T
expected number of times in state j
 t  j
t 1
4/8/2015
94
How is it connected to Gene prediction?


 0.25 Yellow

 Red
B1   0.25
 0.25 Green
 0.25 Blue




 0.35 Yellow

 Red
B 2   0.10
 0.35 Green
 0.10 Blue




 0.10Yellow

 Red
B 3   0.65
0
Green
 0.25 Blue


#1
#2
#3
i+1 turn
#1
#2
#3
 0 .1 0 .7 0 .2 


A   0 .4 0 .2 0 .4 
 0 .2 0 .3 0 .5 


4/8/2015
#1
#2
#3
ith turn
 0.6 #1
 
   0.3 #2
 0.1 #3
 
95
How is it connected to Gene prediction?


 0.25


B1   0.25
 0.25
 0.25




 0.35


B 2   0.10
 0.35
 0.10


A
G
C
T
Exon
A
G
C
T


 0.10 A

G
B 3   0.65
0
C
 0.25 T


Intron
GCT C
CCC C
UTR
G
T G
i+1 turn
#1
#2
#3
 0 .1 0 .7 0 .2 


A   0 .4 0 .2 0 .4 
 0 .2 0 .3 0 .5 


4/8/2015
#1
#2
#3
ith turn
 0.6 Exon
 
   0.3 Intron
 0.1 UTR
 
96
3’ UTR
5’ UTR
Ex1
In1
Ex2
Ex2
In2
Ex3
In3
GT
E0
E1
E2
I0
I1
I2
5’ UTR
Single
exon
gene
In4
Ex5
Ex5
AG
GENESCAN
Eterm
Einit
Ex4
Chris Burge 1997
3’ UTR
Poly A
promoter
Signal
4/8/2015
Intergenic
region
97
GENESCAN components
E
E
E
0
1
2
I
I
I
0
1
2
Eterm
Einit
Single
exon
gene
5’ UTR
promote
r
1
0
0
 0

0
1
0
 0
A   0.28 0.33 0 0.39

 0.28 0.41 0.31 0
 




 0 .0 6 


0
.
0
4


   0 .6 0 


 0 .1 2 
  


3’ UTR
Sequence generating models:
Poly A
P1
P2
P3
P4
Intergenic
region
C
Intron
CCC C
4/8/2015







A

Set of length distributions:
f1
f2
f3
f4

fintron(10)=0
fintron(350)=.03
98
How do we use all that for gene
prediction?
Definitions:
For fixed sequence length L we define:
ФL- set of all possible parses of length L
SL- set of all possible DNA sequences of length L
ΩL= ФL x SL
Our model M is a probability measure on this space assigns a probability density
to each parse/sequence pair.
4/8/2015
99
Or in other words…
Given a sequence S
A C G C G A C T A G G C G C A G G T C T A …G A T
and a parse Фi
Exon
0
Intron0
Exon0
Intron1
Exon1
3’UTR
We can calculate P(S, Фi):
4/8/2015
100
E
E
0
1
2
I
I
I
0
1
2
Eterm
Einit
Single
exon
gene
5’ UTR
promot
er
Intergenic
region
 0

 0
A   0.28

 0.28
 

 0 .0 6 


 0 .0 4 
   0 .6 0 


 0 .1 2 





E
3’ UTR


0

0.39 

0.41 0.31
0





1
0
0
0.33
1
0
0
Sequence generating models:
C
A
P1
P2
P3
P4

Intron
Poly A
CCC C
Set of length distributions:
f1
f2
f3
f4

A C G C G A C T A G G C G C A G G T C T A …G A T
Exon Intron0
Exon0
Intron1
Exon1
3’UTR
0
P(S, Фi) = πq1 fq1(d1)Pq1(s1) * A…
q1 -> q2 fq2(d2)P(s2) * … Aqk-1->qkfqk(dk)P(sk)
4/8/2015
101
P(S, Фi) = πq1 fq1(d1)Pq1(s1) * Aq1 -> q2 fq2(d2)P(s2)*…Aqk-1->qkfqk(dk)P(sk)
Conditional probability of parse Фi given S sequence is:
P(i, S)
P(i , S)
P(i | S) 

P(S) j  LP(j , S)
Prediction:
In order to parse a given sequence S
(i.e. predict genes in S) we…
Find the parse with maximum likelihood, i.e.
max P(Фi | S)
4/8/2015
102
Splice site sequence generator
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
C
A
C
C
G
G
T
A
A
G
T
A
C
A
C
C
T
G
T
G
A
G
T
A
C
A
C
A
G
G
T
A
A
G
T
A
C
A
C
C
G
G
T
A
A
G
T
A
• What is the probability for generating signal O-5O-4…O6 ?
-4
-3
-2
-1
0
33
60
8
0
0
1
…
3
49
C%
37
13
4
0
0
…
3
G%
18
14
81
10
0
0
…
45
T%
12
13
7
0
10
0
…
3
A%
WMM – Weight Matrix Method
• What about adjacent nucleotides dependencies
WAM – Weight Array Model
Conditional probability of generating nucleotide Xk at position I given nucleotide Xj at position i-1
•4/8/2015
What about non-adjacent nucleotides dependencies?
103
What about non-adjacent nucleotides dependencies?
Procedure: MDD- Maximal Dependency Decomposition
4/8/2015
104
What about non-adjacent nucleotides dependencies?
MDD- Maximal Dependency Decomposition
Given data set D consisting of N sequences with length k
1.
Align sequences
2.
Find Ci, the consensus nucleotide at position i.
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
C
A
C
C
G
G
T
A
A
G
T
A
C
A
C
C
T
G
T
G
A
G
T
A
C
A
C
A
G
G
T
A
A
G
T
A
C
A
C
C
G
G
T
A
A
G
T
A
Nci – number of sequences containing Ci
2

For each pair of positions (i,j) where i!=j Calculate
3.
statistic for Ci vs. nucleotide indicator Xj.
(O  E ) 2
2
 
`
E
Do the same for all (i,j)
i\j
-3
G
-3
A
-2
G
-1
-2
-1
…
6
SUM
O
E
A
…
(%A in D)*Nc
C
…
(%C in D)*Nc
G
…
(%G in D)*Nc
T
…
(%T in D)*Nc
MAX(Si)
…
T
For a specific (i,j)
6
4.Calculate Si, the sum of each row (which is the measure
between dependencies of Ci and nucleotides at remaining
position sites)
5. if (not (stop condition)) Choose Ci with max(Si) and partition D.
1. K-1 level of tree is reached
4/8/2015
2. No significant dependencies found
3.Number of remaining sequences is to small
105
E
E
E
0
1
2
I
I
I
0
1
2
Not mentioned
•Reverse strand states
Eterm
Einit
Single
exon
gene
5’ UTR
•C+G%
3’ UTR
•Coding / non coding detection
promote
r
Poly A
Signal
Intergenic
region
•Branch point detection
•Expected vs. observed AG
composition
•And more…
Expected and observed percentage of AG near Acceptor site in coding region
0.2
Observed
0.18
Expected
Signal
promote
r
Poly A
Single
exon
gene
3’ UTR
Einit
5’
UTR
Eterm
0.16
1
2
0.14
I
I
0.12
0
I
0.1
0.08
0.06
0
2
E
E
0.04
4/8/2015
0.02
0
-100
106
-80
-60
-40
-20
0
20
Evaluating prediction programs
TP
FP
TN
FN
TP
FN
TN
Actual
Predicted
Sensitivity/
Recall
How many of the known genes were found?
Specificity/
Precision
How many of the predicted genes were real?
Correlation/ How good is it overall?
F-measure
4/8/2015
107
Evaluating prediction programs
TP
FP
TN
FN
TP
FN
TN
Actual
Predicted
Sensitivity
Sn=TP/(TP + FN)
Specificity
Sp=TP/(TP + FP)
F-measure
F=(sn+sp)/2
Correlation Coefficient
CC=(TP*TN-FP*FN)/[(TP+FP)(TN+FN)(TP+FN)(TN+FP)]0.5
4/8/2015
108
Gene Prediction Accuracy at
the Exon Level
WRONG
EXON
CORRECT
EXON
MISSING
EXON
Actual
Predicted
Sensitivity
Sn =
number of correct exons
number of actual exons
number of correct exons
Specificity
4/8/2015
Sp =
number of predicted exons
109
Gene finders - a comparison
Accuracy per nucleotide
Method
Sn
Sp
AC
Sn
GENSCAN
FGENEH
GeneID
GeneParser2
GenLang
GRAILII
SORFIND
Xpound
0.93
0.77
0.63
0.66
0.72
0.72
0.71
0.61
0.93
0.85
0.81
0.79
0.75
0.84
0.85
0.82
0.91
0.78
0.67
0.66
0.69
0.75
0.73
0.68
0.78
0.61
0.44
0.35
0.5
0.36
0.42
0.15
Accuracy per exon
(Sn+Sp)/
Sp
ME
2
0.81
0.8
0.09
0.61
0.61
0.15
0.45
0.45
0.28
0.39
0.37
0.29
0.49
0.5
0.21
0.41
0.38
0.25
0.47
0.45
0.24
0.17
0.16
0.32
WE
0.05
0.11
0.24
0.17
0.21
0.1
0.14
0.13
Sn = Sensitivity
Sp = Specificity
Ac = Approximate Correlation
ME = Missing Exons
WE = Wrong Exons
4/8/2015
GENSCAN Performance Data, http://genes.mit.edu/Accuracy.html
110
Gene finder comparison (cont.)
"Evaluation of gene finding programs" S. Rogic, A. K. Mackworth
and B. F. F. Ouellette. Genome Research, 11: 817-832 (2001).
4/8/2015
111
After putative genes are found,
they’re annotated
Annotation category
1.
2.
3.
4.
5.
6.
4/8/2015
Matches known protein sequence
Strong similarity to protein sequence
Similar to known protein
Similar to unknown protein
Similar to EST (i.e., putative protein)
No EST or protein matches (i.e.,
hypothetical protein)
112
Pitfalls and Issues
Several issues make the problem of eukaryotic
gene finding extremely difficult.
1) Very long genes: for example, the largest
human gene, the dystrophin gene, is
composed of 79 exons spanning nearly 2.3
Mb.
2) Very long introns: again, in the human
dystrophin gene, some introns are >100 kb
long and >99% of the gene is composed of
introns.
4/8/2015
113
Pitfalls and Issues
3) Very conserved introns. (Conserved noncoding sequences) This is particularly a
problem when gene prediction is
bolstered through similarity searches.
4/8/2015
114
Pitfalls and Issues
4) Very short exons: Some exons are only 3 bp long
in Arabidopsis genes. Such small exons are easily
missed by all content sensors, especially if
bordered by large introns. The more difficult cases
are those where the length of a coding exon is a
multiple of three (typically 3, 6 or 9 bp long),
because missing such exons will not cause a
problem in the exon assembly as they do not
introduce any change in the frame.
4/8/2015
115
Pitfalls and Issues
5) Overlapping genes: Though very rare in
eukaryotic genomes, there are some
documented cases in animals as well as
in plants
6) Polycistronic gene arrangement: Also
rare. One gene and one mRNA, but two
or more proteins.
4/8/2015
116
Pitfalls and Issues
7) Frameshifts: Some sequences stored in
databases may contain errors (either
sequencing errors or simply errors made
when editing the sequence) resulting in the
introduction of artificial frameshifts (deletion
or insertion of one base). Such frameshifts
greatly increase the difficulty of the
computational gene finding problem by
producing erroneous statistics and masking
true solutions.
4/8/2015
117
Pitfalls and Issues
8) Introns in UTRs: There are genes for which the
genomic region corresponding to the 5`- and/or
3`-UTR in the mature mRNA is interrupted by
one or more intron(s).
9) Alternative transcription start: e.g. three
alternative promoters regulate the transcription
of the 14 kb full-length dystrophin mRNAs and
four `intragenic' promoters control that of smaller
isoforms.
4/8/2015
118
Pitfalls and Issues
10) Alternative splicing.
11) Alternative polyadenylation: 20% of
human transcripts showing evidence of
alternative polyadenylation, affecting
where the 3’ end is cleaved.
4/8/2015
119
Pitfalls and Issues
12)Alternative initiation of translation: finding the
right AUG initiator is still a major concern for
gene prediction methods. the rule stating that the
firrst AUG in the mRNA is the initiator codon can
be escaped through three mechanisms: contextdependent leaky scanning, re-initiation and direct
internal initiation. Non-AUG triplet can sometimes
act as the functional codon for translation
initiation, as ACG in Arabidopsis or CUG in
human sequences
4/8/2015
120
Download