Lecture 11

advertisement
Gene Finding
1
Gene Split
• Discovered by Phillip Sharp and Richard
Roberts in 1977 when experimenting with
Hexon mRNA, which is a viral protein.
From [13]
2
1
From [13]
3
Gene Finding
central dogma:
DNA
transcription
RNA
translation
protein
Eukaryotes only ~2% coding … find these regions
Prokaryotes
no nucleus
Eukaryotes
nucleus
most of genome is coding
H.influenza: 70 % coding
continuous genes
part of genome is coding
Human: ~2% coding
introns & exons
Fast replication
Slow transcription and
splicing
4
2
Main Gen-Finding Strategies
See [12 and 13]
5
Gene Finding in Prokaryotes
6
3
Codon Statistics
UUU
UUC
start
AUG
stop
UAA,
UAG,
UGA
7
Reading Frames
DNA strand
5’-end
downstream
3’-end
AGGCATGCGATCCAAGTTCCACCATGATGACATGATGACTA
upstream
TCCGTACGCTAGGTTCAAGGTGGTACTACTGTACTACTGAT
3’-end
downstream
5’-end
Complementary DNA strand
C
G
A
T
8
4
Reading Frames
DNA strand
5’-end
downstream
3’-end
AGG CAT GCG ATC CAA GTT CCA CCA TGA TGA CAT GAT GAC TAA
GGC ATG CGA TCC AAG TTC CAC CAT GAT GAC ATG ATG ACT AA.
GCA TGC GAT CCA AGT TCC ACC ATG ATG ACA TGA TGA CTA A..
upstream
..T CCG TAC GCT AGG TTC AAG GTG GTA CTA CTG TAC TAC TGA
.TC CGT ACG CTA GGT TCA AGG TGG TAC TAC TGT ACT ACT GAT
TCC GTA CGC TAG GTT CAA GGT GGT ACT ACT GTA CTA CTG ATT
3’-end
downstream
5’-end
Complementary DNA strand
9
Open Reading Frames (ORFs)
ORF is the part of a reading frame not containing any stop codons.
5’-end
3 (or 6) ORFs
start
stop
downstream
3’-end
ACT GAC TGA CT GACTGACTGAC TGACTGACTGA
CTG ACT GAC TG ACTGACTGACT GACTGACTGAC
TGA CTG AC TGA CTGACTGACTG ACTGACTGACT
AUG
UAA, UAG, UGA
upstream
Coding region stops with a stop codon
3 out of the 64 codons are stop codons ,i.e., random
probability of stop codon is 3/64
=> expected that one in every 21 codons is a stop (random)
But: average protein 1000bp [much longer]
Thus search for long ORFs in all 3 reading frames => coding regions!
- miss short genes
- miss overlapping long ORFs on opposite strands
10
- too many found (6500 ORFs in E.Coli genome, but only 1100 genes)
5
Genes are not Random
motto
codon frequencies (see table next slide)
Leu
Ala
Trp
Leucine
6 codons
Alanine
4
Tryptophan 1
=>
‘random’
ratios
6.9
6.5
1
coding
ratios
Also:
A or T in third position of a codon
sometimes in 90% of the cases.
=> Genes are not random!
11
Codons
UUU
UUC
start
AUG
stop
UAA,
UAG,
UGA
12
6
Nucleotide Based Markov Chains
• For CpG islands √
NORFs
• For Non coding ORFs (NORFs)
–
–
–
–
Genes
On average shorter than 100 codons
G: gene sequence Markov model
R: NORF sequence Markov model
Null model (frequency in all data)
– Different probabilities
– Big variance => Not useful
– 2nd order Markov models similar results
13
Codon Based Markov Chains
NORF
C-ORF
Assume ORFs in a set of sequences are known
=> ORF and NORF codon frequencies
• Make an ORF Markov model with a state for each codon =>
64 state Markov model
• Use model to calculate probability that a given ORF is a
coding region.
14
• Figure shows coding region recognition.
7
Codon Based Markov Chains
Codon Frequencies
• Given a set of coding ORFs we can determine the frequencies
fabc that a codon abc occurs in a coding region.
• The probability p1 that a coding sequence appears in
the 1st reading frame can be determined using the given fabc’s
• Let p2, and p3 the probabilities for the 2nd and 3rd reading
frames, resp.
• Now Pi = pi / (p1 + p2 + p3) is the probability of the reading
frame i being the coding reading frame.
• Slide a window of size n along the sequence, and compute Pi
for each start position of the window.
• Plot for each reading window i:
log (Pi / (1 – Pi)) using a 25 codon window
• Quality depending on quality of frequency counts!
15
Coding Codon Frequencies
Codon Preference Program:
Plot for each reading window i: log (Pi / (1 – Pi)) using a 25 codon window. 16
8
Coding Codon Frequencies
Following biased used:
The bias for each reading frame is
17
the fraction of the third position in each codon that is either G or C.
18
9
19
20
10
Promotor Region Detection
‘consensus’ sequence around RNA
transcription start point i.e. not exact
… n TTGAC n18 TATAAT n6 N n …
TATA box
start
coding
• Promotor region is an ‘anchor’ point for polymerase, i.e.,
regulatory region that controls transcription rate.
• TATAAT is called TATA- or Pribnow-box.
• Use frequency of the occurrence of these sequences
• Variability of binding sites => no exact method for TATAbox identification ….
21
Promotor Region Detection
… n TTGAC n18 TATAAT n6 N n …
TATA box
start
coding
Construct statistics fb,i frequency of base b
in position i of known promotor region suffixes =>
position weight matrix
pos
A
C
G
T
1
2
9
10
79
2
95
2
1
3
3
26
14
16
44
4
59
13
15
13
5
51
20
13
17
6
1
3
0
96
cf. ‘profile’
Note: there is a 80% correlation between
the weight matrix score of the region and the binding energy.
22
11
Promotor Region Detection
• Given sequence S = B1B2…B6
• Likelihood of S, being a TATA-box:
• Likelihood of S, given it is a non-promotor:
• Log-likelihood ratio:
23
Promotor Detection
HMM-based
in GenScan
Neural Network
In Grail
24
12
25
HMM Gene Finding
Krogh, I. Saira Mian, D. Haussler, A Hidden
Markov Model that finds genes in E. coli
DNA, Nucleid Acids Research, Vol. 22, pp
4768-4778, 1994
26
13
HMM Model
61 Codon Models
Stop Codon Models
• TAA and TGA
• TAG
Start Codon Model
• ATG
• GTA
• TTT (rare)
27
More Advanced Intergenic Model
28
14
29
HMM Results
Data Set:
• EcoSeq6 contained about 1/3th of the complete E. coli genome (total
5.44x106 nucleotides, 5416 genes), and was not fully annotated at that
time
HMM Training:
• on ~106 nucleotides from the EcoSeq6 database of labeled genes (K.
Rudd, 1991)
HMM Testing
• On the remainder of ~325.000 nucleotides
Method:
• For each contig in the test the Viterbi algorithm was used to find the
most likely path through the hidden states of the HMM
• This path was then used to define a parse of the contig into genes
separated by intergenic regions
30
15
HMM Results
Post-processing consists of 3 rules to handle:
•
•
Overlapping genes, which will look like frame-shifts
Short genes overlapping with long genes on the opposite direction,
as a result of self-complementary type codons.
31
HMM Results
• 80% of the labeled protein coding genes were
exactly found (i.e. with precisely the same
start and end codon)
• 5% found within 10 codons from start codon
• 5% overlap by at least 60 bases or 50%
• 5% missed completely
• Several new genes indicated
• Several insertion and deletion errors were
labeled in the contig parse
32
16
Gene Finding in Eukaryotes
central dogma:
DNA
transcription
RNA
translation
protein
Eukaryotes: only ~2% coding … find these regions?
Prokaryotes
no nucleus
Eukaryotes
nucleus
most of genome is coding
H.influenza: 70 % coding
continuous genes
part of genome is coding
Human: ~2% coding
introns & exons
Fast replication
Slow transcription and
33
splicing
Eukaryotes Gene Structure
exons
expressed
introns
noncoding
(alternative) splicing
exon
intron
donor
intron
acceptor
tss
transcription start site
polyA polyadenylation
utr
untranslated region
5’ tss and start codon
3’ stop codon and polyA
34
17
Eukaryotes Gene Structure
35
Eukaryotes Gene Structure
36
18
Splicing Consensus Sequence
5 important bases
© 2000 by Geoffrey M. Cooper
37
Spliceosome at work
38
19
Typical Distributions: Vertebrates
Some typical data:
• Average gene length: 30Kb with coding region ~1-2Kb long.
• The average coding region has 6 exons, each ~150bp long.
• The promoter is about 6bp long and appears about 30bp upstream of
the transcription start site (TSS).
• Transcription rate of less than 50b/sec
• Splicing process takes several minutes
But huge deviations exist:
• dystrophin is 2.4Mb long.
• Blood coagulation-factor VIII has 26 exons whit sizes from 69bp to
3106bp.
39
Typical Distributions: Vertebrates
Introns:
Geometric distribution
Initial exons
Internal exons
Terminal exons
40
20
Markov Sequence Models
Models for coding and non-coding regions:
• Use windows of 6 bases => 5th order Markov model
• Two probability tables, each of size 46.
• No reading frame information => homogeneous model.
• Non-homogenuous model can be built using different
tables for each reading frame.
Problems:
• exons too short
• Difficult to detect splice
junctions (donor and
acceptor sites)
41
introns: splicing
• Splicing sites should be precise as a miss would
lead to non-sense interpretation of the rest of the
sequence.
• So called anchor points in the intron called
branch point appears frequently.
• Also pyrimidine (bases C,T) rich areas appear
between the branchpoint and the acceptor site.
 algorithms based on position specific weight matrices.
 does not exploit all the information (reading frames,
intron/exon states, etc) and is not suitable for short genes.
 Recent studies: sequence characteristics of (multiple) branch
points per intron
Picture from: http://en.wikipedia.org/wiki/RNA_splicing
42
21
Consensus Sequence Intron
consensus sequences / position specific weight matrices
intron
AGGUAAGU … … CTGAC … … NCAGG …
62 77 100 100 60 74 84 50
63-91
78 100 100 55
Freq%
< 15 bp >
pyrimidine
rich CT
donor
site
branchpoint
acceptor
site 43
HMM for Gene Finding
Geometric distribution of the intron state length k:
P(exon of length k) = qk(1-q)
Geometric distribution of the exon state length k:
P(exon of length k) = pk(1-p)
HMM memory-less => modeled length distribution is geometric.
But exon length does not have a geometric distribution! =>
44
22
exon lengths
HMM cannot model arbitrary length distributions
p
1-p
P(len=k) = pk(1-p)
45
Exon Length
46
23
generalized HMM
Xi Xi … Xi
states + length distribution
emit strings of symbols
Choose length Di from a given prob. distribution
Xi
parse of observation
assigns subsequences to states
Viterby like
time consuming, hard to train
GenScan, GeneZilla, Genie:
models for subsequences in genome
transitions biologically consistent
statistics depending on C+G content
Di
Si
State Si
47
GenScan
By Chris Burge (1997), Stanford University.
Prediction of complete gene structure:
• Introns and exons
• Promotor sites
• Polyadenylation signals
Takes into account the length distributions.
48
24
GenScan Model
Exon
Intron
Exon init/term
5’/3’ UTR
Promotor/PolyA
Figure from: M.Q. Zhang,
Computational Prediction of eukaryotic protein-coding genes., 2002
49
GenScan Model
From: L. Cerutti
50
25
GenScan
On Human Genes Test Set:
Sensitivity: 86%
Specificity: 81%
From: Srabanti Maji and Deepak Garg, Current Bioinformatics, 2013, 8,
Progress in Gene Prediction: Principles and Challenges
51
Bibliography
[1] H. Carrillo and D. Lipmann. The multiple sequence alignment problem in biology. SIAM
J. Appl. Math, 48:1073–1082, 1988.
[2] D. Feng and R. F. Doolittle. Progressive sequence alignment as a prerequisite to correct
phylogenetic trees. J. Mol. Evol., 25:351–360, 1987.
[3] W. M. Fitch and E. Margoliash. Construction of phylogenetic trees. science, 15:279–284,
1967.
[4] D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press,
New York, 1997.
[5] T. Jiang, L. Wang, and E. L. Lawler. Approximation algorithms for tree alignment with
a given phylogeny. Algorithmica, 16:302–315, 1996.
[6] D. J. Lipman, S. Altshul, and J. Kececiogly. A tool for multiple sequence alignment.
Proc. Natl. Academy Science, 86:4412–4415, 1989.
[7] M. Murata, J.S. Richardson, and J.L. Sussman. Three protein alignment. Medical
Information Sciences, 231:9, 1999.
[8] J. D. Thompson, D. G. Higgins, and T. J. Gibson. Clustal w: improving the sensitivity
of progressive multiple sequence alignment through sequence weighting, position-specific
gap penalties and weight matrix choice. Nucleic Acids Res, 22:4673–80, 1994.
[9] L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Computational
Biology, 1:337–348, 1994.
[10] http://www.uib.no/aasland/chromo/chromoCC.html.
[11] http://www.uib.no/aasland/chromo/chromo-tree.gif.
[12] L. Sterck, ProCoGen Training Workshop, Umea, 31-1 2013.
52
26
Bibliography
[13] D. Frishman, Gene prediction in Eukaryotes, Technische
Universität München.
[14] G. Gremme. Computational Gene Structure Prediction.
Ph.D. thesis, University of Hamburg, 2012
53
Appendix
Further slides (55 – 93)
are for information only.
54
27
Gene Finding and
Gene Structure Prediction
Overview of Recent Methods
55
Gene Finding
•
‘Ad Hoc’
– Scoring methods for ORFs, etc.
– Glimmer, GlimmerM (Ab Initio)
•
Homology
– Based on similarities with previously found genes and gene
structures
– Families of sequences
– Ad hoc: Grail; Probabilistic: TwinScan, Slam, Twain
•
Ab Initio
– Based on composition and signal
– Probabilistic Modelling
• HMMs, GHMMs, etc. probabilistic models
• GenScan, Tigrscan
Integrative
– EuGene, Combiner
•
56
28
Gene Prediction Challenges
In general:
– Low information contents of sequence signals
– Real signals difficult to detect
– Handling of sequencing errors
– New genomes
Prokaryotes:
– Short genes can easily be missed
– Overlapping genes (transcript, CDS)
Eukaryotes:
– Complex gene structure: nested in introns, merged on opposite
strands
– Splicing
– Pseudo genes
– Characterization of mRNAs, IncRNAs, vlincRNAs (Vlinc RNA
very long intergenic non-coding RNA), anti-sense RNAs, etc.
57
Ab Initio Gene Prediction
• Ab initio
– Signals: start (TTS), stop (TTS), splice, polyA
sites, promoter sites, TATA-box, etc.
– Content based: GC, coding, non-coding, base
frequencies and periodicity (FFT-analysis)
– Statistical models: MM, HMM, WMM (Weight
Matrix Models), SVM, NN, etc.
– Organism dependent
– Many challenges remain
58
29
Ab Initio Gene Prediction
•
•
•
•
•
•
Genscan (Burge, Karlin 1997)
Fgenesh (Solovyev, Salamov, 1997)
GeneMark (Lukashin, Borodovsky, 1998)
GRAIL (NN) (Xu et al., 1994)
GlimmerM (Pertea et al. 2002)
…
• M. Zhu1, A. Lomsadze and M.K. Borodovsky, Ab initio
gene identification in metagenomic sequences. Nucleic Acids
Research, 2010, Vol. 38, No. 12
59
Comparative Gene Prediction
Comparative
–
–
–
–
Homology between sequences
DNA-sequences, Protein sequences
Human genome – Mouse genome
Alignments of transcripts
Challenges/problems:
- Organism specific genes
- Exon-intron boundaries
- Completeness of homologous sequences
- Availability of homologous sequences
- DNA: Quality and completeness of cDNAs and EST sequences
- DNA: Expression levels => coverage differences
- Still only about 70% of all transcripts are available
60
30
See also: www.genprediction.org
GHMM
• mGene (+SVM)
• SNAP
• GeneZilla
• GlimmerHMM
• ChemGenome
• TWAIN
• GenScan
• TigrScan
• Genie
• Exonomy
• Phat
HMM
• Unveil
• Veil
• HMMGene
Homology
• TWINSCAN (GHMM)
• TWAIN
• SLAM
• SGP-1/-2
• GenomeScan
• DoubleScan
‘Ad Hoc’
• GlimmerM
• Grail (NN)
• GrailEXP
• MORGAN
• GeneMark
• FGenesH
Integrated
• Combiner
• EuGene
• GAZE
• JigSaw
61
Integrated Approaches: EuGene
EuGene (Schiex et al. 2001, 2008)
In same group as:
• Twinscan (Flicek et al., 2003)
• Augustus (Stanke et al., 2006)
Approach
• First homology alignment information
• Alignment is used for gene structure modeling
62
31
EuGene
Interpolated MM
GenScan, GenID, …
Ab Initio Probabilistic modeling
Homology based
From [12] ] L. Sterck, ProCoGen Training Workshop, Umea, 31-1 2013.
63
Prediction Graph
From [12] ] L. Sterck, ProCoGen Training Workshop, Umea, 31-1 2013.
64
32
EuGene Output
From [12] ] L. Sterck, ProCoGen Training Workshop, Umea, 31-1 2013.
Eugene
65
A Genome Annotation Pipeline
A Genome Annotation Pipeline
66
From [12] ] L. Sterck, ProCoGen Training Workshop, Umea, 31-1 2013.
33
mGene.Web
Gabriele Schweikert, et al.
mGene.web: a web service for accurate computational
gene finding.
Nucleic Acids Research, Vol. 37, 2009.
Translation Start Site
Donor Site
Translation Initiation Site
Acceptor Site
67
mGene.Web
Gabriele Schweikert, et al., mGene.web: a web service for
accurate computational gene finding, Nucleic Acids
Research, Vol. 37, 2009.
Results nGASP Challenge; Ab Initio; nematode genomes
68
34
Ab Initio Gene Identification
W. Zhu1, A. Lomsadze and M.K. Borodovsky
Ab initio gene identification in metagenomic
sequences.
Nucleic Acids Research, 2010, Vol. 38, No. 12
69
Ab Initio Gene Identification
Goal: Gene identification in DNA sequences derived from
shotgun sequencing of microbial communities.
Characteristics:
• short nucleotide sequence of anonymous origin
• uncertainty in model parameters
Approach:
• estimate parameters from evolutionary dependencies
between frequencies of oligonucleotides in protein-coding
regions and genome nucleotide composition.
70
35
Ab Initio Gene Prediction
Original version (1999) used for:
(i) reconstructing codon frequency vector needed for gene finding in viral
genomes
(ii) initializing parameters of self-training gene finding algorithms.
Improved version:
• Using new prokaryotic genomes to enhance the original approach by
using direct polynomial and logistic approximations of oligonucleotide
frequencies. (Non-linear polynomial regression is used.)
• Separate models for bacteria and archaea.
Evaluation and Application:
• Assess accuracy on known prokaryotic genomes split into short
sequences.
• Several thousands of new genes added to existing annotations of
several human and mouse gut metagenomes.
71
Improved Regression Methods
Observed codon (CGT) frequencies in 319 Bacterial Genomes.
• Codon frequency
dependencies on CG
contents
• Linear regression
(1999)
• Logistics regression
• Order 3 polynomial
regression
72
36
GHMM
Coding/Non-Coding Length distributions
GHMM GeneMark.hmm
73
Results
BlastN
74
37
Other Methods: Gene prediction
M. Roy and S. Barman
Effective gene prediction by high
resolution frequency estimator based on
least-norm solution technique.
EURASIP Journal on Bioinformatics and
Systems Biology 2014 2014:2
75
Digital Signal Processing for Gene prediction
The noise subspace concept for finding hidden periodicities in DNA
sequences.
•
•
•
•
Coding segments have a 3-base periodicity, in contrast with noncoding regions.
The novel least-norm estimator shows sharp period-3 peaks in coding
regions completely eliminating background noise.
Resolution, quality factor, sensitivity, specificity, miss rate, and wrong
rate are used to evaluate the least-norm gene prediction method over
existing methods.
Comparison with existing sliding discrete Fourier transform (SDFT)
on several genes from various organisms shows that the least norm
estimator has better performance on gene prediction.
76
38
Gene prediction
Coding tables defining a function from A, C, T, G to a complex value.
In contrast with binary indicator function where:
1000 = A, 0100 = C, 0010 = T, 0001 = G
Example: x[n] = [ATGCCTTAGGAT] -> [-1 j 1 1 -j -j j j -1 1 1 -1]
77
Principal Eigenvalues
•
•
x[n] is modeled as the sum of p complex exponentials and white noise w(n)
Allowing a decomposition of the signal as a sum of signal-eigenvectors and a
sum of noise-eigenvectors.
78
39
Least-Norm Estimator
• The least-norm allows a periodogram
modification that increases the quality
factor for various genes.
79
Power Spectrum Densities (PSD)
• Plots of power spectrum density (PSD) for F56F11.4a
gene.
80
40
Power Spectrum Densities (PSD)
81
Eigenvalue-ratio Plots
82
41
Eigenvalue-ratio Plots
83
Performance
84
42
Some Notes on Intron Length Distributions
S. William Roy and D. Penny
Intron length distributions and gene
prediction.
Nucleic Acids Research, 2007, Vol. 35, No. 14 4737–
4742,
85
Intron Length Distributions
Accurate gene prediction in eukaryotes:
• introns are not translated => intron lengths are not
expected to respect a coding frame
• number of genomic 3n introns ≈ the number of
3n+1 ≈ 3n+2 genomic introns
• a genome-wide excess of 3n introns suggests:
– many internal exonic sequences have been incorrectly
called introns
• a deficit of 3n introns suggests:
– many 3n introns that lack stop codons have been
incorrectly called exons
86
43
Intron Length Distributions
A survey of genomic annotations for 29 diverse
eukaryotic species:
• showed a skew in intron length distributions
• Indication of systematic problems with gene
prediction.
• Evaluation of length distributions of predicted
introns can be a very useful quality feature in
genome annotation protocols.
87
29 Eukaryotic Species
88
44
3n-, 3n+1-, 3n+2-Intron Percentages
•
•
excess of 3n introns => many internal exonic sequences may have been incorrectly called introns
89
a deficit of 3n introns => many 3n introns may have been incorrectly called exons
Excess 3n introns in T. Pseudonana
Stop codon
Frame shift
exon
To a large extent genes in T. Pseudonana were predicted using homology searches.
Many 3n introns are not true introns: most 3n introns lack in-frame stop codons.
90
45
Deficit of 3n introns in Bigelowiella Natans
•
•
•
•
Here many 3n introns seemed to have been missed.
Due to short intron length (less than 36 bp)
Short introns may lack stop codons
=> introns without stops may have escaped correct
prediction and were misclassified as exons
91
Assembly indels in E. histolytica => artificial introns
Assembly indels (gaps) lead to an excess of 3n+2 introns.
92
46
Conclusion
• Genome annotation requires balancing of false
negatives and positives, and accuracy
• Current genome annotations (2007) still need
improvement
• Systematic biases in gene prediction and genome
assembly problems can be detected by evaluating
the distributions of predicted intron lengths
93
47
Download