Gene Finding 1 Gene Split • Discovered by Phillip Sharp and Richard Roberts in 1977 when experimenting with Hexon mRNA, which is a viral protein. From [13] 2 1 From [13] 3 Gene Finding central dogma: DNA transcription RNA translation protein Eukaryotes only ~2% coding … find these regions Prokaryotes no nucleus Eukaryotes nucleus most of genome is coding H.influenza: 70 % coding continuous genes part of genome is coding Human: ~2% coding introns & exons Fast replication Slow transcription and splicing 4 2 Main Gen-Finding Strategies See [12 and 13] 5 Gene Finding in Prokaryotes 6 3 Codon Statistics UUU UUC start AUG stop UAA, UAG, UGA 7 Reading Frames DNA strand 5’-end downstream 3’-end AGGCATGCGATCCAAGTTCCACCATGATGACATGATGACTA upstream TCCGTACGCTAGGTTCAAGGTGGTACTACTGTACTACTGAT 3’-end downstream 5’-end Complementary DNA strand C G A T 8 4 Reading Frames DNA strand 5’-end downstream 3’-end AGG CAT GCG ATC CAA GTT CCA CCA TGA TGA CAT GAT GAC TAA GGC ATG CGA TCC AAG TTC CAC CAT GAT GAC ATG ATG ACT AA. GCA TGC GAT CCA AGT TCC ACC ATG ATG ACA TGA TGA CTA A.. upstream ..T CCG TAC GCT AGG TTC AAG GTG GTA CTA CTG TAC TAC TGA .TC CGT ACG CTA GGT TCA AGG TGG TAC TAC TGT ACT ACT GAT TCC GTA CGC TAG GTT CAA GGT GGT ACT ACT GTA CTA CTG ATT 3’-end downstream 5’-end Complementary DNA strand 9 Open Reading Frames (ORFs) ORF is the part of a reading frame not containing any stop codons. 5’-end 3 (or 6) ORFs start stop downstream 3’-end ACT GAC TGA CT GACTGACTGAC TGACTGACTGA CTG ACT GAC TG ACTGACTGACT GACTGACTGAC TGA CTG AC TGA CTGACTGACTG ACTGACTGACT AUG UAA, UAG, UGA upstream Coding region stops with a stop codon 3 out of the 64 codons are stop codons ,i.e., random probability of stop codon is 3/64 => expected that one in every 21 codons is a stop (random) But: average protein 1000bp [much longer] Thus search for long ORFs in all 3 reading frames => coding regions! - miss short genes - miss overlapping long ORFs on opposite strands 10 - too many found (6500 ORFs in E.Coli genome, but only 1100 genes) 5 Genes are not Random motto codon frequencies (see table next slide) Leu Ala Trp Leucine 6 codons Alanine 4 Tryptophan 1 => ‘random’ ratios 6.9 6.5 1 coding ratios Also: A or T in third position of a codon sometimes in 90% of the cases. => Genes are not random! 11 Codons UUU UUC start AUG stop UAA, UAG, UGA 12 6 Nucleotide Based Markov Chains • For CpG islands √ NORFs • For Non coding ORFs (NORFs) – – – – Genes On average shorter than 100 codons G: gene sequence Markov model R: NORF sequence Markov model Null model (frequency in all data) – Different probabilities – Big variance => Not useful – 2nd order Markov models similar results 13 Codon Based Markov Chains NORF C-ORF Assume ORFs in a set of sequences are known => ORF and NORF codon frequencies • Make an ORF Markov model with a state for each codon => 64 state Markov model • Use model to calculate probability that a given ORF is a coding region. 14 • Figure shows coding region recognition. 7 Codon Based Markov Chains Codon Frequencies • Given a set of coding ORFs we can determine the frequencies fabc that a codon abc occurs in a coding region. • The probability p1 that a coding sequence appears in the 1st reading frame can be determined using the given fabc’s • Let p2, and p3 the probabilities for the 2nd and 3rd reading frames, resp. • Now Pi = pi / (p1 + p2 + p3) is the probability of the reading frame i being the coding reading frame. • Slide a window of size n along the sequence, and compute Pi for each start position of the window. • Plot for each reading window i: log (Pi / (1 – Pi)) using a 25 codon window • Quality depending on quality of frequency counts! 15 Coding Codon Frequencies Codon Preference Program: Plot for each reading window i: log (Pi / (1 – Pi)) using a 25 codon window. 16 8 Coding Codon Frequencies Following biased used: The bias for each reading frame is 17 the fraction of the third position in each codon that is either G or C. 18 9 19 20 10 Promotor Region Detection ‘consensus’ sequence around RNA transcription start point i.e. not exact … n TTGAC n18 TATAAT n6 N n … TATA box start coding • Promotor region is an ‘anchor’ point for polymerase, i.e., regulatory region that controls transcription rate. • TATAAT is called TATA- or Pribnow-box. • Use frequency of the occurrence of these sequences • Variability of binding sites => no exact method for TATAbox identification …. 21 Promotor Region Detection … n TTGAC n18 TATAAT n6 N n … TATA box start coding Construct statistics fb,i frequency of base b in position i of known promotor region suffixes => position weight matrix pos A C G T 1 2 9 10 79 2 95 2 1 3 3 26 14 16 44 4 59 13 15 13 5 51 20 13 17 6 1 3 0 96 cf. ‘profile’ Note: there is a 80% correlation between the weight matrix score of the region and the binding energy. 22 11 Promotor Region Detection • Given sequence S = B1B2…B6 • Likelihood of S, being a TATA-box: • Likelihood of S, given it is a non-promotor: • Log-likelihood ratio: 23 Promotor Detection HMM-based in GenScan Neural Network In Grail 24 12 25 HMM Gene Finding Krogh, I. Saira Mian, D. Haussler, A Hidden Markov Model that finds genes in E. coli DNA, Nucleid Acids Research, Vol. 22, pp 4768-4778, 1994 26 13 HMM Model 61 Codon Models Stop Codon Models • TAA and TGA • TAG Start Codon Model • ATG • GTA • TTT (rare) 27 More Advanced Intergenic Model 28 14 29 HMM Results Data Set: • EcoSeq6 contained about 1/3th of the complete E. coli genome (total 5.44x106 nucleotides, 5416 genes), and was not fully annotated at that time HMM Training: • on ~106 nucleotides from the EcoSeq6 database of labeled genes (K. Rudd, 1991) HMM Testing • On the remainder of ~325.000 nucleotides Method: • For each contig in the test the Viterbi algorithm was used to find the most likely path through the hidden states of the HMM • This path was then used to define a parse of the contig into genes separated by intergenic regions 30 15 HMM Results Post-processing consists of 3 rules to handle: • • Overlapping genes, which will look like frame-shifts Short genes overlapping with long genes on the opposite direction, as a result of self-complementary type codons. 31 HMM Results • 80% of the labeled protein coding genes were exactly found (i.e. with precisely the same start and end codon) • 5% found within 10 codons from start codon • 5% overlap by at least 60 bases or 50% • 5% missed completely • Several new genes indicated • Several insertion and deletion errors were labeled in the contig parse 32 16 Gene Finding in Eukaryotes central dogma: DNA transcription RNA translation protein Eukaryotes: only ~2% coding … find these regions? Prokaryotes no nucleus Eukaryotes nucleus most of genome is coding H.influenza: 70 % coding continuous genes part of genome is coding Human: ~2% coding introns & exons Fast replication Slow transcription and 33 splicing Eukaryotes Gene Structure exons expressed introns noncoding (alternative) splicing exon intron donor intron acceptor tss transcription start site polyA polyadenylation utr untranslated region 5’ tss and start codon 3’ stop codon and polyA 34 17 Eukaryotes Gene Structure 35 Eukaryotes Gene Structure 36 18 Splicing Consensus Sequence 5 important bases © 2000 by Geoffrey M. Cooper 37 Spliceosome at work 38 19 Typical Distributions: Vertebrates Some typical data: • Average gene length: 30Kb with coding region ~1-2Kb long. • The average coding region has 6 exons, each ~150bp long. • The promoter is about 6bp long and appears about 30bp upstream of the transcription start site (TSS). • Transcription rate of less than 50b/sec • Splicing process takes several minutes But huge deviations exist: • dystrophin is 2.4Mb long. • Blood coagulation-factor VIII has 26 exons whit sizes from 69bp to 3106bp. 39 Typical Distributions: Vertebrates Introns: Geometric distribution Initial exons Internal exons Terminal exons 40 20 Markov Sequence Models Models for coding and non-coding regions: • Use windows of 6 bases => 5th order Markov model • Two probability tables, each of size 46. • No reading frame information => homogeneous model. • Non-homogenuous model can be built using different tables for each reading frame. Problems: • exons too short • Difficult to detect splice junctions (donor and acceptor sites) 41 introns: splicing • Splicing sites should be precise as a miss would lead to non-sense interpretation of the rest of the sequence. • So called anchor points in the intron called branch point appears frequently. • Also pyrimidine (bases C,T) rich areas appear between the branchpoint and the acceptor site. algorithms based on position specific weight matrices. does not exploit all the information (reading frames, intron/exon states, etc) and is not suitable for short genes. Recent studies: sequence characteristics of (multiple) branch points per intron Picture from: http://en.wikipedia.org/wiki/RNA_splicing 42 21 Consensus Sequence Intron consensus sequences / position specific weight matrices intron AGGUAAGU … … CTGAC … … NCAGG … 62 77 100 100 60 74 84 50 63-91 78 100 100 55 Freq% < 15 bp > pyrimidine rich CT donor site branchpoint acceptor site 43 HMM for Gene Finding Geometric distribution of the intron state length k: P(exon of length k) = qk(1-q) Geometric distribution of the exon state length k: P(exon of length k) = pk(1-p) HMM memory-less => modeled length distribution is geometric. But exon length does not have a geometric distribution! => 44 22 exon lengths HMM cannot model arbitrary length distributions p 1-p P(len=k) = pk(1-p) 45 Exon Length 46 23 generalized HMM Xi Xi … Xi states + length distribution emit strings of symbols Choose length Di from a given prob. distribution Xi parse of observation assigns subsequences to states Viterby like time consuming, hard to train GenScan, GeneZilla, Genie: models for subsequences in genome transitions biologically consistent statistics depending on C+G content Di Si State Si 47 GenScan By Chris Burge (1997), Stanford University. Prediction of complete gene structure: • Introns and exons • Promotor sites • Polyadenylation signals Takes into account the length distributions. 48 24 GenScan Model Exon Intron Exon init/term 5’/3’ UTR Promotor/PolyA Figure from: M.Q. Zhang, Computational Prediction of eukaryotic protein-coding genes., 2002 49 GenScan Model From: L. Cerutti 50 25 GenScan On Human Genes Test Set: Sensitivity: 86% Specificity: 81% From: Srabanti Maji and Deepak Garg, Current Bioinformatics, 2013, 8, Progress in Gene Prediction: Principles and Challenges 51 Bibliography [1] H. Carrillo and D. Lipmann. The multiple sequence alignment problem in biology. SIAM J. Appl. Math, 48:1073–1082, 1988. [2] D. Feng and R. F. Doolittle. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25:351–360, 1987. [3] W. M. Fitch and E. Margoliash. Construction of phylogenetic trees. science, 15:279–284, 1967. [4] D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press, New York, 1997. [5] T. Jiang, L. Wang, and E. L. Lawler. Approximation algorithms for tree alignment with a given phylogeny. Algorithmica, 16:302–315, 1996. [6] D. J. Lipman, S. Altshul, and J. Kececiogly. A tool for multiple sequence alignment. Proc. Natl. Academy Science, 86:4412–4415, 1989. [7] M. Murata, J.S. Richardson, and J.L. Sussman. Three protein alignment. Medical Information Sciences, 231:9, 1999. [8] J. D. Thompson, D. G. Higgins, and T. J. Gibson. Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 22:4673–80, 1994. [9] L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Computational Biology, 1:337–348, 1994. [10] http://www.uib.no/aasland/chromo/chromoCC.html. [11] http://www.uib.no/aasland/chromo/chromo-tree.gif. [12] L. Sterck, ProCoGen Training Workshop, Umea, 31-1 2013. 52 26 Bibliography [13] D. Frishman, Gene prediction in Eukaryotes, Technische Universität München. [14] G. Gremme. Computational Gene Structure Prediction. Ph.D. thesis, University of Hamburg, 2012 53 Appendix Further slides (55 – 93) are for information only. 54 27 Gene Finding and Gene Structure Prediction Overview of Recent Methods 55 Gene Finding • ‘Ad Hoc’ – Scoring methods for ORFs, etc. – Glimmer, GlimmerM (Ab Initio) • Homology – Based on similarities with previously found genes and gene structures – Families of sequences – Ad hoc: Grail; Probabilistic: TwinScan, Slam, Twain • Ab Initio – Based on composition and signal – Probabilistic Modelling • HMMs, GHMMs, etc. probabilistic models • GenScan, Tigrscan Integrative – EuGene, Combiner • 56 28 Gene Prediction Challenges In general: – Low information contents of sequence signals – Real signals difficult to detect – Handling of sequencing errors – New genomes Prokaryotes: – Short genes can easily be missed – Overlapping genes (transcript, CDS) Eukaryotes: – Complex gene structure: nested in introns, merged on opposite strands – Splicing – Pseudo genes – Characterization of mRNAs, IncRNAs, vlincRNAs (Vlinc RNA very long intergenic non-coding RNA), anti-sense RNAs, etc. 57 Ab Initio Gene Prediction • Ab initio – Signals: start (TTS), stop (TTS), splice, polyA sites, promoter sites, TATA-box, etc. – Content based: GC, coding, non-coding, base frequencies and periodicity (FFT-analysis) – Statistical models: MM, HMM, WMM (Weight Matrix Models), SVM, NN, etc. – Organism dependent – Many challenges remain 58 29 Ab Initio Gene Prediction • • • • • • Genscan (Burge, Karlin 1997) Fgenesh (Solovyev, Salamov, 1997) GeneMark (Lukashin, Borodovsky, 1998) GRAIL (NN) (Xu et al., 1994) GlimmerM (Pertea et al. 2002) … • M. Zhu1, A. Lomsadze and M.K. Borodovsky, Ab initio gene identification in metagenomic sequences. Nucleic Acids Research, 2010, Vol. 38, No. 12 59 Comparative Gene Prediction Comparative – – – – Homology between sequences DNA-sequences, Protein sequences Human genome – Mouse genome Alignments of transcripts Challenges/problems: - Organism specific genes - Exon-intron boundaries - Completeness of homologous sequences - Availability of homologous sequences - DNA: Quality and completeness of cDNAs and EST sequences - DNA: Expression levels => coverage differences - Still only about 70% of all transcripts are available 60 30 See also: www.genprediction.org GHMM • mGene (+SVM) • SNAP • GeneZilla • GlimmerHMM • ChemGenome • TWAIN • GenScan • TigrScan • Genie • Exonomy • Phat HMM • Unveil • Veil • HMMGene Homology • TWINSCAN (GHMM) • TWAIN • SLAM • SGP-1/-2 • GenomeScan • DoubleScan ‘Ad Hoc’ • GlimmerM • Grail (NN) • GrailEXP • MORGAN • GeneMark • FGenesH Integrated • Combiner • EuGene • GAZE • JigSaw 61 Integrated Approaches: EuGene EuGene (Schiex et al. 2001, 2008) In same group as: • Twinscan (Flicek et al., 2003) • Augustus (Stanke et al., 2006) Approach • First homology alignment information • Alignment is used for gene structure modeling 62 31 EuGene Interpolated MM GenScan, GenID, … Ab Initio Probabilistic modeling Homology based From [12] ] L. Sterck, ProCoGen Training Workshop, Umea, 31-1 2013. 63 Prediction Graph From [12] ] L. Sterck, ProCoGen Training Workshop, Umea, 31-1 2013. 64 32 EuGene Output From [12] ] L. Sterck, ProCoGen Training Workshop, Umea, 31-1 2013. Eugene 65 A Genome Annotation Pipeline A Genome Annotation Pipeline 66 From [12] ] L. Sterck, ProCoGen Training Workshop, Umea, 31-1 2013. 33 mGene.Web Gabriele Schweikert, et al. mGene.web: a web service for accurate computational gene finding. Nucleic Acids Research, Vol. 37, 2009. Translation Start Site Donor Site Translation Initiation Site Acceptor Site 67 mGene.Web Gabriele Schweikert, et al., mGene.web: a web service for accurate computational gene finding, Nucleic Acids Research, Vol. 37, 2009. Results nGASP Challenge; Ab Initio; nematode genomes 68 34 Ab Initio Gene Identification W. Zhu1, A. Lomsadze and M.K. Borodovsky Ab initio gene identification in metagenomic sequences. Nucleic Acids Research, 2010, Vol. 38, No. 12 69 Ab Initio Gene Identification Goal: Gene identification in DNA sequences derived from shotgun sequencing of microbial communities. Characteristics: • short nucleotide sequence of anonymous origin • uncertainty in model parameters Approach: • estimate parameters from evolutionary dependencies between frequencies of oligonucleotides in protein-coding regions and genome nucleotide composition. 70 35 Ab Initio Gene Prediction Original version (1999) used for: (i) reconstructing codon frequency vector needed for gene finding in viral genomes (ii) initializing parameters of self-training gene finding algorithms. Improved version: • Using new prokaryotic genomes to enhance the original approach by using direct polynomial and logistic approximations of oligonucleotide frequencies. (Non-linear polynomial regression is used.) • Separate models for bacteria and archaea. Evaluation and Application: • Assess accuracy on known prokaryotic genomes split into short sequences. • Several thousands of new genes added to existing annotations of several human and mouse gut metagenomes. 71 Improved Regression Methods Observed codon (CGT) frequencies in 319 Bacterial Genomes. • Codon frequency dependencies on CG contents • Linear regression (1999) • Logistics regression • Order 3 polynomial regression 72 36 GHMM Coding/Non-Coding Length distributions GHMM GeneMark.hmm 73 Results BlastN 74 37 Other Methods: Gene prediction M. Roy and S. Barman Effective gene prediction by high resolution frequency estimator based on least-norm solution technique. EURASIP Journal on Bioinformatics and Systems Biology 2014 2014:2 75 Digital Signal Processing for Gene prediction The noise subspace concept for finding hidden periodicities in DNA sequences. • • • • Coding segments have a 3-base periodicity, in contrast with noncoding regions. The novel least-norm estimator shows sharp period-3 peaks in coding regions completely eliminating background noise. Resolution, quality factor, sensitivity, specificity, miss rate, and wrong rate are used to evaluate the least-norm gene prediction method over existing methods. Comparison with existing sliding discrete Fourier transform (SDFT) on several genes from various organisms shows that the least norm estimator has better performance on gene prediction. 76 38 Gene prediction Coding tables defining a function from A, C, T, G to a complex value. In contrast with binary indicator function where: 1000 = A, 0100 = C, 0010 = T, 0001 = G Example: x[n] = [ATGCCTTAGGAT] -> [-1 j 1 1 -j -j j j -1 1 1 -1] 77 Principal Eigenvalues • • x[n] is modeled as the sum of p complex exponentials and white noise w(n) Allowing a decomposition of the signal as a sum of signal-eigenvectors and a sum of noise-eigenvectors. 78 39 Least-Norm Estimator • The least-norm allows a periodogram modification that increases the quality factor for various genes. 79 Power Spectrum Densities (PSD) • Plots of power spectrum density (PSD) for F56F11.4a gene. 80 40 Power Spectrum Densities (PSD) 81 Eigenvalue-ratio Plots 82 41 Eigenvalue-ratio Plots 83 Performance 84 42 Some Notes on Intron Length Distributions S. William Roy and D. Penny Intron length distributions and gene prediction. Nucleic Acids Research, 2007, Vol. 35, No. 14 4737– 4742, 85 Intron Length Distributions Accurate gene prediction in eukaryotes: • introns are not translated => intron lengths are not expected to respect a coding frame • number of genomic 3n introns ≈ the number of 3n+1 ≈ 3n+2 genomic introns • a genome-wide excess of 3n introns suggests: – many internal exonic sequences have been incorrectly called introns • a deficit of 3n introns suggests: – many 3n introns that lack stop codons have been incorrectly called exons 86 43 Intron Length Distributions A survey of genomic annotations for 29 diverse eukaryotic species: • showed a skew in intron length distributions • Indication of systematic problems with gene prediction. • Evaluation of length distributions of predicted introns can be a very useful quality feature in genome annotation protocols. 87 29 Eukaryotic Species 88 44 3n-, 3n+1-, 3n+2-Intron Percentages • • excess of 3n introns => many internal exonic sequences may have been incorrectly called introns 89 a deficit of 3n introns => many 3n introns may have been incorrectly called exons Excess 3n introns in T. Pseudonana Stop codon Frame shift exon To a large extent genes in T. Pseudonana were predicted using homology searches. Many 3n introns are not true introns: most 3n introns lack in-frame stop codons. 90 45 Deficit of 3n introns in Bigelowiella Natans • • • • Here many 3n introns seemed to have been missed. Due to short intron length (less than 36 bp) Short introns may lack stop codons => introns without stops may have escaped correct prediction and were misclassified as exons 91 Assembly indels in E. histolytica => artificial introns Assembly indels (gaps) lead to an excess of 3n+2 introns. 92 46 Conclusion • Genome annotation requires balancing of false negatives and positives, and accuracy • Current genome annotations (2007) still need improvement • Systematic biases in gene prediction and genome assembly problems can be detected by evaluating the distributions of predicted intron lengths 93 47