Gene Finding 4/8/2015 1 Copyright notice • Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc. • Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks! 4/8/2015 2 Gene Finding Why do it? • Find and annotate all the genes within the large volume of DNA sequence data – Human DNA length = 3.4*109 bp – Number of genes = 30,000 - 100,000 – Gene percentage ~= 1% • Gain understanding of problems in basic biology – e.g. gene regulation-what are the mechanisms involved in transcription, splicing, etc? • Different emphasis in these goals has some effect on the design of computational approaches for gene finding. 4/8/2015 3 Gene Finding • Cells recognize genes from DNA sequence – find genes via their bioprocesses • Not so easy for us.. 4/8/2015 4 Where is Gene? CTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCT CTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGA AGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAG GAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGT TTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGT GGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAG AATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAA CTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACT TGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATA AGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGG ACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCAT ATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAAC AAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTAT TGTTATGAGACTGGATATAT... 4/8/2015 5 Types of Genes • Protein coding – most genes • RNA genes – rRNA – tRNA – snRNA (small nuclear RNA) – snoRNA (small nucleolar RNA) 4/8/2015 6 3 Major Categories of Information used in Gene Finding Programs • Signals/features – a sequence pattern with functional significance e.g. splice donor & acceptor sites, start and stop codons, promoter features such as TATA boxes, TF binding sites, CpG islands • Content/composition – statistical properties of coding vs. non-coding regions. • e.g. codon-bias; length of ORFs in prokaryotes;GC content • Similarity – compare DNA sequence to known sequences in database – Not only known proteins but also ESTs, cDNAs 4/8/2015 7 Gene Structure 4/8/2015 8 Prokaryotic Genes Structure 5’ 3’ Open Reading Frame Promoter region (maybe) Ribosome binding site (maybe) Termination sequence (maybe) Start codon / Stop Codon 4/8/2015 9 In Prokaryotic Genomes • We usually start by looking for an ORF – A start codon, followed by (usually) at least 60 amino acid codons before a stop codon occurs – Or by searching for similarity to a known ORF • Look for basal signals – Transcription (the promoter consensus and the termination consensus) – Translation (ribosome binding site: the Shine-Dalgarno sequence) • Look for differences in sequence content between coding and non-coding DNA – GC content and codon bias 4/8/2015 10 Gene Finding in Bacterial Genomes • Advantages – Simple gene structure • Small genomes (0.5 to 10 million bp) • No introns – Dense Genomes • High coding density (>90%) • Short intergenic regions – Conserved signals – Abundant comparative information • Complete Genomes available for many – Uninterrupted ORFs • Disadvantages – Some genes overlap (nested) – Some genes are quite short (<60 bp) 4/8/2015 11 Open Reading Frame (ORF) • Any stretch of DNA that potentially encodes a protein • The identification of an ORF is the first indication that a segment of DNA may be part of a functional gene 4/8/2015 12 Open Reading Frames A C G T A A C T G A C T A G G T G A A T CGT GTA AAC ACT TGA GAC CTA TAG GGT GTG GAA AAT Each grouping of the nucleotides into consecutive triplets constitutes a reading frame. A sequence of triplets that contains no stop codon is an Open Reading Frame (ORF) 4/8/2015 13 ORFs as gene candidates • An open reading frame that begins with a start codon (usually ATG, GTG or TTG, but this is speciesdependent) • Most prokaryotic genes code for proteins that are 60 or more amino acids in length • The probability that a random sequence of nucleotides of length n has no stop codons (UAA, UAG, UGA) is (61/64)n – When n is 50, there is a probability of 92% that the random sequence contains a stop codon – When n is 100, this probability exceeds 99% 4/8/2015 14 Codon Bias • Genetic code degenerate – Equivalent triplet codons code for the same amino acid • Codon usage varies – organism to organism – gene to gene • Biological basis – Avoidance of codons similar to stop – Preference for codons that correspond to abundant tRNAs within the organism 4/8/2015 15 Codon Bias Gene Differences GlyGGG GlyGGA GlyGGT GlyGGC 4/8/2015 GAL4 0.21 0.17 0.38 0.24 ADH1 0 0 0.93 0.07 16 Codon Bias Organism differences • Arginine : CGT,CGC,CGA,CGG,AGA,AGG • Yeast Genome: arg specified by AGA 48% of time (other five equivalent codons ~10% each) • Fruitfly Genome: arg specified by CGC 33% of time (other five ~13% each) • Complete set of codon usage biases can be found at: http://www.kazusa.or.jp/codon/ 4/8/2015 17 GC content • GC relative to AT is a distinguishing factor of bacterial genomes • Varies dramatically across species – Serves as a means to identify bacterial species • For various biological reasons – Mutational bias of particular DNA polymerases – DNA repair mechanisms – horizontal gene transfer (transformation, transduction, conjugation) 4/8/2015 18 GC Content • GC content may be different in recently acquired genes than elsewhere • This can lead to variations in the frequency of codon usage within coding regions – There may be significant differences in codon bias within different genes of a single bacterium’s genome 4/8/2015 19 Ribosome Binding Sites • RBS is also known as a Shine-Dalgarno sequence (species-dependent) that should bind well with the 3’ end of 16S rRNA (part of the ribosome) • Usually found within 4-18 nucleotides of the start codon of a true gene 4/8/2015 20 Shine-Dalgarno Sequence • Shine-Dalgarno sequence is a nucleotide sequence (consensus = AGGAGG) that is present in the 5'-untranslated region of prokaryotic mRNAs. • This sequence serves as a binding site for ribosomes and is thought to influence the reading frame. • If a subsequence aligning well with the ShineDalgarno sequence is found within 4-18 nucleotides of an ORF’s start codon, that improves the ORF’s candidacy. 4/8/2015 21 Bacterial Promoter -35 T82T84G78A65C54A45… (16-18 bp)… T80A95T45A60A50T96…(A,G) -10 +1 Not so simple: remember, these are consensus sequences 4/8/2015 22 Eukaryotic Gene Structure 4/8/2015 23 Genes and Signals 4/8/2015 24 The Complicating factors in Eukaryotes • Interrupted genes (split genes) • introns and exons • Large genomes • Most DNA is non-coding • introns, regulatory regions, “junk” DNA (unknown function) • About 3% coding • Complex regulation of gene expression • Regulatory sequences may be far away from start codon 4/8/2015 25 Some numbers to consider: • Vertebrate genes average about 30Kb long – varies a lot • Coding region is only about 1-2 Kb • Exon sizes and numbers vary a lot – Average is 6 exons, each about 150 bp long • An average 5’ UTR is about 750 bp • An average 3’UTR is about 450 bp – (both can be much longer) • There are huge deviations from all of these numbers e.g. dystrophin is 2.4 Mb long ; factor VIII gene has 26 exons, introns are up to 32 Kb (one intron produces 2 transcripts unrelated to the gene!) – There are genes without introns: called single-exon or intronless genes 4/8/2015 26 Given a long eukaryotic DNA sequence: • How would you determine if it had a gene? • How would you determine which substrings of the sequence contained protein-coding regions? 4/8/2015 27 So, what’s the problem with looking for ORFs? “split” genes make it difficult to define ORFs • Where are the stars and stops? • What problems do introns introduce? • What would you predict for the size of ORFs? 4/8/2015 28 Most Programs Concentrate on Finding Exons • Exon: the region of DNA within a gene that codes for a polypeptide chain or domain • Intron: non-coding sequences found in the structural genes 4/8/2015 29 Splice Sites used to Define Exons • Splice donor (exon-intron boundary) and splice acceptor (intron-exon boundary) • Common sequence motifs – C(orA)AG/GTA(orG)AGT "donor" splice site – T(orC)nNC(orT)AG/G "acceptor" splice site 4/8/2015 30 Gene finding programs look for different types of exon • single exon genes: begin with start codon & end with stop codon • initial exons: begin with start codon & end with donor site • internal exons: begin with acceptor & end with donor • terminal exons: begin with acceptor & end with stop codon 4/8/2015 31 How are correct splice sites identified? • There are many occurrences of GT or AG within introns that are not splice sites • Statistical profiles of splice sites are used http://www.lclark.edu/~lycan/Bio490/pptpresentations/mutation/sld016.htm 4/8/2015 32 Other Biologically Important Signals Used in Gene Finding Programs • Transcriptional Signals – Transcription Start: characterized by cap signal • A single purine (A/G) – TATA box (promoter) at –25 relative to start – Polyadenylation signal: AATAAA (3’ end) • Major Caveat: not all genes have these signals • Makes it difficult to define the beginning and end of a gene 4/8/2015 33 Upstream Promoter Sites • Transcription Factor (TF) sites – Transcription factors are sequence-specific DNAbinding proteins – Bind to consensus DNA sequences – e.g. CAAT transcription factor and CAAT box • Many of these – Vary in sequence, location, interaction with other sites – Further complicates the problem of delineating a “gene” 4/8/2015 34 Translation Signals • Kozak sequence – The signal for initiation of translation in vertebrates – Consensus is GCCACCatgG • And of course.. – Translation stop codons 4/8/2015 35 GC Content in Eukaryotes • Overall GC content does not vary between species as it does in prokaryotes • GC content is still important in gene finding algorithms – CpG Islands 4/8/2015 36 CpG Islands • CpG stands for cytosine and guanine separated by a phosphate, which links the two nucleosides together in DNA. – CG dinucleotides are often written CpG to avoid confusion with the base pair C-G 4/8/2015 37 CpG Islands • In the eukaryotic genome, CpG occur at lower frequency than would be expected in purely random sequences (1/16). – Occurrence related to methylation – Methylation of C in CG, turning it into 5-methylcytosin. Following spontaneous deamination, the 5methylcytosine converts into thymine. – Methylation of C makes CpG prone to mutation (e.g. to TpG or CpA). CpG sites thus tend to be eliminated from the genomes of eukaryotes 4/8/2015 38 CpG Islands • However, in the start regions of many genes which have a high concentration of CpG sites: CpG islands, – Found at the promoters of eukaryotic genes. – These CpG sites are unmethylated, and therefore any spontaneous deaminations of cytosine to uracil are recognized by the repair machinery and the CpG site is restored. – High occurrence of CpGs in many cases marks the existence of downstream genes and is frequently used in genome annotation as indicator of gene density. 4/8/2015 39 Gene Finding by Computational Methods • Dependent on good experimental data to build reliable predictive models • Various aspects of gene structure/function provide information used in gene finding programs 4/8/2015 40 Computational Gene finding approaches 1) Rule-based (e.g, start & stop codons) 2) Content-based (e.g., codon bias, promoter sites) 3) Similarity-based (e.g., orthologs) 4) Pattern-based (e.g., machine-learning: neural network, HMM) 4/8/2015 41 Simple rule-based gene finding in prokaryotes, based on ORFs • Look for putative start codon (ATG) • Staying in same frame, scan in groups of three until a stop codon is found • If # of codons >=50, assume it’s a gene • If # of codons <50, go back to last start codon, increment by 1 & start again • At end of chromosome, repeat process for reverse complement 4/8/2015 42 Example ORF 4/8/2015 43 Problems with rule-based approaches • Advantages – Simple and fairly sensitive (>50%) • Disadvantages – Prokaryotic genes are not always so simple to find – ATG is not the only possible start site (e.g. CTG, TTG – class I alternates) – Small genes tend to be overlooked and long ones over-predicted • Solution? Use additional information to increase confidence in predictions 4/8/2015 44 Content based approaches • Key prokaryotic gene features – RNA polymerase promoter site (-10, -30 site or TATA box) – Shine-Dalgarno sequence (+10, Ribosome Binding Site) to initiate protein translation – Codon biases – High GC content – Stem-loop (rho-independent) terminators 4/8/2015 45 Content based approaches • Key eukaryotic gene features – CpG islands • More abundant near gene start site • High GC content in 5’ ends of genes – Codon Bias • Some codons are strongly preferred in coding regions, others are not – Hexamers • Dicodon frequencies informative – physical constraints prefer certain adjacent amino acids over others – Positional Bias • 3rd base tends to be G/C rich in coding regions 4/8/2015 46 Content-based recognition • Advantages: – Increases accuracy over rule-based • Disadvantages: – Features are degenerate – Features are not always present 4/8/2015 47 Homology-Based Approaches in Eukaryotic Genomes • More complicated than prokaryotes due to split genes • Genome sequence -> first identify all candidate exons • Use a spliced alignment algorithm to explore all possible exon assemblies & compare to known – e.g. Procrustes • Limitations: – must have similar sequence in the database with known exon structure – Sensitive to frame shift errors 4/8/2015 48 Gene Finding using Comparative Genomics • Purifying selection – Conserved regions between two genomes are useful or else they would have diverged. • If genomes are too close in the phylogenetic tree, there may be too much noise. • If genomes are too far, then regions can be missed. 4/8/2015 49 UCSC Browser 4/8/2015 50 Gene Prediction using sequence similarities • Genomescan incorporates similarity-based method by adding a blastX component to its prediction algorithm, using the translated sequence to search protein db. • http://genes.mit.edu/genomescan/ • “TWINSCAN is a gene prediction system that models both gene structure and evolutionary conservation. The scores of features like splice sites and coding regions are modified using the patterns of divergence between the target genome and a closely related genome.” • http://genes.cs.wustl.edu/ 4/8/2015 51 Neural Networks - Grail • Sensors are trained using a set of known genes in the organism. • GrailExp incorporates similarity-based method by adding a blastn component to its prediction algorithm. Runs reliably on unmasked sequences. • Sensors are : – Frame Bias Matrix - This uses the codon bias to determine the correct frame . – Fickett - Named after Fickett who originally used properties such as 3-periodicity and overall base composition to predict genes. 4/8/2015 52 Neural Networks - Grail – Coding 6-tuple word preference -frequency of 6-tuple words in the coding region. – Coding 6-tuple in-frame preference - 6-tuple composition is evaluated for the 3 frames and the one with the best score is used. – Repetitive 6-tuple word preference - 6-tuple statistics in repetitive elements. This is an identification where coding regions are not expected. 4/8/2015 53 Neural Network Training Set ACGAAG AGGAAG AGCAAG ACGAAA AGCAAC EEEENN Dersired Output 4/8/2015 Definitions Sliding Window ACGAAG A = [001] C = [010] G = [100] E = [01] N = [00] [010100001] Input Vector [01] Output Vector 54 Neural Network Training [010100001] ACGAAG Input Vector 4/8/2015 .2 .4 .1 .1 .0 .4 .7 .1 .1 .0 .1 .1 .0 .0 .0 [.6 .4 .6] .2 .4 .1 .0 .3 .5 .1 .1 .0 .5 .3 .1 Weight Matrix1 1 1 - e-x .1 .8 .0 .2 .3 .3 [.24 .74] compare [0 1] Hidden Weight Output Layer Matrix2 Vector 55 Back Propagation 1 1 - e-x [010100001] Input Vector 4/8/2015 .2 .4 .1 .1 .0 .4 .02 .83 .7 .1 .1 .1 .8 .0 .1 .1 .0 .0 .0 [.6 .4 .6] .0 .2.23 [.24 .74] .3 .3 .2 .4 .1 compare .22 .33 .0 .3 .5 .1 .1 .0 [0 1] .5 .3 .1 Weight Matrix1 Hidden Weight Output Layer Matrix2 Vector 56 Calculate New Output [010100001] Input Vector 4/8/2015 .1 .1 .1 .2 .0 .4 .7 .1 .1 .0 .1 .1 .0 .0 .0 [.7 .4 .7] .2 .2 .1 .0 .3 .5 .1 .3 .0 .5 .3 .3 Weight Matrix1 1 1 - e-x .02 .83 .00 .23 .22 .33 [.16 .91] Converged! [0 1] Hidden Weight Output Layer Matrix2 Vector 57 Train on Second Input Vector [100001001] ACGAAG Input Vector 4/8/2015 .1 .1 .1 .2 .0 .4 .7 .1 .1 .0 .1 .1 .0 .0 .0 [.8 .6 .5] .2 .2 .1 .0 .3 .5 .1 .3 .0 .5 .3 .3 Weight Matrix1 1 1 - e-x .02 .83 .00 .23 .22 .33 [.12 .95] Compare [0 1] Hidden Weight Output Layer Matrix2 Vector 58 Back Propagation 1 1 - e-x [010100001] Input Vector 4/8/2015 .1 .1 .1 .2 .0 .4 .01 .84 .7 .1 .1 .02 .83 .0 .1 .1 .0 .0 .0 [.8 .6 .5] .00 .23.24[.12 .95] .22 .33 .2 .2 .1 compare .21 .34 .0 .3 .5 .1 .3 .0 [0 1] .5 .3 .3 Weight Matrix1 Hidden Weight Output Layer Matrix2 Vector 59 After Many Iterations…. .13 .08 .12 .24 .01 .45 .76 .01 .31 .06 .32 .14 .03 .11 .23 .21 .21 .51 .10 .33 .85 .12 .34 .09 .51 .31 .33 .03 .93 .01 .24 .12 .23 Two “Generalized” Weight Matrices 4/8/2015 60 Neural Networks Matrix1 Matrix2 ACGAGG EEEENN New pattern Input 4/8/2015 Prediction Layer 1 Hidden Layer Output 61 Hidden Markov Models • In general, sequences are not monolithic, but can be made up of discrete segments • Hidden Markov Models (HMMs) allow us to model complex sequences, in which the character emission probabilities depend upon the state • Think of an HMM as a probabilistic or stochastic sequence generator, and what is hidden is the current state of the model 4/8/2015 62 MM A Markov process is a process, which moves from state to state depending (only) on the previous n states. 0.25 0.25 0.5 Sunny Cloudy Weather today Sunny cloudy Rainy 0.25 0.25 Sunny 0 .5 A 0.375 0.125 0.375 Cloudy Weather yesterday 0.125 0.625 0.375 Rainy 4/8/2015 Rainy 0 .6 0 .3 0 .1 Sunny Cloudy Rainy 63 Example: P (Sunny , Sunny, Cloudy, Rainy | Model) = Π(sunny)* P (Sunny | Sunny) * P (Cloudy | Sunny) *P (Rainy | Cloudy) = 0.6 * 0.5 * 0.25 * 0.375 = 0.0281 0.25 0.25 0.5 Sunny Cloudy Weather today Sunny cloudy Rainy 0.25 0.25 Sunny 0 .5 A 0.375 0.125 0.375 Cloudy Weather yesterday 0.125 0.625 0.375 Rainy 4/8/2015 Rainy 0 .6 0 .3 0 .1 Sunny Cloudy Rainy 64 HMM emission probabilities 0.25 Yellow Red B1 0.25 0.25 Green 0.25 Blue 0.35 Yellow Red B 2 0.10 0.35 Green 0.10 Blue 0.10Yellow Red B 3 0.65 0 Green 0.25 Blue #1 #2 #3 i+1 turn #1 #2 #3 0 .1 0 .7 0 .2 A 0 .4 0 .2 0 .4 0 .2 0 .3 0 .5 4/8/2015 #1 #2 #3 State transition probabilities ith turn 0 .6 0 .3 0 .1 #1 #2 #3 65 Elements of an HMM • An HMM is characterized by the following: 1. N, the number of states in the model 2. M, the number of distinct observation symbols per state 3. The state transition probability distribution A={aij}, where aij=P[qt+1=j|qt=i], 1≤i,j≤N 4. The observation symbol probability distribution in state j, B={bj(vk)} , where bj(vk)=P[ot=vk|qt=j], 1≤j≤N, 1≤k≤M 5. The initial state distribution ={i}, where i=P[q1=i], 1≤i≤N • For convenience, we usually use a compact notation =(A,B,) to indicate the complete parameter set of an HMM – Requires specification of two model parameters (N and M) 66 Two Major Assumptions for HMM • First-order Markov assumption – The state transition depends only on the origin and destination P Q P q1 ,...,qt ,...,qT P q1 P qt qt 1 , T t 2 – The state transition probability is time invariant aij=P(qt+1=j|qt=i), 1≤i, j≤N • Output-independent assumption – The observation is dependent on the state that generates it, not dependent on its neighbor observations P O Q, P o1 ,...,ot ,...,oT q1 ,...,qt ,...,qT , P ot qt , bqt ot T T t 1 t 1 67 0.25 B1 0.25 0.25 0.25 Yellow Red Green Blue #1 0.35 B 2 0.10 0.35 0.10 Yellow Red Green #2 Blue 0.10 B3 0.65 0 0.25 Yellow #3 i+1 turn Red #1 Green Blue #2 #3 0.1 0.7 0.2 #1 A 0.4 0.2 0.4 #2 ith turn 0 .2 0 .3 0 .5 #3 0 .1 0 .3 0 .6 #1 #2 #3 The three Basic problems of HMMs Problem 1: Given observation sequence O=O1O2…OT and model M=(Π, A, B) compute P(O | M). for example: P ( 4/8/2015 | M) 68 Problem 1: Given observation sequence O=O1O2…OT and model M=(Π, A, B) compute P(O | M) We define a sequence of states Q=q1q2…qT. P(Q| M) q1aq1 q 2 aq 2 q3...ak 1 k P(O| Q, M) Tt 1 P(Ot | qt, M ) P(O| M) allQ P(O | Q, M )P(Q | M ) Example: P( 4/8/2015 | M). #1 #1 #1 #1 #2 #2 #2 #2 #3 #3 #3 #3 69 Problem 1: Given observation sequence O=O1O2…OT and model M=(Π, A, B) compute P(O | M). We define a sequence of states Q=q1q2…qT. P(Q| M) q1aq1 q 2 aq 2 q3...ak 1 k P(O| Q, M) Tt 1 P(Ot | qt, M ) P(O| M) allQ P(O | Q, M )P(Q | M ) Example: P( #1 | M). #1 #1 #1 O(NT*T) !!! N- number of states 4/8/2015 #2 #2 #2 #2 #3 #3 #3 #3 T- number of observations 70 Problem 1: Given observation sequence O=O1O2…OT and model M=(Π, A, B) compute P(O | M). Solution: Much better… Forward algorithm Example: P( O(N2T) !!! | M). N- number of states T- number of observations #1 #1 #1 #1 For N=5 an T =100 #2 #3 4/8/2015 #2 #3 #2 #3 #2 Naive solution…1072 Forward algorithm… 3000 #3 71 0.25 B1 0.25 0.25 0.25 Yellow Red Green Blue 0.35 B 2 0.10 0.35 0.10 #1 Yellow Red Green Blue 0.10 B3 0.65 0 0.25 #2 Yellow #3 i+1 turn Red Green Blue #1 #2 #3 0.1 0.7 0.2 #1 A 0.4 0.2 0.4 #2 ith turn 0 .2 0 .3 0 .5 #3 0 .1 0 .3 0 .6 #1 #2 #3 The three Basic problems of HMMs Problem 2: Given observation sequence O=O1O2…OT and model M=(Π, A, B) how do we choose a corresponding state sequence q=q1q2…qT ,which best “explains” the observation. For example: What are most probable q1q2q3q4 given the observation #? 4/8/2015 #? #? #? 72 0.25 B1 0.25 0.25 0.25 Yellow Red Green Blue #1 0.35 B 2 0.10 0.35 0.10 Yellow Red Green #2 Blue 0.10 B3 0.65 0 0.25 Yellow #3 i+1 turn Red Green Blue #1 #2 #3 0.1 0.7 0.2 #1 A 0.4 0.2 0.4 #2 ith turn 0 .2 0 .3 0 .5 #3 0 .6 0 .3 0 .1 #1 #2 #3 The three Basic problems of HMMs Problem 3: How do we adjust the model parameters Π, A, B to maximize P(O |{Π, A, B})? 4/8/2015 73 Solution to the three problems: • Given an observation sequence O=(o1,o2,…,oT), and an HMM =(A,B,) – Problem 1: How to efficiently compute P(O|) ? Evaluation problem • Solution: Forward algorithm O(N2L) – Problem 2: How to choose an optimal state sequence Q=(q1,q2,……, qT) which best explains the observations? Q* arg max P(Q, O | ) Decoding Problem Q • Solution: Viterbi algorithm O(N2L) – Problem 3: How to adjust the model parameters =(A,B,) to maximize P(O|)? Learning/Training Problem • Solution: Baum-Welch reestimation formulas 4/8/2015 74 Solution to Problem 1 - The Forward Procedure • Base on the HMM assumptions, the calculation of Pqt qt 1, and Pot qt , involves only qt-1, qt , and ot , so it is possible to compute the likelihood PO with recursion on t • Forward variable : αt i Po1, o2 ,...,ot , qt i λ – The probability of the joint event that o1,o2,…,ot are observed and the state at time t is i, given the model λ αt 1 j Po1 , o2 ,...,ot , ot 1 , qt 1 j λ N αt (i )aij b j (ot 1 ) i 1 4/8/2015 75 Solution to Problem 1 - The Forward Procedure (cont.) P( A, B, ) P( A, B, ) P( B, ) t 1 j Po1 , o2 ,...,ot , ot 1 , qt 1 j | P( A, B | ) P( ) P(B, ) P( ) P( A | B, )P(B | ) Po1 , o2 ,...,ot , ot 1 | qt 1 j , P (qt 1 j | ) Output-independent assumption Po1 , o2 ,...,ot | qt 1 j , P (ot 1 | qt 1 j , ) P (qt 1 j | ) Po1 , o2 ,...,ot , qt 1 j | P (ot 1 | qt 1 j , ) P( A | B, )P(B | ) P( A, B | ) Po1 , o2 ,...,ot , qt 1 j | b j (ot 1 ) Po q j, b (o ) t 1 t 1 N P o1 , o2 ,...,ot , qt i, qt 1 j λ b j (ot 1 ) i 1 4/8/2015 j P A t 1 P( A, B) all B P( A, B | ) P( A | ) P( B | A, ) N P o1 , o2 ,...,ot , qt i λ P (qt 1 j | o1 , o2 ,...,ot , qt i, λ)b j (ot 1 ) i 1 First-order Markov assumption N P o1 , o2 ,...,ot , qt i λ P (qt 1 j | qt i, λ)b j (ot 1 ) i 1 N t (i )aij b j (ot 1 ) i 1 76 Solution to Problem 1 - The Forward Procedure (cont.) • 3(2)=P(o1,o2,o3,q3=2|) =[2(1)*a12+ 2(2)*a22 +2(3)*a32]b2(o3) State S3 S3 S3 a32 2(3) b2(o3) S3 S3 S2 S2 a22 S2 2(2) a12 S1 S1 S2 S2 S1 S1 S1 2(1) 4/8/2015 1 2 3 T-1 T o1 o2 o3 oT-1 oT Si means bj(ot) has been computed aij means aij has been computed Time 77 Solution to Problem 1 - The Forward Procedure (cont.) t i Po1o2...ot , qt i λ • Algorithm 1. Initializtion a α1 i πi bi o1 , 1 i N N 2. Induction αt 1 j αt i aij b j ot 1 , 1 t T-1,1 j N i 1 3.Te rminat ion PO λ αT i N i 1 Complexity: O(N2T) MUL: N(N+1 )(T-1 )+N N 2T ADD: (N-1 )N(T-1 ) N 2T • Based on the lattice (trellis) structure – Computed in a time-synchronous fashion from left-to-right, where each cell for time t is completely computed before proceeding to time t+1 • All state sequences, regardless how long previously, merge to N nodes (states) at each time instance t 4/8/2015 78 Solution to Problem 1 - The Forward Procedure (cont.) • A three-state Hidden Markov Model for the Dow Jones Industrial average α2(1)= (0.35*0.6+0.02*0.5+0.09*0.4)*0.7 a11=0.6 α1(1)=0.5*0.7 π1=0.5 b1(up)=0.7 a21=0.5 b1(up)=0.7 a31=0.4 α1(2)= 0.2*0.1 π2=0.2 b2(up)= 0.1 b2(up)= 0.1 α1(3)= 0.3*0.3 π3=0.3 b3(up)=0.3 b3(up)=0.3 (Huang et al., 2001) 4/8/2015 79 Solution to Problem 2 - The Viterbi Algorithm • The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm – Instead of summing up probabilities from different paths coming to the same destination state, the Viterbi algorithm picks and remembers the best path • Find a single optimal state sequence Q=(q1,q2,……, qT) – The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm 4/8/2015 80 Solution to Problem 2 - The Viterbi Algorithm (cont.) State 4/8/2015 S3 S3 S3 S3 S3 S2 S2 S2 S2 S2 S1 S1 S1 S1 S1 1 2 3 T-1 o1 o2 o3 oT-1 T Time oT 81 Solution to Problem 2 - The Viterbi Algorithm (cont.) 1. Initialization 1 i πi bi o1 , 1 i N 2. Induction t 1 j max[ t i aij ]b j ot 1 , 1 t T-1,1 j N 1 (i ) 0, 1 i N 1i N t 1 ( j ) arg max[ t i aij ], 1 t T-1,1 j N 1i N 3. Termination P * O λ max T i 1i N qT* arg max T i 1i N 4. Backtracking q*t t 1 (qt*1 ), t T 1.T 2,...,1 Q* (q1* , q2* ,...,qT* ) is the best state sequence Complexity: O(N2T) 4/8/2015 82 Solution to Problem 2 - The Viterbi Algorithm (cont.) • A three-state Hidden Markov Model for the Dow Jones Industrial average δ1(1)=0.5*0.7 π1=0.5 b1(up)=0.7 δ1(2)= 0.2*0.1 π2=0.2 b2(up)= 0.1 δ2(1) =max (0.35*0.6, 0.02*0.5, 0.09*0.4)*0.7 a11=0.6 a21=0.5 b1(up)=0.7 a31=0.4 δ2(1)= 0.35*0.6*0.7=0.147 Ψ2(1)=1 b2(up)= 0.1 δ1(3)= 0.3*0.3 π3=0.3 4/8/2015 b3(up)=0.3 b3(up)=0.3 (Huang et al., 2001) 83 Solution to Problem 3 – The Baum-Welch Algorithm • How to adjust (re-estimate) the model parameters =(A,B,) to maximize P(O|)? – The most difficult one among the three problems, because there is no known analytical method that maximizes the joint probability of the training data in a closed form • The data is incomplete because of the hidden state sequence – The problem can be solved by the iterative Baum-Welch algorithm, also known as the forward-backward algorithm • The EM (Expectation Maximization) algorithm is perfectly suitable for this problem 4/8/2015 84 Baum-Welch Local Maximization • 1st step: You determine – The number of hidden states, N – The emission (observation alphabet) • 2nd step: randomly assign values to… A - the transition probabilities B - the observation (emission) probabilities - the starting state probabilities • 3rd step: Let the machine re-estimate A, B, 4/8/2015 85 Solution to Problem 3 – The Backward Procedure • Backward variable t i Pot 1, ot 2 ,...,oT qt i, λ : – The probability of the partial observation sequence ot+1,ot+2,…,oT, given state i at time t and the model – 2(3)=P(o3,o4,…, oT|q2=3,) =a31* b1(o3)*3(1)+a32* b2(o3)*3(2)+a33* b3(o3)*3(3) State S3 S3 S3 S3 S3 S3 S2 S2 S2 S2 S2 S2 S1 S3 S1 a31 S1 S1 S1 b1(o3) 3(1) 4/8/2015 1 2 3 T-1 T o1 o2 o3 oT-1 oT Time 86 Solution to Problem 3 – The Backward Procedure (cont.) t i Pot 1, ot 2 ,...,oT qt i, λ • Algorithm 1. Initialization βT i 1, 1 i N N 2. Induction t i aij b j ot 1 t 1 j , 1 t T-1,1 j N j 1 Complexity MUL : 2 N 2(T-1 ) N 2T ; ADD : (N-1 )N(T-1 ) N 2T P O λ P o1 , o2 , o3 ,...,oT , q1 i λ P o1 , o2 , o3 ,...,oT q1 i, λ P q1 i λ N N i 1 i 1 P o2 , o3 ,...,oT q1 i, λ P o1 q1 i, λ P q1 i λ N i 1 N 1 (i )bi (o1 ) i i 1 4/8/2015 cf. P O λ αT i N i 1 87 Solution to Problem 3 – The Forward-Backward Algorithm • Relation between the forward and backward variables t i P o1o2 ...ot , qt i λ N t i [ t 1 j a ji ]bi (ot ) j 1 t i Pot 1ot 2 ...oT qt i, λ t i N aij b j (ot 1 ) t 1 j j 1 t i t (i) PO, qt i λ PO λ iN1t i t (i) 4/8/2015 (Huang et al., 2001) 88 Solution to Problem 3 – The Forward-Backward Algorithm (cont.) t i t (i) P(o1 , o2 ,...,ot , qt i | ) P(ot 1 , ot 2 ,...,oT | qt i, ) P(o1 , o2 ,...,ot | qt i, ) P(qt i | ) P(ot 1 , ot 2 ,...,oT | qt i, ) P(o1 , o2 ,...,oT | qt i, ) P(qt i | ) P(o1 , o2 ,...,oT , qt i | ) PO, qt i λ P O λ P O, qt i λ t (i ) t (i ) 4/8/2015 N N i 1 i 1 89 Solution to Problem 3 – The Intuitive View t i t (i) PO, qt i λ • Define two new variables: PO λ iN1t i t (i) t(i)= P(qt = i | O, ) – Probability of being in state i at time t, given O and P(O, qt i | ) t i t i i t i t i Nt PO λ PO λ t i t i i 1 t( i, j )=P(qt = i, qt+1 = j | O, ) – Probability of being in state i at time t and state j at time t+1, given O and t i, j Pqt i, qt 1 j, O λ PO λ t i aijb j ot 1 t 1 j t mamn bn ot 1 t 1 n N N m1n 1 t i 4/8/2015 N t i, j j 1 90 Solution to Problem 3 – The Intuitive View (cont.) • P(q3 = 3, O | )=3(3)*3(3) 3(3) State 4/8/2015 3(3) Ss13 Ss13 Ss13 S3 S3 S3 Ss 2 Ss 2 S2 S2 S2 S2 Ss31 Ss31 S1 S1 S1 S1 1 2 3 T-1 T Time o1 o2 o3 oT-1 oT 4 91 Solution to Problem 3 – The Intuitive View (cont.) • P(q3 = 3, q4 = 1, O | )=3(3)*a31*b1(o4)*4(1) 3(3) State Ss13 Ss13 Ss13 S3 S3 S3 a31 Ss 2 Ss 2 S2 S2 S2 S2 Ss31 Ss31 S1 S1 S1 S1 T-1 T oT-1 oT b1(o4) 4(1) 4/8/2015 1 2 3 o1 o2 o3 4 Time 92 Solution to Problem 3 – The Intuitive View (cont.) • t( i, j )=P(qt = i, qt+1 = j | O, ) T 1 i, j t 1 t expect ednumber of t ransit ions fromst at ei t o st at e j in O • t(i)= P(qt = i | O, ) T 1 t i t 1 expectednumber of transitions fromstatei in O 4/8/2015 93 Solution to Problem 3 – The Intuitive View (cont.) • Re-estimation formulae for , A, and B are i expected freqency (number of times) in state i at time (t 1) 1i ξ t i,j T-1 expected number of transitions from state i to state j aij expected number of transitions from state i t 1 T-1 t i t 1 t j T t 1 expected number of times in state j and observing symbol vk s.t. ot vk b j vk T expected number of times in state j t j t 1 4/8/2015 94 How is it connected to Gene prediction? 0.25 Yellow Red B1 0.25 0.25 Green 0.25 Blue 0.35 Yellow Red B 2 0.10 0.35 Green 0.10 Blue 0.10Yellow Red B 3 0.65 0 Green 0.25 Blue #1 #2 #3 i+1 turn #1 #2 #3 0 .1 0 .7 0 .2 A 0 .4 0 .2 0 .4 0 .2 0 .3 0 .5 4/8/2015 #1 #2 #3 ith turn 0.6 #1 0.3 #2 0.1 #3 95 How is it connected to Gene prediction? 0.25 B1 0.25 0.25 0.25 0.35 B 2 0.10 0.35 0.10 A G C T Exon A G C T 0.10 A G B 3 0.65 0 C 0.25 T Intron GCT C CCC C UTR G T G i+1 turn #1 #2 #3 0 .1 0 .7 0 .2 A 0 .4 0 .2 0 .4 0 .2 0 .3 0 .5 4/8/2015 #1 #2 #3 ith turn 0.6 Exon 0.3 Intron 0.1 UTR 96 3’ UTR 5’ UTR Ex1 In1 Ex2 Ex2 In2 Ex3 In3 GT E0 E1 E2 I0 I1 I2 5’ UTR Single exon gene In4 Ex5 Ex5 AG GENESCAN Eterm Einit Ex4 Chris Burge 1997 3’ UTR Poly A promoter Signal 4/8/2015 Intergenic region 97 GENESCAN components E E E 0 1 2 I I I 0 1 2 Eterm Einit Single exon gene 5’ UTR promote r 1 0 0 0 0 1 0 0 A 0.28 0.33 0 0.39 0.28 0.41 0.31 0 0 .0 6 0 . 0 4 0 .6 0 0 .1 2 3’ UTR Sequence generating models: Poly A P1 P2 P3 P4 Intergenic region C Intron CCC C 4/8/2015 A Set of length distributions: f1 f2 f3 f4 fintron(10)=0 fintron(350)=.03 98 How do we use all that for gene prediction? Definitions: For fixed sequence length L we define: ФL- set of all possible parses of length L SL- set of all possible DNA sequences of length L ΩL= ФL x SL Our model M is a probability measure on this space assigns a probability density to each parse/sequence pair. 4/8/2015 99 Or in other words… Given a sequence S A C G C G A C T A G G C G C A G G T C T A …G A T and a parse Фi Exon 0 Intron0 Exon0 Intron1 Exon1 3’UTR We can calculate P(S, Фi): 4/8/2015 100 E E 0 1 2 I I I 0 1 2 Eterm Einit Single exon gene 5’ UTR promot er Intergenic region 0 0 A 0.28 0.28 0 .0 6 0 .0 4 0 .6 0 0 .1 2 E 3’ UTR 0 0.39 0.41 0.31 0 1 0 0 0.33 1 0 0 Sequence generating models: C A P1 P2 P3 P4 Intron Poly A CCC C Set of length distributions: f1 f2 f3 f4 A C G C G A C T A G G C G C A G G T C T A …G A T Exon Intron0 Exon0 Intron1 Exon1 3’UTR 0 P(S, Фi) = πq1 fq1(d1)Pq1(s1) * A… q1 -> q2 fq2(d2)P(s2) * … Aqk-1->qkfqk(dk)P(sk) 4/8/2015 101 P(S, Фi) = πq1 fq1(d1)Pq1(s1) * Aq1 -> q2 fq2(d2)P(s2)*…Aqk-1->qkfqk(dk)P(sk) Conditional probability of parse Фi given S sequence is: P(i, S) P(i , S) P(i | S) P(S) j LP(j , S) Prediction: In order to parse a given sequence S (i.e. predict genes in S) we… Find the parse with maximum likelihood, i.e. max P(Фi | S) 4/8/2015 102 Splice site sequence generator -5 -4 -3 -2 -1 0 1 2 3 4 5 6 C A C C G G T A A G T A C A C C T G T G A G T A C A C A G G T A A G T A C A C C G G T A A G T A • What is the probability for generating signal O-5O-4…O6 ? -4 -3 -2 -1 0 33 60 8 0 0 1 … 3 49 C% 37 13 4 0 0 … 3 G% 18 14 81 10 0 0 … 45 T% 12 13 7 0 10 0 … 3 A% WMM – Weight Matrix Method • What about adjacent nucleotides dependencies WAM – Weight Array Model Conditional probability of generating nucleotide Xk at position I given nucleotide Xj at position i-1 •4/8/2015 What about non-adjacent nucleotides dependencies? 103 What about non-adjacent nucleotides dependencies? Procedure: MDD- Maximal Dependency Decomposition 4/8/2015 104 What about non-adjacent nucleotides dependencies? MDD- Maximal Dependency Decomposition Given data set D consisting of N sequences with length k 1. Align sequences 2. Find Ci, the consensus nucleotide at position i. -5 -4 -3 -2 -1 0 1 2 3 4 5 6 C A C C G G T A A G T A C A C C T G T G A G T A C A C A G G T A A G T A C A C C G G T A A G T A Nci – number of sequences containing Ci 2 For each pair of positions (i,j) where i!=j Calculate 3. statistic for Ci vs. nucleotide indicator Xj. (O E ) 2 2 ` E Do the same for all (i,j) i\j -3 G -3 A -2 G -1 -2 -1 … 6 SUM O E A … (%A in D)*Nc C … (%C in D)*Nc G … (%G in D)*Nc T … (%T in D)*Nc MAX(Si) … T For a specific (i,j) 6 4.Calculate Si, the sum of each row (which is the measure between dependencies of Ci and nucleotides at remaining position sites) 5. if (not (stop condition)) Choose Ci with max(Si) and partition D. 1. K-1 level of tree is reached 4/8/2015 2. No significant dependencies found 3.Number of remaining sequences is to small 105 E E E 0 1 2 I I I 0 1 2 Not mentioned •Reverse strand states Eterm Einit Single exon gene 5’ UTR •C+G% 3’ UTR •Coding / non coding detection promote r Poly A Signal Intergenic region •Branch point detection •Expected vs. observed AG composition •And more… Expected and observed percentage of AG near Acceptor site in coding region 0.2 Observed 0.18 Expected Signal promote r Poly A Single exon gene 3’ UTR Einit 5’ UTR Eterm 0.16 1 2 0.14 I I 0.12 0 I 0.1 0.08 0.06 0 2 E E 0.04 4/8/2015 0.02 0 -100 106 -80 -60 -40 -20 0 20 Evaluating prediction programs TP FP TN FN TP FN TN Actual Predicted Sensitivity/ Recall How many of the known genes were found? Specificity/ Precision How many of the predicted genes were real? Correlation/ How good is it overall? F-measure 4/8/2015 107 Evaluating prediction programs TP FP TN FN TP FN TN Actual Predicted Sensitivity Sn=TP/(TP + FN) Specificity Sp=TP/(TP + FP) F-measure F=(sn+sp)/2 Correlation Coefficient CC=(TP*TN-FP*FN)/[(TP+FP)(TN+FN)(TP+FN)(TN+FP)]0.5 4/8/2015 108 Gene Prediction Accuracy at the Exon Level WRONG EXON CORRECT EXON MISSING EXON Actual Predicted Sensitivity Sn = number of correct exons number of actual exons number of correct exons Specificity 4/8/2015 Sp = number of predicted exons 109 Gene finders - a comparison Accuracy per nucleotide Method Sn Sp AC Sn GENSCAN FGENEH GeneID GeneParser2 GenLang GRAILII SORFIND Xpound 0.93 0.77 0.63 0.66 0.72 0.72 0.71 0.61 0.93 0.85 0.81 0.79 0.75 0.84 0.85 0.82 0.91 0.78 0.67 0.66 0.69 0.75 0.73 0.68 0.78 0.61 0.44 0.35 0.5 0.36 0.42 0.15 Accuracy per exon (Sn+Sp)/ Sp ME 2 0.81 0.8 0.09 0.61 0.61 0.15 0.45 0.45 0.28 0.39 0.37 0.29 0.49 0.5 0.21 0.41 0.38 0.25 0.47 0.45 0.24 0.17 0.16 0.32 WE 0.05 0.11 0.24 0.17 0.21 0.1 0.14 0.13 Sn = Sensitivity Sp = Specificity Ac = Approximate Correlation ME = Missing Exons WE = Wrong Exons 4/8/2015 GENSCAN Performance Data, http://genes.mit.edu/Accuracy.html 110 Gene finder comparison (cont.) "Evaluation of gene finding programs" S. Rogic, A. K. Mackworth and B. F. F. Ouellette. Genome Research, 11: 817-832 (2001). 4/8/2015 111 After putative genes are found, they’re annotated Annotation category 1. 2. 3. 4. 5. 6. 4/8/2015 Matches known protein sequence Strong similarity to protein sequence Similar to known protein Similar to unknown protein Similar to EST (i.e., putative protein) No EST or protein matches (i.e., hypothetical protein) 112 Pitfalls and Issues Several issues make the problem of eukaryotic gene finding extremely difficult. 1) Very long genes: for example, the largest human gene, the dystrophin gene, is composed of 79 exons spanning nearly 2.3 Mb. 2) Very long introns: again, in the human dystrophin gene, some introns are >100 kb long and >99% of the gene is composed of introns. 4/8/2015 113 Pitfalls and Issues 3) Very conserved introns. (Conserved noncoding sequences) This is particularly a problem when gene prediction is bolstered through similarity searches. 4/8/2015 114 Pitfalls and Issues 4) Very short exons: Some exons are only 3 bp long in Arabidopsis genes. Such small exons are easily missed by all content sensors, especially if bordered by large introns. The more difficult cases are those where the length of a coding exon is a multiple of three (typically 3, 6 or 9 bp long), because missing such exons will not cause a problem in the exon assembly as they do not introduce any change in the frame. 4/8/2015 115 Pitfalls and Issues 5) Overlapping genes: Though very rare in eukaryotic genomes, there are some documented cases in animals as well as in plants 6) Polycistronic gene arrangement: Also rare. One gene and one mRNA, but two or more proteins. 4/8/2015 116 Pitfalls and Issues 7) Frameshifts: Some sequences stored in databases may contain errors (either sequencing errors or simply errors made when editing the sequence) resulting in the introduction of artificial frameshifts (deletion or insertion of one base). Such frameshifts greatly increase the difficulty of the computational gene finding problem by producing erroneous statistics and masking true solutions. 4/8/2015 117 Pitfalls and Issues 8) Introns in UTRs: There are genes for which the genomic region corresponding to the 5`- and/or 3`-UTR in the mature mRNA is interrupted by one or more intron(s). 9) Alternative transcription start: e.g. three alternative promoters regulate the transcription of the 14 kb full-length dystrophin mRNAs and four `intragenic' promoters control that of smaller isoforms. 4/8/2015 118 Pitfalls and Issues 10) Alternative splicing. 11) Alternative polyadenylation: 20% of human transcripts showing evidence of alternative polyadenylation, affecting where the 3’ end is cleaved. 4/8/2015 119 Pitfalls and Issues 12)Alternative initiation of translation: finding the right AUG initiator is still a major concern for gene prediction methods. the rule stating that the firrst AUG in the mRNA is the initiator codon can be escaped through three mechanisms: contextdependent leaky scanning, re-initiation and direct internal initiation. Non-AUG triplet can sometimes act as the functional codon for translation initiation, as ACG in Arabidopsis or CUG in human sequences 4/8/2015 120