Stochastic Models for Sequence Pattern Discovery Mayetri Gupta Department of Statistics Harvard University Richard Bayes (1596-1675), a great-grandfather of Thomas Bayes, was a successful cutler in Sheffield. In 1643 Richard Bayes served in the rotating position of Master of the Company of Cutlers of Hallamshire. Richard was sufficiently well off that he sent one of his sons, Samuel Bayes (1635-1681) to Trinity College Cambridge during the Commonwealth period; Samuel obtained his degree in 1656. Another son, Joshua Bayes (1638-1703) followed in his father’s footsteps in the cutlery industry, also serving as Master of the Company in 1679. Evidence of Joshua Bayes’s wealth comes from the size of his house, the fact that he employed a servant and the size of the taxes that he paid. His influence may be taken from his activities in the town government. Following the 1662 Act of Uniformity, Samuel Bayes was ejected from his parish, eventually living in Manchester (Matthews, 1934). Joshua Bayes was closely involved in the erection of one Nonconformist chapel in Sheffield and had two sons-in-law involved in another Sheffield chapel. The second son of Joshua Bayes (1638-1703) was another Joshua Bayes (1671-1746). In 1686 the younger Joshua Bayes entered a dissenting academy where he studied philosophy and divinity. Joshua Bayes and his wife Anne née Carpenter were married some time, probably within days, after their marriage license was issued on October 23, 1700. Joshua and Anne Bayes had seven children and Thomas was the eldest. Ref: Thomas Bayes- a Biography to Celebrate the Tercentenary of his Birth by Dr. D. R. Bellhouse Richard Bayes (1596-1675), a great-grandfather of Thomas Bayes, was a successful cutler in Sheffield. In 1643 Richard Bayes served in the rotating position of Master of the Company of Cutlers of Hallamshire. Richard was sufficiently well off that he sent one of his sons, Samuel Bayes (1635-1681) to Trinity College Cambridge during the Commonwealth period; Samuel obtained his degree in 1656. Another son, Joshua Bayes (1638-1703) followed in his father’s footsteps in the cutlery industry, also serving as Master of the Company in 1679. Evidence of Joshua Bayes’s wealth comes from the size of his house, the fact that he employed a servant and the size of the taxes that he paid. His influence may be taken from his activities in the town government. Following the 1662 Act of Uniformity, Samuel Bayes was ejected from his parish, eventually living in Manchester (Matthews, 1934). Joshua Bayes was closely involved in the erection of one Nonconformist chapel in Sheffield and had two sons-in-law involved in another Sheffield chapel. The second son of Joshua Bayes (1638-1703) was another Joshua Bayes (1671-1746). In 1686 the younger Joshua Bayes entered a dissenting academy where he studied philosophy and divinity. Joshua Bayes and his wife Anne née Carpenter were married some time, probably within days, after their marriage license was issued on October 23, 1700. Joshua and Anne Bayes had seven children and Thomas was the eldest. Richard Bayes (1596-1675), a great-grandfather of Thomas Bayes, was a successful cutler in Sheffield. In 1643 Richard Bayes served in the rotating position of Master of the Company of Cutlers of Hallamshire. Richard was sufficiently well off that he sent one of his sons, Samuel Bayes (1635-1681) to Trinity College Cambridge during the Commonwealth period; Samuel obtained his degree in 1656. Another son, Joshua Bayes (1638-1703) followed in his father’s footsteps in the cutlery industry, also serving as Master of the Company in 1679. Evidence of Joshua Bayes’s wealth comes from the size of his house, the fact that he employed a servant and the size of the taxes that he paid. His influence may be taken from his activities in the town government. Following the 1662 Act of Uniformity, Samuel Bayes was ejected from his parish, eventually living in Manchester (Matthews, 1934). Joshua Bayes was closely involved in the erection of one Nonconformist chapel in Sheffield and had two sons-in-law involved in another Sheffield chapel. The second son of Joshua Bayes (1638-1703) was another Joshua Bayes (1671-1746). In 1686 the younger Joshua Bayes entered a dissenting academy where he studied philosophy and divinity. Joshua Bayes and his wife Anne née Carpenter were married some time, probably within days, after their marriage license was issued on October 23, 1700. Joshua and Anne Bayes had seven children and Thomas was the eldest. after removing punctuation.... richardbayesagreatgrandfatherofthomasbayeswasasuccessfulcutlerin sheffieldinrichardbayesservedintherotatingpositionofmasterofthecompa nyofcutlersofhallamshirerichardwassufficientlywelloffthathesentoneofhi ssonssamuelbayestotrinitycollegecambridgeduringthecommonwealthp eriodsamuelobtainedhisdegreeinanothersonjoshuabayesfollowedinhisf athersfootstepsinthecutleryindustryalsoservingasmasterofthecompanyi nevidenceofjoshuabayesswealthcomesfromthesizeofhishousethefactth atheemployedaservantandthesizeofthetaxesthathepaidhisinfluencema ybetakenfromhisactivitiesinthetowngovernmentfollowingtheactofunifor mitysamuelbayeswasejectedfromhisparisheventuallylivinginmancheste rmatthewsjoshuabayeswascloselyinvolvedintheerectionofonenonconfo rmistchapelinsheffieldandhadtwosonsinlawinvolvedinanothersheffieldc hapelthesecondsonofjoshuabayeswasanotherjoshuabayesintheyounge rjoshuabayesenteredadissentingacademywherehestudiedphilosophya nddivinityjoshuabayesandhiswifeanneneecarpenterweremarriedsometi meprobablywithindaysaftertheirmarriagelicensewasissuedonoctober joshuaandannebayeshadsevenchildrenandthomaswastheeldest after removing punctuation.... richardbayesagreatgrandfatherofthomasbayeswasasuccessfulcutlerin sheffieldinrichardbayesservedintherotatingpositionofmasterofthecompa nyofcutlersofhallamshirerichardwassufficientlywelloffthathesentoneofhi ssonssamuelbayestotrinitycollegecambridgeduringthecommonwealthp eriodsamuelobtainedhisdegreeinanothersonjoshuabayesfollowedinhisf athersfootstepsinthecutleryindustryalsoservingasmasterofthecompanyi nevidenceofjoshuabayesswealthcomesfromthesizeofhishousethefactth atheemployedaservantandthesizeofthetaxesthathepaidhisinfluencema ybetakenfromhisactivitiesinthetowngovernmentfollowingtheactofunifor mitysamuelbayeswasejectedfromhisparisheventuallylivinginmancheste rmatthewsjoshuabayeswascloselyinvolvedintheerectionofonenonconfo rmistchapelinsheffieldandhadtwosonsinlawinvolvedinanothersheffieldc hapelthesecondsonofjoshuabayeswasanotherjoshuabayesintheyounge rjoshuabayesenteredadissentingacademywherehestudiedphilosophya nddivinityjoshuabayesandhiswifeanneneecarpenterweremarriedsometi meprobablywithindaysaftertheirmarriagelicensewasissuedonoctober joshuaandannebayeshadsevenchildrenandthomaswastheeldest with an alphabet of 4 letters.... richardbayesagreatgrandfatherofthomasbayeswasasuccessfulcutlerin sheffieldinrichardbayesservedintherotatingpositionofmasterofthecompa nyofcutlersofhallamshirerichardwassufficientlywelloffthathesentoneofhi ssonssamuelbayestotrinitycollegecambridgeduringthecommonwealthp eriodsamuelobtainedhisdegreeinanothersonjoshuabayesfollowedinhisf athersfootstepsinthecutleryindustryalsoservingasmasterofthecompanyi nevidenceofjoshuabayesswealthcomesfromthesizeofhishousethefactth atheemployedaservantandthesizeofthetaxesthathepaidhisinfluencema ybetakenfromhisactivitiesinthetowngovernmentfollowingtheactofunifor mitysamuelbayeswasejectedfromhisparisheventuallylivinginmancheste rmatthewsjoshuabayeswascloselyinvolvedintheerectionofonenonconfo rmistchapelinsheffieldandhadtwosonsinlawinvolvedinanothersheffieldc hapelthesecondsonofjoshuabayeswasanotherjoshuabayesintheyounge rjoshuabayesenteredadissentingacademywherehestudiedphilosophya nddivinityjoshuabayesandhiswifeanneneecarpenterweremarriedsometi meprobablywithindaysaftertheirmarriagelicensewasissuedonoctober joshuaandannebayeshadsevenchildrenandthomaswastheeldest abcdefg a hijklmn c opqrstu g vwxyz t with an alphabet of 4 letters.... gcacagaaatagaagaagagacaaagcaggagcgcagaatagtagaggaaaggagcaggcagcc gcaaacacaccgcacagaaataggagtaaccgcagggagccagggcgcgcgacaggaggagcaagcga ctgaaggcagggacaccacgccgagcacagatagggaacacacgcttaccgaagcagcagacggcagacc gggcggacgacaatagggggcccgtagccaaaaacagcaaaaggccagcaagccgctaacgcg agcgagacgacgagaccaaccgaaagaaccacggcagggccggcgaaatagagccgtaaccccga agcaggagggggaggccgcaaggcagtccaggggtacgggagtccaagcaggaggagcaagcgactc catcaacaagacggcgaaataggtaacgcagcagaggcgcagctagaccgcgggagcaaaaggc agcaacgcgtaaagagtacgacagcagctagagcagataggcagcagacaccgccacgacaaca taagacacaggcccgaagctcgcagccgcaggtcagtagccacgagccgtccagcaaaggagccagg ccgtgacgacaatagtagacaagaaaggcccggagcgcatacggacctcctccacccacacagga gcaggcatgcggcgaaatagtagacggactcctgctaaccgcaagaagcgcgagcacgcagcag gccggacagacccgcaaacacaacacaagtgggcgcccatcctgctaaccacggcaggcaaacacaa cagacgcagaagcaggcgacggcgaaatagtagacggcagcggcgaaatagccgcatggcaa gcggcgaaatagacgagaaaacggacgccaaaaaacttcagacagggacaagcccggggcta caactcccgtcggcgaaatagacaccgtcaaaccacaaaaggacgagtagacaggcaaggcagc cagggaaacttcgcccaatgaagaggcacgcaggcaaaccaacgatagcgggaagcgaggaag cggcgaacaaccaaatagcaagatacacccagacacagcgcagtaggcaacaagg abcdefg a hijklmn c opqrstu g vwxyz t with an alphabet of 4 letters.... gcacagaaatagaagaagagacaaagcaggagcgcagaatagtagaggaaaggagcaggcagcc gcaaacacaccgcacagaaataggagtaaccgcagggagccagggcgcgcgacaggaggagcaagcga ctgaaggcagggacaccacgccgagcacagatagggaacacacgcttaccgaagcagcagacggcagacc gggcggacgacaatagggggcccgtagccaaaaacagcaaaaggccagcaagccgctaacgcg agcgagacgacgagaccaaccgaaagaaccacggcagggccggcgaaatagagccgtaaccccga agcaggagggggaggccgcaaggcagtccaggggtacgggagtccaagcaggaggagcaagcgactc catcaacaagacggcgaaataggtaacgcagcagaggcgcagctagaccgcgggagcaaaaggc agcaacgcgtaaagagtacgacagcagctagagcagataggcagcagacaccgccacgacaaca taagacacaggcccgaagctcgcagccgcaggtcagtagccacgagccgtccagcaaaggagccagg ccgtgacgacaatagtagacaagaaaggcccggagcgcatacggacctcctccacccacacagga gcaggcatgcggcgaaatagtagacggactcctgctaaccgcaagaagcgcgagcacgcagcag gccggacagacccgcaaacacaacacaagtgggcgcccatcctgctaaccacggcaggcaaacacaa cagacgcagaagcaggcgacggcgaaatagtagacggcagcggcgaaatagccgcatggcaa gcggcgaaatagacgagaaaacggacgccaaaaaacttcagacagggacaagcccggggcta caactcccgtcggcgaaatagacaccgtcaaaccacaaaaggacgagtagacaggcaaggcagc cagggaaacttcgcccaatgaagaggcacgcaggcaaaccaacgatagcgggaagcgaggaag cggcgaacaaccaaatagcaagatacacccagacacagcgcagtaggcaacaagg joshua aatag cggcga bayes with an alphabet of 4 letters.... gcacagaaatagaagaagagacaaagcaggagcgcagaatagtagaggaaaggagcaggcagcc gcaaacacaccgcacagaaataggagtaaccgcagggagccagggcgcgcgacaggaggagcaagcga ctgaaggcagggacaccacgccgagcacagatagggaacacacgcttaccgaagcagcagacggcagacc gggcggacgacaatagggggcccgtagccaaaaacagcaaaaggccagcaagccgctaacgcg agcgagacgacgagaccaaccgaaagaaccacggcagggccggcgaaatagagccgtaaccccga agcaggagggggaggccgcaaggcagtccaggggtacgggagtccaagcaggaggagcaagcgactc catcaacaagacggcgaaataggtaacgcagcagaggcgcagctagaccgcgggagcaaaaggc agcaacgcgtaaagagtacgacagcagctagagcagataggcagcagacaccgccacgacaaca taagacacaggcccgaagctcgcagccgcaggtcagtagccacgagccgtccagcaaaggagccagg ccgtgacgacaatagtagacaagaaaggcccggagcgcatacggacctcctccacccacacagga gcaggcatgcggcgaaatagtagacggactcctgctaaccgcaagaagcgcgagcacgcagcag gccggacagacccgcaaacacaacacaagtgggcgcccatcctgctaaccacggcaggcaaacacaa cagacgcagaagcaggcgacggcgaaatagtagacggcagcggcgaaatagccgcatggcaa gcggcgaaatagacgagaaaacggacgccaaaaaacttcagacagggacaagcccggggcta caactcccgtcggcgaaatagacaccgtcaaaccacaaaaggacgagtagacaggcaaggcagc cagggaaacttcgcccaatgaagaggcacgcaggcaaaccaacgatagcgggaagcgaggaag cggcgaacaaccaaatagcaagatacacccagacacagcgcagtaggcaacaagg joshua aatag cggcga bayes with an alphabet of 4 letters.... gcacagaaatagaagaagagacaaagcaggagcgcagaatagtagaggaaaggagcaggcagcc gcaaacacaccgcacagaaataggagtaaccgcagggagccagggcgcgcgacaggaggagcaagcga ctgaaggcagggacaccacgccgagcacagatagggaacacacgcttaccgaagcagcagacggcagacc gggcggacgacaatagggggcccgtagccaaaaacagcaaaaggccagcaagccgctaacg “rdwas” agcgagacgacgagaccaaccgaaagaaccacggcagggccggcgaaatagagccgtaaccccga agcaggagggggaggccgcaaggcagtccaggggtacgggagtccaagcaggaggagcaagcgactc catcaacaagacggcgaaataggtaacgcagcagaggcgcagctagaccgcgggagcaaaaggc agcaacgcgtaaagagtacgacagcagctagagcagataggcagcagacaccgccacgacaaca “taxes” taagacacaggcccgaagctcgcagccgcaggtcagtagccacgagccgtccagcaaaggagccagg ccgtgacgacaatagtagacaagaaaggcccggagcgcatacggacctcctccacccacacagga gcaggcatgcggcgaaatagtagacggactcctgctaaccgcaagaagcgcgagcacgcagcag gccggacagacccgcaaacacaacacaagtgggcgcccatcctgctaaccacggcaggcaaacacaa cagacgcagaagcaggcgacggcgaaatagtagacggcagcggcgaaatagccgcatggcaa gcggcgaaatagacgagaaaacggacgccaaaaaacttcagacagggacaagcccggggcta caactcccgtcggcgaaatagacaccgtcaaaccacaaaaggacgagtagacaggcaaggcagc cagggaaacttcgcccaatgaagaggcacgcaggcaaaccaacgatagcgggaagcgaggaag cggcgaacaaccaaatagcaagatacacccagacacagcgcagtaggcaacaagg “sewas” joshua aatag cggcga bayes taatgtttgtgctggtttttgtggcatcgggcgagaatagcgcgtggtgtgaaagactgttttt ttgatcgttttcacaaaaatggaagtccacagtcttgacaggacaaaaacgcgtaacaaaagtg tctataatcacggcagaaaagtccacattgattatttgcacggcgtcacactttgctatgccat agcatttttatccataagacaaatcccaataacttaattattgggatttgttatatataacttt ataaattcctaaaattacacaaagttaataactgtgagcatggtcatatttttatcaatcacaa agcgaaagctatgctaaaacagtcaggatgctacagtaatacattgatgtactgcatgtatgca aaggacgtcacattaccgtgcagtacagttgatagcacggtgctacacttgtatgtagcgcatc tttctttacggtcaatcagcatggtgttaaattgatcacgttttagaccattttttcgtcgtga aactaaaaaaaccagtgaattatttgaaccagatcgcattacagtgatgcaaacttgtaagtag atttccttaattgtgatgtgtatcgaagtgtgttgcggagtagatgttagaatagcgcataaaa aacggctaaattcttgtgtaaacgattccactaatttattccatgtcacacttttcgcatcttt gttatgctatggttatttcataccataagccgctccggcggggttttttgttatctgcaattca gtacaaaacgtgatcaacccctcaattttccctttgctgaaaaattttccattgtctcccctgt aaagctgtaacgcaattaatgtgagttagctcactcattaggcaccccaggctttacactttat gcttccggctcgtatgttgtgtggaattgtgagcggataacaatttcacacattaccgccaatt ctgtaacagagatcacacaaagcgacggtggggcgtaggggcaaggaggatggaaagaggttgc cgtataaagaaactagagtccgtttaggaggaggcgggaggatgagaacacggcttctgtgaac taaaccgaggtcatgtaaggaatttcgtgatgttgcttgcaaaaatcgtggcgattttatgtgc gcagatcagcgtcgttttaggtgagttgttaataaagatttggaattgtgacacagtgcaaatt cagacacataaaaaaacgtcatcgcttgcattagaaaggtttctgctgacaaaaaagattaaac ataccttatacaagacttttttttcatatgcctgacggagttcacacttgtaagttttcaacta cgttgtagactttacatcgccttttttaaacattaaaattcttacgtaatttataatctttaaa aaaagcatttaatattgctccccgaacgattgtgattcgattcacatttaaacaatttcagacc catgagagtgaaattgttgtgatgtggttaacccaattagaattcgggattgacatgtctta Upstream sequences in E. coli taatgtttgtgctggtttttgtggcatcgggcgagaatagcgcgtggtgtgaaagactgttttt ttgatcgttttcacaaaaatggaagtccacagtcttgacaggacaaaaacgcgtaacaaaagtg tctataatcacggcagaaaagtccacattgattatttgcacggcgtcacactttgctatgccat agcatttttatccataagacaaatcccaataacttaattattgggatttgttatatataacttt ataaattcctaaaattacacaaagttaataactgtgagcatggtcatatttttatcaatcacaa agcgaaagctatgctaaaacagtcaggatgctacagtaatacattgatgtactgcatgtatgca aaggacgtcacattaccgtgcagtacagttgatagcacggtgctacacttgtatgtagcgcatc tttctttacggtcaatcagcatggtgttaaattgatcacgttttagaccattttttcgtcgtga aactaaaaaaaccagtgaattatttgaaccagatcgcattacagtgatgcaaacttgtaagtag atttccttaattgtgatgtgtatcgaagtgtgttgcggagtagatgttagaatagcgcataaaa aacggctaaattcttgtgtaaacgattccactaatttattccatgtcacacttttcgcatcttt gttatgctatggttatttcataccataagccgctccggcggggttttttgttatctgcaattca gtacaaaacgtgatcaacccctcaattttccctttgctgaaaaattttccattgtctcccctgt aaagctgtaacgcaattaatgtgagttagctcactcattaggcaccccaggctttacactttat gcttccggctcgtatgttgtgtggaattgtgagcggataacaatttcacacattaccgccaatt ctgtaacagagatcacacaaagcgacggtggggcgtaggggcaaggaggatggaaagaggttgc cgtataaagaaactagagtccgtttaggaggaggcgggaggatgagaacacggcttctgtgaac taaaccgaggtcatgtaaggaatttcgtgatgttgcttgcaaaaatcgtggcgattttatgtgc gcagatcagcgtcgttttaggtgagttgttaataaagatttggaattgtgacacagtgcaaatt cagacacataaaaaaacgtcatcgcttgcattagaaaggtttctgctgacaaaaaagattaaac ataccttatacaagacttttttttcatatgcctgacggagttcacacttgtaagttttcaacta cgttgtagactttacatcgccttttttaaacattaaaattcttacgtaatttataatctttaaa aaaagcatttaatattgctccccgaacgattgtgattcgattcacatttaaacaatttcagacc catgagagtgaaattgttgtgatgtggttaacccaattagaattcgggattgacatgtctta Biological Motivation Transcription regulation: DNA strand DNA ..ATTGATAATCCCTAGTCGATATTAAATCGGCTAACTTAAAGCGC.. Alphabet of 4 letters: A, C, G, T Biological Motivation Transcription factor recognizes binding site (motif) TRANSCRIPTION FACTOR DNA ..ATTGATAATCCCTAGTCGATATTAAATCGGCTAACTTAAAGCGC.. Biological Motivation TF attracts RNA polymerase to promoter TRANSCRIPTION FACTOR RNA POLYMERASE DNA ..ATTGATAATCCCTAGTCGATATTAAATCGGCTAACTTAAAGCGC.. Biological Motivation Transcription mRNA copied from DNA TRANSCRIPTION FACTOR RNA POLYMERASE DNA ..ATTGATAATCCCTAGTCGATATTAAATCGGCTAACTTAAAGCGC.. TRANSCRIPTION mRNA Biological Motivation mRNA PROTEINS TRANSCRIPTION FACTOR RNA POLYMERASE DNA ..ATTGATAATCCCTAGTCGATATTAAATCGGCTAACTTAAAGCGC.. TRANSCRIPTION mRNA PROTEINS Transcription Regulation Discovery of transcription factor binding sites TRANSCRIPTION FACTOR RNA POLYMERASE DNA ..ATTGATAATCCCTAGTCGATATTAAATCGGCTAACTTAAAGCGC.. TRANSCRIPTION mRNA PROTEINS Motif Representation Position matrix Θ 0 |TATAAT| G 1 2 3 4 5 6 bits “information content” 1 G 2 CT weight Sequence logo Kullback-Leibler specific 00 00 00 10 08 0 15 0 05 00 09 00 00 01 00 00 04 06 10 00 00 00 00 00 00 10 A C G T TAGAAT TATACT TATTAT TAGAAT TATAAT TATAAT TAGACT TATAAT TATACT TATAAT TATAAT TATAAT TAGAGT TAGAAT TATAAT TAGAAT TAGTAT TATAAT TATAAT TAGAAT Computational Approaches to Motif Discovery Pattern Search Methods word frequency based EM (MEME, ) Gibbs sampling (Motif Sampler, AlignAce, BioProspector) enumerate, initialize, check refine significance using data Consensus Dictionary MDscan, weight matrix based A Word Frequency Based Approach Consensus (Stormo and Hartzell, 1989) Width = 5 A 0 0 0 0 G 0 0 0 1 0 T 1 1 0 0 0 1 A 0 0 1 1 C 0 0 0 1 G 0 0 1 0 0 T 1 0 0 0 0 0 C 0 0 1 0 0 G 0 1 0 0 1 T 0 0 0 0 0 tgctaatct 0 A 0 0 0 1 0 0 0 C 1 0 0 0 0 G 1 0 0 1 0 G 0 0 1 0 0 T 0 0 0 0 1 T 0 0 0 1 1 C 0 0 1 1 0 0 0 A 0 Idea: maximize entropy distance between motif and background 0 0 C 1 0 0 1 1 0 0 0 ctaattagc A ttagcagtt Find all matrices from 1st sequence A Word Frequency Based Approach Consensus (Stormo and Hartzell, 1989) Width = 5 0 0 0 G 0 0 0 1 0 T 1 1 0 0 0 1 C 0 0 0 1 G 0 0 1 0 0 T 1 0 0 0 0 0 A 0 0 G 1 0 0 1 0 T 0 0 0 0 1 1 0 C 0 0 1 0 0 G 0 1 0 0 1 T 0 0 0 0 0 1 0 0 0 0 C 1 0 0 0 0 G 0 0 1 0 0 T 0 0 0 1 1 1 0 0 0 0 C 1 0 1 A 0 0 1 Update each matrix by “best” match in 2nd sequence, keep all matrices that score above a cut-off level. Repeat for rest of the sequences A 0 0 0 G 0 0 0 2 0 T 2 2 0 0 0 A 2 1 0 2 C 0 0 1 0 0 C 1 0 0 0 0 G 0 1 0 0 2 G 0 0 1 0 0 T 0 1 1 0 0 T 0 0 0 2 2 Initial 0 sequences must contain motif 0 0 2 0 C 1 0 0 2 0 0 0 2 A 0 Repeat, randomizing sequence order? 0 0 1 0 A A C tgctaatct 0 0 1 0 0 ctaattagc 0 A ttagcagtt Find all matrices from 1st sequence Weight Matrix-based Approaches Motif width w assumed known. EM for finite mixture models (Lawrence and Reilly, 1990) 1 motif per sequence MEME (Bailey and Elkan, 1994) Data set broken into overlapping subsequences of length w Fit a two-component (motif / background) mixture model iteratively by EM algorithm. motifs must be same width; number of different motifs (components) may be unknown Monte Carlo-based Approaches Gibbs Motif Sampler (Liu et al, JASA, 95) Weight matrix Θ (parameter), start positions A (“missing” data) Predictive updating – Assume N 1 sequences aligned; 1 motif per sequence (originally) stochastically predict the N th one A1 Randomly initialize Ai ’s A2 A3 Iteratively, choose a sequence k to exclude, update Ak repeat until convergence A N = ?? Extensions of Gibbs Sampling AlignAce (Roth et al, 1998) Iterative masking to find multiple motifs Variable widths of motifs BioProspector (X. Liu et al, 2001) 3rd order Markovian background assumption A modified “scoring” function Allows 2-block (gapped) motifs “Dictionary” Model (Bussemaker et al. 2000) C C C C C C C A CC CC C C A C C C C C C A A A ACTTA A C A A AA TAT A A A AA TAT G TAT A A G G G G G G T T T G T TAT G G G T T T T G G T T TT T TT G T T T T ACTTA G G T ACTTA A A TGC G TAT TG T T T TA C C C A GT T T ... A, C, G, T and longer words are part of a dictionary. Words drawn independently according to certain probabilities and concatenated together to form the sequence. “Dictionary” Model (Bussemaker,et al. 2000) A, C, G, T Algorithm: Starting dictionary : D 0 Find over-represented pairs (adjacent words) Concatenate and add them to the dictionary Estimate word probabilities Problems: upper limit to length of patterns; fuzzy patterns; patterns with under-represented substrings A, C, G, T, AC, GC, TA, ATA, GCTA, TAGCTA D4 Words Stochastic Word Matrices T T G A C A Words Stochastic Word Matrices T A T G A C A 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 G C T ! Words Stochastic Word Matrices A T 0 .0 .95 0 .9 .0 0 .1 .05 .99 .0 .0 0 .8 .0 .01 .0 .8 1 .1 .0 0 .1 G C .2 ! TTGACA [PROB = 0.5417] ATGACT [PROB = 0.0150] TTTAGT [PROB = 7.6e-05] Stochastic Dictionary Framework Stochastic Dictionary MD " " " ρ MD " " xN " " " x 1 x2 Sequence S ρ M1 Word usage probabilities: ρ M 1 M2 Dictionary of D word matrices D Stochastic Dictionary " ρ MD " " ρ M1 MD " Word usage probabilities: ρ " M 1 M2 Dictionary of D word matrices D Under independence, observed likelihood of sequence S is & % Nj Π ∑ ∏ ρ Mj # PS ρ D $ Π j 1 summed over all partitions Π Nj Π frequency for word type j in partition Π. PROBLEM: partitions, words, number of words unknown! Missing Data– Sequence Partitions 4 23 4: 2 words; w A23 2 1 6: 1 word Site indicators unknown for motif type i if motif type k starts at i, 0 otherwise. ( 1 " Aik 1 ' A14 1 w 1 ' ' A4 1 AAATATTCGACGCTATTTCCCGTTGACAT Sequence S 14 " ) Model Parameters 4 single letters and stochastic Dictionary of size D contains d word matrices ΘD " - D d 1 1 . " " (θi j is a vector of length 4) 0 |TATAAT| G 1 2 3 4 5 6 ' , i bits θi wi G 2 CT * 2 Θd " * + width wi θi1 θi2 Θi 1 Θd Θ Word usage probabilities " ρΘ For conciseness, O " " ρD " ρ 1 ρ2 ρ Model Parameters 4 single letters and stochastic Dictionary of size D contains d word matrices " * ΘD bits . " " Priors: Columns are independent Dirichlet 0 |TATAAT| G 1 2 3 4 5 6 D Word usage probabilities ρD " " " ρ 1 ρ2 ρ G - 1 CT ' , d 1 2 θi wi θi1 θi2 i 2 Θd " * width wi + Θi 1 Θd Θ Prior: Dirichlet Stochastic Dictionary-based Data Augmentation (SDDA) S Easy " " Sample Θ ρ A # – Word counts multinomial, Dirichlet prior on ρ " ρ A S Dirichlet # – Motif column counts independent multinomial, prior on columns of Θ independent Dirichlet " Θ A S Product Dirichlet # Difficult! " S Θρ " Sample A # Need dynamic programming (forward summation-backward sampling) Details: Discovery of conserved sequence patterns using a stochastic dictionary model. Gupta, M and Liu, JS (2003). JASA 98, 55-66. Forward Summation Data Likelihood Observed currently D words in the dictionary " " # xk O P x1 gk O partial likelihood richardbayesagreatgrandfatherofthomasbayeswas... k = 45 PS O # " gk O / Given gk 1 O Then, gN easy to calculate! Forward Summation Data Likelihood Observed Currently 3 words in the dictionary richardbayesagreatgrandfatherofthomasbayeswas... . # P WAS Θ1 01 g41 O g45 O Forward Summation Data Likelihood Observed Currently 3 words in the dictionary richardbayesagreatgrandfatherofthomasbayeswas... P ESWAS Θ2 . 10 . # g40 O # P WAS Θ1 01 g41 O g45 O Forward Summation Data Likelihood Observed Currently 3 words in the dictionary richardbayesagreatgrandfatherofthomasbayeswas... P ESWAS Θ2 01 P ASBAYESWAS Θ3 . # # g35 O . 10 g40 O # P WAS Θ1 01 g41 O g45 O Backward Sampling Partitions Currently 3 words in the dictionary richardbayesagreatgrandfatherofthomasbayeswas... # " " " # P ASBAYESWAS Θ3 g45 O 10 P ‘‘ASBAYESWAS’’ S O A46 g35 O Backward Sampling for Motif Sites Faster Convergence than the Gibbs Sampler: j ? k 32 32 32 32 32 32 23 23 . Gibbs Ak= 1 5 Aj = 1 Ai | A1 , ... A i−1 , Ai+1 , ... , AN , S "local" moves Sample A from joint distribution ? k SDDA 4 i k, motif can be at i only if i j w 4 for j –faster convergence than the ordinary Gibbs sam- Ai | Ai+1 , ... , AN , Θ, S pler using conditionals "global" moves (Liu et al. Biometrika, 1994) Extensions– Insertions and Deletions in Motifs GACAC TATTCC CACA TATTCC TCG TATTCC TCTGT TATTCC GCT TATTCC TATGC GGC ACCA GGC GTGGTA GGC ACCCA GGC AAA GGC CTACGACCT TGATCTCAGACAC AGTTTCTCAAAC GTAATTCGAGA TTCTCGAGAAATC Not all w columns of the motif may be contiguous W=6+3=9 varying gap size Discovery of Gapped Motifs Hidden Markov model for gapped motifs Dj Ij Mj BEGIN END Each position of a segment can be generated from 1 of 3 possible states: MATCH, INSERTION, DELETION " " θ9 θ1 Θ TATTCCAAAGGC ?? Discovery of Gapped Motifs " P TATTCCAAAGGC Θ path # ∑ # P TATTCCAAAGGC Θ all paths through HMM Multiple paths through HMM θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9 to “align” T A gapped segment of length l M T T with D C matrix Θ (width w) C A A A G G Approx. order of computation: I . l w 6 C w 22w wπ Discovery of Gapped Motifs " P TATTCCAAAGGC Θ path # ∑ # P TATTCCAAAGGC Θ all paths through HMM θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9 T A Use recursive forward summation M to compute likelihood and T T D C C backward sampling to sample path through the HMM. A A A 3 l 0 Order of computation 0 C I 7 G G w Motifs in “Low-complexity” Regions Genomes of higher organisms often characterized by polynucleotide repeats– local “traps” Example > gi| 200712| gb| M97810.1| MUSREX01 Mus musculus ZFP 42 (Rex- 1) gene TCAGGCAACTAGTGTACTTTGTAGCGGGGTCCGGGAGAGGCTGGGGTCTAGAGTGGCGATGGGACGAAAGGGTAAA AGTTTTCGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTAGGTAGGTAGGTAGGTAGAAAC ATCCTCTGCTTGTGTAAATCCGGTTACTGTGTAACAGAGGTACTGAGATGTGACTGAGTCTCAAGGCCAGGCGATC GGATTCAGAAGAGGCATTTGCATAACTGAGCAAGAGCCTTTGCCCCACCCTTCCACGCGGACCCAAGACAAGCGGG TCTGGGTGGGTCACCTTGAAGCCAGGGGCCCGCCCACATCCCCGCCCACACCCACCTTGAGCGCTTCTCATTGGTT CAGCCTACCCAGCTCGGAGTTAGTTACTCCGTAAGTGTGGCCGGAACAGAGTTCGTCCATCTA... [mouse skeletal muscle regulatory region] Compared performance of Stochastic Dictionary (SDDA) to BioProspector and AlignAce Simulation Study under Sequence Background Correlation Markov transition matrices used: = α 3α α α α 1 α α = α 3α α = α 1 3α @ < 2000 α β α α β ? ? ? ? ? α β β > β 2α 1 = 2α β 2α = = 2α 1 = α 1 = β = @ ; ; ; ; ; = < 9 8 1 α α = : ii α highest eigenvalue Simulated data sets of 2-letter repeats β dependence (EV2 ) ? ? ? ? ? = 9 8 ; ; ; ; ; α 1 α 2nd > : i 3α α of represented by 1-letter repeats 1 Degree nucleotides each containing 1 true motif, w=16 and correlated background for 4 choices each of matrix (i) and (ii). Performance Comparison SDDA background BioProspector AlignAce FP Success FP Success FP 1st 0.24 1.0 0.07 1.0 0.02 0.3 0.43 order 0.48 0.6 0.06 0.7 0.00 0.0 0.72 0.5 0.12 0.1 0.00 0.0 A 0.96 0.7 0.02 0.0 0.0 A 2nd 0.24 1.0 0.03 1.0 0.09 0.1 0.52 order 0.48 0.9 0.12 0.7 0.01 0.1 0.62 0.72 0.9 0.05 0.6 0.00 0.1 0.36 0.96 1.0 0.03 0.0 A 0 Success: true motif among top 5 found 80% site overlap B 60% of true positives, B Criterion: Progressive updating in stochastic dictionary provides a control– treats repeats as adjacent words of the same pattern A Success A dependence A type EV2 A Matrix Model Selection Evaluating Motif Significance back to the problem ... How many patterns to include in the model? Bayes Factor for comparing model M1 (1 pattern and background) to M0 (background only) is " " # ∑ A O p A S O M1 d O O p S O M0 d O # " # " C FGE C E E E E E # D ρ Θ) (O p S M1 p S M0 Integration can be done, but sum involves summing over all possible partitions of sequences! Computational Approaches to BF Bayes Factor: unnormalized # # D # # c1 c0 " " ∑A q A S M 1 p S M0 IH ∑A p A S M 1 p S M0 Approaches: Importance sampling, bridge sampling, marginalization ... J and estimate " # S O Can get draws from p A & % " $ J # % S O M1 & ĉ1 S M1 1 N2 2i q A ∑ N2 i 1 1 N1 1i p A ∑ N1 i 1 " " $ # biased estimate Heuristics: use mixture of densities as trial distribution # S M1 : " But ... difficult to get “correct” samples from q A Analytical Approximation– MAP Score Maximal À Posteriori score maxA P S A M1 J Evaluated at “optimal” alignment A " # # Note: J So J MAP A BF A # # A M1 " " ∑P S M1 LK PS A " J MAP A # J P S A M1 P S M0 K −2 −4 −6 −10 −8 log(MAP) 0 2 log(MAP) in Random Sequences for Varying w 6 8 10 12 14 16 w Horizontal line corresponds to alignment with no motifs Divergence of the MAP Score Result 1 ∞, MAP A J in N if model M1 is true. As sequence length N P ∞ at an exponential rate (under certain simple conditions) Result 2 MAP monotonically increases with increasing number of motifs if the larger model is true MAP as stopping criterion for number of words to include in the dictionary– more useful as word length increases Future Directions Refine clustering model More appropriate inter-motif distance models Model selection criteria Incorporate more biological knowledge Correlated sequences (phylogenetic information) Intra-motif correlations? Broaden scope of the problem Biological databases, knowledge growing at a rapid pace Discovery of regulatory networks, relationships among genes through study of regulatory motifs References Previous Work Gibbs Motif Sampler Liu, JS, Neuwald, AF, and Lawrence, CL (1995). JASA 90, 1156–70. BioProspector Liu, XS, Brutlag, D and Liu, JS (2001). P. Symp. Biocomp. 6, 127–38. Dictionary Bussemaker, HJ, Li, H and Siggia, ED (2000). PNAS 97 10096–10100. For this work Discovery of conserved sequence patterns using a stochastic dictionary model. Gupta, M and Liu, JS (2003). JASA 98, 55-66. Statistical models for biological sequence motif discovery. Liu, JS, Gupta, M, Liu, X, Mayerhofer,L and Lawrence, CE (2002) Case studies in Bayesian Statistics 6. Motif cluster prediction using a regulatory module framework. in preparation.