Motif discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington thabangh@gmail.com Outline • One-minute response • Revision • Motifs – Gibbs sampling – Expectation maximization • Python One-minute responses • Are we able to read from the smooth curve that we learned to create today what is a “good enough p-value”? – No, the curve gives you the p-value, but deciding what is “good enough” requires that you know the costs associated with false positives and false negatives. • The concepts were clear (for today). • Keep doing more explanation on the board. • Can you please explain the last part of converting scores to p-values? • Continue with revision every day. Other questions and comments • We would prefer to know how we did on the first assessment before you give us the second assessment. Converting scores to p-values 0 1 2 3 4 … A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 400 T 100 43 13 64 • Say that your motif has N rows. Create a matrix that has N rows and 100N columns. • The entry in row i, column j is the number of different sequences of length i that can have a score of j. Converting scores to p-values A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 0 1 2 3 4 … 10 60 100 1 1 1 1 400 T 100 43 13 64 • For each value in the first column of your motif, put a 1 in the corresponding entry in the first row of the matrix. • There are only 4 possible sequences of length 1. Converting scores to p-values A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 0 1 2 3 4 … 10 1 1 60 77 1 100 400 1 1 T 100 43 13 64 • For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. • Add y to the x+zth column of the matrix. Converting scores to p-values A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 0 1 2 3 4 … 10 1 1 60 77 1 100 400 1 1 T 100 43 13 64 • For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. • Add y to the x+zth column of the matrix. • What values will go in row 2? – 10+67, 10+39, 10+71, 10+43, 60+67, …, 100+43 • These 16 values correspond to all 16 strings of length 2. Converting scores to p-values A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 0 1 2 3 4 … 10 1 1 60 77 1 100 400 1 1 T 100 43 13 64 • In the end, the bottom row contains the scores for all possible sequences of length N. • Use these scores to compute a p-value. Sample problem • List the scores of all 4 length-1 DNA sequences relative to this motif: A 10 15 100 80 C 100 30 7 21 G 0 30 22 14 T 10 35 9 51 – A=10, C=100, G=0, T=10 • List the scores of all 16 length-2 DNA sequences relative to the same motif. – AA=15, AC=40, AG=40, AT=45, CA=115, CC=130, CG=130, CT=135, GA=15, GC=30, GG=30, GT=35, TA=25, TC=40, TG=40, TT=45 • Draw the dynamic programming matrix for this motif and indicate where the 20 scores you computed would go. • 15x2, 25, 30x2, 40x4 Sample problem 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 1 2 1 2 1 2 1 4 1 1 2 1 • Draw the dynamic programming matrix for this motif and indicate where the 20 scores you computed would go. – AA=15, GA=15, TA=25, GC=30, GG=30, GT=35, AC=40, AG=40, TC=40, TG=40, AT=45, TT=45, CA=115, CC=130, CG=130, CT=135 • How many distinct scores do you observe for sequences of length 2? – 9 • How many calculations will you need to perform to compute scores for sequences of length 3? – 9 x 4 = 36 Revision • How many distinct amino acid sequences of length 10 exist? – 2010=1.024 x 1012 • Say that you use dynamic programming to compute the distribution of scores for a motif of width 10. You then observe a sequence with a score of 28. Describe how you would compute the p-value of 28 from the output of the dynamic programming. – Compute the sum S of the counts for scores ≥28. – The p-value is S/2010. 1 ¥ p(x = 28) = 20 å si 10 i=28 Motif discovery problem • Given sequences seq. 1 seq. 2 seq. 3 • Find motif IGRGGFGEVY at position 515 LGEGCFGQVV at position 430 VGSGGFGQVY at position 682 seq. 1 seq. 2 seq. 3 Motif discovery problem • Given: – a sequence or family of sequences. • Find: – the number of motifs – the width of each motif – the locations of motif occurrences Why is this hard? • Input sequences are long (thousands or millions of residues). • Motif may be subtle – Instances are short. – Instances are only slightly similar. ? ? HAHU HAOR HADK HBHU HBOR HBDK MYHU MYOR IGLOB GPUGNI GPYL GGZLB xxxxxxxxxxx.xxxxxxxxx.xxxxx..........xxxxxx.xxxxxxx.xxxxxxxxxx.xxxxxxxxx V.LSPADKTN..VKAAWGKVG.AHAGE..........YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSA M.LTDAEKKE..VTALWGKAA.GHGEE..........YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSA V.LSAADKTN..VKGVFSKIG.GHAEE..........YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSA VHLTPEEKSA..VTALWGKVN.VDEVG...........G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPD VHLSGGEKSA..VTNLWGKVN.INELG...........G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAG VHWTAEEKQL..ITGLWGKVNvAD.CG...........A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPT G.LSDGEWQL..VLNVWGKVE.ADIPG..........HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSED G.LSDGEWQL..VLKVWGKVE.GDLPG..........HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTED M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY.PDIQNKFSQFaGKDLASIKD A.LTEKQEAL..LKQSWEVLK.QNIPA..........HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPE GVLTDVQVAL..VKSSFEEFN.ANIPK...........N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQ M.L.DQQTIN..IIKATVPVLkEHGVT...........ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE HAHU HAOR HADK HBHU HBOR HBDK MYHU MYOR IGLOB GPUGNI GPYL GGZLB xxxxx.xxxxxxxxxxxxx..xxxxxxxxxxxxxxx..xxxxxxx.xxxxxxx...xxxxxxxxxxxxxxxx QVKGH.GKKVADA.LTN......AVA.HVDDMPNA...LSALS.D.LHAHKL....RVDPVNF.KLLSHCLL QIKAH.GKKVADA.L.S......TAAGHFDDMDSA...LSALS.D.LHAHKL....RVDPVNF.KLLAHCIL QIKAH.GKKVAAA.LVE......AVN.HVDDIAGA...LSKLS.D.LHAQKL....RVDPVNF.KFLGHCFL AVMGNpKVKAHGK.KVLGA..FSDGLAHLDNLKGT...FATLS.E.LHCDKL....HVDPENF.RL.LGNVL AVMGNpKVKAHGA.KVLTS..FGDALKNLDDLKGT...FAKLS.E.LHCDKL....HVDPENFNRL..GNVL AILGNpMVRAHGK.KVLTS..FGDAVKNLDNIKNT...FAQLS.E.LHCDKL....HVDPENF.RL.LGDIL EMKASeDLKKHGA.TVL......TALGGILKKKGHH..EAEIKPL.AQSHATK...HKIPVKYLEFISECII EMKASaDLKKHGG.TVL......TALGNILKKKGQH..EAELKPL.AQSHATK...HKISIKFLEYISEAII T.GA...FATHATRIVSFLseVIALSGNTSNAAAV...NSLVSKL.GDDHKA....R.GVSAA.QF..GEFR NNPK...LKAHAAVIFKTI...CESATELRQKGHAVwdNNTLKRL.GSIHLK....N.KITDP.HF.EVMKG NNPD...LQAHAG.KVFKL..TYEAAIQLEVNGAVAs.DATLKSL.GSVHVS....K.GVVDA.HF.PVVKE Q......PKALAM.TVL......AAAQNIENLPAIL..PAVKKIAvKHCQAGVaaaH.YPIVGQEL.LGAIK HAHU HAOR HADK HBHU HBOR HBDK MYHU MYOR IGLOB GPUGNI GPYL GGZLB xxxxxxxxx.xxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxx..x VT.LAA.H..LPAEFTPA..VHASLDKFLASV.STVLTS..KY..R VV.LAR.H..CPGEFTPS..AHAAMDKFLSKV.ATVLTS..KY..R VV.VAI.H..HPAALTPE..VHASLDKFMCAV.GAVLTA..KY..R VCVLAH.H..FGKEFTPP..VQAAYQKVVAGV.ANALAH..KY..H IVVLAR.H..FSKDFSPE..VQAAWQKLVSGV.AHALGH..KY..H IIVLAA.H..FTKDFTPE..CQAAWQKLVRVV.AHALAR..KY..H QV.LQSKHPgDFGADAQGA.MNKALELFRKDM.ASNYKELGFQ..G HV.LQSKHSaDFGADAQAA.MGKALELFRNDM.AAKYKEFGFQ..G TA.LVA.Y..LQANVSWGDnVAAAWNKA.LDN.TFAIVV..PR..L ALLGTIKEA.IKENWSDE..MGQAWTEAYNQLVATIKAE..MK..E AILKTIKEV.VGDKWSEE..LNTAWTIAYDELAIIIKKE..MKdaA EVLGDAAT..DDILDAWGK.AYGVIADVFIQVEADLYAQ..AV..E Globin motifs A Concrete Example: Transcription Factor Binding Sites • Transcription factor proteins bind to DNA and regulate gene expression. • The promoter is a region near the start of the gene where transcription factors bind. A Concrete Example: Transcription Factor Binding Sites We are given a set of promoters from co-regulated genes. TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 A Concrete Example: Transcription Factor Binding Sites An unknown transcription factor binds to positions unknown to us, on either DNA strand. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 A Concrete Example: Transcription Factor Binding Sites The DNA binding motif of the transcription factor can be described by a position-specific scoring matrix (PSSM). 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 A Concrete Example: Transcription Factor Binding Sites The sequence motif discovery problem is to discover the sites (or the motif) given just the sequences. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 Gibbs sampling Alternating approach 1. Guess an initial weight matrix 2. Use weight matrix to predict instances in the input sequences 3. Use instances to predict a weight matrix 4. Repeat 2 & 3 until satisfied. Initialization • Randomly guess an instance si from each of t input sequences {S1, ..., St}. sequence 1 sequence 2 sequence 3 sequence 4 sequence 5 ACAGTGT TTAGACC GTGACCA ACCCAGG CAGGTTT Gibbs sampler • Initially: randomly guess an instance si from each of t input sequences {S1, ..., St}. • Steps 2 & 3 (search): – Throw away an instance si: remaining (t - 1) instances define weight matrix. – Weight matrix defines instance probability at each position of input string Si – Pick new si according to probability distribution • Return highest-scoring motif seen Sampler step illustration: ACAGTGT TAGGCGT ACACCGT ??????? CAGGTTT ACAGTGT TAGGCGT ACACCGT ACGCCGT CAGGTTT A C G T .45 .45 .45 .05 .05 .05 .05 .25 .45 .05 .25 .45 .05 .05 .05 .05 .45 .65 .05 .65 .05 .25 .05 .05 .05 .45 .25 .85 sequence 4 ACGCCGT:20% 11% ACGGCGT:52% The MEME Algorithm MEME uses expectation maximization (EM) to discover sequence motifs. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 Slides courtesy of Tim Bailey The MEME Algorithm The positions (and strands) of the sites are the missing information variables, Z = {Zi,j}. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 The MEME Algorithm If we knew the Z = {Zi,j}, we could estimate the motif PSSM using maximum likelihood. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 The MEME Algorithm Alignment i 1 AAAAGAGTCA 2 AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA N AAATGAGTCA 12 … w j Count Matrix A C G T Frequency Matrix A C G T 12 … w 12 … w 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 The MEME Algorithm MEME starts Expectation Maximization from an initial estimate of θ based on a subsequence in the data. θ(1) 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 The MEME Algorithm EM maximizes the expected joint likelihood of the sequences (X) and missing information (Z) under a probabilistic model. θ(1) 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 The MEME Algorithm The current estimates of the parameters of the data model are φ(t), and include θ(t). θ(1) 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 The MEME Algorithm EM iteratively refines φ(t). The MEME Algorithm E-step: Use φ(t) to estimate the probabilities of each position in the data being a site. The MEME Algorithm M-step: Use Z(t) to find the value of φ that maximizes the expectation of the joint conditional probability. Alternating approach 1. Guess an initial weight matrix 2. Use weight matrix to predict instances in the input sequences 3. Use instances to predict a weight matrix 4. Repeat 2 & 3 until satisfied. Expectation-maximization foreach subsequence of width W convert subsequence to a matrix do { re-estimate motif occurrences from matrix EM re-estimate matrix model from motif occurrences } until (matrix model stops changing) end select matrix with highest score Comparison • Both EM and Gibbs sampling involve iterating over two steps • Convergence: – EM converges when the PSSM stops changing. – Gibbs sampling runs until you ask it to stop. • Solution: – EM may not find the motif with the highest score. – Gibbs sampling will provably find the motif with the highest score, if you let it run long enough. Sample problem #1 • You are implementing a Gibbs sampler for motif discovery. Your code has just scanned a sequence S with a motif PSSM to produce a list of scores x1, …, xn. You will write the code to randomly select the new motif occurrence on the basis of the score list. • Given: – a list of scores x1, …, xn • Return: – Print to standard output an integer i in the range 1, …, n and the corresponding score xi, where i is selected with probability proportional to xi. > ./randomly-select-4.py scores.txt Read 10000 scores from scores.txt. Sum of scores = 20042.7 Random value = 3298.62 1655 1.41589 Sample problem #2 • Given: – a list of DNA sequences, – the motif width, – the integer index of the (currently estimated) motif occurrence within each sequence, – the index j of the sequence for which to update the motif occurrence, and – the list of scores for sequence j. • Return: – a new alignment with the motif occurrence for sequence j randomly replaced. >./do-gibbs-step.py crp0.txt 10 crp0-offsets.txt 1 crp0-scores.txt > foo Read 18 sequences from crp0.txt. Read 95 values from crp0-scores.txt. Read 18 values from crp0-indices.txt. Sum of scores = 189.341 Random value = 69.6017 Selected 36 with score 2.81402. > cat foo TTTGTGGCAT CAGAAAAGTC ACTGTGAGCA TATGCAAAGG GGTGTTAAAT ATTGTGATGT AATTTATTCC AACGTGATCA AATGTGAGTT TCTGTAACAG TCTGTGAACT ATTGTGACAC TGCCTGACGG ATTGTGATTC GTTGTGATGT GGTGTGAAAT AATGAGACGT TTTGTGAGTG