Algorithms in Bioinformatics: A Practical Introduction Project: Motif finding using ChIP-seq peak data Transcriptional Control (I) Transcriptional Control (II) TATAAT is the motif! Motif model TTGACA TCGACA TTGACA TTGAAA ATGACA TTGACA GTGACA TTGACT TTGACC TTGACA Consensus Pattern Positional Weight Matrix (PWM) TTGACA nucleotide A C G T alignment position 1 2 3 4 5 6 0.1 0 0 1 0.1 0.8 0 0.1 0 0 0.9 0.1 0.1 0 1 0 0 0 0.8 0.9 0 0 0 0.1 Motif can be described in two ways based on the binding sites discovered ChIP experiment Chromatin immunoprecipitation experiment Detect the interaction between protein (transcription factor) and DNA. Peak data Peak data represents the locations where a particular TF binding. The data tells us the locations and intensities. (Note that due to experimental error, peaks of low intensity may be noise.) ChIP-seq data for Human (MCF7) E2 treatment at 45min chr1:883,686-958,485 Our aim Given the DNA sequences of those peaks, find motifs which occur in those peak regions. For the example below, we have two motifs: TTGACA and GCATC. Note that each instance has at most 1 mutation. GCACGCGGTATCGTTAGCTTGACAATGAAGAATCCCCCCGCTCGACAGT GCATACTTTGACACTGACTTCGCTTCTTTAATGTTTAATGAAACATGCG CCCTCTGGAAATTAGTGCGGCATCTCACAACCCGAGGAATGACCAAATG GTATTGAAAGTAAGGCAACGGTGATCCCCATGACACCAAAGATGCTAAG CAACGCTCAGGCAACGTTGACAGGTGACACGTTGACTGCGGCCTCCTGC GTCTCTTGACCGCTTAATCCTAAAGGCCTCCTATTAGTATCCGCAATGT GAACAGGAGCGCGAGCCATCAATTGAAGCGAAGTTGACACCTAATAACT Input (I) From every peak, we get approximately +/-200 DNA sequence >cmyc_1_chr1_4842133_4842148_range_chr1_4841934_4842348_intensity_20 CCTCCATACCAGCCCCAATGTTCTGCGTTCCCGAATGAAAGACACACAACACAGCCTTTATATTTTGATATGCCTAAAACTGCT CAATGGCTGGGCCACTTCCTAGCTAGTATCCACGTGGCTATCCCACCTCTCTCTGATATTCCCAAGTCATTACTTACTAA AATCTGTAATTACATCTTTGCTGCCCTAGGCCCAATCTGGCAGCCCTCCTGTGGCCCCTCAGGCTACTACATGGCAGCT AAGCTCTCTGACCCACATCTTCTCAGGCACCGTGCCTCCTCTTCTCCACCTTATTCAAACATGGTGGCTCTCCTTCCTC CTTCTTCCTGTCTGTCCCCAGCCTGGGAATTCTAAAAGTCCCACCTCTGTCTGCCCTGTTCAGCCATTGGCTGTCGGCA TCTTTATTTACGAG >cmyc_2_chr1_5073201_5073215_range_chr1_5073002_5073415_intensity_15 GGTCATAAACCAAGCTTCTTCAAAGATTTTTGGCTTTTTGGCACCAGTGGCCTGCAGGGTGGCGAGCTCTGCCAGTTTGAAG TGACCAAGTTAAGTGGCCTGGGAAAGGCCATTTGGTGCGCGGTCCAGCAGTTTTGGGCGCTCTCGGCTTCCGCCCTC AGCTGCGGTCACGTGCGGCTGCTCACGTGCCAGACGCTGCTGTCACTTCGTAGCTGTTCCGGCTTCCTCTGAGTGAG GCTCGCAACGTCTCCCACGGAGTCGCCTTCGTTCTGCTCTGGGTCTCCCGTGGCCACTGAGACCTCGGAGCTCGACC GGCGCCTGCCCGCCCGTGCGGCCCTCACTCCCCGAGGCTATCCAGGTGAGGCCGCCTGGGGTCCCTCCCCGGCTCC GGAGAGCCGACTGGTTTCCCTGCCG >cmyc_3_chr1_9530642_9530652_range_chr1_9530443_9530852_intensity_36 GTAGTCCCAACCAGGTCCTGAGCTGGTTAGCCAACCCTCAGCGCCAGTCGGGCCAACATCCGGTGACGAATCCAAGTCCCG CCTCTAAGCCCATCTGCTGTCCAATGCCGCCCTCTGCCGGTCTTTACCTCCCCGCCTAGCTGTGAGCCGCTTCCAGAC AACCCGGAAGTGATCTTTCCTCTTCCGGATTACGGGTCCGGACGTCCGCACGTGGTTGCCGGTTTAGGGTGCTGCTGT AGTGGCGATACGTCCCGCCGCTGTCCCGAAGTGAGGGATCCGAGCCGCAGCGAGAGCCATGGAGGGCCAGCGCGTG GAGGAGCTGCTGGCCAAGGCAGAGCAGGAGGAGGCGGAGAAGCTGCAGCGCATCACGGTGCACAAGGAGCTGGAG CTGGAGTTCGACCTGGGCAACC …………… Input (II) A set of sequences which are likely containing no motif. >SEQ_1 AACAAGGGAAAGAGTAGTGAGTGCTTCTTTCTATTCAGAGGGAGGGGAAGTTGCTGTTAGCTAAGACAGTCAGGACTGAGA AGGGGGGGGGGGGTTTAACTCTCCTGGAGGGAGCTGAGAGGTAAAGGGAGGGGCGTGAGGTAGAACAAGCCGAGA ACACAGGGCAGGTTGGTCTGACTCCAGAGCACAGTGCAGGAGCCCGGAAGTTGACTCAGTTCAGTTAGCAAGTATTTT CACACAAGGCGTGAACACTGAAGACAAAAGCAAGAGACACAGCTCTATCTCTAAGAAGATTTTCAGAGCCAAGATCGA TGGGGCACACCTGTTAATCCCAGCACTTAGGAGGCTGAGGCAGGAGGATCCCAAGTTCAAAACCAGCCTGGACTTGT TTTAAGGAAAA >SEQ_2 AAAAAAAAAAAAAAGACTTCCAGTTTAATAAATGACCAATTCAGGAATGGAGATTAGGGCTGGATGACAAGTTTTTAATTGTC AAGGACTCAATTCTGTTTATCAGTTGGTATGGAATTATGTAAGCTTTTAGCGATATGACCGCACGGAGCAGTGTAGAGA GTGATCTGAGAGACGCTTGGGGGTCAGGATGGAGATAGAACTCCCTCTCTATTAGAAGGTGTTTGGTGGTAGGTAACC CTGGGCTAGCATGGTGGGTCTCTTCTTACTTAGGCTTCCATCTTTGTGGTTCAAATCCAAGAAGGACCTGCGTTCCCTC CCTCCTTGTGATCAGCTGATTGCTAGAGCATAACTCATCTTAACTTCTCATGTACTCTCCGGGTACAGGAAGGGAGGGG GC >SEQ_3 CCACTGCTGACAGTGGAGCATGAAACGACCGGCTTCCTGACTATGTTGGTACCCTTTCAGGAGCCTAAAACAGTGCTTTCAA TACTTGTGTCTATGTCTGTTAGCCACAACTTTCTAGTTTCCCAGAGAGATTTTGAAGTGTAGTTTTGTATTTGCTCAAAT ATATATTCATATGGTGAGGTGCACATTTTTTATATTATATTTTTATTCATTTATTTTTGGTGCTTGGGAATTATACTCTAGGA ATAAAGCGCCTGGTAGAAAGTGGCACACATCTTTAATCCCAGCACTCAGGAAGCAGAGGCAGACAAATCTCTGCGTTC CAGGACAGCCTGGTCTATAGAGCAAGGTCCAAGCCAGCCAGGTTTACACAAAGAAACCTAGTGTGGAAAAGACAAAA …………… Output You need to output a list of candidate (ranked) motifs. You can model the motif as PWM or consensus sequence. If you model the motif as a PWM, one of the answer for the previous dataset is You may also return other significant motifs. Aim of the project Given a sample file and a background file, you need to implement a method which output a list of motifs. You need to take advantage of the fact that this is a ChIP-seq dataset Hint: Read papers on ChIP-seq and understand its properties.