Algorithms in Bioinformatics: A Practical Introduction Project: Motif finding using ChIP-seq peak

advertisement
Algorithms in Bioinformatics: A
Practical Introduction
Project:
Motif finding using ChIP-seq peak
data
Transcriptional Control (I)
Transcriptional Control (II)
TATAAT is the motif!
Motif model
TTGACA
TCGACA
TTGACA
TTGAAA
ATGACA
TTGACA
GTGACA
TTGACT
TTGACC
TTGACA

Consensus
Pattern
Positional
Weight
Matrix (PWM)
TTGACA
nucleotide
A
C
G
T
alignment position
1
2
3 4
5
6
0.1 0
0 1 0.1 0.8
0 0.1 0 0 0.9 0.1
0.1 0
1 0
0
0
0.8 0.9 0 0
0 0.1
Motif can be described in two ways based
on the binding sites discovered
ChIP experiment

Chromatin immunoprecipitation experiment

Detect the interaction between protein
(transcription factor) and DNA.
Peak data


Peak data represents the locations where a
particular TF binding.
The data tells us the locations and intensities.

(Note that due to experimental error, peaks of low
intensity may be noise.)
ChIP-seq data for
Human (MCF7)
E2 treatment at 45min
chr1:883,686-958,485
Our aim


Given the DNA sequences of those peaks, find motifs
which occur in those peak regions.
For the example below, we have two motifs: TTGACA
and GCATC.

Note that each instance has at most 1 mutation.
GCACGCGGTATCGTTAGCTTGACAATGAAGAATCCCCCCGCTCGACAGT
GCATACTTTGACACTGACTTCGCTTCTTTAATGTTTAATGAAACATGCG
CCCTCTGGAAATTAGTGCGGCATCTCACAACCCGAGGAATGACCAAATG
GTATTGAAAGTAAGGCAACGGTGATCCCCATGACACCAAAGATGCTAAG
CAACGCTCAGGCAACGTTGACAGGTGACACGTTGACTGCGGCCTCCTGC
GTCTCTTGACCGCTTAATCCTAAAGGCCTCCTATTAGTATCCGCAATGT
GAACAGGAGCGCGAGCCATCAATTGAAGCGAAGTTGACACCTAATAACT
Input (I)

From every peak, we get approximately +/-200 DNA sequence
>cmyc_1_chr1_4842133_4842148_range_chr1_4841934_4842348_intensity_20
CCTCCATACCAGCCCCAATGTTCTGCGTTCCCGAATGAAAGACACACAACACAGCCTTTATATTTTGATATGCCTAAAACTGCT
CAATGGCTGGGCCACTTCCTAGCTAGTATCCACGTGGCTATCCCACCTCTCTCTGATATTCCCAAGTCATTACTTACTAA
AATCTGTAATTACATCTTTGCTGCCCTAGGCCCAATCTGGCAGCCCTCCTGTGGCCCCTCAGGCTACTACATGGCAGCT
AAGCTCTCTGACCCACATCTTCTCAGGCACCGTGCCTCCTCTTCTCCACCTTATTCAAACATGGTGGCTCTCCTTCCTC
CTTCTTCCTGTCTGTCCCCAGCCTGGGAATTCTAAAAGTCCCACCTCTGTCTGCCCTGTTCAGCCATTGGCTGTCGGCA
TCTTTATTTACGAG
>cmyc_2_chr1_5073201_5073215_range_chr1_5073002_5073415_intensity_15
GGTCATAAACCAAGCTTCTTCAAAGATTTTTGGCTTTTTGGCACCAGTGGCCTGCAGGGTGGCGAGCTCTGCCAGTTTGAAG
TGACCAAGTTAAGTGGCCTGGGAAAGGCCATTTGGTGCGCGGTCCAGCAGTTTTGGGCGCTCTCGGCTTCCGCCCTC
AGCTGCGGTCACGTGCGGCTGCTCACGTGCCAGACGCTGCTGTCACTTCGTAGCTGTTCCGGCTTCCTCTGAGTGAG
GCTCGCAACGTCTCCCACGGAGTCGCCTTCGTTCTGCTCTGGGTCTCCCGTGGCCACTGAGACCTCGGAGCTCGACC
GGCGCCTGCCCGCCCGTGCGGCCCTCACTCCCCGAGGCTATCCAGGTGAGGCCGCCTGGGGTCCCTCCCCGGCTCC
GGAGAGCCGACTGGTTTCCCTGCCG
>cmyc_3_chr1_9530642_9530652_range_chr1_9530443_9530852_intensity_36
GTAGTCCCAACCAGGTCCTGAGCTGGTTAGCCAACCCTCAGCGCCAGTCGGGCCAACATCCGGTGACGAATCCAAGTCCCG
CCTCTAAGCCCATCTGCTGTCCAATGCCGCCCTCTGCCGGTCTTTACCTCCCCGCCTAGCTGTGAGCCGCTTCCAGAC
AACCCGGAAGTGATCTTTCCTCTTCCGGATTACGGGTCCGGACGTCCGCACGTGGTTGCCGGTTTAGGGTGCTGCTGT
AGTGGCGATACGTCCCGCCGCTGTCCCGAAGTGAGGGATCCGAGCCGCAGCGAGAGCCATGGAGGGCCAGCGCGTG
GAGGAGCTGCTGGCCAAGGCAGAGCAGGAGGAGGCGGAGAAGCTGCAGCGCATCACGGTGCACAAGGAGCTGGAG
CTGGAGTTCGACCTGGGCAACC
……………
Input (II)

A set of sequences which are likely containing no motif.
>SEQ_1
AACAAGGGAAAGAGTAGTGAGTGCTTCTTTCTATTCAGAGGGAGGGGAAGTTGCTGTTAGCTAAGACAGTCAGGACTGAGA
AGGGGGGGGGGGGTTTAACTCTCCTGGAGGGAGCTGAGAGGTAAAGGGAGGGGCGTGAGGTAGAACAAGCCGAGA
ACACAGGGCAGGTTGGTCTGACTCCAGAGCACAGTGCAGGAGCCCGGAAGTTGACTCAGTTCAGTTAGCAAGTATTTT
CACACAAGGCGTGAACACTGAAGACAAAAGCAAGAGACACAGCTCTATCTCTAAGAAGATTTTCAGAGCCAAGATCGA
TGGGGCACACCTGTTAATCCCAGCACTTAGGAGGCTGAGGCAGGAGGATCCCAAGTTCAAAACCAGCCTGGACTTGT
TTTAAGGAAAA
>SEQ_2
AAAAAAAAAAAAAAGACTTCCAGTTTAATAAATGACCAATTCAGGAATGGAGATTAGGGCTGGATGACAAGTTTTTAATTGTC
AAGGACTCAATTCTGTTTATCAGTTGGTATGGAATTATGTAAGCTTTTAGCGATATGACCGCACGGAGCAGTGTAGAGA
GTGATCTGAGAGACGCTTGGGGGTCAGGATGGAGATAGAACTCCCTCTCTATTAGAAGGTGTTTGGTGGTAGGTAACC
CTGGGCTAGCATGGTGGGTCTCTTCTTACTTAGGCTTCCATCTTTGTGGTTCAAATCCAAGAAGGACCTGCGTTCCCTC
CCTCCTTGTGATCAGCTGATTGCTAGAGCATAACTCATCTTAACTTCTCATGTACTCTCCGGGTACAGGAAGGGAGGGG
GC
>SEQ_3
CCACTGCTGACAGTGGAGCATGAAACGACCGGCTTCCTGACTATGTTGGTACCCTTTCAGGAGCCTAAAACAGTGCTTTCAA
TACTTGTGTCTATGTCTGTTAGCCACAACTTTCTAGTTTCCCAGAGAGATTTTGAAGTGTAGTTTTGTATTTGCTCAAAT
ATATATTCATATGGTGAGGTGCACATTTTTTATATTATATTTTTATTCATTTATTTTTGGTGCTTGGGAATTATACTCTAGGA
ATAAAGCGCCTGGTAGAAAGTGGCACACATCTTTAATCCCAGCACTCAGGAAGCAGAGGCAGACAAATCTCTGCGTTC
CAGGACAGCCTGGTCTATAGAGCAAGGTCCAAGCCAGCCAGGTTTACACAAAGAAACCTAGTGTGGAAAAGACAAAA
……………
Output




You need to output a list of candidate (ranked) motifs.
You can model the motif as PWM or consensus sequence.
If you model the motif as a PWM, one of the answer for the
previous dataset is
You may also return other significant motifs.
Aim of the project

Given a sample file and a background file,


you need to implement a method which output a
list of motifs.
You need to take advantage of the fact that
this is a ChIP-seq dataset

Hint: Read papers on ChIP-seq and understand its
properties.
Download