De novo motif finding using ChIP-seq

advertisement
De novo Motif Finding using ChIP-Seq
Presenter: Zhizhuo Zhang
Supervisor: Wing-Kin Sung
Outline
•Introduction of Chip-Seq Data
•The Impact of Chip-Seq’s Properties in Motif Finding
•Our proposing algorithm (Pomoda)
•Experiment Result
•Exploring Center Distribution
7/1/2016
2
Copyright 2009 @ Zhang ZhiZhuo
Chip-Seq Technique
7/1/2016
3
Copyright 2009 @ Zhang ZhiZhuo
Comparison with Chip-Chip
7/1/2016
4
Copyright 2009 @ Zhang ZhiZhuo
What Chip-Seq means to us?
Sequences
Motif Finding
Tools
Motif models
More data
Good news for data mining, but necessary for denovo motif
finding
Higher resolution
job becomes easier, localization
7/1/2016
5
Copyright 2009 @ Zhang ZhiZhuo
How large the data is?
The definition of “large data” keeps changing!
•10 years before, tens of sequences (Promoter Sequences:
MEME,AlignACE)
•5 years before, hundreds of sequences (Chip-Chip:
Weeder)
•2 years before, thousands of sequences (higher
throughput Chip-Chip: Trawler, Amandeus)
•Now, tens of thousands of sequences (Chip-Seq: ?)
7/1/2016
6
Copyright 2009 @ Zhang ZhiZhuo
Higher Resolution Means?
Means finding main motif (antibody targeting TF) becomes
a easy job!
Main Motif would be very over-represented
The Peak range just about 50 bp, simply align all the peak
region, we can get the good motif.
It means our focuses may change from the main TF to the
TFs who are working with the main one.
7/1/2016
7
Copyright 2009 @ Zhang ZhiZhuo
Localization =?Over-Representation
AR
GATA
1000
450
900
400
800
350
700
Frequency
Frequency
300
600
500
400
250
200
150
300
100
200
50
100
0
0
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61
Location bins
7/1/2016
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61
Location bins
8
Copyright 2009 @ Zhang ZhiZhuo
Peak Oriented Motif Discovery
What information of Peak can be helpful?
Peak Intensity
Peak location
Our targets: not only the main motif, but also the co-motifs
sitting around the main motif.
7/1/2016
9
Copyright 2009 @ Zhang ZhiZhuo
POMODA
• Peak Oriented Motif Discovery Algorithm
Centered on
ChIP-seq peak of
7/1/2016
The main motif
A co-motif
Should be noise as it does not exhibit distance preference to the main motif
10
Copyright 2009 @ Zhang ZhiZhuo
Motif Modeling
String Motif : Smaller searching space, enable fast string
matching algorithm
PWM Motif: More precise approximation to the real motif,
statistics sound. (PWM—Position Weighted Matrix)
7/1/2016
11
Copyright 2009 @ Zhang ZhiZhuo
Background Modeling
Organism Specified Background:
Hard to capture the negative information in background
Position Specified Background:
Reveal the biological context, and easier to capture the
negative information
7/1/2016
12
Copyright 2009 @ Zhang ZhiZhuo
Position Specified Background
Given the peak position in chip-seq, we not only identify the
active position(center) of the master TF, but also the active
region of its co-motif.
Peak in Chip-Seq
7/1/2016
13
Copyright 2009 @ Zhang ZhiZhuo
Center Enrichment Score
Since we don’t know the exact size of the active region, and
it may vary for different motif. Hence, we define a odd-ratio
score base on dynamic window size.
CenterOcc  ( Seqlen  windowsize )
Score  max windowsize{
}
BgOcc  windowsize
where CenterOcc  minimal support
7/1/2016
14
Copyright 2009 @ Zhang ZhiZhuo
Algorithm Overview
Seed Finding
PWM Extending &
Refinement
Redundant Motifs
Filtering
7/1/2016
15
Copyright 2009 @ Zhang ZhiZhuo
Seeds Finding
GGTCAC
CGGTCA
GGGTCA
AGGTCA
…
Enumerate all length 6 patterns
AACTTG
…
ATGACC
CAGGTC
AGGTCG
CGTGAC
CTGACC
7/1/2016
Po
1
2
3
4
5
6
A
0.97
0.97
0.01
0.01
0.01
0.01
C
0.01
0.01
0.97
0.01
0.01
0.01
G
0.01
0.01
0.01
0.01
0.01
0.97
T
0.01
0.01
0.01
0.97
0.97
0.01
16
Copyright 2009 @ Zhang ZhiZhuo
PWM Extending & Refinement
Encapsulate the core PWM into a wide PWM
For example, we implant the length 6 PWM into a length 26
PWM, as following:
Po
1
2
…
…
9
10
11
12
13
14
15
16
…
…
25
26
A
0.25
0.25
……
0.25
0.97
0.97
0.01
0.01
0.01
0.01
0.25
……
0.25
0.25
C
0.25
0.25
……
0.25
0.01
0.01
0.97
0.01
0.01
0.01
0.25
……
0.25
0.25
G
0.25
0.25
……
0.25
0.01
0.01
0.01
0.01
0.01
0.97
0.25
……
0.25
0.25
T
0.25
0.25
……
0.25
0.01
0.01
0.01
0.97
0.97
0.01
0.25
……
0.25
0.25
7/1/2016
Core PWM
17
Copyright 2009 @ Zhang ZhiZhuo
Background
Instances
PWM Extending & Refinement
A…A…GGTCA…C…C
T…G…GGTCA…A…G
G…A…GGTCA…T…T
T…G…GGTCA…G…G
……
C…T…GGTCA…T…A
Select the best column to
update based on Center
PWM and Bg PWM.
Center
Instances
7/1/2016
A…A…GGTCA…C…C
T…G…GGTCA…C…G
……
C…T…GGTCA…C…A
GGTCANNNNC
18
Copyright 2009 @ Zhang ZhiZhuo
Redundant Motifs Filtering
1. Positions overlap more than 5%
2. PWM divergence less than 0.18
PWM divergence  ED ( P1 , P 2 ) 
7/1/2016
l
1

2  l i 1

b{ A,C ,G ,T }
( Pi1,b  Pi ,2b ) 2
19
Copyright 2009 @ Zhang ZhiZhuo
Results – Comparison
1. Dataset:
1. MCF7 dataset (ER), 4361 sequences
2. LNCAP dataset (AR), 10000 sequences
2. Evaluate “PWM divergence” with Transfac motif as in
Harbison et al (2004) and Amadeus (2008)
3. +/- 5000 bases from peak (Pomoda), and +/- 200 bases
from peak for other algorithms
4. Each motif finder report its top20 results
7/1/2016
20
Copyright 2009 @ Zhang ZhiZhuo
Cell
TF
Mcf7
ER
Pomoda
Amadeus
Trawler
Weeder
HNF3
GATA
AP1
<0.12
SP1
<0.18
BACH1
<0.24
E2F
OCT1
AP4
LNCAP
AR
HNF3
NF1
GATA
OCT
ETS
7/1/2016
21
Copyright 2009 @ Zhang ZhiZhuo
Comparison
Pomoda
Amadeus
Trawler
Weeder
Background
model
Position Specified
Organism
Specified
Organism
Specified
Organism
Specified
Motif model
PWM
(k-mer exact
match)
PWM
(k-mer with
mismatches )
PWM
(IUPAC string in
initial scan)
k-mer with
mismatches
Algorithm
Exhaustive search
+PWM column
updating
Add mismatches
Exhaustive search
Merge (recursively) + clustering
EM
Exhaustive search
Motif Length
Various length
Fixed length
Semi-various
length
Semi-various
length
Gap Detection
Supported
Not Supported
Not Supported
Not Supported
Localization
center windows
size
Over-represented
bins
Not supported
Not supported
Sequence
Weighting
Supported
Not Supported
Not Supported
Not Supported
Average
Running time
30min
93min
>4hours
>4 hours
7/1/2016
22
Copyright 2009 @ Zhang ZhiZhuo
Center Distribution
Foxa1
1600
1400
1200
1000
800
600
400
200
-1900
-1700
-1500
-1300
-1100
-900
-700
-500
-300
-100
100
300
500
700
900
1100
1300
1500
1700
1900
0
Mixture Model:
 x
y    e  (1   )  c
x
if   e (10.1503
 )  c
 1.3738
 x 1.94
c  0.0506
 Range
x  binsize  194bp
7/1/2016
23
Copyright 2009 @ Zhang ZhiZhuo
7/1/2016
24
Copyright 2009 @ Zhang ZhiZhuo
Download