ChIP-seq Methods & Analysis

advertisement
ChIP-seq Methods & Analysis
Gavin Schnitzler
Asst. Prof. Medicine TUSM, Investigator at MCRI, TMC
gschnitzler@tuftsmedicalcenter.org
617-636-0615
ChIP-seq COURSE OUTLINE
• Day 1: ChIP techniques, library production,
USCS browser tracks
• Day 2: QC on reads, Mapping binding site
peaks, examining read density maps.
• Day 3: Analyzing peaks in relation to
genomic feature, etc.
• Day 4: Analyzing peaks for
transcription factor binding site
consensus sequences.
• Day 5: Variants & advanced approaches.
DAY 4 OUTLINE
• Position weight matrices to find
transcription factor binding sites (TFBSes)
• TFBS enrichment in peaks using CentDist
• TFBS enrichment using Storm in UNIX
• Mining Storm results
• Disambiguating similar matrices w/
STAMP
DAY 3 REMNANT
• Analyzing overlaps between peak &
regulated genes in UNIX
How can we test the significance of binding
site association w/ regulated genes?
If you haven’t already, go to the cluster & move bed
and txt files to your /cluster/shared/userID/chip folder
(mkdir chip & cd chip if you don’t have this folder yet):
cp /cluster/tufts/cbi*/Ch*/Sam*/ER*beds/*.* .
The .txt files list the transcription start sites (TSSes) of
genes that were up- or down-regulated by estrogen in
aorta or liver (by RNA-seq analysis).
Overlaps between peaks & genes
Take a look at one of them using head [name].txt
chr6
chr2
chr6
…
73171625
25356026
65540391
+
Dnahc6
C8g
Tnip3
The file format is (tab-delimited) chromosome, TSS,
transcription direction (+=sense) & geneID.
You can get all this info easily from the UCSC browser, for individual genes
(by hand)…
… or you can get this information for all genes & extract what you want for
your gene set of interest.. Check out the RNA-seq module for info on
making & handling .gtf files.
Overlaps between peaks & genes 2
The overlap program can recognize this type of file & will test
for overlaps between ChIP-seq peaks and regions around the
listed TSSes (default +/-1000 bp).
You can also change this range by specifying a –range
variable.
Find the overlaps between 10-kb regions around TSSes of
genes up- or downregulated in each tissue & the corresponding
ER binding site data using variations on:
bsub perl /cluster/home/g/s/gschni01/perl*/overlap_1.3.pl
Ao_up_TSS.txt AoE_all.bed –outfile Ao_up_v_AoE.overlap
(these commands are in /cluster/tufts/cbi*/Ch*/Sam*/Fin*/workflow2.txt)
Note the number of overlaps (hits), number of genes (tests) and the number
of overlaps expected by chance divided by the number of genes
(background frequency) provides all the information you need for binomial
tests. Note these numbers down for each comparison.
Accessing the R statistical language
On the PCs in this room:
Start->programs->R
To get R for your PC (free): http://cran.r-project.org/
To get RStudio (allows for easier management of R projects):
http://www.rstudio.com/
On the cluster type: module load R
Then: bsub -Ip -q int_public6 R
To exit use the R command q()
For more info on using R & Unix see:
http://sites.tufts.edu/cbi/resources/rna-seq-course/
UNIX resources & R resources
Binomial tests in R
Use the R command: binom.test(hits, tests, bkg_freq) to address the
significance of overlaps you see
For Ao_down_TSS.txt vs. AoE.bed:
binom.test(118,2, 1.03/118)
Which comparisons show significant enrichment. Do any show antienrichment?
DAY 4 OUTLINE
• Position weight matrices to find
transcription factor binding sites
(TFBSes)
• TFBS enrichment in peaks using CentDist
• TFBS enrichment using Storm in UNIX
• Mining Storm results
• Disambiguating similar matrices w/
STAMP
What is PWM?
 Transcription factor binding sites (TFBSs) are
usually slightly variable in their sequences.
 A positional frequency matrix (PFM) specifies the
probability that you will see a given base at each
index position of the motif.
 This is built from sequences known to bind the TF
(e.g. 46 sequences for the PFM below).
Pos 1
A 18
C
8
G 13
T
7
Con N
5’
2
8
3
31
4
G
3 4
5 4
3 9
34 9
4 24
G T
5
1
33
8
4
C
6
29
4
10
3
A
7
7
21
11
7
N
8
7
15
15
9
N
9
7
14
19
6
N
10
0
0
4
42
T
11
1
0
44
1
G
12
39
1
3
3
A
13
1
43
0
2
C
14
1
39
1
5
C
15
6
18
6
16
N
3’
Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University
PFM->normalized PFM->PWM
Binding site data
1.
2.
3.
4.
5.
6.
7.
acggcagggTGACCc
aGGGCAtcgTGACCc
cGGTCGccaGGACCt
tGGTCAggcTGGTCt
aGGTGGcccTGACCc
cTGTCCctcTGACCc
aGGCTAcgaTGACGt
41.
42.
43.
44.
45.
46.
cagggagtgTGACCc
gagcatgggTGACCa
aGGTCAtaacgattt
gGAACAgttTGACCc
cGGTGAcctTGACCc
gGGGCAaagTGACTg
...
Position frequency matrix (PFM)
(also known as raw count matrix)
Given N sequence fragments of fixed length, one
can assemble a position frequency matrix
(number of times a particular nucleotide appears
at a given position).
A normalized PFM, in which each column adds up
to a total of one, is a matrix of probabilities for
observing each nucleotide at each position (e.g.
divide by 46).
Position weight matrix (PWM)
(also known as position-specific scoring matrix)
The normalized PFM is converted to log-scale for
efficient computational analysis. To eliminate null
values before log-conversion, and to correct for small
samples of binding sites, a sampling correction,
known as pseudocounts, is added to each cell of the
PFM.
Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University
Position Weight Matrix for ERE
Converting a PFM into a PWM
A
C
G
T
18
8
13
7
8
3
31
4
5
3
34
4
For each matrix
element do:
4
9
9
24
1
33
8
4
29
4
10
3
7
21
11
7
7
15
15
9
7
14
19
6
0
0
4
42
1
0
44
1
39
1
3
3
1
43
0
2
1
39
1
5
6
18
6
16
N
4
N N
p(b)
f b ,i 
w(b, i )  log2
pb, i 
 log2
pb 
A
0.58
-0.44
-0.98
-1.21
-2.29
1.22
-0.60
-0.60
-0.60
-2.96
-2.29
1.62
-2.29
-2.29
-0.72
C
-0.44
-1.49
-1.49
-0.30
1.39
-1.21
0.78
0.34
0.25
-2.96
-2.96
-2.29
1.76
1.62
0.46
G
0.16
1.31
1.44
-0.30
-0.44
-0.17
-0.06
0.34
0.65
-1.21
1.79
-1.49
-2.96
-2.29
-0.64
T
-0.60
-1.21
-1.21
0.96
-1.21
-1.49
-0.60
-0.30
-0.78
1.73
-2.29
-1.49
-1.84
-0.98
0.23
f b ,i
– raw count (PFM matrix element) of nucleotide b in column i
N
– number of sequences used to create PFM (= column sum)
N
and N
4
- pseudocounts (correction for small sample size)
p(b) - background frequency of nucleotide b
Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University
Scoring putative EREs by scanning the promoter w/ PWM
GGGTCAGCATGGCCA
A
0.58
-0.44
-0.98
-1.21
-2.29
1.22
-0.60
-0.60
-0.60
-2.96
-2.29
1.62
-2.29
-2.29
-0.72
C
-0.44
-1.49
-1.49
-0.30
1.39
-1.21
0.78
0.34
0.25
-2.96
-2.96
-2.29
1.76
1.62
0.46
G
0.16
1.31
1.44
-0.30
-0.44
-0.17
-0.06
0.34
0.65
-1.21
1.79
-1.49
-2.96
-2.29
-0.64
T
-0.60
-1.21
-1.21
0.96
-1.21
-1.49
-0.60
-0.30
-0.78
1.73
-2.29
-1.49
-1.84
-0.98
0.23
m
Absolute score of the site
S   w(b, i) =11.57
Row
Sum
Max 0.58 1.31 1.44 0.96 1.39 1.22 0.78 0.34 0.65 1.73 1.79 1.62 1.76 1.62 17.20
Min -0.60 -1.49 -1.49 -1.21 -2.29 -1.49 -0.60 -0.60 -0.78 -2.96 -2.96 -2.29 -2.96 -2.29 -24.02
i 1
relative_ score 
This is also called
“functional depth”
Absolute_ score  Minim um_ score
Maxim um_ score  Minim um_ score

11.57   24.02
 0.86
17.20   24.02
Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University
Estimating p. values for a match to the matrix
GGGTCAGCATGGCCA
A
0.58
-0.44
-0.98
-1.21
-2.29
1.22
-0.60
-0.60
-0.60
-2.96
-2.29
1.62
-2.29
-2.29
-0.72
C
-0.44
-1.49
-1.49
-0.30
1.39
-1.21
0.78
0.34
0.25
-2.96
-2.96
-2.29
1.76
1.62
0.46
G
0.16
1.31
1.44
-0.30
-0.44
-0.17
-0.06
0.34
0.65
-1.21
1.79
-1.49
-2.96
-2.29
-0.64
T
-0.60
-1.21
-1.21
0.96
-1.21
-1.49
-0.60
-0.30
-0.78
1.73
-2.29
-1.49
-1.84
-0.98
0.23
This sequence had a functional depth (f) of 0.86
The summed probabilities of all sequences with f >=.86 gives
the p.value for this sequence = chance that f>=.86 would be
achieved by a randomized DNA sequence.
Short matrices can reach f > .9 but still have high p. values
– thus f is the best measure for short seqs.
Long matrices can have very low p. values but still have f< .9
– thus p.value is the best measure for long seqs.
DAY 4 OUTLINE
• Position weight matrices to find
transcription factor binding sites (TFBSes)
• TFBS enrichment in peaks using
CentDist
• TFBS enrichment using Storm in UNIX
• Mining Storm results
• Disambiguating similar matrices w/
STAMP
Preparing for PWM search
Lauch WinSCP (Start->programs->WinSCP)
Navigate to:
/cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/Final_
output_files
Pull over the “ipvinput19_peaks.xls” file to the PC.
(this is the MACS output file that we generated yesterday)
Open it into Excel
Making .bed file w/ +/-200 bp around
peak summit (where we expect TFBS
enrichment will be greatest)
chr
chr7
chr10
chr2
chr12
chrX
start
74606586
94601968
1.67E+08
34760179
48371756
end
length
summit
tags -10*LOG10(pvalue)fold_enrichment
FDR(%)
74607824
1239
571 181
3132.99
34.87
0
94603119
1152
541 174
3135.11
34.76
0
1.67E+08
809
377
18
100.44
4.7
0.06
34761206
1028
496
22
101.03
4.17
0.06
48372420
665
437
18
100.29
4.12
0.06
=same row, chr column
=start col+summit+200
=start col+summit-200
•Copy these 3 columns (without any header row).
•In WinSCP click on any file on the PC, then on files->new->file
& provide a name (“LiE_chr19_400bp.bed”) to edit a new
simple text file.
•Paste, save & close.
Making a file of control .bed regions
peak ctrs.
chr
chr7
chr10
chr2
chr12
chrX
start
74606586
94601968
1.67E+08
34760179
48371756
control regions
start end
chr -10*LOG10(pvalue)
start end chrfold_enrichment
end
length
summit
tags
FDR(%)
74607824
1239
571 181
3132.99
34.87
0
94603119
1152
541 174
3135.11
34.76
0
1.67E+08
809
377
18
100.44
4.7
0.06
34761206
1028
496
22
101.03
4.17
0.06
48372420
665
437
18
100.29
4.12
0.06
…
=peaks:chr
=peaks:start-5000
=peaks:end-5000
•5000 bp is far enough away to not be part of an enhancer composed of the
ER binding site... but is close enough to likely be in the same general
chromatin territory (e.g. accessible euchromatin vs. inaccessible
heterochromatin)
•Copy these columns & make a “CTRL_chr19_400bp.bed” file with WinSCP
CentDist
A TFBS enrichment program designed for ChIP-seq data
Assumes that TFBS-matrix hits will be most highly enriched at
the centers of ChIP-seq peaks.
Adds PWM score to “peakiness” score (e.g. how much more
enriched the TF site is in the center of the peak)  final p. val.
Good
enrichment
good shape
(best p.)
Good
enrichment
OK shape
Good enrichment poor
shape (higher p.val.)
Go to:
http://biogpu.ddns.comp.nus.edu.sg/~chipseq/webseqto
ols2/TASKS/Motif_Enrichment/submit.php?email=guest
…or (more simply) just google centdist and click on
the first link (should end in /centdist/)
Run CentDist
Give centdist a name for your run
Upload your +/-200 bp .bed file
(CentDist does not need a separate background file, instead using flanking
sequences at a case-specific optimized distance as background)
Check “Jaspar”, “vertebrate”
& set max-co-motif distance to 3000
Then click Submit Job
On the new window that opens click “turn on autorefresh” so
you will be notified when the job ends
Jaspar vs. Transfac
Jaspar is a freely-available set of TFBS matrices that can be
downloaded from jaspar.genereg.net
Transfac is a commercial product that you need to pay for the
latest release of. A version of Transfac (from ~2006) is
available on the cluster (e.g.
/cluster/home/g/s/gschni01/vertebrates.mat)
Which to use? Both, ideally, but beware that programs like
CentDist will give you results from Transfac matrices – and you
won’t be able to find out details of what they are.
CentDist Results
View by factors, put in max number & hit go.
•P. Values (based on Score compose of Z0 (enrichment) & Z1 (peakiness)
•Distribution graph
•Weblogo representation of Jaspar matrix
Shows information content at each position. A,G,C&T 25% each-> 0 bits,
only 1 base 100%->2 bits. Bases most highly over-represented relative to
chance are largest.
How many enriched TF sites are
there really?
Matrix hit enrichment for many factors, are all of them real?
V$jaspar_HNF4A
V$jaspar_NR2F1
V$jaspar_ESR1
Maybe not, notice how similar top sites are to each other and to
estrogen response elements (EREs) such as V$jaspar_ESR1
Downloading CentDist Results
Click on download as text & save the file somewhere you
remember.
Open it into excel. Basic summary statistics & a few other
things.
Many questions unanswered:
-What is the fold enrichment over background?
-What are the peaks with the greatest enrichment for
each factor?
-What are the TFBS hit locations in each peak?
-Which are the real enriched TFBSes & which are just
showing up by homology?
-Do certain factors group together into the same same
peaks?
DAY 4 OUTLINE
• Position weight matrices to find
transcription factor binding sites (TFBSes)
• TFBS enrichment in peaks using CentDist
• TFBS enrichment using Storm in UNIX
• Mining Storm results
• Disambiguating similar matrices w/
STAMP
Storm
Storm is a straightforward PWM scanning program that runs in
UNIX.
Its greatest advantage is that it gives you all of the
unprocessed output data, which allows you to do much
more powerful analyses!
It also allows us to specify thresholds for matches to the matrix
– allowing us to use functional depth as well as p. value
Getting DNA for Storm
To run storm, we first need to get the actual DNA sequence
for centers of our peaks (where we expect the greatest
enrichment for TFBSes to be).
Go to the UCSC genome browser at: genome.ucsc.edu
Under genome choose mouse mm9
Then choose add custom track & upload your +/-200 bp .bed file.
.fa denotes a simple
Click on Tools->Table Browser
‘fasta’ format sequence
Select your new track
file.
Select output format “sequence”
Provide a file name “LiE_chr19_400bp.fa” & hit “get output”
Hit ‘get output’ again on the next page
Now do the same for your “CTRL_chr19_400bp.bed” file.
Cleaning up our .fa files
Use WinSCP to move these .fa files and their corresponding
.bed files to your …/chip directory.
Each entry in the .fa file has a header with special
characters in it that confuse storm.
All of the commands below are in the file
/cluster/tufts/cbi*/Ch*/Sam*/Final*/workflow2.txt… cat this to your
screen, to copy & paste commands.
To fix this, go to your …/chip directory in Putty & do:
perl /cluster/home/g/s/gschni01/perl*/Lax_convert.pl
LiE_chr19_400bp.fa > LiE_chr19_400bp_converted.fa
To see what has changed use:
head *.fa
Do the same for your “CTRL_chr19_400bp.fa” file.
Running storm
First set some path variables:
export CREAD=/cluster/home/g/s/gschni01/cread-0.84
export PATH=$PATH:$CREAD/bin
Then run storm for your IP .fa file:
bsub -oo LiE_chr19_400bp_p.storminfo storm -p -t 0.0005 -s
LiE_chr19_400bp_converted.fa -o LiE_chr19_400bp_p.storm
/cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat
And for your control .fa file:
bsub -oo CTRL_chr19_400bp_p.storminfo storm -p -t 0.0005 -s
CTRL_chr19_400bp_converted.fa -o CTRL_chr19_400bp_p.storm
/cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat
Use more to look at one of your .storm output files
(space for next page ctrl c to exit)
DAY 4 OUTLINE
• Position weight matrices to find
transcription factor binding sites (TFBSes)
• TFBS enrichment in peaks using CentDist
• TFBS enrichment using Storm in UNIX
• Mining Storm results
• Disambiguating similar matrices w/
STAMP
Interpreting Storm data
Run the dme_parse perl program to gather and tabulate
your storm data:
bsub -oo LiE_chr19_400bp_p.dmeparseinfo perl
/cluster/home/g/s/gschni01/perl*/dme_parse5.4.pl
LiE_chr19_400bp_p.storm LiE_chr19_400bp.bed peaks
bsub -oo CTRL_chr19_400bp_p.dmeparseinfo perl
/cluster/home/g/s/gschni01/perl*/dme_parse5.4.pl
CTRL_chr19_400bp_p.storm CTRL_chr19_400bp.bed
peaks
dme_parse outputs
…storm.bed file:
Has USCS browser tracks for each TFBS matrix with
locations of all hits in bed format.
…storm.map file:
Lists all input matrices followed by the PFM derived
from all of the hits to this matrix from our data.
…storm.info file:
Summarizes a lot of information about matrix hits
Move the .info files to your PC with WinSCP & open
them into Excel. File provides summary statistics for # of peaks
with 0,1,2,etc. hits, total hits, and normalized hits per 50 bp vs
distance from peak center.
average matches/kb
-225
4
3.5
3
2.5
2
1.5
1
0.5
0
-175 -125 -75
-25
25
75
125
175
BP from pe ak apex
AoE_ER_avg
BKG4_AoE_ER_avg
LiE_ER_avg
BKG4_LiE_ER_avg
2
225
1.8
1.6
1.4
1.2
1
-225
average matches/kb
Using the .info file
to plot relative
density of TFBS
hits in aorta IP,
liver IP & offset
controls:
average matches/kb
dme_parse outputs
0.8
-75
-25
25
75
125
175
225
BP from pe ak apex
AoE_M YC_avg
BKG4_AoE_M YC_avg
LiE_M YC_avg
BKG4_LiE_M YC_avg
2.7
-175
-125
2.5
2.3
2.1
1.9
1.7
-225
-175
-125
1.5
-75
-25
25
75
125
BP from pe ak apex
AoE_EBF_avg
BKG4_AoE_EBF_avg
LiE_EBF_avg
BKG4_LiE_EBF_avg
175
225
dme_parse outputs
Using the .info files to structure binomial tests
Hits= # of matches to each matrix in IP data
Tests=# of times storm tested for a match
=(# of peaks) * (400 bp length of peaks - matrix length)
Background freq= matches to offset conrol peak data/# tests
(same as for IP)
Using the .info files to determine fractional enrichment
Hit frequency in IP data/Hit frequency in offset control
dme_parse outputs
.freqs file: Number of hits to each matrix for each peak
Distribution of hits per peak in offset background
establishes # of hits to be p.<=.05 enriched over backgound
• Allows identification of sites at which a given TFBS may be
functionally targeted (candidates for further testing)
• Can also look for significant overlaps between the peaks with
enrichment for 2 different factors - to identify cooperative versus
antagonistic interactions.
Details on how to do these analyses are in
ChIPseq_analysis_methods_2013_02_11 on the cbi
website.
DAY 4 OUTLINE
• Position weight matrices to find
transcription factor binding sites (TFBSes)
• TFBS enrichment in peaks using CentDist
• TFBS enrichment using Storm in UNIX
• Mining Storm results
• Disambiguating similar matrices w/
STAMP
STAMP
Go to www.benoslab.pitt.edu/stamp/index.php
STAMP lets you compare matrices for evolutionary
similarities to each other.
Go to your CentDist output.
Create a new column in which you change the names of
the factors to fit with the names in the
Jaspar_non_redundant_vertebrate.mat file you used for Storm.
=substitute(b2,“V$jaspar_”,”Jaspar$”), & propogate down
Select all matrix names w/ p.<.05 & paste them into a new
file called “select_mats.txt” in your /chip folder on the
cluster using WinSCP.
Getting STAMP to help
classify our CentDist top hits
perl /cluster/home/g/s/gschni01/perl*/MatrixSelect.pl
/cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.
mat select_mats.txt select_mats.mat
Now, open the select_mats.mat file with WinSCP, copy
everything & paste it into STAMP.
Keep all the STAMP defaults & hit submit.
STAMP Tree
This indicates that
enrichment of PPARG,
RORA, NR4A2 could be
just because of their
similarity to EREs.
Other enriche sites, such
as SP1, FoxA2 & Myf fall
in separate homology
classes.
To further distinguish
which one is real, you can
use the enrichment ratios
& p. values (the “real”
TFBS should be best in
both of these.
Download