ChIPseq_analysis_methods_2013_02_11

advertisement
Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012
1
WORKFLOW FOR ANALYSIS OF CHIP-SEQ DATA INCLUDING ANALYSIS OF ENRICHED
TRANCRIPTION FACTOR BINDING SITES
-Gavin Schnitzler 02/11/2013
1) QUALITY CONTROL ON SEQUENCING RESULTS
Open the fastqc_report.html file.
Check per base sequence quality. This should be in the green range throughout. Somewhat
lower quality is expected in the first ~8bp (due to technical aspects of the sequencing method),
which is why these bases are often excluded from the analysis. If other 5’ or 3’ end bases fall
out of the green range you may also want to exclude these with the caveats that you’ll need at
least ~25 bp to map to a mammalian genome & you will want to perform the same trim on all
samples.
Check per sequence quality scores. Ideally there will be a sharp peak in the 30’s & relatively few
reads below this.
Other metrics often don’t link properly to the .html report, but you can access each of these by
opening the images subfolder of the fastqc results folder:
duplication_levels.png shows the number of exact duplicate reads. If most of your reads have a
duplication value greater than 1 this indicates that you are sequencing multiples of the same
fragment that were generated by the PCR amplification step in library production. Avoiding this
is the major reason why you need to start with ~3ng of ChIP fragments and why you want to
limit amplification to ~18 cycles. If there is a single curve centered around 2 or 3 your data
should still be fine – but will be about as accurate as having 1/3 as many unique sequences.
The kmer content values in the .html report & kmer_profiles.png image describe certain short
sequences that are over represented compared to chance predicted from GC content. Usually
TTTTT & AAAAA show up & a few others, which are just normal characteristics of mammalian
DNA. Don’t worry about this unless certain sequences show up 100s of times greater than
background and/or the kmers in your input DNA sample differ greatly from those in your ChIP
sample… in which case these kmers could be an indicator of contamination by very high copies
of the same sequence – potentially a PCR fragment from some other experiment that
contaminated your ChIP sample.
Per base GC content & sequence content .png files should show roughly straight lines with the
exception of the first ~8 or 9 bp, which is expected (as mentioned above).
The per sequence GC content .png file should have a single peak centered over the expected
%GC for the genome in question (~45% for mammals). Sharp peaks of any given GC content
probably represent highly repeated sequences (resulting perhaps from contamination of your
ChIP DNA with a PCR
product of some specific
sequence that was
hanging around the lab). A
second broad peak of
higher GC content
probably indicates
bacterial DNA
contamination. An
example of contamination
with a common soil
bacterium that was
probably growing in a
ChIP buffer is shown here:
Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012
2
Note that none of these issues should completely prevent analysis of your data. For instance, nonmammalian sequences won’t map to the genome & over-amplified artifact sequences are
removed by the recommended procedure of removing duplicate reads before calling peaks.
These problems will, however, decrease your total number of reads and thus decrease the
accuracy of peak calls. Similarly a library of too low a concentration will result in fewer than
optimal numbers of sequenced clusters. What will prevent analysis is low sequence quality
across all bases of reads, or high N (unknown base) calls, which will prevent any reliable
mapping to your target genome. This sort of effect is rare & is likely to indicate a problem with
the Illumina run itself, that you will want to contact your core about.
2) DOWNLOADING & UNPACKING ILLUMINA SEQUENCING RESULTS
#go to the web site provided by the core facility. Find the correct file ending in “fastq.gz”. Control
click to get menu & select “copy link location”
#go to the Tufts computing cluster either using a shell program like TeraTerm or using “sftp
your_account@cluster.uit.tufts.edu” from a Mac OSX shell.
#Navigate to an appropriate folder in your shared directory & do…
wget {pasted URL copied from website: e.g.
ftp://gschnitzler:ExMGhvk9@genomics.med.tufts.edu/120209214/sequence_data_illumina/Unaligned/Sample_LiverE2_ERa_ChIP_man2.R1.fastq.gz}
#unpack using…
bsub gunzip FILE.fastq.gz
**Remember to create a backup of the QC and .fastq.gz files in someplace other than the cluster,
such as an external hard drive.
3) CHIP-SEQ STEP 1: MAPPING READS TO TARGET GENOME WITH BOWTIE
#run bowtie using:
bsub -oo LiE_man2_bowtie.runinfo /cluster/shared/gschni01/bowtie-0.12.5/bowtie -n 1 -m 1 -5 8 -3
10 --best --strata mm9 FILE.fastq FILE.map
# “bsub –oo filename” submits the bowtie run as batch and records any output that would have been sent to
the screen in the provided file. bsub is necessary for any job that will take more than a few seconds to
run. “-n #” specifies the number of mismatches allowed in 1st 25 bp of read, “-m #” specifies the maximum
number of different genomic locations a read can map to before being rejected, “-5 #” indicates the
number of bp trimmed from 5’ end (8 generally advised), “-3 #” number of bp trimmed from 3’ end, “--best"
& “--strata" are recommended parameters to find likely best fit in the genome.
#note, no commands to set path parameters are necessary to run bowtie, so long as the full path to the
bowtie executable file is used. You can identify this path by going to that directory and typing “pwd”.
#Examine the bowtie.runinfo file & record the results, which will tell you what % of sequences
mapped to the genome, what percent failed to map & what % were suppressed due to the --m
parameter (e.g. mapping to more than one genomic location if --m is set to 1).
# To install bowtie go to: http://bowtie-bio.sourceforge.net/index.shtml & follow instructions for installation. To
install indexes for a genome of interest, right click on the link for the pre-assembled index you want (along
right of page) & select copy link location. In UNIX go to the “indexes” folder in bowtie & do:
wget [pasted link location]
unzip [downloaded file]
4) CHIP-SEQ STEP 2: CONVERTING FROM MAP TO BED FORMAT
#MACs requires that the .map bowtie output be converted into a .bed file. Do:
awk 'OFS="\t" {print $4,$5,$5+length($6),$1,".",$3}' FILE.map > FILE.bed
Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012
3
# “\t” tells awk that entries within a line are separated by tabs, $4 means the 4 th column, $5+length($6)
means add the numeric value in the 5th column to the (text) length of the entry in column 6 & “.” means to
insert a literal period in this tabbed position.
5) IDENTIFYING CHIP-SEQ PEAKS WITH MACS
#To prepare to run macs:
module load python/2.6.5
export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH
export PATH=/cluster/shared/gschni01/bin:$PATH
#note the paths used here depend on parameters specified when MACS was installed
#Running MACs using an input control (-c) and ChIP experimental (-t) files:
bsub -oo IPvINPUT.macsinfo macs14 --format=BED --bw=210 --tsize=33 --keep-dup=1 -B -S -c
INPUT.bed -t IP.bed --name IPvINPUT
#--format=BED tells MACs that the input file is in .bed format, bw=210 tells MACs the expected size of
sequenced fragments (before addition of linkers, which add an additional ~90 bp) from which value it
attempts to build a model from sense and antisense sequence reads, --tsize=33 indicates that each read
(after trimming in bowtie) is 33 bp long (not strictly necessary: MACS can figure this out on its own), -keep-dup=1 instructs MACS to consider only the first instance of a read starting at any given genomic
base pair coordinate & pointing in the same direction – assuming that additional reads starting at the
same base pair are due to amplified copies of the same ChIP fragment in the library (by default MACS
estimates the number of duplicates that are likely to arise by linear amplification of all fragments from a
limited starting sample, and sets the threshold to cut out replicate reads with a much higher number –
likely artifacts), -B tells MACS to make a bedgraph file of read density at each base pair (which can be
used to visualize the results on the UCSC browser) & -S tells MACS to make a single .bedgraph file
instead of one for each chromosome, & finally --name gives the prefix name for all output files.
# Carefully scan through the MACS output file & examine the results for:
INFO @ Tue, 21 Feb 2012 18:30:07: #1 total tags in treatment: 54992015
INFO @ Tue, 21 Feb 2012 18:30:07: #1 tags after filtering in treatment: 38314984
#(reads left after keeping only one of each repeated sequence based on keep-dup=1)
INFO @ Tue, 21 Feb 2012 18:30:07: #1 Redundant rate of treatment: 0.30
# also look at these numbers for your control file. A high redundant rate reduces your read count &
your ability to detect peaks accordingly.
# MACS identifies + strand and – strand peaks, assumes that some of these come from left & right
reads of binding site peak fragments & builds a model that determines the optimum separation
between these peaks…
INFO @ Tue, 21 Feb 2012 18:31:19: #2 Build Peak Model...
INFO @ Tue, 21 Feb 2012 18:32:01: #2 number of paired peaks: 9916
INFO @ Tue, 21 Feb 2012 18:32:04: #2 predicted fragment length is 133 bps
# You ideally want >1000 peak pairs in the model, but MACS will run (giving a warning) so long as
you have ~100. Importantly, the predicted fragment size should be within +/-10 or 20 bp of your
estimate of the ChIP fragment size in your library (equal to your average final library fragment
length minus the ~90 bp length of the adapters). If it is not, then the model is probably based on
spurious read peaks resulting from, for instance, sonication sensitive sequences that get
cleaved at high frequency.
#You can visualize the quality of the model as follows: copy the “.r” file generated by MACS (e.g.
LiE_man2.33_v_Li_Input_bw210_dup1_model.r to your PC. Open R. Set the working directory
to equal the folder with the .r file using:
setwd(“[file path]”)
#Then generate a viewable .pdf from the .r file using:
source(“filename.r”)
Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012
4
#This should show two broad peaks for plus strand & minus strand reads and a 3rd for the read
density center called by MACs based on its predicted fragment length. Sharp peaks separated
by less than 50 bp are peak call errors due perhaps to sonication-sensitive sequences. If you
also see broad peaks flanking these sharp peaks it means that it may be possible to identify the
real peaks if you adjust the MACS parameters. In particular the --mfold=X,Y parameter tells
MACS to use peaks that are between X and Y fold over background in calculating the model
(default 10 & 30). Sharp false peaks are often very high over background, so reducing the max -mfold can often help (e.g. --mfold=10,20). If signal for real binding events is not strong, you can
also reduce the lower parameter (I’ve had good results sometimes with --mfold=7,12) – with the
caveat that this is more likely to call true noise as a peak.
# Finally you get to the number of peaks called:
INFO @ Tue, 21 Feb 2012 18:38:19: #3 Finally, 55920 peaks are called!
# Then, as a control, MACS swaps treatment and control data, considering treatment as
background and asking for the peaks detected in the control data.
INFO @ Tue, 21 Feb 2012 18:38:19: #3 find negative peaks by swapping treat and control
INFO @ Tue, 21 Feb 2012 18:39:08: #3 Finally, 1119 peaks are called!
# This is a measure of how many peaks result from random noise. Ideally you want “negative
peaks” to be 10-fold or more lower than “positive peaks”. This is how MACS calculates empirical
false discovery rate, so the lower the ratio of negative to positive peaks, the better your FDRs
will be. If these values are roughly equal, this indicates that non-specific fragments in your ChIP
DNA are more numerous than the fragments specifically brought down with your antibody &
may suggest you need to optimize your ChIP conditions or try other antibodies.
# Keep tweaking --mfold ranges until you get the correct estimated fragment size & a high
positive/negative peak ratio, if possible. Beware that the smaller the mfold range you allow the
fewer qualifying peaks there will be. If the model won’t resolve, you can tell MACS to forget the
model & just use your estimate of fragment size by adding these parameters to your MACS
command “--bw=fragment_length --shiftsize=fragment_length/2 --nomodel" & hope that this
gives a good positive/negative peaks ratio.
#To install MACS 1st go to: liulab.dfci.harvard.edu/MACS/ & choose the download option & ctrl-click to copy
the link location. If you need to login use default username=macs & password=chipseq
To download directly to the cluster, you will need to include login information in the wget command:
wget --http-user=macs --http-password=chipseq [paste url]
Follow the instructions for installation. To get it to work, you will probably need to change the default version
of python with:
module add python/2.6.5
6) SUBSAMPLING BEFORE MACS, PLUSES and MINUSES
# I am told that the P-values and fold-enrichment values given by MACS are sensitive to the
relative number of reads in the treatment vs control files, and if this number is very different
(maybe > 1.5x different) MACs may not accurately report these numbers (even though the
accuracy of peak calls should be mostly unaffected).
# One way to handle this is to run MACS with all of the data initially, and compare tags after
filtering for treat and control. Then take a subsample of the dataset with the higher number of
reads & re-run MACS. E.g. if control has 60M reads after filtering and treat has 20M, take a 33%
subsample of the control & rerun MACS. The following small command-line perl script can be
used to subsample reads, in this case 33% of lines in LiE_man2.bed are put into
LiE_man2_33pct.bed.
perl -e 'open(F1,"FILE.bed"); open(FH2,">","FILE_33pct.bed"); while(<F1>){if(rand()>.666){print
FH2 $_;} }; close F1; close F2'
#This will not give an exactly equal number of reads, especially if % redundancy is high, but may
get you close – especially if you try a few iterations. To get a truly equal number requires
Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012
5
deduplicating the reads _outside_ of MACS and feeding MACS an equal number of these
reads. Unfortunately, this is trickier than it might seem, involving converting your bed into a .sam
file, running two samtools programs and then converting back to .bed.
#Be aware that subsampling is a delicate trade-off which increases the effective noise & reduces
the accuracy of the subsampled dataset in order to improve MACS fold enrichment and p.value
output accuracy.
7) COMPARING PEAK CALLS TO UNDERLYING READ DENSITY DATA
#If you included the -B & -S parameters in MACS it will create subfolders containing treat and
control bedgraph (.bdg) files which map the read density base-pair per base-pair across the
whole genome.
#Unfortunately MACS often extends predicted density a few bases past the UCSC-browser
recognized chromosome ends, causing errors. To fix this you can modify and run a short perl
program as follows:
vi trim_bdg_chrom_ends.pl
i
#cut and paste this content after adjusting the chromosome lengths to what the UCSC genome
browser reports for the species and build you are using (you can change the number of base
pairs per chromosome, add or subtract numbered chromosomes without problem, but you’ll
need to make further adjustments to the program if you add a non-numbered chromosome other
than X, Y and M):
=head1 Simple file to remove bed or bedgraph regions that excede chromosome ends in mouse mm9
=head1 Usage
Input: At command line, type:
> perl trim_bdg_chrom_ends.pl input_bed_or_begraph_file.bdg output_filename.bdg
=head1 Version Information
Gavin Schnitzler 6/29/2012
=cut
my %chromhash=(
chr1=>197195432,
chr2=>181748087,
chr3=>159599783,
chr4=>155630120,
chr5=>152537259,
chr6=>149517037,
chr7=>152524553,
chr8=>131738871,
chr9=>124076172,
chr10=>129993255,
chr11=>121843856,
chr12=>121257530,
chr13=>120284312,
chr14=>125194864,
chr15=>103494974,
chr16=>98319150,
chr17=>95272651,
chr18=>90772031,
chr19=>61342430,
chrX=>166650296,
chrY=>15902555,
chrM=>16299
);
#print "Chr1: $chromhash{chr1}\n";
open (FH1, "<", $ARGV[0]) or die ("Could not open input bed or begraph file $ARGV[0]\n");
open (FH2, ">", $ARGV[1]) or die ("Could not open output file $ARGV[1]\n");
while(<FH1>){
Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012
#
6
chomp;
@tmp_dat = split(/\t/, $_);
if($tmp_dat[0] =~/track/){print FH2 $_; next;}
if($tmp_dat[2]>=$chromhash{$tmp_dat[0]}){next;}
print FH2 $_;
}
close FH1;
close FH2;
#finish editing in vi by
[esc]
:wq
#then run the program with:
perl trim_bdg_chrom_ends.pl filename.bdg outputfilename.bdg
#Next compress your file for uploading to the browser with:
gzip outpufilename.bdg
#copy this file to your desktop using WinSCP (in windows) or sftp within a shell window (in mac
OSX)
#open the genome browser, select your genome, click add custom tracks, browse for the file name
& hit upload (it will take some time, but should eventually finish). Do the same with your input
control .bdg file.
# click add custom tracks again & upload the MACS result file ending with peaks.bed
# if you are comparing ChIP results from more than one condition or with more than one antibody
you can upload .bdg and .bed files for each one.
# Now scan along any chromosome & examine the peaks MACS called. Are they believably above
background? Make note of the coordinates of believable and non-believable peaks, then open
the peaks.xls file made by MACS into Excel on your desktop. Is there a p-value or foldenrichment threshold that separates most good peaks from most bad peaks? If so, you may
want to apply this(these) threshold(s) to create your final list of peaks.
# Note that the height of the browser graphs for each track will be proportional to the number of
reads after filtering that MACS used for your treat and control samples. If these differ
considerably, you can adjust for this easily enough by just mentally multiplying the axis values
for the sample with more reads by the ratio of [reads for the smaller sample]/[reads for the larger
sample] (e.g. in the example above, you’d multiply the control sample axis values by 1/3).
# A better approach would be to do the following to normalize each .bdg entry by dividing by the
number of millions of reads after filtering (e.g. 20 for treat and 60 for control in the example
above). This reads per million base pairs (RPM) normalization is easy enough to do in awk, e.g.:
awk 'OFS="\t" {print $1,$2,$3,$4/20}' treat.bdg > treat_normalized.bdg
#Note: the browser is fine with non-integral values in .bdg files
8a) EXAMINING OVERLAP BETWEEN PEAKS AND OTHER GENOMIC FEATURES,
INCLUDING TRANSCRIPTION START SITES OR OTHER CHIP-SEQ PEAKS
#To determine the degree to which two sets of bed regions overlap with each other use:
bsub perl /cluster/home/g/s/gschni01/perl_programs/overlap_1.2.txt File1.bed File2.bed -outfile
File1_v_File2.overlap
#the output file will summarize the number of ranges in each input file & the average length of each
range & then provide details of all ranges from file1 that overlapped 1bp or more with those in
file 2.
#the program will also provide an estimate of the number of overlaps that would be expected by
random chance, from the formula:
(avg_length_file1+avg_length_file2)*regions_in_file1*regions_in_file2/genomesize
Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012
7
#this provides a reasonable first pass estimate, which can be used in binomial tests (see **12
below) which would be structured as follows: hits=overlaps, tests=regions_in_file1,
bkg_freq=overlaps_by_chance/regions_in_file1.
# this background estimate, however, fails to account for the fact that bed regions in each file are
non-overlapping, and the fact that more than one bed region in file1 might overlap with a single
region from file2. If you have relatively small numbers of not-too-long peaks in file1 & file2 (such
that their combined total bp is less than 1% of the genome) this estimate is probably good
enough. If not, it is better to establish a background overlap frequency empirically. This can be
done by creating a randomly-distributed non-overlapping set of regions of equal length to those
in file2 & repeating the overlap of file1 with this file2_background set. To do this use:
perl /cluster/home/g/s/gschni01/overlap_bkg_generator.pl file2.bed random_bkg_for_file2.bed
# Running overlaps.pl with file1 versus random_bkg_for_file2.bed gives an empirically-determined
number of overlaps expected by regions in file1 with randomly distributed same-length regions
in file 2. If there are more than 5000 regions in file 2 this should be pretty accurate. If the regions
in file2 are less than this you may want to make multiple bkg set files, run overlaps.pl on each of
them & take the average as your background hits.
#Note, if you want to generate any number of ranges of a given fixed size (e.g. for +/-50 kb from
TSSes in the example below), create a raw feed file for overlap_bkg_generator.pl using the
following template (where the chromosome number and exact bp positions are irrelevant, the
only thing that matters is the distance between start & end).
perl -e 'for($x=1;$x<=10000;$x++){print "chr1\t100000\t200000\n";}' >
plus_minus_50kb_feed_to_rand.bed
perl /cluster/home/g/s/gschni01/overlap_bkg_generator.pl plus_minus_50kb_feed_to_rand.bed
random_100_kb_regions.bed
8b) USE OF CEAS TO LOOK AT DISTRIBUTION RELATIVE TO TSSES, EXONS &
CHROMOSOMES
Go to the Galaxy/Cistrome website at: http://cistrome.dfci.harvard.edu/
Upload or paste .bed file of peaks w/ no header
Run Integrative analysis/CEAS, choosing appropriate range from TSSes (note: it appears that
span should be set to equal the highest range value). Note also that the number of peaks in
intergenic regions isn’t given directly but can be calculated, & p. value is not given at all (but can
probably be assumed based on p. values for similar differences).
8c) OTHER CISTROME TOOLS:
peak2gene gives location of nearest gene to peaks
Conservation Plot shows average degree of conservation of sequences relative to the center of
the peaks
Venn Diagram Shows overlap of up to 3 sets of peaks or features
GCA: Gene Centered Annotation, finds nearest binding site for each gene in the genome
(inverse of peak2gene).
Cistrome has many additional tools that might also be useful for certain applications.
9) GETTING SEQUENCES FOR PEAKS
Create an excel file containing the regions of interest* in bed format:
chr# [tab] start [tab] end [tab] optional_additional_columns
*These are just the bed coordinates from the …peaks.xls file from MACS. A perhaps better
approach is to limit the sequence to +/-200 bp (400 bp total) around the peak apex, which also
reported in the peaks.xls file.
Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012
8
*In addition to your bed regions of interest (e.g. ERalpha binding sites in aorta), you should also
prepare a similar-sized set of control regions (of similar average length to the ChIP peak bed
regions). One option is to take flanking region DNA that is, say, ~2000 bp from the edge of your
peak. You can easily get this in excel by setting the 3’ edge of the control region to the 3’ peak
edge -2000 bp and the 5’ peak edge to –(2000+length of the peak, or average peak size).
Alteratively, you can select random regions from the mouse genome, distributed on
chromosomes in roughly the same proportion as ChIP peaks. I did this by adding columns after
the ChIP-seq bed columns, the first equal to the bed region chr#, the second being a random
number from 1 to the size of the chromosome, and the 3rd being that + peak length (or average
peak length). After creating these, sort by chr# and starting bp and eliminate any overlap.
Copy and paste these bed regions into text files for both the ChIPseq peaks and the control
regions.
Add a first line that says:
track name=short_descriptor_of_data
save again as text.
Upload as a custom track in the UCSC genome browser
Select “go to table browser”
Select the track you want, and output format: sequence
Give a name for the output file, ending in .fa
Hit “get output”.
Transfer that file to your cluster account.
Run this convert script to simplify the first line for each entry (otherwise Storm will choke):
perl /cluster/home/g/s/gschni01/perl_programs/Lax_convert.pl FILE.fa > FILE_corrected.fa
10) LOOKING FOR MATCHES TO TRANSFAC MATRICES USING STORM
export CREAD=/cluster/shared/gschni01/cread-0.84
export PATH=$PATH:$CREAD/bin
bsub -oo FILE_f.85.storminfo storm -f -t 0.85 -s FILE_corrected.fa -o FILE_f.85.storm
/cluster/shared/gschni01/cread-0.84/vertebrates.mat
# -f –t 0.85 indicates to identify matches between test sequences and matrices that give 85% of the
maximal score. Short sequence elements reach this threshold easily, while long elements reach
it rarely.
bsub -oo FILE_p.0005.storminfo storm -p -t 0.0005 -s FILE_corrected.fa -o FILE_p.0005.storm
/cluster/shared/gschni01/cread-0.84/vertebrates.mat
# -p –t 0.0005 indicates to identify matches between test sequences & matrices that would occur
by chance less than or equal to .05% of the time. This is a better way of identifying long
sequence elements, but might be poor at detecting shorter elements, since even perfect
matches to the matrix might happen by chance at higher than this rate.
# Below, I describe a method that allows the use of a combination of both of these measures.
11) RUNNING DME_PARSE TO INTERPRET STORM OUTPUT
bsub -oo FILE_f.85.dmeparseinfo perl /cluster/home/g/s/gschni01/perl_programs/dme_parse5.4.pl
FILE_f.85.storm FILE.bed peaks
# where .storm is your storm output file and .bed is a file simply containing 3 tab separated values
identifying the bed ranges fed to the UCSC browser to get sequence. “peaks” at the end
specifies that this is ChIP-seq peak data with site enrichment expected to be greatest at the
center of the peak, and varying region lengths allowed. The alternative here is “promoters”
Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012
9
which does not make any assumption about where sites will be enriched and which requires
that all sequences be the same length.
# note that this also works fine if the storm file is further processed using cread programs such as
the following:
motifclass –Orv –f ChIP_regions.fa –b background_regions.fa –o outputfilename.motifclass
ChIP_region_storm_output.storm –P 1000
#this uses CREAD’s methods of assessing significance of enrichment relative to bkg seqs
sortmotifs –k RELATIVE_ERRORRATE –a –o outputfilename.sortmotifs previous_file.motifclass
#this gives a ranking of significant enrichment by the relative error rate metric from motifclass
#DME parse gives several output files, with these endings appended to the input storm file name:
“.runinfo”: summarizes the parameters and input files as well as the names & contents of output
files. It also contains a table of the number of times a bin of 50bp size outwards in both
directions from the peak center was covered by peak bed regions (used for some internal
calculations when raw peak bed regions of differing sizes are used as input).
“.info”: contains tab separated data including: [all summary information provided by storm (as well
as any potential additional programs) for each TFBS matrix], e.g. MATCH1, INFO,
RELATIVE_ERRORRATE, SCORE, etc., followed by one letter code words for the input
position weight matrix (“consensus”) and the matrix derived from matches found by storm
(“data_consensus”) (which unfortunately don’t work very well, for complete info see the .mat
file), followed by 12 columns tabulating frequency of regions with 0, 1,2,3… 10 & 11 or more of
each given site, a column reporting total matches found & columns (BP of sequence/binsize,
default binsize=50) which tabulate site distance from sequence edge.
“.freqs” lists tab separated summary information about each peak, starting with the chr#, start &
end information from the ,bed file fed to storm (as well as any additional following columns of
information that may have been there), followed by columns for each matrix included in the
analysis, with each cell giving the number of matches to that matrix found in that bed region.
“.bed” contains bed format regions for each TFBS, suitable to view on the UCSC genome
browser.
“.mat” contains the position frequency matrix used by storm followed by the position frequency
matrix derived from the binding site matches in your data that storm identified.
12) BINOMIAL TESTS TO CALCULATE P VALUES
To do the binomial tests you will need the “total_sites” number in the .info file generated by
dme_parse & the number of bed regions, for both your background and your experimental sets
(run Storm and DME parse for both background and foreground sets).
In excel create four columns for each transcription factor binding site:
Column 1: transfac matrix identifier for the site [make sure that this column header is “Name”
(without the quote marks).
Column 2: total_sites reported for your foreground (e.g. ChIP) data
Column 3: total number of base pairs searched for these sites, which is calculated as (number of
bed regions)*(average # of bp per region). Considering that TFBS sites are generally longer
than 1 bp (thus the last few bases cannot match to anything in the matrix), you could use (BP
per promoter)-(average length of TFBS sites in the matrix). I used 1200-6, but it would be close
enough just to use 1200. More accurately, you could subtract one less than the length of the
consensus sequence column text.
Column 4: background frequency: (total_sites_from background)/(# of background promoters *
length of bkg promoters).These numbers are derived from your storm & then dme_parse
analysis of your background region sequences.
Open R and then do the following…
Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012
10
With the table you prepared above ---->, select and ctrl-C copy
In R:
btest<-read.delim("clipboard",header=T)
#if this doesn’t work, you’ll need to save the table to a text file and use file=”filepath/filename.txt” in
read.delim
btest[1:4,] #make sure it input properly
for(x in c(1:length(btest$Name))){
btest[x,5]=binom.test(btest[x,2],btest[x,3],btest[x,4])$p.value
}
btest[1:10,]
write.table(btest,file="E176vWT_UPvCTRL_binom.xls", sep="\t")
Next, in Excel, open output file & choose all but the index column.
Sort by Name & then sort the dme_parse output in the original file so it's also by Motif_ID
Insert the new data after the >=11 column using Insert/Cells
If desired, delete Name column & other extraneous ones.
Add adjusted_p column, multiply p.value by 585 (# of tests=# of motif matrices searched)*
* Alternatively, use the less conservative Benjamini-Hochberg correction. To do so, sort your raw p
values from lowest to highest and give each a rank (with lowest=1, next =2, etc.). For each
TFBS adjusted_p = raw_p*595/rank.
Next calculate enrichment ratios: foreground/background frequency (fg_counts)/(fg_bp) /
(bkg_counts)/(bkg_bp), which will indicate how great the enrichment/anti-enrichment of each
site is.
Once you know the enrichment ratio, sort first by p.value & then by this to set up 2 lists of
significant enrichment differences, sites that are significantly overenriched (adjusted p.value
<.05 & fg/bkg>1) and sites that are underenriched relative to chance (fg/bkg<1).
Obviously, sites with great enrichment over background are potentially most interesting, but they
should also, ideally, be present in a large fraction of sequences. To estimate the fraction of
ChIP-seq peaks that have enrichment for any given TFBS (assuming each sequence has one
more site than background sequences) simply calculate, in Excel:
=(foreground_matrix_hits-background_matrix_hits)/total_bed_regions
13) PLOTTING TFBS FREQUENCY RELATIVE TO PEAK CENTERS
In the final columns of the “.info” file DME parse also measures the frequency of matrix matches
relative to peak centers, one row for each transcription factor. To look at this distribution, create a
new row with base pair position at the center of each bin (the first one goes from -1000 to -950, so
the value would be -975). Plot the desired row data on the Y axis with BP position on the X. Note,
this data has been normalized to matrix matches per kb of DNA sequence (e.g. number of
matches/(# of bed regions*50 bp bin size/1000bp). This allows you to directly compare results
across different conditions (e.g. ERalpha ChIP from liver and aorta).
This will tell you whether enrichment is tightly associated with peak centers or broadly distributed,
or even (potentially) associated with flanks but not centers, as might be the case for a factor that
contributes to an enhancer that your ChIP’ed factor often is part of, but which does not directly
recruit your ChIP’ed factor to chromatin.
14) USING THE .FREQS FILE TO IDENTIFY PEAK SUBSETS ENRICHED IN ANY GIVEN
FACTOR
Paste the .freqs file from your background region dme_parse run into excel & in the rows above
calculate:
=PERCENTILE(range_including_all_values_in_column,0.95)
# this is the 95th percentile of sitesper region values
=PERCENTRANK(all_values_in_column,value_from_above+1)
Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012
11
# this is the fraction of bed regions w/ sites per region equal to 95th percentile value or lower
Copy & “paste special” these as values in rows above the same TFBS matrix IDs of the dme_parse
results from your foreground ChIP-seq data. Then in additional rows calculate:
=COUNT(IF(range_including_all_values_in_fg_data_column>95th_percentile_bkg_value_cell,)) [hit
ctrl shift enter to activate, should be surrounded by {} brackets when you click on the cell again]
#this is the number of ChIP seq peaks with more sites than the 95th percentile value
=INT(number_from_countif-(1-percentrank_value_from_bkg_set)*total_number_of_peaks)
# This is the number of sites in excess of those that would have this characteristic by random
chance. If this number is high, especially if it is greater than say ~1% of your total peaks, it
suggests that choosing all peaks with sites greater than the 95th percentile value is likely to
identify peaks with functional enrichment of that site.
If you then perform a descending sort on the chosen column (making sure to sort ALL columns as
well) and take those greater than the 95th percentile value, you’ve got a set of putative target
locations.
#This analysis will let you identify binding site peaks for your ChIP’ed factor that are highly
enriched in consensus motifs for another factor. These can serve as candidate regions to check
by ChIP for the presence of the other factor and/or on which to test for the loss of binding of
your ChIP’ed factor after knock down or inhibition of the other factor.
15) ALL-BY-ALL OVERLAPS ANALYSIS TO IDENTIFY CO-ENRICHED OR MUTUALLYEXCLUSIVE TFBS SITES
Create a new column that makes a single-word-identifier for these bed regions using:
=concatenate([chromosome_cell],”:”,[start_cell],”-“,[end_cell]) & copy & “paste special” as values
these putative target IDs in columns to a separate sheet, copy this new table (capturing the
longest column, and including empty cells for shorter columns). If you ran both f .85 and p.0005
storm analyses you can repeat the analysis below for each one, or (since many peaks will show
site enrichment by both methods) simply combine both lists together, removing duplicates.
First, modify the R commands below as follows: replace “45” with the number of columns you have
and replace “11975” with the number of original ChIP-seq peaks used for the storm analysis.
Then, in R do:
test<-read.delim("clipboard",na.strings="",fill=T,header=T)
dim(test) #tells the number of columns and rows
colnums=c(1:45) #this should be equal to the number of columns in test
intersections=matrix(data=NA,ncol=45,nrow=45)
testout=matrix(data=NA,ncol=45,nrow=45)
chance=matrix(data=NA,ncol=45,nrow=45)
ratios=matrix(data=NA,ncol=45,nrow=45)
for(x in colnums){for(y in colnums){chance[x,y]<-(length(na.omit(test[,x]))*length(na.omit(test[,y]))/11975)}}
for(x in colnums){ids[x]=colnames(test)[x]}
ids
colnames(chance)=ids
rownames(chance)=ids
chance
Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012
12
for(x in colnums){for(y in colnums){if(x!=y){testout[x,y]<binom.test(length(intersect(na.omit(test[,x]),na.omit(test[,y]))),
length(na.omit(test[,x])),length(na.omit(test[,y]))/11975)$p.value}}}
colnames(testout)=ids
rownames(testout)=ids
testout
for(x in colnums){for(y in colnums){intersections[x,y]<-length(intersect(na.omit(test[,x]),na.omit(test[,y])))}}
colnames(intersections)=ids
rownames(intersections)=ids
intersections
ratios=intersections/chance
colnames(ratios)=ids
rownames(ratios)=ids
ratios
write.table(testout,file="temp.txt",sep="\t",col.names=NA,quote=F)
#copy this table of p. values for the overlap of sites enriched in each factor relative to each other
factor into Excel, then repeat write.table & copy for intersections, chance & ratios. To be
conservative, multiply the p.values by 45*44/2 (the number of relevant non-self-to-self
comparisons for the binomial tests) – essentially the Bonferroni correction for multiple testing.
Be aware that this is p. value for either enrichment (more overlaps than chance) or anti
enrichment (fewer overlaps than chance) – which you can easily determine by looking at the
ratios table.
# This analysis will tell you which TFBS matrices tend to group together (very low p values & high
ratios) versus what ones group separately (very low p values & low ratios), which can give
insights into functional modules of factors. Note that some will group together because their
sites are highly homologous (with only one of those factors likely to be relevant) – and this can
be determined by STAMP analysis as described in **18 below.
16) ALTERNATIVE METHOD FOR TFBS IDENTIFICATION: CENTDIST
Centdist is specially designed for detecting enriched TFBSes in peak centers relative to
surrounding sequences. It’s two improvements over simple Storm analysis is a dynamic
determination of the optimal range from the center to give the best fg/bkg ratio & its
consideration of “peakiness” as part of the score. The downside of using CentDist is that it does
not make the individual matrix hit locations available, it doesn’t allow adjustments to parameters
(e.g. what constitutes a hit), it doesn’t provide any direct way of quantitating fold enrichment,
and it can’t identify anti-enrichment.
CentDist is available only on line, with limited storage capacity for jobs, at:
http://biogpu.ddns.comp.nus.edu.sg/~chipseq/webseqtools2/TASKS/Motif_Enrichment/submit.p
hp?email=guest
It allows results to be viewed by top “families” of related TFBSes or by top individual
factors/matrices (which is more useful). Viewing results from all matrices may crash your
browser. Choosing export data however gives you all data in a text file (unfortunately, losing the
also-useful graphs).
17) INTEGRATING RESULTS FROM SEVERAL METHODS, SIMPLE THRESHOLDS VERSUS
PERMUTATION OF RANKS APPROACHES
If you have multiple measures that assess TFBS enrichment, you can arbitrarily decide to take only
TFBSes that are significantly enriched at some threshold in all (or most) methods. This is often
OK, but can be too conservative and leave you with few positives.
Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012
13
Alternatively you can estimate the significance of the combined results of all measures (e.g. what is
the FDR for enrichment of V$SP1_01 matrix based on combined Storm f.0005, Storm p.85 &
CentDist results). You can do this by setting up a system to rank the results from each method,
randomly permuting the ranks from each method, determining the average ranks, & then the
distribution of average ranks curve to estimate the probability that any given average rank (e.g.
for V$SP1_01 enrichment) would occur by chance (a measure of false discovery rate).
First, you need to give ranks to each entry for each measure. It is important that these ranks be
based on something that is not a property only of that dataset, and ideally that is tied to
probability of occurance in some way (e.g. use reported p.value or fold enrichment, not a
percentile of data values). The reason for this, is that 3 entries that are all 99th percentile from a
dataset with poor underlying enrichment (e.g. 99th percentile is 1.1x enriched and raw p.=0.1 by
each measure) should not give the same very low FDR value that would be associated with a
dataset with strong underlying enrichment (e.g. 99th percentile is 4x enriched and p.=1e-20 by
each measure).
Now take these columns of ranks, plus a first column of identifiers (like V$SP1_01), copy them
from Excel, open R & do:
Dat<-read.delim(“clipboard”,sep=”\t”, header=T)
Outside of R edit a text file to read as follows & save this as Permute_ranks.txt.
# Permutes average of ranks from >= 2 samples to give a single p value
# Required input:
# Dat is a matrix. Column 1= names, Column 2 = Ranks for dataset 1, Column 3 = ranks for dataset 2, etc.
# Default number of permutations is 100. To set something else do: nperm<-# before running script
# IMPORTANT NOTE: Scale your rank values so that your maximum possible average rank is less than 20.
if(!nperm){nperm <- 100}
print ('Number of permutations was:')
print (nperm)
ravranks<-c(0,0)
freq<-c(0,0)
pval<-c(0,0)
binfreq<-c(0,0)
origfreq<-c(0,0)
binlab<-c(0,0)
#cnames=c(0,0)
#fdr=c(0,0)
#cnames[1]<-colnames(Dat)[1]
avranks<-rowMeans(Dat[,2:length(colnames(Dat))])
print('Number of rows in dataset was:')
print(length(avranks))
for (i in c(1:200)){
binfreq[i]<-0
origfreq[i]<-0
binlab[i]<-(i/10)
}
for (i in 1:length(avranks)){
ravranks[i]<-0
freq[i]<-0
pval[i]<-0
if(avranks[i]<20){ origfreq[as.integer(10*avranks[i])]=origfreq[as.integer(10*avranks[i])]+1}
}
print ('Calculating permutations of summed ranks...')
for (i in 1:nperm) {
Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012
14
rDat<-data.frame("data"=sample(Dat[,2]))
for(j in 3:length(colnames(Dat)) ) {
rDat<-data.frame(rDat,"data"=sample(Dat[,j]))
}
ravranks<-rowMeans(rDat[,1:length(colnames(rDat))])
for(k in 1:length(avranks)){
freq[k]<-freq[k]+sum(ravranks <= avranks[k])/length(avranks)
if(ravranks[k]<20){ binfreq[as.integer(10*ravranks[k])]=binfreq[as.integer(10*ravranks[k])]+1}
}
}
print ('Calculating p values...')
for (j in 1:length(avranks)) {
pval[j]<-freq[j]/nperm
}
Out<-data.frame(Dat,"Avg_rank"=avranks,"P.value"=pval)
freqout<data.frame("bin_top"=binlab,"count"=binfreq,"rel_freq"=(binfreq/(length(avranks)*nperm)),"orig_freq"=(origfreq/length(
avranks)))
write.table(Out, file="Permute_ranks.output", sep="\t", col.names=NA)
write.table(freqout, file ="Permute_ranks.freqs",sep="\t",col.names=NA)
print ('Output sent to Permute_ranks.output.')
print ('Frequencies of ranks in original & permuted data sent to Permute_ranks.freqs.')
print ('Be sure to rename files before additional runs.')
In R do:
source(“Permute_ranks.txt”)
The .output file lists the original values for each identifier & gives a p.value/FDR for each.
The .freqs file gives columns that can be plotted to give the normalized frequency of any given
rank (by each 0.1).
The straight-up results from the .output file are accurate only within one condition. If you have two
or more conditions and you want to compare FDRs from one condition to those from another,
use the method below.
First, take all of the values from all of your conditions and catenate them together (3 samples=3x
longer columns) & run Permute_ranks on this combined data.
Next, take the results from the .freqs file to Excel, and perform a running sum of these frequencies
to give the FDR value for any given rank. To do this: if your ranks list column is A2:A100 & your
frequencies data column is B2:B100, in C2 type “=B2”, in C3 type “=C2+B3” & then propogate
this formala down. If it works properly the final value will be 1.0.
With this you can assign p.values to any rank value under any condition using:
=vlookup(int(10*rank)/10,$A$2:$C$100,3,FALSE)
Finally, when you want to know the significance of an average rank across methods derived from
averages of more than one entry within a method (e.g. all six of the matrices for V$SP1_...), to
get an accurate FDR value you need to feed replicates of the rank numbers for all methods to
the ranking program (e.g. w/ a rank determined by averaging 6 entries from 3 methods, repeat
the ranks from each method 6 times for a total of 18 columns). This makes sense because if
rank of 1 occurs 10% of the time in each method, a random average of 1 occurs much more
frequently (1e-3) when you’ve averaged 3 ranks than when you’ve averaged 18 (1e-18).
18) USING STAMP-GENERATED TREE DIAGRAMS OF MATRIX SIMILARITIES TO FURTHER
EXPLORE THE MEANING OF SITE ENRICHMENT
a) Chose TFBS ID’s you want and create a file on the cluster one line per entry, e.g.
Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012
15
V$TFBS_01
V$TFBS_02
…
b) perl /cluster/home/g/s/gschni01/perl_programs/MatrixSelect.pl [Storm_output.storm
or_matrix_file.mat] selected_sites_file.txt outputfile.mat
c) upload the output file (or copy and paste its contents) into STAMP at
http://www.benoslab.pitt.edu/stamp/index.php.
d) Run analysis: I tend to use mostly default settings, but w/ trim edges set to info content of 0.0
and requesting the 10 best matches to transfac v11.3.
e) The output will show the consensus for all your input matrices followed by a tree, and then by
the best matches for each input TFBS matrix to transfac v11.3.
f) You can display the tree better by using the “newick formatted tree”.
g) On your PC download & install MEGA from www.megasoftware.net.
h) Click on the “newick formatted tree” in your STAMP output, select all and copy that long string,
paste this into a simple text file and save with the .nwk extension.
i) In MEGA chose user tree->display newick tree, open your .nwk file. There are many options for
display. I like the circle under the tree/branch style button. When you have the tree (or branch)
you like, use “image-> copy to clipboard” to paste this into something like powerpoint.
j) TFBSes that are tightly clustered at the same radial distance on the outer ring of the circle are
closely related. There is likely to be only one factor or family of factors that is enriched, while the
others just show up by homology.
Determining which TFBS or family in a closely related group is most likely to be real:
a) Note the p values, enrichments and number of sites you got from your Storm & binomial test
analysis. If one TFBS ID has by far the lowest p value & the best enrichment and is represented
by a large number of sites (>~100 and roughly equal or greater than the number of sites for the
other TFBSes) it is very likely to be the real one.
b) For less clear cut cases, you can examine the specific match sequences found by Storm to see
what they most resemble. Is it the matrix that they were originally found to match, or is there
another matrix that they match better to? Prepare a file with one TFBS ID in (V$TFBS_01
format) per line for all those IDs in a group you want to examine.
c) Move this file to the cluster and do:
perl /cluster/home/g/s/gschni01/MatrixSelect4dmeparseMat.pl dme_parse_output_file.mat
TFBS_IDs list.txt outputfile.mat
**This should select only the matrices derived from all of the ‘matches’ found in your sequence
data to the previously-established matrices you named, although it could potentially need some
debugging…
d) Either load this file or cut and paste its contents into STAMP & run as above (comparing to
Transfac v 11.3). In this case STAMP is identifying the transfac matrix that best fits the matrix
formed from the sequences in your data that “matched” your chosen matrices, e.g. V$TFBS_01,
etc.
e) Look at the best fits to transfac matrices and the p values. If you are trying to see which of sites
A, B, C & D are real, and find that A & C have highly significant matches to appropriate A* & C*
matrices in Transfac 11.3, while B & D do not have highly significant matches to B* and D* in
transfac, it suggests that A & C are real. Further support for this conclusion would be provided if
you see that B and D fit A* and C* better than B* and D* in Transfac 11.3.
For consideration & speculation on how to tell whether the enrichment of a short TFBS matrix can
be explained by its homology to a longer TFBS matrix in the same STAMP homology group, see
the file 2012_01_12_analyzing_enriched_TFBS_similarities.doc.
Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012
16
19a) MAPPING AVERAGE READ DENSITY RELATIVE TO GENOMIC FEATURES LIKE TSSES
**Coming soon
19b) MAPPING NUCLEOSOME DENSITY RELATIVE TO GENOMIC FEATURES LIKE TSSES
**Coming soon
Download