Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012 1 WORKFLOW FOR ANALYSIS OF CHIP-SEQ DATA INCLUDING ANALYSIS OF ENRICHED TRANCRIPTION FACTOR BINDING SITES -Gavin Schnitzler 02/11/2013 1) QUALITY CONTROL ON SEQUENCING RESULTS Open the fastqc_report.html file. Check per base sequence quality. This should be in the green range throughout. Somewhat lower quality is expected in the first ~8bp (due to technical aspects of the sequencing method), which is why these bases are often excluded from the analysis. If other 5’ or 3’ end bases fall out of the green range you may also want to exclude these with the caveats that you’ll need at least ~25 bp to map to a mammalian genome & you will want to perform the same trim on all samples. Check per sequence quality scores. Ideally there will be a sharp peak in the 30’s & relatively few reads below this. Other metrics often don’t link properly to the .html report, but you can access each of these by opening the images subfolder of the fastqc results folder: duplication_levels.png shows the number of exact duplicate reads. If most of your reads have a duplication value greater than 1 this indicates that you are sequencing multiples of the same fragment that were generated by the PCR amplification step in library production. Avoiding this is the major reason why you need to start with ~3ng of ChIP fragments and why you want to limit amplification to ~18 cycles. If there is a single curve centered around 2 or 3 your data should still be fine – but will be about as accurate as having 1/3 as many unique sequences. The kmer content values in the .html report & kmer_profiles.png image describe certain short sequences that are over represented compared to chance predicted from GC content. Usually TTTTT & AAAAA show up & a few others, which are just normal characteristics of mammalian DNA. Don’t worry about this unless certain sequences show up 100s of times greater than background and/or the kmers in your input DNA sample differ greatly from those in your ChIP sample… in which case these kmers could be an indicator of contamination by very high copies of the same sequence – potentially a PCR fragment from some other experiment that contaminated your ChIP sample. Per base GC content & sequence content .png files should show roughly straight lines with the exception of the first ~8 or 9 bp, which is expected (as mentioned above). The per sequence GC content .png file should have a single peak centered over the expected %GC for the genome in question (~45% for mammals). Sharp peaks of any given GC content probably represent highly repeated sequences (resulting perhaps from contamination of your ChIP DNA with a PCR product of some specific sequence that was hanging around the lab). A second broad peak of higher GC content probably indicates bacterial DNA contamination. An example of contamination with a common soil bacterium that was probably growing in a ChIP buffer is shown here: Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012 2 Note that none of these issues should completely prevent analysis of your data. For instance, nonmammalian sequences won’t map to the genome & over-amplified artifact sequences are removed by the recommended procedure of removing duplicate reads before calling peaks. These problems will, however, decrease your total number of reads and thus decrease the accuracy of peak calls. Similarly a library of too low a concentration will result in fewer than optimal numbers of sequenced clusters. What will prevent analysis is low sequence quality across all bases of reads, or high N (unknown base) calls, which will prevent any reliable mapping to your target genome. This sort of effect is rare & is likely to indicate a problem with the Illumina run itself, that you will want to contact your core about. 2) DOWNLOADING & UNPACKING ILLUMINA SEQUENCING RESULTS #go to the web site provided by the core facility. Find the correct file ending in “fastq.gz”. Control click to get menu & select “copy link location” #go to the Tufts computing cluster either using a shell program like TeraTerm or using “sftp your_account@cluster.uit.tufts.edu” from a Mac OSX shell. #Navigate to an appropriate folder in your shared directory & do… wget {pasted URL copied from website: e.g. ftp://gschnitzler:ExMGhvk9@genomics.med.tufts.edu/120209214/sequence_data_illumina/Unaligned/Sample_LiverE2_ERa_ChIP_man2.R1.fastq.gz} #unpack using… bsub gunzip FILE.fastq.gz **Remember to create a backup of the QC and .fastq.gz files in someplace other than the cluster, such as an external hard drive. 3) CHIP-SEQ STEP 1: MAPPING READS TO TARGET GENOME WITH BOWTIE #run bowtie using: bsub -oo LiE_man2_bowtie.runinfo /cluster/shared/gschni01/bowtie-0.12.5/bowtie -n 1 -m 1 -5 8 -3 10 --best --strata mm9 FILE.fastq FILE.map # “bsub –oo filename” submits the bowtie run as batch and records any output that would have been sent to the screen in the provided file. bsub is necessary for any job that will take more than a few seconds to run. “-n #” specifies the number of mismatches allowed in 1st 25 bp of read, “-m #” specifies the maximum number of different genomic locations a read can map to before being rejected, “-5 #” indicates the number of bp trimmed from 5’ end (8 generally advised), “-3 #” number of bp trimmed from 3’ end, “--best" & “--strata" are recommended parameters to find likely best fit in the genome. #note, no commands to set path parameters are necessary to run bowtie, so long as the full path to the bowtie executable file is used. You can identify this path by going to that directory and typing “pwd”. #Examine the bowtie.runinfo file & record the results, which will tell you what % of sequences mapped to the genome, what percent failed to map & what % were suppressed due to the --m parameter (e.g. mapping to more than one genomic location if --m is set to 1). # To install bowtie go to: http://bowtie-bio.sourceforge.net/index.shtml & follow instructions for installation. To install indexes for a genome of interest, right click on the link for the pre-assembled index you want (along right of page) & select copy link location. In UNIX go to the “indexes” folder in bowtie & do: wget [pasted link location] unzip [downloaded file] 4) CHIP-SEQ STEP 2: CONVERTING FROM MAP TO BED FORMAT #MACs requires that the .map bowtie output be converted into a .bed file. Do: awk 'OFS="\t" {print $4,$5,$5+length($6),$1,".",$3}' FILE.map > FILE.bed Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012 3 # “\t” tells awk that entries within a line are separated by tabs, $4 means the 4 th column, $5+length($6) means add the numeric value in the 5th column to the (text) length of the entry in column 6 & “.” means to insert a literal period in this tabbed position. 5) IDENTIFYING CHIP-SEQ PEAKS WITH MACS #To prepare to run macs: module load python/2.6.5 export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH export PATH=/cluster/shared/gschni01/bin:$PATH #note the paths used here depend on parameters specified when MACS was installed #Running MACs using an input control (-c) and ChIP experimental (-t) files: bsub -oo IPvINPUT.macsinfo macs14 --format=BED --bw=210 --tsize=33 --keep-dup=1 -B -S -c INPUT.bed -t IP.bed --name IPvINPUT #--format=BED tells MACs that the input file is in .bed format, bw=210 tells MACs the expected size of sequenced fragments (before addition of linkers, which add an additional ~90 bp) from which value it attempts to build a model from sense and antisense sequence reads, --tsize=33 indicates that each read (after trimming in bowtie) is 33 bp long (not strictly necessary: MACS can figure this out on its own), -keep-dup=1 instructs MACS to consider only the first instance of a read starting at any given genomic base pair coordinate & pointing in the same direction – assuming that additional reads starting at the same base pair are due to amplified copies of the same ChIP fragment in the library (by default MACS estimates the number of duplicates that are likely to arise by linear amplification of all fragments from a limited starting sample, and sets the threshold to cut out replicate reads with a much higher number – likely artifacts), -B tells MACS to make a bedgraph file of read density at each base pair (which can be used to visualize the results on the UCSC browser) & -S tells MACS to make a single .bedgraph file instead of one for each chromosome, & finally --name gives the prefix name for all output files. # Carefully scan through the MACS output file & examine the results for: INFO @ Tue, 21 Feb 2012 18:30:07: #1 total tags in treatment: 54992015 INFO @ Tue, 21 Feb 2012 18:30:07: #1 tags after filtering in treatment: 38314984 #(reads left after keeping only one of each repeated sequence based on keep-dup=1) INFO @ Tue, 21 Feb 2012 18:30:07: #1 Redundant rate of treatment: 0.30 # also look at these numbers for your control file. A high redundant rate reduces your read count & your ability to detect peaks accordingly. # MACS identifies + strand and – strand peaks, assumes that some of these come from left & right reads of binding site peak fragments & builds a model that determines the optimum separation between these peaks… INFO @ Tue, 21 Feb 2012 18:31:19: #2 Build Peak Model... INFO @ Tue, 21 Feb 2012 18:32:01: #2 number of paired peaks: 9916 INFO @ Tue, 21 Feb 2012 18:32:04: #2 predicted fragment length is 133 bps # You ideally want >1000 peak pairs in the model, but MACS will run (giving a warning) so long as you have ~100. Importantly, the predicted fragment size should be within +/-10 or 20 bp of your estimate of the ChIP fragment size in your library (equal to your average final library fragment length minus the ~90 bp length of the adapters). If it is not, then the model is probably based on spurious read peaks resulting from, for instance, sonication sensitive sequences that get cleaved at high frequency. #You can visualize the quality of the model as follows: copy the “.r” file generated by MACS (e.g. LiE_man2.33_v_Li_Input_bw210_dup1_model.r to your PC. Open R. Set the working directory to equal the folder with the .r file using: setwd(“[file path]”) #Then generate a viewable .pdf from the .r file using: source(“filename.r”) Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012 4 #This should show two broad peaks for plus strand & minus strand reads and a 3rd for the read density center called by MACs based on its predicted fragment length. Sharp peaks separated by less than 50 bp are peak call errors due perhaps to sonication-sensitive sequences. If you also see broad peaks flanking these sharp peaks it means that it may be possible to identify the real peaks if you adjust the MACS parameters. In particular the --mfold=X,Y parameter tells MACS to use peaks that are between X and Y fold over background in calculating the model (default 10 & 30). Sharp false peaks are often very high over background, so reducing the max -mfold can often help (e.g. --mfold=10,20). If signal for real binding events is not strong, you can also reduce the lower parameter (I’ve had good results sometimes with --mfold=7,12) – with the caveat that this is more likely to call true noise as a peak. # Finally you get to the number of peaks called: INFO @ Tue, 21 Feb 2012 18:38:19: #3 Finally, 55920 peaks are called! # Then, as a control, MACS swaps treatment and control data, considering treatment as background and asking for the peaks detected in the control data. INFO @ Tue, 21 Feb 2012 18:38:19: #3 find negative peaks by swapping treat and control INFO @ Tue, 21 Feb 2012 18:39:08: #3 Finally, 1119 peaks are called! # This is a measure of how many peaks result from random noise. Ideally you want “negative peaks” to be 10-fold or more lower than “positive peaks”. This is how MACS calculates empirical false discovery rate, so the lower the ratio of negative to positive peaks, the better your FDRs will be. If these values are roughly equal, this indicates that non-specific fragments in your ChIP DNA are more numerous than the fragments specifically brought down with your antibody & may suggest you need to optimize your ChIP conditions or try other antibodies. # Keep tweaking --mfold ranges until you get the correct estimated fragment size & a high positive/negative peak ratio, if possible. Beware that the smaller the mfold range you allow the fewer qualifying peaks there will be. If the model won’t resolve, you can tell MACS to forget the model & just use your estimate of fragment size by adding these parameters to your MACS command “--bw=fragment_length --shiftsize=fragment_length/2 --nomodel" & hope that this gives a good positive/negative peaks ratio. #To install MACS 1st go to: liulab.dfci.harvard.edu/MACS/ & choose the download option & ctrl-click to copy the link location. If you need to login use default username=macs & password=chipseq To download directly to the cluster, you will need to include login information in the wget command: wget --http-user=macs --http-password=chipseq [paste url] Follow the instructions for installation. To get it to work, you will probably need to change the default version of python with: module add python/2.6.5 6) SUBSAMPLING BEFORE MACS, PLUSES and MINUSES # I am told that the P-values and fold-enrichment values given by MACS are sensitive to the relative number of reads in the treatment vs control files, and if this number is very different (maybe > 1.5x different) MACs may not accurately report these numbers (even though the accuracy of peak calls should be mostly unaffected). # One way to handle this is to run MACS with all of the data initially, and compare tags after filtering for treat and control. Then take a subsample of the dataset with the higher number of reads & re-run MACS. E.g. if control has 60M reads after filtering and treat has 20M, take a 33% subsample of the control & rerun MACS. The following small command-line perl script can be used to subsample reads, in this case 33% of lines in LiE_man2.bed are put into LiE_man2_33pct.bed. perl -e 'open(F1,"FILE.bed"); open(FH2,">","FILE_33pct.bed"); while(<F1>){if(rand()>.666){print FH2 $_;} }; close F1; close F2' #This will not give an exactly equal number of reads, especially if % redundancy is high, but may get you close – especially if you try a few iterations. To get a truly equal number requires Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012 5 deduplicating the reads _outside_ of MACS and feeding MACS an equal number of these reads. Unfortunately, this is trickier than it might seem, involving converting your bed into a .sam file, running two samtools programs and then converting back to .bed. #Be aware that subsampling is a delicate trade-off which increases the effective noise & reduces the accuracy of the subsampled dataset in order to improve MACS fold enrichment and p.value output accuracy. 7) COMPARING PEAK CALLS TO UNDERLYING READ DENSITY DATA #If you included the -B & -S parameters in MACS it will create subfolders containing treat and control bedgraph (.bdg) files which map the read density base-pair per base-pair across the whole genome. #Unfortunately MACS often extends predicted density a few bases past the UCSC-browser recognized chromosome ends, causing errors. To fix this you can modify and run a short perl program as follows: vi trim_bdg_chrom_ends.pl i #cut and paste this content after adjusting the chromosome lengths to what the UCSC genome browser reports for the species and build you are using (you can change the number of base pairs per chromosome, add or subtract numbered chromosomes without problem, but you’ll need to make further adjustments to the program if you add a non-numbered chromosome other than X, Y and M): =head1 Simple file to remove bed or bedgraph regions that excede chromosome ends in mouse mm9 =head1 Usage Input: At command line, type: > perl trim_bdg_chrom_ends.pl input_bed_or_begraph_file.bdg output_filename.bdg =head1 Version Information Gavin Schnitzler 6/29/2012 =cut my %chromhash=( chr1=>197195432, chr2=>181748087, chr3=>159599783, chr4=>155630120, chr5=>152537259, chr6=>149517037, chr7=>152524553, chr8=>131738871, chr9=>124076172, chr10=>129993255, chr11=>121843856, chr12=>121257530, chr13=>120284312, chr14=>125194864, chr15=>103494974, chr16=>98319150, chr17=>95272651, chr18=>90772031, chr19=>61342430, chrX=>166650296, chrY=>15902555, chrM=>16299 ); #print "Chr1: $chromhash{chr1}\n"; open (FH1, "<", $ARGV[0]) or die ("Could not open input bed or begraph file $ARGV[0]\n"); open (FH2, ">", $ARGV[1]) or die ("Could not open output file $ARGV[1]\n"); while(<FH1>){ Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012 # 6 chomp; @tmp_dat = split(/\t/, $_); if($tmp_dat[0] =~/track/){print FH2 $_; next;} if($tmp_dat[2]>=$chromhash{$tmp_dat[0]}){next;} print FH2 $_; } close FH1; close FH2; #finish editing in vi by [esc] :wq #then run the program with: perl trim_bdg_chrom_ends.pl filename.bdg outputfilename.bdg #Next compress your file for uploading to the browser with: gzip outpufilename.bdg #copy this file to your desktop using WinSCP (in windows) or sftp within a shell window (in mac OSX) #open the genome browser, select your genome, click add custom tracks, browse for the file name & hit upload (it will take some time, but should eventually finish). Do the same with your input control .bdg file. # click add custom tracks again & upload the MACS result file ending with peaks.bed # if you are comparing ChIP results from more than one condition or with more than one antibody you can upload .bdg and .bed files for each one. # Now scan along any chromosome & examine the peaks MACS called. Are they believably above background? Make note of the coordinates of believable and non-believable peaks, then open the peaks.xls file made by MACS into Excel on your desktop. Is there a p-value or foldenrichment threshold that separates most good peaks from most bad peaks? If so, you may want to apply this(these) threshold(s) to create your final list of peaks. # Note that the height of the browser graphs for each track will be proportional to the number of reads after filtering that MACS used for your treat and control samples. If these differ considerably, you can adjust for this easily enough by just mentally multiplying the axis values for the sample with more reads by the ratio of [reads for the smaller sample]/[reads for the larger sample] (e.g. in the example above, you’d multiply the control sample axis values by 1/3). # A better approach would be to do the following to normalize each .bdg entry by dividing by the number of millions of reads after filtering (e.g. 20 for treat and 60 for control in the example above). This reads per million base pairs (RPM) normalization is easy enough to do in awk, e.g.: awk 'OFS="\t" {print $1,$2,$3,$4/20}' treat.bdg > treat_normalized.bdg #Note: the browser is fine with non-integral values in .bdg files 8a) EXAMINING OVERLAP BETWEEN PEAKS AND OTHER GENOMIC FEATURES, INCLUDING TRANSCRIPTION START SITES OR OTHER CHIP-SEQ PEAKS #To determine the degree to which two sets of bed regions overlap with each other use: bsub perl /cluster/home/g/s/gschni01/perl_programs/overlap_1.2.txt File1.bed File2.bed -outfile File1_v_File2.overlap #the output file will summarize the number of ranges in each input file & the average length of each range & then provide details of all ranges from file1 that overlapped 1bp or more with those in file 2. #the program will also provide an estimate of the number of overlaps that would be expected by random chance, from the formula: (avg_length_file1+avg_length_file2)*regions_in_file1*regions_in_file2/genomesize Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012 7 #this provides a reasonable first pass estimate, which can be used in binomial tests (see **12 below) which would be structured as follows: hits=overlaps, tests=regions_in_file1, bkg_freq=overlaps_by_chance/regions_in_file1. # this background estimate, however, fails to account for the fact that bed regions in each file are non-overlapping, and the fact that more than one bed region in file1 might overlap with a single region from file2. If you have relatively small numbers of not-too-long peaks in file1 & file2 (such that their combined total bp is less than 1% of the genome) this estimate is probably good enough. If not, it is better to establish a background overlap frequency empirically. This can be done by creating a randomly-distributed non-overlapping set of regions of equal length to those in file2 & repeating the overlap of file1 with this file2_background set. To do this use: perl /cluster/home/g/s/gschni01/overlap_bkg_generator.pl file2.bed random_bkg_for_file2.bed # Running overlaps.pl with file1 versus random_bkg_for_file2.bed gives an empirically-determined number of overlaps expected by regions in file1 with randomly distributed same-length regions in file 2. If there are more than 5000 regions in file 2 this should be pretty accurate. If the regions in file2 are less than this you may want to make multiple bkg set files, run overlaps.pl on each of them & take the average as your background hits. #Note, if you want to generate any number of ranges of a given fixed size (e.g. for +/-50 kb from TSSes in the example below), create a raw feed file for overlap_bkg_generator.pl using the following template (where the chromosome number and exact bp positions are irrelevant, the only thing that matters is the distance between start & end). perl -e 'for($x=1;$x<=10000;$x++){print "chr1\t100000\t200000\n";}' > plus_minus_50kb_feed_to_rand.bed perl /cluster/home/g/s/gschni01/overlap_bkg_generator.pl plus_minus_50kb_feed_to_rand.bed random_100_kb_regions.bed 8b) USE OF CEAS TO LOOK AT DISTRIBUTION RELATIVE TO TSSES, EXONS & CHROMOSOMES Go to the Galaxy/Cistrome website at: http://cistrome.dfci.harvard.edu/ Upload or paste .bed file of peaks w/ no header Run Integrative analysis/CEAS, choosing appropriate range from TSSes (note: it appears that span should be set to equal the highest range value). Note also that the number of peaks in intergenic regions isn’t given directly but can be calculated, & p. value is not given at all (but can probably be assumed based on p. values for similar differences). 8c) OTHER CISTROME TOOLS: peak2gene gives location of nearest gene to peaks Conservation Plot shows average degree of conservation of sequences relative to the center of the peaks Venn Diagram Shows overlap of up to 3 sets of peaks or features GCA: Gene Centered Annotation, finds nearest binding site for each gene in the genome (inverse of peak2gene). Cistrome has many additional tools that might also be useful for certain applications. 9) GETTING SEQUENCES FOR PEAKS Create an excel file containing the regions of interest* in bed format: chr# [tab] start [tab] end [tab] optional_additional_columns *These are just the bed coordinates from the …peaks.xls file from MACS. A perhaps better approach is to limit the sequence to +/-200 bp (400 bp total) around the peak apex, which also reported in the peaks.xls file. Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012 8 *In addition to your bed regions of interest (e.g. ERalpha binding sites in aorta), you should also prepare a similar-sized set of control regions (of similar average length to the ChIP peak bed regions). One option is to take flanking region DNA that is, say, ~2000 bp from the edge of your peak. You can easily get this in excel by setting the 3’ edge of the control region to the 3’ peak edge -2000 bp and the 5’ peak edge to –(2000+length of the peak, or average peak size). Alteratively, you can select random regions from the mouse genome, distributed on chromosomes in roughly the same proportion as ChIP peaks. I did this by adding columns after the ChIP-seq bed columns, the first equal to the bed region chr#, the second being a random number from 1 to the size of the chromosome, and the 3rd being that + peak length (or average peak length). After creating these, sort by chr# and starting bp and eliminate any overlap. Copy and paste these bed regions into text files for both the ChIPseq peaks and the control regions. Add a first line that says: track name=short_descriptor_of_data save again as text. Upload as a custom track in the UCSC genome browser Select “go to table browser” Select the track you want, and output format: sequence Give a name for the output file, ending in .fa Hit “get output”. Transfer that file to your cluster account. Run this convert script to simplify the first line for each entry (otherwise Storm will choke): perl /cluster/home/g/s/gschni01/perl_programs/Lax_convert.pl FILE.fa > FILE_corrected.fa 10) LOOKING FOR MATCHES TO TRANSFAC MATRICES USING STORM export CREAD=/cluster/shared/gschni01/cread-0.84 export PATH=$PATH:$CREAD/bin bsub -oo FILE_f.85.storminfo storm -f -t 0.85 -s FILE_corrected.fa -o FILE_f.85.storm /cluster/shared/gschni01/cread-0.84/vertebrates.mat # -f –t 0.85 indicates to identify matches between test sequences and matrices that give 85% of the maximal score. Short sequence elements reach this threshold easily, while long elements reach it rarely. bsub -oo FILE_p.0005.storminfo storm -p -t 0.0005 -s FILE_corrected.fa -o FILE_p.0005.storm /cluster/shared/gschni01/cread-0.84/vertebrates.mat # -p –t 0.0005 indicates to identify matches between test sequences & matrices that would occur by chance less than or equal to .05% of the time. This is a better way of identifying long sequence elements, but might be poor at detecting shorter elements, since even perfect matches to the matrix might happen by chance at higher than this rate. # Below, I describe a method that allows the use of a combination of both of these measures. 11) RUNNING DME_PARSE TO INTERPRET STORM OUTPUT bsub -oo FILE_f.85.dmeparseinfo perl /cluster/home/g/s/gschni01/perl_programs/dme_parse5.4.pl FILE_f.85.storm FILE.bed peaks # where .storm is your storm output file and .bed is a file simply containing 3 tab separated values identifying the bed ranges fed to the UCSC browser to get sequence. “peaks” at the end specifies that this is ChIP-seq peak data with site enrichment expected to be greatest at the center of the peak, and varying region lengths allowed. The alternative here is “promoters” Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012 9 which does not make any assumption about where sites will be enriched and which requires that all sequences be the same length. # note that this also works fine if the storm file is further processed using cread programs such as the following: motifclass –Orv –f ChIP_regions.fa –b background_regions.fa –o outputfilename.motifclass ChIP_region_storm_output.storm –P 1000 #this uses CREAD’s methods of assessing significance of enrichment relative to bkg seqs sortmotifs –k RELATIVE_ERRORRATE –a –o outputfilename.sortmotifs previous_file.motifclass #this gives a ranking of significant enrichment by the relative error rate metric from motifclass #DME parse gives several output files, with these endings appended to the input storm file name: “.runinfo”: summarizes the parameters and input files as well as the names & contents of output files. It also contains a table of the number of times a bin of 50bp size outwards in both directions from the peak center was covered by peak bed regions (used for some internal calculations when raw peak bed regions of differing sizes are used as input). “.info”: contains tab separated data including: [all summary information provided by storm (as well as any potential additional programs) for each TFBS matrix], e.g. MATCH1, INFO, RELATIVE_ERRORRATE, SCORE, etc., followed by one letter code words for the input position weight matrix (“consensus”) and the matrix derived from matches found by storm (“data_consensus”) (which unfortunately don’t work very well, for complete info see the .mat file), followed by 12 columns tabulating frequency of regions with 0, 1,2,3… 10 & 11 or more of each given site, a column reporting total matches found & columns (BP of sequence/binsize, default binsize=50) which tabulate site distance from sequence edge. “.freqs” lists tab separated summary information about each peak, starting with the chr#, start & end information from the ,bed file fed to storm (as well as any additional following columns of information that may have been there), followed by columns for each matrix included in the analysis, with each cell giving the number of matches to that matrix found in that bed region. “.bed” contains bed format regions for each TFBS, suitable to view on the UCSC genome browser. “.mat” contains the position frequency matrix used by storm followed by the position frequency matrix derived from the binding site matches in your data that storm identified. 12) BINOMIAL TESTS TO CALCULATE P VALUES To do the binomial tests you will need the “total_sites” number in the .info file generated by dme_parse & the number of bed regions, for both your background and your experimental sets (run Storm and DME parse for both background and foreground sets). In excel create four columns for each transcription factor binding site: Column 1: transfac matrix identifier for the site [make sure that this column header is “Name” (without the quote marks). Column 2: total_sites reported for your foreground (e.g. ChIP) data Column 3: total number of base pairs searched for these sites, which is calculated as (number of bed regions)*(average # of bp per region). Considering that TFBS sites are generally longer than 1 bp (thus the last few bases cannot match to anything in the matrix), you could use (BP per promoter)-(average length of TFBS sites in the matrix). I used 1200-6, but it would be close enough just to use 1200. More accurately, you could subtract one less than the length of the consensus sequence column text. Column 4: background frequency: (total_sites_from background)/(# of background promoters * length of bkg promoters).These numbers are derived from your storm & then dme_parse analysis of your background region sequences. Open R and then do the following… Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012 10 With the table you prepared above ---->, select and ctrl-C copy In R: btest<-read.delim("clipboard",header=T) #if this doesn’t work, you’ll need to save the table to a text file and use file=”filepath/filename.txt” in read.delim btest[1:4,] #make sure it input properly for(x in c(1:length(btest$Name))){ btest[x,5]=binom.test(btest[x,2],btest[x,3],btest[x,4])$p.value } btest[1:10,] write.table(btest,file="E176vWT_UPvCTRL_binom.xls", sep="\t") Next, in Excel, open output file & choose all but the index column. Sort by Name & then sort the dme_parse output in the original file so it's also by Motif_ID Insert the new data after the >=11 column using Insert/Cells If desired, delete Name column & other extraneous ones. Add adjusted_p column, multiply p.value by 585 (# of tests=# of motif matrices searched)* * Alternatively, use the less conservative Benjamini-Hochberg correction. To do so, sort your raw p values from lowest to highest and give each a rank (with lowest=1, next =2, etc.). For each TFBS adjusted_p = raw_p*595/rank. Next calculate enrichment ratios: foreground/background frequency (fg_counts)/(fg_bp) / (bkg_counts)/(bkg_bp), which will indicate how great the enrichment/anti-enrichment of each site is. Once you know the enrichment ratio, sort first by p.value & then by this to set up 2 lists of significant enrichment differences, sites that are significantly overenriched (adjusted p.value <.05 & fg/bkg>1) and sites that are underenriched relative to chance (fg/bkg<1). Obviously, sites with great enrichment over background are potentially most interesting, but they should also, ideally, be present in a large fraction of sequences. To estimate the fraction of ChIP-seq peaks that have enrichment for any given TFBS (assuming each sequence has one more site than background sequences) simply calculate, in Excel: =(foreground_matrix_hits-background_matrix_hits)/total_bed_regions 13) PLOTTING TFBS FREQUENCY RELATIVE TO PEAK CENTERS In the final columns of the “.info” file DME parse also measures the frequency of matrix matches relative to peak centers, one row for each transcription factor. To look at this distribution, create a new row with base pair position at the center of each bin (the first one goes from -1000 to -950, so the value would be -975). Plot the desired row data on the Y axis with BP position on the X. Note, this data has been normalized to matrix matches per kb of DNA sequence (e.g. number of matches/(# of bed regions*50 bp bin size/1000bp). This allows you to directly compare results across different conditions (e.g. ERalpha ChIP from liver and aorta). This will tell you whether enrichment is tightly associated with peak centers or broadly distributed, or even (potentially) associated with flanks but not centers, as might be the case for a factor that contributes to an enhancer that your ChIP’ed factor often is part of, but which does not directly recruit your ChIP’ed factor to chromatin. 14) USING THE .FREQS FILE TO IDENTIFY PEAK SUBSETS ENRICHED IN ANY GIVEN FACTOR Paste the .freqs file from your background region dme_parse run into excel & in the rows above calculate: =PERCENTILE(range_including_all_values_in_column,0.95) # this is the 95th percentile of sitesper region values =PERCENTRANK(all_values_in_column,value_from_above+1) Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012 11 # this is the fraction of bed regions w/ sites per region equal to 95th percentile value or lower Copy & “paste special” these as values in rows above the same TFBS matrix IDs of the dme_parse results from your foreground ChIP-seq data. Then in additional rows calculate: =COUNT(IF(range_including_all_values_in_fg_data_column>95th_percentile_bkg_value_cell,)) [hit ctrl shift enter to activate, should be surrounded by {} brackets when you click on the cell again] #this is the number of ChIP seq peaks with more sites than the 95th percentile value =INT(number_from_countif-(1-percentrank_value_from_bkg_set)*total_number_of_peaks) # This is the number of sites in excess of those that would have this characteristic by random chance. If this number is high, especially if it is greater than say ~1% of your total peaks, it suggests that choosing all peaks with sites greater than the 95th percentile value is likely to identify peaks with functional enrichment of that site. If you then perform a descending sort on the chosen column (making sure to sort ALL columns as well) and take those greater than the 95th percentile value, you’ve got a set of putative target locations. #This analysis will let you identify binding site peaks for your ChIP’ed factor that are highly enriched in consensus motifs for another factor. These can serve as candidate regions to check by ChIP for the presence of the other factor and/or on which to test for the loss of binding of your ChIP’ed factor after knock down or inhibition of the other factor. 15) ALL-BY-ALL OVERLAPS ANALYSIS TO IDENTIFY CO-ENRICHED OR MUTUALLYEXCLUSIVE TFBS SITES Create a new column that makes a single-word-identifier for these bed regions using: =concatenate([chromosome_cell],”:”,[start_cell],”-“,[end_cell]) & copy & “paste special” as values these putative target IDs in columns to a separate sheet, copy this new table (capturing the longest column, and including empty cells for shorter columns). If you ran both f .85 and p.0005 storm analyses you can repeat the analysis below for each one, or (since many peaks will show site enrichment by both methods) simply combine both lists together, removing duplicates. First, modify the R commands below as follows: replace “45” with the number of columns you have and replace “11975” with the number of original ChIP-seq peaks used for the storm analysis. Then, in R do: test<-read.delim("clipboard",na.strings="",fill=T,header=T) dim(test) #tells the number of columns and rows colnums=c(1:45) #this should be equal to the number of columns in test intersections=matrix(data=NA,ncol=45,nrow=45) testout=matrix(data=NA,ncol=45,nrow=45) chance=matrix(data=NA,ncol=45,nrow=45) ratios=matrix(data=NA,ncol=45,nrow=45) for(x in colnums){for(y in colnums){chance[x,y]<-(length(na.omit(test[,x]))*length(na.omit(test[,y]))/11975)}} for(x in colnums){ids[x]=colnames(test)[x]} ids colnames(chance)=ids rownames(chance)=ids chance Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012 12 for(x in colnums){for(y in colnums){if(x!=y){testout[x,y]<binom.test(length(intersect(na.omit(test[,x]),na.omit(test[,y]))), length(na.omit(test[,x])),length(na.omit(test[,y]))/11975)$p.value}}} colnames(testout)=ids rownames(testout)=ids testout for(x in colnums){for(y in colnums){intersections[x,y]<-length(intersect(na.omit(test[,x]),na.omit(test[,y])))}} colnames(intersections)=ids rownames(intersections)=ids intersections ratios=intersections/chance colnames(ratios)=ids rownames(ratios)=ids ratios write.table(testout,file="temp.txt",sep="\t",col.names=NA,quote=F) #copy this table of p. values for the overlap of sites enriched in each factor relative to each other factor into Excel, then repeat write.table & copy for intersections, chance & ratios. To be conservative, multiply the p.values by 45*44/2 (the number of relevant non-self-to-self comparisons for the binomial tests) – essentially the Bonferroni correction for multiple testing. Be aware that this is p. value for either enrichment (more overlaps than chance) or anti enrichment (fewer overlaps than chance) – which you can easily determine by looking at the ratios table. # This analysis will tell you which TFBS matrices tend to group together (very low p values & high ratios) versus what ones group separately (very low p values & low ratios), which can give insights into functional modules of factors. Note that some will group together because their sites are highly homologous (with only one of those factors likely to be relevant) – and this can be determined by STAMP analysis as described in **18 below. 16) ALTERNATIVE METHOD FOR TFBS IDENTIFICATION: CENTDIST Centdist is specially designed for detecting enriched TFBSes in peak centers relative to surrounding sequences. It’s two improvements over simple Storm analysis is a dynamic determination of the optimal range from the center to give the best fg/bkg ratio & its consideration of “peakiness” as part of the score. The downside of using CentDist is that it does not make the individual matrix hit locations available, it doesn’t allow adjustments to parameters (e.g. what constitutes a hit), it doesn’t provide any direct way of quantitating fold enrichment, and it can’t identify anti-enrichment. CentDist is available only on line, with limited storage capacity for jobs, at: http://biogpu.ddns.comp.nus.edu.sg/~chipseq/webseqtools2/TASKS/Motif_Enrichment/submit.p hp?email=guest It allows results to be viewed by top “families” of related TFBSes or by top individual factors/matrices (which is more useful). Viewing results from all matrices may crash your browser. Choosing export data however gives you all data in a text file (unfortunately, losing the also-useful graphs). 17) INTEGRATING RESULTS FROM SEVERAL METHODS, SIMPLE THRESHOLDS VERSUS PERMUTATION OF RANKS APPROACHES If you have multiple measures that assess TFBS enrichment, you can arbitrarily decide to take only TFBSes that are significantly enriched at some threshold in all (or most) methods. This is often OK, but can be too conservative and leave you with few positives. Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012 13 Alternatively you can estimate the significance of the combined results of all measures (e.g. what is the FDR for enrichment of V$SP1_01 matrix based on combined Storm f.0005, Storm p.85 & CentDist results). You can do this by setting up a system to rank the results from each method, randomly permuting the ranks from each method, determining the average ranks, & then the distribution of average ranks curve to estimate the probability that any given average rank (e.g. for V$SP1_01 enrichment) would occur by chance (a measure of false discovery rate). First, you need to give ranks to each entry for each measure. It is important that these ranks be based on something that is not a property only of that dataset, and ideally that is tied to probability of occurance in some way (e.g. use reported p.value or fold enrichment, not a percentile of data values). The reason for this, is that 3 entries that are all 99th percentile from a dataset with poor underlying enrichment (e.g. 99th percentile is 1.1x enriched and raw p.=0.1 by each measure) should not give the same very low FDR value that would be associated with a dataset with strong underlying enrichment (e.g. 99th percentile is 4x enriched and p.=1e-20 by each measure). Now take these columns of ranks, plus a first column of identifiers (like V$SP1_01), copy them from Excel, open R & do: Dat<-read.delim(“clipboard”,sep=”\t”, header=T) Outside of R edit a text file to read as follows & save this as Permute_ranks.txt. # Permutes average of ranks from >= 2 samples to give a single p value # Required input: # Dat is a matrix. Column 1= names, Column 2 = Ranks for dataset 1, Column 3 = ranks for dataset 2, etc. # Default number of permutations is 100. To set something else do: nperm<-# before running script # IMPORTANT NOTE: Scale your rank values so that your maximum possible average rank is less than 20. if(!nperm){nperm <- 100} print ('Number of permutations was:') print (nperm) ravranks<-c(0,0) freq<-c(0,0) pval<-c(0,0) binfreq<-c(0,0) origfreq<-c(0,0) binlab<-c(0,0) #cnames=c(0,0) #fdr=c(0,0) #cnames[1]<-colnames(Dat)[1] avranks<-rowMeans(Dat[,2:length(colnames(Dat))]) print('Number of rows in dataset was:') print(length(avranks)) for (i in c(1:200)){ binfreq[i]<-0 origfreq[i]<-0 binlab[i]<-(i/10) } for (i in 1:length(avranks)){ ravranks[i]<-0 freq[i]<-0 pval[i]<-0 if(avranks[i]<20){ origfreq[as.integer(10*avranks[i])]=origfreq[as.integer(10*avranks[i])]+1} } print ('Calculating permutations of summed ranks...') for (i in 1:nperm) { Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012 14 rDat<-data.frame("data"=sample(Dat[,2])) for(j in 3:length(colnames(Dat)) ) { rDat<-data.frame(rDat,"data"=sample(Dat[,j])) } ravranks<-rowMeans(rDat[,1:length(colnames(rDat))]) for(k in 1:length(avranks)){ freq[k]<-freq[k]+sum(ravranks <= avranks[k])/length(avranks) if(ravranks[k]<20){ binfreq[as.integer(10*ravranks[k])]=binfreq[as.integer(10*ravranks[k])]+1} } } print ('Calculating p values...') for (j in 1:length(avranks)) { pval[j]<-freq[j]/nperm } Out<-data.frame(Dat,"Avg_rank"=avranks,"P.value"=pval) freqout<data.frame("bin_top"=binlab,"count"=binfreq,"rel_freq"=(binfreq/(length(avranks)*nperm)),"orig_freq"=(origfreq/length( avranks))) write.table(Out, file="Permute_ranks.output", sep="\t", col.names=NA) write.table(freqout, file ="Permute_ranks.freqs",sep="\t",col.names=NA) print ('Output sent to Permute_ranks.output.') print ('Frequencies of ranks in original & permuted data sent to Permute_ranks.freqs.') print ('Be sure to rename files before additional runs.') In R do: source(“Permute_ranks.txt”) The .output file lists the original values for each identifier & gives a p.value/FDR for each. The .freqs file gives columns that can be plotted to give the normalized frequency of any given rank (by each 0.1). The straight-up results from the .output file are accurate only within one condition. If you have two or more conditions and you want to compare FDRs from one condition to those from another, use the method below. First, take all of the values from all of your conditions and catenate them together (3 samples=3x longer columns) & run Permute_ranks on this combined data. Next, take the results from the .freqs file to Excel, and perform a running sum of these frequencies to give the FDR value for any given rank. To do this: if your ranks list column is A2:A100 & your frequencies data column is B2:B100, in C2 type “=B2”, in C3 type “=C2+B3” & then propogate this formala down. If it works properly the final value will be 1.0. With this you can assign p.values to any rank value under any condition using: =vlookup(int(10*rank)/10,$A$2:$C$100,3,FALSE) Finally, when you want to know the significance of an average rank across methods derived from averages of more than one entry within a method (e.g. all six of the matrices for V$SP1_...), to get an accurate FDR value you need to feed replicates of the rank numbers for all methods to the ranking program (e.g. w/ a rank determined by averaging 6 entries from 3 methods, repeat the ranks from each method 6 times for a total of 18 columns). This makes sense because if rank of 1 occurs 10% of the time in each method, a random average of 1 occurs much more frequently (1e-3) when you’ve averaged 3 ranks than when you’ve averaged 18 (1e-18). 18) USING STAMP-GENERATED TREE DIAGRAMS OF MATRIX SIMILARITIES TO FURTHER EXPLORE THE MEANING OF SITE ENRICHMENT a) Chose TFBS ID’s you want and create a file on the cluster one line per entry, e.g. Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012 15 V$TFBS_01 V$TFBS_02 … b) perl /cluster/home/g/s/gschni01/perl_programs/MatrixSelect.pl [Storm_output.storm or_matrix_file.mat] selected_sites_file.txt outputfile.mat c) upload the output file (or copy and paste its contents) into STAMP at http://www.benoslab.pitt.edu/stamp/index.php. d) Run analysis: I tend to use mostly default settings, but w/ trim edges set to info content of 0.0 and requesting the 10 best matches to transfac v11.3. e) The output will show the consensus for all your input matrices followed by a tree, and then by the best matches for each input TFBS matrix to transfac v11.3. f) You can display the tree better by using the “newick formatted tree”. g) On your PC download & install MEGA from www.megasoftware.net. h) Click on the “newick formatted tree” in your STAMP output, select all and copy that long string, paste this into a simple text file and save with the .nwk extension. i) In MEGA chose user tree->display newick tree, open your .nwk file. There are many options for display. I like the circle under the tree/branch style button. When you have the tree (or branch) you like, use “image-> copy to clipboard” to paste this into something like powerpoint. j) TFBSes that are tightly clustered at the same radial distance on the outer ring of the circle are closely related. There is likely to be only one factor or family of factors that is enriched, while the others just show up by homology. Determining which TFBS or family in a closely related group is most likely to be real: a) Note the p values, enrichments and number of sites you got from your Storm & binomial test analysis. If one TFBS ID has by far the lowest p value & the best enrichment and is represented by a large number of sites (>~100 and roughly equal or greater than the number of sites for the other TFBSes) it is very likely to be the real one. b) For less clear cut cases, you can examine the specific match sequences found by Storm to see what they most resemble. Is it the matrix that they were originally found to match, or is there another matrix that they match better to? Prepare a file with one TFBS ID in (V$TFBS_01 format) per line for all those IDs in a group you want to examine. c) Move this file to the cluster and do: perl /cluster/home/g/s/gschni01/MatrixSelect4dmeparseMat.pl dme_parse_output_file.mat TFBS_IDs list.txt outputfile.mat **This should select only the matrices derived from all of the ‘matches’ found in your sequence data to the previously-established matrices you named, although it could potentially need some debugging… d) Either load this file or cut and paste its contents into STAMP & run as above (comparing to Transfac v 11.3). In this case STAMP is identifying the transfac matrix that best fits the matrix formed from the sequences in your data that “matched” your chosen matrices, e.g. V$TFBS_01, etc. e) Look at the best fits to transfac matrices and the p values. If you are trying to see which of sites A, B, C & D are real, and find that A & C have highly significant matches to appropriate A* & C* matrices in Transfac 11.3, while B & D do not have highly significant matches to B* and D* in transfac, it suggests that A & C are real. Further support for this conclusion would be provided if you see that B and D fit A* and C* better than B* and D* in Transfac 11.3. For consideration & speculation on how to tell whether the enrichment of a short TFBS matrix can be explained by its homology to a longer TFBS matrix in the same STAMP homology group, see the file 2012_01_12_analyzing_enriched_TFBS_similarities.doc. Workflow for ChIP-seq & RNA-seq data analysis, GRS 2012 16 19a) MAPPING AVERAGE READ DENSITY RELATIVE TO GENOMIC FEATURES LIKE TSSES **Coming soon 19b) MAPPING NUCLEOSOME DENSITY RELATIVE TO GENOMIC FEATURES LIKE TSSES **Coming soon