ChIP-seq Methods & Analysis Gavin Schnitzler Asst. Prof. Medicine TUSM, Investigator at MCRI, TMC gschnitzler@tuftsmedicalcenter.org 617-636-0615 ChIP-seq COURSE OUTLINE • Day 1: ChIP techniques, library production, USCS browser tracks • Day 2: QC on reads, Mapping binding site peaks, examining read density maps. • Day 3: Analyzing peaks in relation to genomic feature, etc. • Day 4: Analyzing peaks for transcription factor binding site consensus sequences. • Day 5: Variants & advanced approaches. DAY 4 OUTLINE • Position weight matrices to find transcription factor binding sites (TFBSes) • TFBS enrichment in peaks using CentDist • TFBS enrichment using Storm in UNIX • Mining Storm results • Disambiguating similar matrices w/ STAMP DAY 3 REMNANT • Analyzing overlaps between peak & regulated genes in UNIX How can we test the significance of binding site association w/ regulated genes? If you haven’t already, go to the cluster & move bed and txt files to your /cluster/shared/userID/chip folder (mkdir chip & cd chip if you don’t have this folder yet): cp /cluster/tufts/cbi*/Ch*/Sam*/ER*beds/*.* . The .txt files list the transcription start sites (TSSes) of genes that were up- or down-regulated by estrogen in aorta or liver (by RNA-seq analysis). Overlaps between peaks & genes Take a look at one of them using head [name].txt chr6 chr2 chr6 … 73171625 25356026 65540391 + Dnahc6 C8g Tnip3 The file format is (tab-delimited) chromosome, TSS, transcription direction (+=sense) & geneID. You can get all this info easily from the UCSC browser, for individual genes (by hand)… … or you can get this information for all genes & extract what you want for your gene set of interest.. Check out the RNA-seq module for info on making & handling .gtf files. Overlaps between peaks & genes 2 The overlap program can recognize this type of file & will test for overlaps between ChIP-seq peaks and regions around the listed TSSes (default +/-1000 bp). You can also change this range by specifying a –range variable. Find the overlaps between 10-kb regions around TSSes of genes up- or downregulated in each tissue & the corresponding ER binding site data using variations on: bsub perl /cluster/home/g/s/gschni01/perl*/overlap_1.3.pl Ao_up_TSS.txt AoE_all.bed –outfile Ao_up_v_AoE.overlap (these commands are in /cluster/tufts/cbi*/Ch*/Sam*/Fin*/workflow2.txt) Note the number of overlaps (hits), number of genes (tests) and the number of overlaps expected by chance divided by the number of genes (background frequency) provides all the information you need for binomial tests. Note these numbers down for each comparison. Accessing the R statistical language On the PCs in this room: Start->programs->R To get R for your PC (free): http://cran.r-project.org/ To get RStudio (allows for easier management of R projects): http://www.rstudio.com/ On the cluster type: module load R Then: bsub -Ip -q int_public6 R To exit use the R command q() For more info on using R & Unix see: http://sites.tufts.edu/cbi/resources/rna-seq-course/ UNIX resources & R resources Binomial tests in R Use the R command: binom.test(hits, tests, bkg_freq) to address the significance of overlaps you see For Ao_down_TSS.txt vs. AoE.bed: binom.test(118,2, 1.03/118) Which comparisons show significant enrichment. Do any show antienrichment? DAY 4 OUTLINE • Position weight matrices to find transcription factor binding sites (TFBSes) • TFBS enrichment in peaks using CentDist • TFBS enrichment using Storm in UNIX • Mining Storm results • Disambiguating similar matrices w/ STAMP What is PWM? Transcription factor binding sites (TFBSs) are usually slightly variable in their sequences. A positional frequency matrix (PFM) specifies the probability that you will see a given base at each index position of the motif. This is built from sequences known to bind the TF (e.g. 46 sequences for the PFM below). Pos 1 A 18 C 8 G 13 T 7 Con N 5’ 2 8 3 31 4 G 3 4 5 4 3 9 34 9 4 24 G T 5 1 33 8 4 C 6 29 4 10 3 A 7 7 21 11 7 N 8 7 15 15 9 N 9 7 14 19 6 N 10 0 0 4 42 T 11 1 0 44 1 G 12 39 1 3 3 A 13 1 43 0 2 C 14 1 39 1 5 C 15 6 18 6 16 N 3’ Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University PFM->normalized PFM->PWM Binding site data 1. 2. 3. 4. 5. 6. 7. acggcagggTGACCc aGGGCAtcgTGACCc cGGTCGccaGGACCt tGGTCAggcTGGTCt aGGTGGcccTGACCc cTGTCCctcTGACCc aGGCTAcgaTGACGt 41. 42. 43. 44. 45. 46. cagggagtgTGACCc gagcatgggTGACCa aGGTCAtaacgattt gGAACAgttTGACCc cGGTGAcctTGACCc gGGGCAaagTGACTg ... Position frequency matrix (PFM) (also known as raw count matrix) Given N sequence fragments of fixed length, one can assemble a position frequency matrix (number of times a particular nucleotide appears at a given position). A normalized PFM, in which each column adds up to a total of one, is a matrix of probabilities for observing each nucleotide at each position (e.g. divide by 46). Position weight matrix (PWM) (also known as position-specific scoring matrix) The normalized PFM is converted to log-scale for efficient computational analysis. To eliminate null values before log-conversion, and to correct for small samples of binding sites, a sampling correction, known as pseudocounts, is added to each cell of the PFM. Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University Position Weight Matrix for ERE Converting a PFM into a PWM A C G T 18 8 13 7 8 3 31 4 5 3 34 4 For each matrix element do: 4 9 9 24 1 33 8 4 29 4 10 3 7 21 11 7 7 15 15 9 7 14 19 6 0 0 4 42 1 0 44 1 39 1 3 3 1 43 0 2 1 39 1 5 6 18 6 16 N 4 N N p(b) f b ,i w(b, i ) log2 pb, i log2 pb A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -0.60 -0.60 -2.96 -2.29 1.62 -2.29 -2.29 -0.72 C -0.44 -1.49 -1.49 -0.30 1.39 -1.21 0.78 0.34 0.25 -2.96 -2.96 -2.29 1.76 1.62 0.46 G 0.16 1.31 1.44 -0.30 -0.44 -0.17 -0.06 0.34 0.65 -1.21 1.79 -1.49 -2.96 -2.29 -0.64 T -0.60 -1.21 -1.21 0.96 -1.21 -1.49 -0.60 -0.30 -0.78 1.73 -2.29 -1.49 -1.84 -0.98 0.23 f b ,i – raw count (PFM matrix element) of nucleotide b in column i N – number of sequences used to create PFM (= column sum) N and N 4 - pseudocounts (correction for small sample size) p(b) - background frequency of nucleotide b Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University Scoring putative EREs by scanning the promoter w/ PWM GGGTCAGCATGGCCA A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -0.60 -0.60 -2.96 -2.29 1.62 -2.29 -2.29 -0.72 C -0.44 -1.49 -1.49 -0.30 1.39 -1.21 0.78 0.34 0.25 -2.96 -2.96 -2.29 1.76 1.62 0.46 G 0.16 1.31 1.44 -0.30 -0.44 -0.17 -0.06 0.34 0.65 -1.21 1.79 -1.49 -2.96 -2.29 -0.64 T -0.60 -1.21 -1.21 0.96 -1.21 -1.49 -0.60 -0.30 -0.78 1.73 -2.29 -1.49 -1.84 -0.98 0.23 m Absolute score of the site S w(b, i) =11.57 Row Sum Max 0.58 1.31 1.44 0.96 1.39 1.22 0.78 0.34 0.65 1.73 1.79 1.62 1.76 1.62 17.20 Min -0.60 -1.49 -1.49 -1.21 -2.29 -1.49 -0.60 -0.60 -0.78 -2.96 -2.96 -2.29 -2.96 -2.29 -24.02 i 1 relative_ score This is also called “functional depth” Absolute_ score Minim um_ score Maxim um_ score Minim um_ score 11.57 24.02 0.86 17.20 24.02 Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University Estimating p. values for a match to the matrix GGGTCAGCATGGCCA A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -0.60 -0.60 -2.96 -2.29 1.62 -2.29 -2.29 -0.72 C -0.44 -1.49 -1.49 -0.30 1.39 -1.21 0.78 0.34 0.25 -2.96 -2.96 -2.29 1.76 1.62 0.46 G 0.16 1.31 1.44 -0.30 -0.44 -0.17 -0.06 0.34 0.65 -1.21 1.79 -1.49 -2.96 -2.29 -0.64 T -0.60 -1.21 -1.21 0.96 -1.21 -1.49 -0.60 -0.30 -0.78 1.73 -2.29 -1.49 -1.84 -0.98 0.23 This sequence had a functional depth (f) of 0.86 The summed probabilities of all sequences with f >=.86 gives the p.value for this sequence = chance that f>=.86 would be achieved by a randomized DNA sequence. Short matrices can reach f > .9 but still have high p. values – thus f is the best measure for short seqs. Long matrices can have very low p. values but still have f< .9 – thus p.value is the best measure for long seqs. DAY 4 OUTLINE • Position weight matrices to find transcription factor binding sites (TFBSes) • TFBS enrichment in peaks using CentDist • TFBS enrichment using Storm in UNIX • Mining Storm results • Disambiguating similar matrices w/ STAMP Preparing for PWM search Lauch WinSCP (Start->programs->WinSCP) Navigate to: /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/Final_ output_files Pull over the “ipvinput19_peaks.xls” file to the PC. (this is the MACS output file that we generated yesterday) Open it into Excel Making .bed file w/ +/-200 bp around peak summit (where we expect TFBS enrichment will be greatest) chr chr7 chr10 chr2 chr12 chrX start 74606586 94601968 1.67E+08 34760179 48371756 end length summit tags -10*LOG10(pvalue)fold_enrichment FDR(%) 74607824 1239 571 181 3132.99 34.87 0 94603119 1152 541 174 3135.11 34.76 0 1.67E+08 809 377 18 100.44 4.7 0.06 34761206 1028 496 22 101.03 4.17 0.06 48372420 665 437 18 100.29 4.12 0.06 =same row, chr column =start col+summit+200 =start col+summit-200 •Copy these 3 columns (without any header row). •In WinSCP click on any file on the PC, then on files->new->file & provide a name (“LiE_chr19_400bp.bed”) to edit a new simple text file. •Paste, save & close. Making a file of control .bed regions peak ctrs. chr chr7 chr10 chr2 chr12 chrX start 74606586 94601968 1.67E+08 34760179 48371756 control regions start end chr -10*LOG10(pvalue) start end chrfold_enrichment end length summit tags FDR(%) 74607824 1239 571 181 3132.99 34.87 0 94603119 1152 541 174 3135.11 34.76 0 1.67E+08 809 377 18 100.44 4.7 0.06 34761206 1028 496 22 101.03 4.17 0.06 48372420 665 437 18 100.29 4.12 0.06 … =peaks:chr =peaks:start-5000 =peaks:end-5000 •5000 bp is far enough away to not be part of an enhancer composed of the ER binding site... but is close enough to likely be in the same general chromatin territory (e.g. accessible euchromatin vs. inaccessible heterochromatin) •Copy these columns & make a “CTRL_chr19_400bp.bed” file with WinSCP CentDist A TFBS enrichment program designed for ChIP-seq data Assumes that TFBS-matrix hits will be most highly enriched at the centers of ChIP-seq peaks. Adds PWM score to “peakiness” score (e.g. how much more enriched the TF site is in the center of the peak) final p. val. Good enrichment good shape (best p.) Good enrichment OK shape Good enrichment poor shape (higher p.val.) Go to: http://biogpu.ddns.comp.nus.edu.sg/~chipseq/webseqto ols2/TASKS/Motif_Enrichment/submit.php?email=guest …or (more simply) just google centdist and click on the first link (should end in /centdist/) Run CentDist Give centdist a name for your run Upload your +/-200 bp .bed file (CentDist does not need a separate background file, instead using flanking sequences at a case-specific optimized distance as background) Check “Jaspar”, “vertebrate” & set max-co-motif distance to 3000 Then click Submit Job On the new window that opens click “turn on autorefresh” so you will be notified when the job ends Jaspar vs. Transfac Jaspar is a freely-available set of TFBS matrices that can be downloaded from jaspar.genereg.net Transfac is a commercial product that you need to pay for the latest release of. A version of Transfac (from ~2006) is available on the cluster (e.g. /cluster/home/g/s/gschni01/vertebrates.mat) Which to use? Both, ideally, but beware that programs like CentDist will give you results from Transfac matrices – and you won’t be able to find out details of what they are. CentDist Results View by factors, put in max number & hit go. •P. Values (based on Score compose of Z0 (enrichment) & Z1 (peakiness) •Distribution graph •Weblogo representation of Jaspar matrix Shows information content at each position. A,G,C&T 25% each-> 0 bits, only 1 base 100%->2 bits. Bases most highly over-represented relative to chance are largest. How many enriched TF sites are there really? Matrix hit enrichment for many factors, are all of them real? V$jaspar_HNF4A V$jaspar_NR2F1 V$jaspar_ESR1 Maybe not, notice how similar top sites are to each other and to estrogen response elements (EREs) such as V$jaspar_ESR1 Downloading CentDist Results Click on download as text & save the file somewhere you remember. Open it into excel. Basic summary statistics & a few other things. Many questions unanswered: -What is the fold enrichment over background? -What are the peaks with the greatest enrichment for each factor? -What are the TFBS hit locations in each peak? -Which are the real enriched TFBSes & which are just showing up by homology? -Do certain factors group together into the same same peaks? DAY 4 OUTLINE • Position weight matrices to find transcription factor binding sites (TFBSes) • TFBS enrichment in peaks using CentDist • TFBS enrichment using Storm in UNIX • Mining Storm results • Disambiguating similar matrices w/ STAMP Storm Storm is a straightforward PWM scanning program that runs in UNIX. Its greatest advantage is that it gives you all of the unprocessed output data, which allows you to do much more powerful analyses! It also allows us to specify thresholds for matches to the matrix – allowing us to use functional depth as well as p. value Getting DNA for Storm To run storm, we first need to get the actual DNA sequence for centers of our peaks (where we expect the greatest enrichment for TFBSes to be). Go to the UCSC genome browser at: genome.ucsc.edu Under genome choose mouse mm9 Then choose add custom track & upload your +/-200 bp .bed file. .fa denotes a simple Click on Tools->Table Browser ‘fasta’ format sequence Select your new track file. Select output format “sequence” Provide a file name “LiE_chr19_400bp.fa” & hit “get output” Hit ‘get output’ again on the next page Now do the same for your “CTRL_chr19_400bp.bed” file. Cleaning up our .fa files Use WinSCP to move these .fa files and their corresponding .bed files to your …/chip directory. Each entry in the .fa file has a header with special characters in it that confuse storm. All of the commands below are in the file /cluster/tufts/cbi*/Ch*/Sam*/Final*/workflow2.txt… cat this to your screen, to copy & paste commands. To fix this, go to your …/chip directory in Putty & do: perl /cluster/home/g/s/gschni01/perl*/Lax_convert.pl LiE_chr19_400bp.fa > LiE_chr19_400bp_converted.fa To see what has changed use: head *.fa Do the same for your “CTRL_chr19_400bp.fa” file. Running storm First set some path variables: export CREAD=/cluster/home/g/s/gschni01/cread-0.84 export PATH=$PATH:$CREAD/bin Then run storm for your IP .fa file: bsub -oo LiE_chr19_400bp_p.storminfo storm -p -t 0.0005 -s LiE_chr19_400bp_converted.fa -o LiE_chr19_400bp_p.storm /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat And for your control .fa file: bsub -oo CTRL_chr19_400bp_p.storminfo storm -p -t 0.0005 -s CTRL_chr19_400bp_converted.fa -o CTRL_chr19_400bp_p.storm /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat Use more to look at one of your .storm output files (space for next page ctrl c to exit) DAY 4 OUTLINE • Position weight matrices to find transcription factor binding sites (TFBSes) • TFBS enrichment in peaks using CentDist • TFBS enrichment using Storm in UNIX • Mining Storm results • Disambiguating similar matrices w/ STAMP Interpreting Storm data Run the dme_parse perl program to gather and tabulate your storm data: bsub -oo LiE_chr19_400bp_p.dmeparseinfo perl /cluster/home/g/s/gschni01/perl*/dme_parse5.4.pl LiE_chr19_400bp_p.storm LiE_chr19_400bp.bed peaks bsub -oo CTRL_chr19_400bp_p.dmeparseinfo perl /cluster/home/g/s/gschni01/perl*/dme_parse5.4.pl CTRL_chr19_400bp_p.storm CTRL_chr19_400bp.bed peaks dme_parse outputs …storm.bed file: Has USCS browser tracks for each TFBS matrix with locations of all hits in bed format. …storm.map file: Lists all input matrices followed by the PFM derived from all of the hits to this matrix from our data. …storm.info file: Summarizes a lot of information about matrix hits Move the .info files to your PC with WinSCP & open them into Excel. File provides summary statistics for # of peaks with 0,1,2,etc. hits, total hits, and normalized hits per 50 bp vs distance from peak center. average matches/kb -225 4 3.5 3 2.5 2 1.5 1 0.5 0 -175 -125 -75 -25 25 75 125 175 BP from pe ak apex AoE_ER_avg BKG4_AoE_ER_avg LiE_ER_avg BKG4_LiE_ER_avg 2 225 1.8 1.6 1.4 1.2 1 -225 average matches/kb Using the .info file to plot relative density of TFBS hits in aorta IP, liver IP & offset controls: average matches/kb dme_parse outputs 0.8 -75 -25 25 75 125 175 225 BP from pe ak apex AoE_M YC_avg BKG4_AoE_M YC_avg LiE_M YC_avg BKG4_LiE_M YC_avg 2.7 -175 -125 2.5 2.3 2.1 1.9 1.7 -225 -175 -125 1.5 -75 -25 25 75 125 BP from pe ak apex AoE_EBF_avg BKG4_AoE_EBF_avg LiE_EBF_avg BKG4_LiE_EBF_avg 175 225 dme_parse outputs Using the .info files to structure binomial tests Hits= # of matches to each matrix in IP data Tests=# of times storm tested for a match =(# of peaks) * (400 bp length of peaks - matrix length) Background freq= matches to offset conrol peak data/# tests (same as for IP) Using the .info files to determine fractional enrichment Hit frequency in IP data/Hit frequency in offset control dme_parse outputs .freqs file: Number of hits to each matrix for each peak Distribution of hits per peak in offset background establishes # of hits to be p.<=.05 enriched over backgound • Allows identification of sites at which a given TFBS may be functionally targeted (candidates for further testing) • Can also look for significant overlaps between the peaks with enrichment for 2 different factors - to identify cooperative versus antagonistic interactions. Details on how to do these analyses are in ChIPseq_analysis_methods_2013_02_11 on the cbi website. DAY 4 OUTLINE • Position weight matrices to find transcription factor binding sites (TFBSes) • TFBS enrichment in peaks using CentDist • TFBS enrichment using Storm in UNIX • Mining Storm results • Disambiguating similar matrices w/ STAMP STAMP Go to www.benoslab.pitt.edu/stamp/index.php STAMP lets you compare matrices for evolutionary similarities to each other. Go to your CentDist output. Create a new column in which you change the names of the factors to fit with the names in the Jaspar_non_redundant_vertebrate.mat file you used for Storm. =substitute(b2,“V$jaspar_”,”Jaspar$”), & propogate down Select all matrix names w/ p.<.05 & paste them into a new file called “select_mats.txt” in your /chip folder on the cluster using WinSCP. Getting STAMP to help classify our CentDist top hits perl /cluster/home/g/s/gschni01/perl*/MatrixSelect.pl /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate. mat select_mats.txt select_mats.mat Now, open the select_mats.mat file with WinSCP, copy everything & paste it into STAMP. Keep all the STAMP defaults & hit submit. STAMP Tree This indicates that enrichment of PPARG, RORA, NR4A2 could be just because of their similarity to EREs. Other enriche sites, such as SP1, FoxA2 & Myf fall in separate homology classes. To further distinguish which one is real, you can use the enrichment ratios & p. values (the “real” TFBS should be best in both of these.