ChIP-seq Data Analysis ChIP-seq overview DNA + bound protein Fragment DNA Sequence Map sequence tags to genome & identify peaks Adapted from slide set by: Stuart M. Brown, Ph.D., Center for Health Informatics & Bioinformatics, NYU School of Medicine Prepare sequencing library Immunoprecipitate Release DNA ChIP-seq big picture • Combine high-throughput sequencing with Chromatin Immuno-precipitation to identify specific protein-DNA interactions genome-wide, including those of: – – – – – Transcription factors Histones (various types and modifications) RNA Polymerase (survey of transcription) DNA polymerase (investigate DNA replication) DNA repair enzymes • … or fragments of DNA that are modified (e.g. methylated) ChIP-seq Workflow Confirm ChIP Prepare library Submit for Sequencing Get Raw sequence data & do QC Map sequence reads to genome Identify ChIP peaks over input background Downstream bioinformatic analyses Initial quality control measures FastQC as for the RNASeq data A note on using duplication levels to estimate your library size (complexity) Assuming you have 100 initial fragments in your library (before amplification) & which fragment gets read is random: # reads : 25 50 75 100 150 200 # unique reads: 23 37 52 63 78 87 % duplicated: 8% 26% 31% 43% 55% 69% x-more left in lib: 4.3 2.7 1.9 1.6 1.3 1.15 x-more than prev: 1.6 1.4 1.2 1.24 1.11 Given 9% duplicates, an additional sequencing run of the same size (from the same library) will give you 1.6x more unique reads. Two additional runs will give you 2.2x more (1.6*1.4). …but if you have a high % duplicates (e.g. 43%) adding one more lane will only give you 1.37x more unique reads than you had initially. Depending on sequencing depth, this could indicate that your library has low complexity – either because too few fragments from your ChIP survived to the library amplification step, or because the protein binds few sites. Alternatively: sub-sample your data and check saturation of peak calling Read mapping - Bowtie Algorithm (Burrows-Wheeler Transformation) Provides an identifier to any sequence, allowing fast lookup of all its genomic positions in an indexed genome file (ebw file). Avoid having to search genome for matches each time (like blast would do). How do peak-finders map binding sites? •Fragments contain the TF binding site at a (mostly) random position within them. •Reads are (randomly) from left or right edges (sense or antisense) of fragments. •Thus peak for sense tags will be 1/2 the fragment length upstream… •Binding site position = mid-way between sense tag peak & antisense tag peak. •To get binding site peak, shift sense downstream by ½ fragsize & antisense upstream by ½ fragsize. • Adapted from slide set by: Stuart M. Brown, Ph.D., Center for Health Informatics & Bioinformatics, NYU School of Medicine & from Jothi, et al. Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data. NAR (2008), 36: 5221-31 MACS procedure IGV visualization of MACS results using the FoxA1 data set IGV visualization of MACS results using the University of Washington H3K4me3 data set. IGV visualization of MACS results using the Broad Institute H3K36me3 data set. MACS’ "shiftsize” model You can find info on the estimated parameters in your .macsinfo file… INFO @ Sun, 10 Feb 2013 21:27:51: #2 Build Peak Model... INFO @ Sun, 10 Feb 2013 21:27:51: #2 number of paired peaks: 0 WARNING @ Sun, 10 Feb 2013 21:27:51: Too few paired peaks (0) so I can not build the model! Broader your MFOLD range parameter may erase this error. If it still can't build the model, please use --nomodel and --shiftsize 100 instead. WARNING @ Sun, 10 Feb 2013 21:27:51: Process for pairing-model is terminated! WARNING @ Sun, 10 Feb 2013 21:27:51: #2 Skipped... WARNING @ Sun, 10 Feb 2013 21:27:51: #2 Use 100 as shiftsize, 200 as fragment length Here MACs tried to estimate the “shift size” for moving sense & antisense reads to get a final peak position, by identifying sets of strong + & - strand peaks at a certain distance from each other. There was not enough info on chromosome 19 to do this, so it made a guess that the fragment size was 200 & shiftsize was 100. 200 is close enough to the actual fragment size of ~150 bp that we can go with this. MACS model file This a representative result from running MACS #2 Build Peak Model... #2 number of paired peaks: 683 Fewer paired peaks (683) than 1000! Model may not be build well! Lower your MFOLD parameter may erase this warning. Now I will use 683 pairs to build model! finished! predicted fragment length is 125 bps Generate R script for model : LiE_IP_v_INPUT_11_2012_dup1_mode l.r Call peaks... use control data to filter peak candidates... Finally, 9504 peaks are called! find negative peaks by swapping treat and control Finally, 337 peaks are called! To generate this file you will need to go into R, and enter: Source(“MACS_output_file.r”), which will generate a .pdf Peaks & negative peaks Keep scrolling down your .macsinfo file until you find… … INFO @ Sun, 10 Feb 2013 21:36:47: #3 Finally, 364 peaks are called! INFO @ Sun, 10 Feb 2013 21:36:47: #3 find negative peaks by swapping treat and control INFO @ Sun, 10 Feb 2013 21:36:52: #3 Finally, 364 peaks are called! INFO @ Sun, 10 Feb 2013 21:36:52: #4 Write output… This is the pay-off, where MACS identifies your ER alpha peak locations! 364 peaks on chromosome 19 (which is ~1/50th of the genome) suggests ~20,000 peaks for the whole genome, which is not bad! Equally critical, MACS now swaps treat & control (pretending your INPUT data is your IP & your ChIP data is your input) and looks again for peaks. The number of “negative” peaks found in this way should be far less than the positive peaks, and the 10:1 ratio here is fine. Troubleshooting MACs Briefly… MACS can’t build a model: - Adjust the mfold values (the fold over background ranges MACs considers for paired peaks) - Tell MACs to not build a model, but instead use the shiftsize you specify. Peaks/Negative Peaks ratio is poor or too few peaks are detected: - Adjust model settings to see if you can improve both. Otherwise, you may have to conclude that 1) your library was no good or 2) the factor just doesn’t bind to many places in the genome. Toubleshooting MACs… Be on the lookout for MACS building a model from short-separation noise peaks (that may arise from sonication sensitive breakpoints or other things unrelated to your protein binding). To avoid this, you can decrease the maximum “mfold” so that these strong irrelevant peaks are ignored when the model is built. Downstream analysis Downstream analysis • Find the nearest gene to each peak • Check distribution relative to gene features (start site, exon, intron, upstream/downstream) • Find overrepresented motifs in peak region (TFBS binding sites of our factor + possible co-binders) – kmers/logos • Check if peaks are clustered or co-occur with other binding events • Sequence conservation (or conservation of binding event, if data is available) • Gene set functional analysis • … Overlaps in Galaxy Operate On Genomic Intervals-> Intersect This lets you create a new .bed file which has only the regions that intersect between two datasets. Overlapping Intervals: (saves complete intervals from file 1 that overlap anything in file 2) Overlapping Pieces of Intervals: (saves only the regions shared between 1 & 2) Additional notes on overlaps Comparing peaks from multiple samples (Bardet et al. 2011) During genome-wide peak calling, only the best peaks pass the stringent thresholds required for low false discovery rates (FDRs) due to the correction for multiple testing (orange lines). Regions may show substantial tag enrichment, yet are not called as peaks (green line). When comparing peaks across conditions, we advocate using 'significant enrichment' (not multiple testing–corrected) as the measure to assess whether a peak is shared across conditions or is truly condition-specific. Merely intersecting peaks called at each condition would miss conserved peaks (e.g., middle examples). How can we tell whether overlaps are significantly greater than chance? Assigning a p-value? We have 3 pieces of count frequency information: The number of overlaps, the number of regions compared, and we can generate the expected background frequency through shuffling. This type of data is like coin tosses & is ideally suited for a binomial test, which uses “number of matches”, “number of tests” and “expected background frequency” to calculate p. values. If you flip a coin, say 10 times and it comes up heads 6 out of 10 (frequency 0.6 vs. expected 0.5), that would not seem unlikely – and a binomial test would tell you this. p.=.74 1 5 6 10 p.=2e-10 However if you flip a coin 1000 times & get heads 600 out of 1000, that would seem a bit odd, and the binomial test would indicate this by saying that the probability of the null hypothesis (that the frequency of heads is 0.5) is low. 100 500 600 1000 Resampling statistic or binomial tests for overlaps Let’s say we have peak regions from two samples (7000 and 8260), and the number of overlaps is 1653. We can estimate the background chance by randomly placing the regions in the first dataset in new locations, and then count the number of overlaps. We would repeat this procedure 100-1000 times. Then, we can either ask: how many times, out of the 1000 repeated procedures, was the number of overlaps greater than that observed for the real dataset? This is our p-value (3/1000-> p=0.003. 0 -> p<0.001) Alternatively, we can take the mean of the number of overlaps, and use this in a binomial test. For example, let’s say the mean number of overlaps in the shuffled sets was 95.11 Run a binomial test in R by typing: binom.test(1653, 8260, 95.11/8260) The p.value is <2e-16. Very low. So, yes, the binding sites in the two samples overlapped more than expected by chance… but binding events are still to ~80% different places between these two samples.. Data in the UCSC browser ChIP-seq for histone modifications Method: •Mouse ES cells vs ES-derived primary neural progenitor cells (NPCs). •Prepare chromatin & ChIP with antibodies to specific histone modifications. •H3K4 methylation marks active genes, H3K27 marks repressed genes, both marks together in ES cells mark “poised” genes that will become activated in certain developmental lineages. Meissner et al. 2008, Nature 454:766. The ENCODE Project Dozens of labs did ChIP-seq, under rigorous quality guidelines, for over 100 transcription factors and histone modifications, plus related assays for DNA methylation, chromatin accessibility etc. Major paper (many others provide additional details): Encode Project Consortium (over 100 authors) An integrated encyclopedia of DNA elements in the human genome. Nature. 2012 Sep 6;489(7414):57-74. doi: 10.1038/nature11247. Some ways to access this data: Nature.com/encode (Nature’s summary & links to all related papers) factorbook.org (a way to explore the data in a wiki format) UCSC genome browser (hands on examples next) sra (short read archive, repository for raw data, more on this later!) Sample of Encode Data Example of ChIP-seq data tracks on UCSC browser Peak calls: regions of significant enrichment over background Processed read density, read as # of reads overlapping a given BP position, data (used to make peak calls) H3K4me3 & H3K27acet are marks of active promoters (e.g. MTRF1L) H3K27me2 is a mark of repressed promoters Genes in ESCs that are required for differentiation are often “poised” and bear K4me3 & K27me3 “bivalent mark” (e.g. SYNE1) Differentiation resolves bivalent mark to all activating marks (Osteoblasts) or all repressive marks (HepG2) Downloading data from UCSC browser Try zooming in (you can go all the way to base pair resolution) Want to learn more about a gene? Control click on it’s ideogram & select “open details page in new window” What if you want to use this data somewhere else? Select Tools->Table Browser Select Group: Regulation, Track: Broad Histone Table: H1-hESC H3K4me3 … Pk (for the peaks data, the signal file will be huge) In output format, select “all fields from selected table” --Note that you could have selected “sequence”… if you had you’d get the actual DNA sequence for each one of these peaks. We’ll use this later. Check “Galaxy” next to send output to: --Note that you could have selected send to file, we’ll use this later as well. Click “Get output” & then click “send query to galaxy” Introduction to Galaxy Tools Galaxy is a web platform providing a lot of basic tools for manipulating genomics data. On the right are input & analysis options. On the left is your history of uploaded files & analyses. You’ll have one item in process, which will finish soon & turn green. Click on the title to get a sample of what the data looks like. Click on the eye to see the data in the central panel. Each entry has a chromosome#, BP for start & BP for end & some other values (signal value=enrichment over background, p.value=-log(base10)of p. value, so, for p=.0001 this would be 4) Click on the pencil to look at and edit the name & other attributes of any item. We’ll look more at Galaxy tools later… What about data that’s not on the UCSC Browser? The ENCODE project was UNUSUALLY considerate when compared to most other researchers who generate genomics data. Even though ENCODE is huge, it’s probably <10% of published NGS data. To publish, researchers must make their data accessible, but they will very rarely provide a link to a UCSC browser track. If you’re lucky, they will have put processed data up somewhere: generally on GEO… The GENE EXPRESSION OMNIBUS (GEO) Key repository for microarray and genomics data. Open a new browser tab (ctrl-T) & go to: http://www.ncbi.nlm.nih.gov/geo/ Search for “encode h3k4me3 h1-hesc”. You’ll see several entries. The first few are larger datasets that include this specific data. The one at the bottom is just the data for this track. First note the “accession number GSM733657” - often publications will give this accession number, providing an easy way to get directly to the right place. ---> Now, click on the title for this entry “Bernstein…” Scroll down the next page: There’s lots of info about this experiment with links for more information. At the bottom are the processed data files: They’ve been nice & offer us a “BROADPEAK” file (the same as what we just uploaded to Galaxy), a “BAM” file for each experimental replicate (which has the genomic coordinates for each read), and a “BIGWIG” file (the filetype for the “signal” track on the UCSC browser) What if the data I’m interested in isn’t in GEO? Authors are almost always required to make their NGS data accessible in order to publish…. …but they’re often not required to make it easy! Many times the only thing that’s available is the raw data, stored in… The Sequence Read (SRA) - Repository for Open a new browser tab (ctrl-T) & go to: raw NGS data. http://www.ncbi.nlm.nih.gov/sra/ Search again for “encode h3k4me3 h1-hesc”. A single record will be called up… Clicking on link 1 or 2 under “run” will give you information about that particular biological replicate sample. Clicking on the link 1 or 2 under ‘size’, will take you to a page with a single linked file with a “.sra” extension. Note how big this file is… just a few of these would rapidly fill up a PC hard drive! So what is a .sra file & what can you do with it? Don’t even try to download & open it… besides being huge, it’s not even normal text. In the next few lectures we’ll find out how to handle this sort of raw data. How many reads do I need? Minimum for ChIP-seq of a transcription factor with < ~30,000 binding sites in a mammalian genome: • 2 replicates per condition • 20+ million reads per sample (>40M per condition, proportionately less for smaller genomes & fewer binding peaks) • One HiSeq lane gives ~150 million reads …& can multiplex ~4 samples (2 exp + 2 input / lane) • Single end 50 bp reads almost always good (unlike RNA-seq where longer and/or paired end reads are required for many downstream questions). For some applications need many more reads (e.g. mapping nucleosome positions need >400 M). Make your best estimate. If you have too few you can re-sequence the same samples or add additional samples. Reads from all runs can be pooled in the end.