Analysis of ChIP-chip experiment Pipeline: Raw data processing to obtain enrichment profile Visualization of the data Finding enriched regions and/or peaks Feature analysis: –Calculating average profiles for different regions of interest (gene, intergenic regions, exons, etc.) –Analysis of the profiles for different genome regions and groups of genes (heterochromatin vs. euchromatin, silent vs. expressed genes, etc.) ChIP-chip data shown below were obtained in the frame of modEncode project (PIs G. Karpen, S. Elgin, V. Pirrotta, M.Kuroda, P.Park) data analysis pipeline developed and maintained by Peter Kharchenko, the Park Lab Raw data processing Fluorescence intensity for each probe for ChIP and Control (if control available) Calculation of smoothed log2 intensity ratio profile or P-value profile Probe position on array and in genome (provided by array manufacture) Correction for probe copy number and sequence bias based on a slide by Peter Park Raw data processing Fluorescence intensity for each probe for ChIP and Control (if control available) Calculation of smoothed log2 intensity ratio profile or P-value profile Probe position on array and in genome (provided by array manufacture) Correction for probe copy number and sequence bias Data look great! based on a slide by Peter Park Software packages for tiling array analysis • Bioconductor packages for R: – Oligo, Ringo, ACME… (http://www.bioconductor.org) • MAT, MA2C (the Liu Lab at Dana-Farber Cancer Inst) http://liulab.dfci.harvard.edu/ • TAS (Affymetrix) http://www.affymetrix.com/partners_programs/programs/ developer/tools/affytools.affx Visualization GBrowse Perl based application; widely used for genome browsing and visualization of various features and annotations; a very popular browser in Genetic Model Organism Database project UCSC Genome Browser Integrative web-based application supported at UCSC Genome Bioinformatics Site. The human Encode data are available through this browser. IGB (Integrated Genome Browser) An application available through Affymetrix site for genome visualization and exploration of data annotations from various sources IGB installation • Go to http://www.affymetrix.com/partners_programs/programs/ developer/tools/download_igb.affx press “Launch IGB (768 MB)” or go to https://compbio.med.harvard.edu/wiki/x/g4RG and follow the instructions on IGB installation • NOTE: If Java is not installed on your system follow the link: “Free Java Download” on the IGB page or go to http://www.java.com and press “Free Java Download” • Download enrichment profiles to the browser from https://compbio.med.harvard.edu/wiki/x/g4RG • Controls of IGB browser allow changing chromosome in view, data scale, graph color, etc. Questions to explore • Which modifications correspond to activation and which to repression marks? • Compare expression and modification profiles in two cell lines. Do changes in expression correlate with changes in modification levels? – Examples to examine: K4me3/H3K36me3/mRNA profiles for S2 and BG3 cells at chr2L: 21,808,548 - 21,857,626 chr3L: 22,243,063 - 22,460,589 chrX: 2,821,457 - 2,936,998 chrX: 2,965,624 - 3,103,080 chr3R: 11,356,133 - 11,384,217 Finding enriched regions/peaks • Using Bayesian networks / Hidden Markov Models e.g. BAC package (bioconductor, http://www.bioconductor.org) • Thresholding e.g. TAS (Affymetrix, http://www.affymetrix.com) Ringo (bioconductor, http://www.bioconductor.org) Thresholding How many clusters? Threshold – minimal allowed enrichment value Min run – minimal width of enriched region to be identified as a ‘cluster’ Max gap – maximal allowed gap inside of a cluster These parameters are manually adjustable in IGB! • Calculating the threshold – Generate randomizations • More random implementations allow higher level of statistical significance – Find a threshold that corresponds to a given expected value (EV) of false discovery rate (FDR) optimization problem threshold few clusters many clusters number of clusters few clusters Threshold value that corresponds to the required expected value EV threshold value Feature analysis modEncode data Average Gene Profiles Expression quintiles Meta-gene Gene Size 5’ 3’ Based on a slide by Peter Park Calculating average enrichment profile for ‘meta-gene’ Challenges: Genes have different sizes Genes are oriented differently in genome Array probes are distributed non-uniformly along genes For each gene in a set Determine a set of probes that belong to i-th gene Calculate a relative gene coordinate for each probe in the set Pchr Gene.Starti Prel Gene.Endi Gene.Starti yes Direct gene orientation? no Prel 1 Group probes in the bins that are regularly distributed along the gene to obtain relative intensity profile for each gene (or extrapolate the intensity values for regularly located points along the gene) Calculate position-wise average for the relative intensity profiles over all genes Pchr Gene.Endi Gene.Starti Gene.Endi modEncode data average enrichments H3K36me3 profile in S2 cells all genes expressed silent meta-gene 5’ 3’ position relative to TSS average enrichments H3K4me3 profile in S2 cells all genes expressed silent meta-gene 5’ 3’ position relative to TSS Specific issues of ChIP-Seq data analysis • Alignment of the sequenced tags to reference sequence (genome): – Eland, blat, Maq, SOAP • Correction for the biases due to DNA fragmentation and sequencing • Accounting for specific patterns of sequenced tag distribution at protein binding sites MNase digestion and sonication have different sequence signature at DNA fragmentation sites Different ‘background’ GC-content MNase digestion Sonication GC-profiles for H3K4me3 enriched human nucleosomes: MNase digested dataset from Barski et al, Cell, 2007 Sonicated dataset from Roberson et al, Genome Res, 2008 GC-content of Solexa tags is biased towards higher values as compared to randomly selected genomic sequences ChIP control NRSF Solexa data, Johnson et al, Science 2007 Kharchenko et al., 2008 Search for two peaks on positive and negative strands separated by characteristic length Shift tag density profiles on positive and negative strands by /2 to match peaks on both strands based on a slide by Peter Park Sequenced tag distribution around well-positioned H3K4me3 nucleosomes TIMM17A Solexa sequencing data, Barski et al, Cell, 2007 Packages for analysis of ChIP-Seq data • SPP – the Park Lab, Harvard Med School (Kharchenko et al, Nat Biotechnol, 2008) • QuEST – the Sidow Lab, Stanford University (Valouev et al, Nat Methods, 2008) • SISSRs – the Zhao Lab, NHLBI, NIH (Jothi et al, NAR, 2008) • CisGenome – the Wong Lab, Stanford University (Ji et al, Nat Biotechnol, 2008)