THE EPIGENOME Network of Excellence A Guideline for ChIP - Chip Data Quality Control and Normalization (PROT 47) Matthias Siebert, Michael Lidschreiber, Holger Hartmann, and Johannes Soding Gene Center Munich Ludwig-Maximilians-Universität Feodor-Lynen-Str. 25 81377 Munich, Germany email feedback to: soeding@genzentrum.lmu.de Last reviewed: 01 Dec 2009 by Tobias Straub, Adolf-Butenandt- Institut, Ludwig Maximilians University, München, Germany Introduction Chromatin immunoprecipitation coupled to tiling microarray analysis (ChIP-on-chip) is used to measure genome-wide the DNA binding sites of a protein of interest. In ChIP-on-chip, proteins are covalently cross-linked to the DNA by formaldehyde, cells are lysed, the chromatin is immunoprecipitated with an antibody to the protein of interest and the fragmented DNA that is directly or indirectly bound to the protein is analyzed with tiling arrays. For this purpose, the fragmented DNA is fluorescently labeled and hybridized to the tiling array, which consists of millions of short (25 to 60 nucleotides long) probes that cover the genome at a constant spacing (4 to 100s of nucleotides), like tiles covering a roof. The data generated by one experiment consists of an intensity value for each DNA probe. These values measure the relative quantity of DNA at the probe's genomic position in the immunoprecipitated material. This guideline describes the first steps in the data analysis for ChIP-on-chip measurements in a bare-bones fashion. The steps comprise the quality control of the obtained data and normalizations to render the data comparable between different arrays, to correct for saturation effects, and to obtain enrichment and occupancy values. Peak calling and other downstream procedures are not part of this exposé. For each step, we give detailed recommendations, warn about problematic but often popular procedures, and occasionally suggest improved versions of standard procedures. Although we have gained experience in ChIP chip data analysis mainly in yeast, this protocol tries to give a general guideline that should be applicable to other species and to both single and two-color arrays. Most advocated procedures can be carried out within the R environment for statistical data analysis, using packages from the Bioconductor project (see, e.g., Toedling and Huber, 2008). A "Bioconductor package Starr for Affymetrix platforms supporting the analysis steps described here is available and will be described in an upcoming protocol (Zacher B. and Tresch A., 2010). Procedure Experimental Design Number of Replicates It is advisable to measure at least two biological replicates of all factors or conditions, for two reasons. First and foremost, corrupted measurements will normally be easily identifiable by comparing replicates. Even if both of two measurements are corrupted, corruption tends to be erratic and will normally not result in the expected high correlation between replicates (step 2 below). Second, averaging over N replicates reduces the standard deviation of unsystematic noise (i.e. the random scattering of measured values) by a factor vN. This means a factor of 1.4 reduction for two replicate measurements, 1.7 for three, and 2.0 for four replicates. The payoff is thus particularly high for the second replicate and falls of slowly for higher N. Mock IP or Genomic Input? In any ChIP-on-chip experiment, it is important to correct for sequence- and genomic regionspecific biases in the efficiency of the various biochemical and biophysical steps of the ChIP-onchip protocol. Crosslinking of the DNA to proteins, fragmentation of the chromatin, immunoprecipitation, PCR amplification, and hybridization to the array all have strong biases that must be corrected for. This is done by measuring a reference signal and dividing the true signal intensities obtained from the immunoprecipitated protein by the reference intensities. This is in our experience by far the most important step in data normalization. A frequent point of contention is whether it is better to use the genomic input fraction or a mock immunoprecipitation (mock IP) for this normalization. (The mock IP is performed with the wildtype strain if the true signal is obtained by immunoprecipitating the protein of interest with a protein tag. It is done using an unspecific antibody if the protein of interest is purified with a specific antibody.) Our experience is that normalization with the mock IP is preferable if a good signal can be obtained from the mock IP hybridization, i.e., if the signal-to-noise ratio of the mock IP is not much inferior to that of the true IP. We have seen higher noise levels and occasional artefacts when using normalization with the genomic input. One reason might be that normalization with the genomic input does not correct for sequence- and region-dependent biases of unspecific binding to the antibody. If signal-to-noise of the mock is a problem, or if the chromatin preparation is deemed hard to reproduce in different strains, then normalization with matched genomic input is advisable. An obvious drawback for single-color arrays is, however, that for each measurement, a matched genomic input is hybridized, necessitating almost twice the number of arrays as if all measurements are normalized with a single mock IP. We have found that genomic input measurements can be highly reproducible for different experiments and chromatin preparations. In this case it may be justified to use only a few representative input measurements for data normalization and to forego the costly measurement of a matched genomic input sample for each IP (see comment 1).A reasonable procedure is to measure both input IP and mock IP for representative factors and to compare the signal-to-noise ratio for both normalizations. Another possibility for two-color arrays is to normalize with both input and mock IP: log enrichment = log [ (signal1/input1) / (mock2/input2) ] = log(signal1/input1) log(mock2/input2) In this case, a dye exchange should be performed (next paragraph). Dye Exchange When using two-color arrays, such as those from NimbleGen or Agilent, the dye-exchange procedure is very much recommended (e.g. Do and Choi, 2006). This means the dyes Cy3 and Cy5, with which signal (true IP) and reference fractions are labeled, should be switched between the two (or four) biological replicates, such that the signal fraction is labeled with Cy5 in half the measurements and with Cy3 in the other half. Dye exchange corrects for dyedependent saturation effects in a much better way than downstream data normalization could ever achieve (steps 5 and 5a below). Data Analysis Protocol 1. Spatial Flaws in Raw Image Data Arrays should be checked for localized spatial defects and for nonuniform mean intensity distributions. On most platforms except Affymetrix, probes are randomly arranged across the array. Therefore, local or global spatial patterns in measured intensities can be assumed to be artifacts. They will lead to increased noise, in particular if many probes are affected. Localized effects arise from scratches, bubbles, manufacturing problems etc. They can be detected by looking at the array's raw intensity image, ideally directly after the exposure of the array, as flaws can lead to underexposure and other problems that might still be corrected while the array is mounted (see protocol 43 by Tobias Straub.) In case of strong defects, either the affected probes or the entire array measurement should be discarded. On Affymetrix arrays, probes are arranged randomly with respect to genomic location, but probes with similar sequences are placed near each other. This leads to horizontal and vertical stripes around 20-50 rows or columns wide with systematically varying intensity, and also to more large-scale variation of intensities across the array. 2. Log-Log Scatter Plots First, we correct for different scales in the raw intensities by subtracting the median from the log intensities over each array. This is necessary since the scale of measurements can easily change due to varying amounts of chromatin applied, changing exposure times or hybridization efficiencies and other effects. To identify flawed array measurements, one can plot the logs of the rescaled intensities of pairs of biological replicates for each probe in X-Y scatter or (better) contour plots (i.e. measurements under the same conditions, wild type vs. wild type, signal vs. signal). The data points on these scatter plots should lie scattered close to a straight line. If the scales are fairly different or if the data points lie scattered about a bent curve, one should try to find out the experimental causes, as this may lead to nonlinearities between measurements for different conditions, which may be difficult to correct. (Note that different scales and non-linearities are common for Affymetrix arrays and do not seem to indicate experimental problems, whereas they are uncommon for Nimblegen arrays with their longer probes.) If the log intensity points for two replicates are approximately distributed along a straight line, we can use scale normalization (Smyth and Speed, 2003; Do and Choi, 2006) to bring them onto the diagonal (step 2b). If the relationship between replicates is not linear, quantile normalization (Bolstad 2003) is required to map them onto the diagonal (step 2c). This normalization will make the replicate measurements comparable and to allow averaging over them (step 3). One can also plot array measurements taken under different conditions (averaged over replicates in a scatter plot to check if the distributions are similar. Outliers in the scatter plot indicate corrupted probes in one of the arrays. Their values should be set to the geometric average of the two neighbouring probes, or, less ideally, to NA (“not available”). Many outliers indicate a corrupted array. 2a. Scale Normalization This is recommended if a linear relationship between replicate measurements is observed in the log-log scatter plots in step 2. In scale normalization, one divides the log intensities of each replicate by its median absolute deviation (MAD) (Smyth and Speed, 2003). The MAD estimates the spread of a distribution, similar to the standard deviation, but it is preferable because of its much better robustness with respect to outliers. It is simply the median of the absolute deviation of all measurements from the median. Let median{X1,...,Xm} denote the median of m measurements X1,..,Xm. Then, the absolute deviation of X1 from the median is |X1median{X1,..,Xm}|. The MAD of measurements X1,...,Xm is defined as MAD = median{ |X1-median{X1,..,Xm}|,...,|Xm-median{X1,..,Xm}| } To conserve the meaning of the rescaled profiles in terms of log enrichment, we recommend multiplying all profiles by the geometric mean of the MAD values of all arrays that are scalenormalized together. MAD values that differ by more than a factor of, say 1.2, between replicates might indicate problems with the measurements. One should check experimental procedures to get rid of the nonlinearities in this case. Scale normalization should not be applied to non-replicate measurements, except in the case that the different conditions are known to have only a minor effect on the expected enrichment distribution. 2b. Quantile Normalization This is recommended if a nonlinear relationship between replicate measurements is observed in the log-log scatter plots in step 2. Quantile normalization can be performed within the Bioconductor/R framework (see, e.g. Toedling and Huber, 2008). It maps the cumulative log intensity (or enrichment) distributions of two or more replicates onto the average cumulative distribution. Ideally, quantile normalization should not be necessary. In case of a strongly nonlinear relationship between replicates, one should consider taking another replicate measurement instead and to discard the data of the affected array. Quantile normalization should not be performed at all between non-replicates, since even minor differences in the expected distribution (e.g. concerning only the 10% most enriched probes) can lead to gross and unexpected distortions [Knott et al. 2009]. 3. Averaging Over Replicates For each condition (including the reference measurements), we average the signal for each probe over replicates by calculating the arithmetic mean of the log intensities. This is equivalent to taking the log of the geometric average over the replicate intensities. Averaging over N replicates reduces the unsystematic noise by a factor of approximately vN. 4. Normalization with Reference By far the most effective normalization step in our experience consists in calculating the log enrichment by subtracting for each probe the averaged log reference intensity from the averaged log signal intensity. This is equivalent to taking the log of the ratio of the geometric mean over signal intensities to the geometric mean over reference intensities: log enrichment = arithmetic mean of log (signal) – arithmetic mean of log (reference) = log( geometric mean of signal / geometric mean of reference) Here, "reference" refers to either the mock IP array measurement or the genomic input measurement (see “Experimental Design”). This normalization corrects very effectively for the strong sequence-dependent probe hybridization biases, chromatin-dependent cross-linking and fragmentation biases. 5. MA-Plots This is a classic and important quality control plot to spot and correct saturation-dependent effects in the log enrichment. For each probe, the log enrichment M is plotted versus the Average log intensities of signal and reference, A: M = arithmetic mean of log(signal) - arithmetic mean of log (reference) A = ½ [ arithmetic mean of log(signal) + arithmetic mean of log (reference) ] . Ideally, the measured enrichment should be independent of the mean intensity A of signal and reference. But if the signal and the reference measurements have different saturation behavior, the expectation of M will show a dependence on A. Any marked dependence should be corrected by Running Median normalization(step 5a). Note that MA plots are very similar to log-log scatter plots: Scaling the x-axis by a factor 2 and rotating the MA plot counterclockwise by 45° results in a scatter plot. See Fig. 4 of Tobias Straub's protocol for an example of an MA-plot. 5a. Lowess or Running Median Normalization Locally weighted scatter plot smoothing (Lowess) is a standard method to correct for the dependence of log enrichments (M) on intensity (A) (Cleveland, 1979; Smyth and Speed, 2002; Do and Choi, 2006). It fits a polynomial locally at each position, resulting in a smooth moving average of the scatter points for a certain A-range. The smoothed mean is subtracted from all M-values. Hence, after Lowess normalization, the data points will lie scattered around a horizontal line in the MA plot. Lowess normalization reduces the systematic noise caused by variations in total intensities (e.g. through GC-content variations, chromatin states etc.) that would otherwise systematically affect the calculated log enrichment. On the other hand, Lowess normalization does not lead to increased statistical noise. We advise against using Lowess normalization, however. It relies on the problematic assumption that the entire data set is composed of background measurements that should be scattered around a mean M value. In fact, some regions may be strongly bound by the ChIPped factor, producing a cloud of points at high M and A values. Lowess normalization uses least squares regression and is therefore not robust against these systematic outliers. As a better alternative, we recommend to use running median smoothing instead. Here, the data points (i.e. probe values) are sorted by A-value and for each data point i the median of the M-values for points i-K to i+K is calculated, effectively averaging over the 2K+1 neighboring values (see, e.g., Härdle and Steiger, 1994). These medians are subtracted from the original M values. Since the median is much less affected by outliers than the average, this method is considerably more robust. When analyzing two-colour arrays, a strong nonlinear M-A dependence is often observed. In these cases, averaging over dye exchange replicates may be advisable and may even render running median normalization unnecessary (see “Experimental Design”). 6. Spatial Flaws in Enrichment Images Spatially non-uniform intensity distributions are easy to spot in the array image using a temperature plot of the log enrichment values. These spatially nonuniform intensities seem to be a fairly frequent source of noise and may have more severe effects than the localized defects discussed, to the extent that they affect more probes. The cause for these defects is likely to be a spatially varying hybridization efficiency (Koren, 2007). A fixed z-scale, e.g. between -2 and 2, should be used for the temperature plot, as automatic z-scaling in the presence of a few outlier z-values will lead to the false impression of a very homogeneous enrichment image. Because probes on Affymetrix arrays are arranged on the array according to their sequence similarities, the spatial log enrichment plot should be done after the running median correction. In that case, the vertical and horizontal stripes seen in step 1 that are caused by varying GC content etc. should have disappeared completely. If flaws are still visible, they can be corrected by subtracting running medians over columns or rows (for vertical and horizontal stripes, respectively), over square quadrants (for gradually varying intensities), or by setting enrichment values on localized flaws to NA. 7. Calculating Factor Occupancies How can the rather abstract log enrichment profiles be transformed into something biologically interpretable? And how can the ChIP profiles of different factors be compared with each other on the same scale? Factors to which the cognate antibody binds with high affinity will show variations over a wider range than factors whose antibody binds more weakly, leading to the false impression that the former factors have more strongly varying occupancy. We would really like to know the factor occupancies directly, i.e., the percentage of cells in which a certain genomic position is occupied Struhl (2008). Unfortunately, the background level corresponding to zero occupancy often varies strongly but smoothly on a scale of 10kb. We therefore estimate the local background level (corresponding to 0% occupancy) by calculating a running 10% quantile over the log enrichment (after scale or quantile normalization, reference normalization, and running median /Lowess correction). The 10% quantile in a running window of, say 10kbp, is the value for which only 10% of probes are lower and 90% are higher. The underlying assumption is that in most running windows, around 20% of the probes have zero occupancy. The background-corrected enrichment is then bg-corrected enrichment = exp(log enrichment) – exp(running 10%-quantile of log enrichment) To be able to scale this corrected enrichment to 100% at sites of full occupancy, we need to normalize this enrichment by the enrichments seen at sites believed to by 100% occupied. If we assume that around 0.2% of sites are fully occupied, we could calculate the absolute occupancy by dividing the background-corrected enrichment by the genome-wide 99.9% quantile of the background-corrected enrichment: occupancy = bg-corrected enrichment / (genome-wide 99.9%-quantile of bg-corrected enrichment) An alternative would be to divide by the smoothed enrichment at a position (at positions) assumed to be 100% occupied by the factor (such as snoRNAs and ribosomal protein genes occupied by PolII during exponential growth). 8. Identification of Significantly Enriched Probes We do not recommend a particular method for this purpose. Any plug-and-play method that calculates P-values for enrichment assumes a certain error model for the measured intensities. Therefore, the safest approach in our view is to actually determine the error model for oneself. We would plot for each probe the estimate of the variance for as a function of mean probe intensity u and fit the points with a suitable parametric function. A standard choice for this function is (Rocke and Durbin, 2001) variance = (c1*u + c2)² + c3 where c1, c2, c3 are coefficients to fit the model variance to that observed We found that often the intensity dependence of the error is not well described by this model. In this case, other parametric dependencies can be tried. We found the following forms helpful: variance = c1*u^c2 + c3*u^c4 or variance = 1/{ 1/(c1*u^c2) + 1/(c3*u^c4) } The error model can be used to determine P-values for high enrichment values according to the formula P-value = (½) erfc{ (enrichment - mean enrichment) / sqrt(variance) } Here, erfc is the complementary error function. Regions where probes show consistently low Pvalues (e.g. smaller than 5%) are significantly enriched. It is recommended to apply the above analysis – including the estimation of the error model – to traces smoothed with running medians, as this decreases the number of falsely predicted enriched regions considerably. Other Procedures Smoothing Traces can be smoothed by assigning the average or, better, median over data points in a sliding window to the central point. Smoothing is often applied for aesthetic reasons. We advise against displaying smoothed data, because smoothing effectively removes information about the noise present in the original measurements. Displaying smoothed traces can also be highly misleading: By hiding the actual measurement noise at high spatial frequencies, noise features at lower frequencies that are not filtered out by the smoothing procedure are easily mistaken for real signals. The procedure should therefore at least be clearly mentioned in the figure caption. Smoothing can, however, be very helpful for computational analysis to increase rness. It is recommended, for example, before calculating Pearson correlation coefficients between measurements subjected to strongly correlated noise, before applying peak detection methods or estimating P-values for significant, or when calculating quantiles of log enrichments for background subtraction (step 7). Rank normalization is a method similar to quantile normalization intended to make replicate measurements comparable. It maps the cumulative log enrichment distributions onto a uniform distribution between 1 and m, the number of probes per array. This procedure is not recommended, as it distorts the intensity scale in an unpredictable and unintuitive way. For example, as it results in a uniform density of measurements along the range 1...m, intensities in the high and low range will be compressed, whereas intensities in the medium range will be stretched out. A better alternative that conserves the original scale is quantile normalization. In variance stabilization normalization (VSN) (Huber, 2002), the measured intensities y are transformed y → h(y) in such a way that the variance of the transformed measurements h(y) will be equal along the entire h-scale. This method was proposed for expression arrays to simplify the subsequent statistical procedures for identifying significantly up- or down-regulated genes. Since the variance of measurements is determined by technical effects (such as electronic noise in the photomultiplier measuring the probe fluorescence), the transformation is arbitrary with regard to the ChIP enrichment that we want to measure and visualize. But worse, since the noise depends strongly on intensity which again varies systematically along the genome (depending for example strongly on GC content), the VSN procedure introduces systematic errors into the binding profiles. Both VSN transformation and rank normalization lead to distorted profiles that are not directly interpretable in terms of x-fold enrichment or occupancy anymore. However, for the purpose of calculating P-values for ChIP enrichment, VSN is justified. Probe sequence-dependent normalization methods correct the probe sequence-dependent hybridization bias, such as the strong GC bias. A popular method of this kind is implemented in Model-based Analysis of Tiling arrays for ChIP-chip (MAT) (Johnson, 2006). These methods train a linear or nonlinear regression method to predict the background probe intensity from the probe sequence and divide probe intensities by the predicted background intensities. Recently it has been demonstrated that the simple normalization with a reference measurement (step 4) renders probe sequence-dependent normalization completely unnecessary (Chung and Vingron, 2009). Worse, in contrast to reference normalization, it is not able to correct for many strong systematic sequence- and genome context-dependent effects and may even obscure biological GC-biased signal (Gilbert and Rechtsteiner, 2009). Last, the procedures introduce an arbitrary offset and scaling factor, hindering the interpretation of the obtained profiles in terms of x-fold enrichment. Deconvolution methods aim to increase the spatial resolution of the measured binding profiles by post-processing. Two articles have been published on this topic (Reiss, 2006; Qi 2006) so far. We find that, in practice, these methods are rarely applied for unknown reasons. We will investigate this topic in the future. Acknowledgements We are grateful to Tobias Straub for detailed discussions, and several parts of this protocol are based on them. See his Protocol on the Analysis of NimbleGen ChIP Chip Data using Bioconductor/R (PROT43). We would also like to thank Achim Tresch and Benedikt Zacher for fruitful discussions. Reviewer Comments Reviewed by: Tobias Straub, Adolf-Butenandt- Institut, Ludwig Maximilians University, München, Germany 1. Given the fact that chromatin preparations can be subject to very strong variations regarding for example fragment size and crosslink efficiency we strongly support the idea that a replicate measurement will always involve the comparative analysis of IP material with a matched genomic sample. An experiment comprising three biological replicates would therefore require three dual-color arrays with matched IP and input hybridisation on each array. Alternatively 6 single color arrays have to be hybridized individually. References 1. Bolstad BM, Irizarry RA, Astrand M, Speed T (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19:185-193 2. Chung H-R and Vingron M (2009) Comparison of sequence-dependent tiling array normalization approaches. BMC Bioinformatics 10:204 3. Cleveland WS (1979) Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc 74:829–836 4. Do JH and Choi DK (2006) Normalization of microarray data: single-labeled and duallabeled arrays. Mol Cells 22, 254-261 5. Gilbert D and Rechtsteiner A. (2009) Comments on sequence normalization of tiling array expression. Bioinformatics, in press 6. Hardle W and Steiger W. (1995) Optimal median smoothing. Applied Statistics 44:258–264 7. Huber W, von Heydebreck A, Sültmann H, Poustka A, Vingron M (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18:96-104 8. Johnson WE, Li W, Meyer CA, Gottardo R, Carroll JS, Brown M, Liu XS (2006) Model-based analysis of tiling-arrays for ChIP-chip. Proc Natl Acad Sci U S A 103:12457-12462 9. Knott SR, Viggiani CJ, Aparicio OM, Tavare S. (2009) Strategies for analyzing highly enriched IP-chip datasets. BMC Bioinformatics 10:305 10. Koren A, Tirosh I, and Barkai N. (2007) Autocorrelation analysis reveals widespread spatial biases in microarray experiments. BMC Genomics 8:164 11. Qi Y, Rolfe A, MacIsaac KD, Gerber GK, Pokholok D, Zeitlinger J, Danford T, Dowell RD, Fraenkel E, Jaakkola TS, Young RA, Gifford DK (2006) High-resolution computational models of genome binding events. Nat Biotechnol. 24:963-70 12. Reiss DJ, Facciotti MT, Baliga NS. (2006) Model-based deconvolution of genome-wide DNA binding. Bioinformatics 24:396-403 13. Rocke DM and Durbin B. (2001) A model for measurement error for gene expression arrays. J. Comput Biol. 8:557-569 14. Smyth GK, Speed T (2003) Normalization of cDNA microarray data. Methods 31:265-273 15. Struhl, K (2007) Interpreting chromatin immunoprecipitation experiments. In: Zuk D, editor. Evaluating Techniques in Biochemical Research. Cambridge, MA: Cell Press. pp. 29–33 16. Toedling J, Huber W (2008) Analyzing ChIP-chip data using Bioconductor. PLoS Comput Biol 4:e1000227