A Guideline for ChIP - Ludwig-Maximilians

advertisement
THE EPIGENOME Network of Excellence
A Guideline for ChIP - Chip Data Quality Control
and Normalization (PROT 47)
Matthias Siebert, Michael Lidschreiber, Holger
Hartmann, and Johannes Soding
Gene Center Munich
Ludwig-Maximilians-Universität
Feodor-Lynen-Str. 25
81377 Munich, Germany
email feedback to: soeding@genzentrum.lmu.de
Last reviewed: 01 Dec 2009 by Tobias Straub, Adolf-Butenandt- Institut, Ludwig Maximilians
University, München, Germany
Introduction
Chromatin immunoprecipitation coupled to tiling microarray analysis (ChIP-on-chip) is used to
measure genome-wide the DNA binding sites of a protein of interest. In ChIP-on-chip, proteins
are covalently cross-linked to the DNA by formaldehyde, cells are lysed, the chromatin is
immunoprecipitated with an antibody to the protein of interest and the fragmented DNA that is
directly or indirectly bound to the protein is analyzed with tiling arrays. For this purpose, the
fragmented DNA is fluorescently labeled and hybridized to the tiling array, which consists of
millions of short (25 to 60 nucleotides long) probes that cover the genome at a constant spacing
(4 to 100s of nucleotides), like tiles covering a roof. The data generated by one experiment
consists of an intensity value for each DNA probe. These values measure the relative quantity of
DNA at the probe's genomic position in the immunoprecipitated material.
This guideline describes the first steps in the data analysis for ChIP-on-chip measurements in a
bare-bones fashion. The steps comprise the quality control of the obtained data and
normalizations to render the data comparable between different arrays, to correct for
saturation effects, and to obtain enrichment and occupancy values. Peak calling and other
downstream procedures are not part of this exposé. For each step, we give detailed
recommendations, warn about problematic but often popular procedures, and occasionally
suggest improved versions of standard procedures.
Although we have gained experience in ChIP chip data analysis mainly in yeast, this protocol
tries to give a general guideline that should be applicable to other species and to both single
and two-color arrays. Most advocated procedures can be carried out within the R environment
for statistical data analysis, using packages from the Bioconductor project (see, e.g., Toedling
and Huber, 2008). A "Bioconductor package Starr for Affymetrix platforms supporting the
analysis steps described here is available and will be described in an upcoming protocol (Zacher
B. and Tresch A., 2010).
Procedure
Experimental Design
Number of Replicates
It is advisable to measure at least two biological replicates of all factors or conditions, for two
reasons. First and foremost, corrupted measurements will normally be easily identifiable by
comparing replicates. Even if both of two measurements are corrupted, corruption tends to be
erratic and will normally not result in the expected high correlation between replicates (step 2
below). Second, averaging over N replicates reduces the standard deviation of unsystematic
noise (i.e. the random scattering of measured values) by a factor vN. This means a factor of 1.4
reduction for two replicate measurements, 1.7 for three, and 2.0 for four replicates. The payoff is thus particularly high for the second replicate and falls of slowly for higher N.
Mock IP or Genomic Input?
In any ChIP-on-chip experiment, it is important to correct for sequence- and genomic regionspecific biases in the efficiency of the various biochemical and biophysical steps of the ChIP-onchip protocol. Crosslinking of the DNA to proteins, fragmentation of the chromatin,
immunoprecipitation, PCR amplification, and hybridization to the array all have strong biases
that must be corrected for. This is done by measuring a reference signal and dividing the true
signal intensities obtained from the immunoprecipitated protein by the reference intensities.
This is in our experience by far the most important step in data normalization. A frequent point
of contention is whether it is better to use the genomic input fraction or a mock
immunoprecipitation (mock IP) for this normalization. (The mock IP is performed with the wildtype strain if the true signal is obtained by immunoprecipitating the protein of interest with a
protein tag. It is done using an unspecific antibody if the protein of interest is purified with a
specific antibody.)
Our experience is that normalization with the mock IP is preferable if a good signal can be
obtained from the mock IP hybridization, i.e., if the signal-to-noise ratio of the mock IP is not
much inferior to that of the true IP. We have seen higher noise levels and occasional artefacts
when using normalization with the genomic input. One reason might be that normalization with
the genomic input does not correct for sequence- and region-dependent biases of unspecific
binding to the antibody. If signal-to-noise of the mock is a problem, or if the chromatin
preparation is deemed hard to reproduce in different strains, then normalization with matched
genomic input is advisable. An obvious drawback for single-color arrays is, however, that for
each measurement, a matched genomic input is hybridized, necessitating almost twice the
number of arrays as if all measurements are normalized with a single mock IP. We have found
that genomic input measurements can be highly reproducible for different experiments and
chromatin preparations. In this case it may be justified to use only a few representative input
measurements for data normalization and to forego the costly measurement of a matched
genomic input sample for each IP (see comment 1).A reasonable procedure is to measure both
input IP and mock IP for representative factors and to compare the signal-to-noise ratio for both
normalizations. Another possibility for two-color arrays is to normalize with both input and
mock IP:
log enrichment = log [ (signal1/input1) / (mock2/input2) ] = log(signal1/input1) log(mock2/input2)
In this case, a dye exchange should be performed (next paragraph).
Dye Exchange
When using two-color arrays, such as those from NimbleGen or Agilent, the dye-exchange
procedure is very much recommended (e.g. Do and Choi, 2006). This means the dyes Cy3 and
Cy5, with which signal (true IP) and reference fractions are labeled, should be switched
between the two (or four) biological replicates, such that the signal fraction is labeled with Cy5
in half the measurements and with Cy3 in the other half. Dye exchange corrects for dyedependent saturation effects in a much better way than downstream data normalization could
ever achieve (steps 5 and 5a below).
Data Analysis Protocol
1. Spatial Flaws in Raw Image Data
Arrays should be checked for localized spatial defects and for nonuniform mean intensity
distributions. On most platforms except Affymetrix, probes are randomly arranged across the
array. Therefore, local or global spatial patterns in measured intensities can be assumed to be
artifacts. They will lead to increased noise, in particular if many probes are affected. Localized
effects arise from scratches, bubbles, manufacturing problems etc. They can be detected by
looking at the array's raw intensity image, ideally directly after the exposure of the array, as
flaws can lead to underexposure and other problems that might still be corrected while the
array is mounted (see protocol 43 by Tobias Straub.) In case of strong defects, either the
affected probes or the entire array measurement should be discarded. On Affymetrix arrays,
probes are arranged randomly with respect to genomic location, but probes with similar
sequences are placed near each other. This leads to horizontal and vertical stripes around 20-50
rows or columns wide with systematically varying intensity, and also to more large-scale
variation of intensities across the array.
2. Log-Log Scatter Plots
First, we correct for different scales in the raw intensities by subtracting the median from the
log intensities over each array. This is necessary since the scale of measurements can easily
change due to varying amounts of chromatin applied, changing exposure times or hybridization
efficiencies and other effects.
To identify flawed array measurements, one can plot the logs of the rescaled intensities of pairs
of biological replicates for each probe in X-Y scatter or (better) contour plots (i.e.
measurements under the same conditions, wild type vs. wild type, signal vs. signal). The data
points on these scatter plots should lie scattered close to a straight line. If the scales are fairly
different or if the data points lie scattered about a bent curve, one should try to find out the
experimental causes, as this may lead to nonlinearities between measurements for different
conditions, which may be difficult to correct. (Note that different scales and non-linearities are
common for Affymetrix arrays and do not seem to indicate experimental problems, whereas
they are uncommon for Nimblegen arrays with their longer probes.) If the log intensity points
for two replicates are approximately distributed along a straight line, we can use scale
normalization (Smyth and Speed, 2003; Do and Choi, 2006) to bring them onto the diagonal
(step 2b). If the relationship between replicates is not linear, quantile normalization (Bolstad
2003) is required to map them onto the diagonal (step 2c). This normalization will make the
replicate measurements comparable and to allow averaging over them (step 3). One can also
plot array measurements taken under different conditions (averaged over replicates in a scatter
plot to check if the distributions are similar.
Outliers in the scatter plot indicate corrupted probes in one of the arrays. Their values should
be set to the geometric average of the two neighbouring probes, or, less ideally, to NA (“not
available”). Many outliers indicate a corrupted array.
2a. Scale Normalization
This is recommended if a linear relationship between replicate measurements is observed in the
log-log scatter plots in step 2. In scale normalization, one divides the log intensities of each
replicate by its median absolute deviation (MAD) (Smyth and Speed, 2003). The MAD estimates
the spread of a distribution, similar to the standard deviation, but it is preferable because of its
much better robustness with respect to outliers. It is simply the median of the absolute
deviation of all measurements from the median. Let median{X1,...,Xm} denote the median of m
measurements X1,..,Xm. Then, the absolute deviation of X1 from the median is |X1median{X1,..,Xm}|. The MAD of measurements X1,...,Xm is defined as
MAD = median{ |X1-median{X1,..,Xm}|,...,|Xm-median{X1,..,Xm}| }
To conserve the meaning of the rescaled profiles in terms of log enrichment, we recommend
multiplying all profiles by the geometric mean of the MAD values of all arrays that are scalenormalized together. MAD values that differ by more than a factor of, say 1.2, between
replicates might indicate problems with the measurements. One should check experimental
procedures to get rid of the nonlinearities in this case. Scale normalization should not be
applied to non-replicate measurements, except in the case that the different conditions are
known to have only a minor effect on the expected enrichment distribution.
2b. Quantile Normalization
This is recommended if a nonlinear relationship between replicate measurements is observed in
the log-log scatter plots in step 2. Quantile normalization can be performed within the
Bioconductor/R framework (see, e.g. Toedling and Huber, 2008). It maps the cumulative log
intensity (or enrichment) distributions of two or more replicates onto the average cumulative
distribution. Ideally, quantile normalization should not be necessary. In case of a strongly
nonlinear relationship between replicates, one should consider taking another replicate
measurement instead and to discard the data of the affected array. Quantile normalization
should not be performed at all between non-replicates, since even minor differences in the
expected distribution (e.g. concerning only the 10% most enriched probes) can lead to gross and
unexpected distortions [Knott et al. 2009].
3. Averaging Over Replicates
For each condition (including the reference measurements), we average the signal for each
probe over replicates by calculating the arithmetic mean of the log intensities. This is
equivalent to taking the log of the geometric average over the replicate intensities. Averaging
over N replicates reduces the unsystematic noise by a factor of approximately vN.
4. Normalization with Reference
By far the most effective normalization step in our experience consists in calculating the log
enrichment by subtracting for each probe the averaged log reference intensity from the
averaged log signal intensity. This is equivalent to taking the log of the ratio of the geometric
mean over signal intensities to the geometric mean over reference intensities:
log enrichment
= arithmetic mean of log (signal) – arithmetic mean of log (reference)
= log( geometric mean of signal / geometric mean of reference)
Here, "reference" refers to either the mock IP array measurement or the genomic input
measurement (see “Experimental Design”). This normalization corrects very effectively for the
strong sequence-dependent probe hybridization biases, chromatin-dependent cross-linking and
fragmentation biases.
5. MA-Plots
This is a classic and important quality control plot to spot and correct saturation-dependent
effects in the log enrichment. For each probe, the log enrichment M is plotted versus the
Average log intensities of signal and reference, A:
M = arithmetic mean of log(signal) - arithmetic mean of log (reference)
A = ½ [ arithmetic mean of log(signal) + arithmetic mean of log (reference) ] .
Ideally, the measured enrichment should be independent of the mean intensity A of signal and
reference. But if the signal and the reference measurements have different saturation behavior,
the expectation of M will show a dependence on A. Any marked dependence should be corrected
by Running Median normalization(step 5a). Note that MA plots are very similar to log-log scatter
plots: Scaling the x-axis by a factor 2 and rotating the MA plot counterclockwise by 45° results
in a scatter plot. See Fig. 4 of Tobias Straub's protocol for an example of an MA-plot.
5a. Lowess or Running Median Normalization
Locally weighted scatter plot smoothing (Lowess) is a standard method to correct for the
dependence of log enrichments (M) on intensity (A) (Cleveland, 1979; Smyth and Speed, 2002;
Do and Choi, 2006). It fits a polynomial locally at each position, resulting in a smooth moving
average of the scatter points for a certain A-range. The smoothed mean is subtracted from all
M-values. Hence, after Lowess normalization, the data points will lie scattered around a
horizontal line in the MA plot. Lowess normalization reduces the systematic noise caused by
variations in total intensities (e.g. through GC-content variations, chromatin states etc.) that
would otherwise systematically affect the calculated log enrichment. On the other hand, Lowess
normalization does not lead to increased statistical noise.
We advise against using Lowess normalization, however. It relies on the problematic assumption
that the entire data set is composed of background measurements that should be scattered
around a mean M value. In fact, some regions may be strongly bound by the ChIPped factor,
producing a cloud of points at high M and A values. Lowess normalization uses least squares
regression and is therefore not robust against these systematic outliers. As a better alternative,
we recommend to use running median smoothing instead. Here, the data points (i.e. probe
values) are sorted by A-value and for each data point i the median of the M-values for points i-K
to i+K is calculated, effectively averaging over the 2K+1 neighboring values (see, e.g., Härdle
and Steiger, 1994). These medians are subtracted from the original M values. Since the median
is much less affected by outliers than the average, this method is considerably more robust.
When analyzing two-colour arrays, a strong nonlinear M-A dependence is often observed. In
these cases, averaging over dye exchange replicates may be advisable and may even render
running median normalization unnecessary (see “Experimental Design”).
6. Spatial Flaws in Enrichment Images
Spatially non-uniform intensity distributions are easy to spot in the array image using a
temperature plot of the log enrichment values. These spatially nonuniform intensities seem to
be a fairly frequent source of noise and may have more severe effects than the localized defects
discussed, to the extent that they affect more probes. The cause for these defects is likely to
be a spatially varying hybridization efficiency (Koren, 2007). A fixed z-scale, e.g. between -2
and 2, should be used for the temperature plot, as automatic z-scaling in the presence of a few
outlier z-values will lead to the false impression of a very homogeneous enrichment image.
Because probes on Affymetrix arrays are arranged on the array according to their sequence
similarities, the spatial log enrichment plot should be done after the running median correction.
In that case, the vertical and horizontal stripes seen in step 1 that are caused by varying GC
content etc. should have disappeared completely. If flaws are still visible, they can be
corrected by subtracting running medians over columns or rows (for vertical and horizontal
stripes, respectively), over square quadrants (for gradually varying intensities), or by setting
enrichment values on localized flaws to NA.
7. Calculating Factor Occupancies
How can the rather abstract log enrichment profiles be transformed into something biologically
interpretable? And how can the ChIP profiles of different factors be compared with each other
on the same scale? Factors to which the cognate antibody binds with high affinity will show
variations over a wider range than factors whose antibody binds more weakly, leading to the
false impression that the former factors have more strongly varying occupancy. We would really
like to know the factor occupancies directly, i.e., the percentage of cells in which a certain
genomic position is occupied Struhl (2008). Unfortunately, the background level corresponding
to zero occupancy often varies strongly but smoothly on a scale of 10kb. We therefore estimate
the local background level (corresponding to 0% occupancy) by calculating a running 10%
quantile over the log enrichment (after scale or quantile normalization, reference
normalization, and running median /Lowess correction). The 10% quantile in a running window
of, say 10kbp, is the value for which only 10% of probes are lower and 90% are higher. The
underlying assumption is that in most running windows, around 20% of the probes have zero
occupancy. The background-corrected enrichment is then
bg-corrected enrichment = exp(log enrichment) –
exp(running 10%-quantile of log enrichment)
To be able to scale this corrected enrichment to 100% at sites of full occupancy, we need to
normalize this enrichment by the enrichments seen at sites believed to by 100% occupied. If we
assume that around 0.2% of sites are fully occupied, we could calculate the absolute occupancy
by dividing the background-corrected enrichment by the genome-wide 99.9% quantile of the
background-corrected enrichment:
occupancy = bg-corrected enrichment
/ (genome-wide 99.9%-quantile of bg-corrected enrichment)
An alternative would be to divide by the smoothed enrichment at a position (at positions)
assumed to be 100% occupied by the factor (such as snoRNAs and ribosomal protein genes
occupied by PolII during exponential growth).
8. Identification of Significantly Enriched Probes
We do not recommend a particular method for this purpose. Any plug-and-play method that
calculates P-values for enrichment assumes a certain error model for the measured intensities.
Therefore, the safest approach in our view is to actually determine the error model for oneself.
We would plot for each probe the estimate of the variance for as a function of mean probe
intensity u and fit the points with a suitable parametric function. A standard choice for this
function is (Rocke and Durbin, 2001)
variance = (c1*u + c2)² + c3
where c1, c2, c3 are coefficients to fit the model variance to that observed
We found that often the intensity dependence of the error is not well described by this model.
In this case, other parametric dependencies can be tried. We found the following forms helpful:
variance = c1*u^c2 + c3*u^c4 or variance = 1/{ 1/(c1*u^c2) + 1/(c3*u^c4) }
The error model can be used to determine P-values for high enrichment values according to the
formula
P-value = (½) erfc{ (enrichment - mean enrichment) / sqrt(variance) }
Here, erfc is the complementary error function. Regions where probes show consistently low Pvalues (e.g. smaller than 5%) are significantly enriched. It is recommended to apply the above
analysis – including the estimation of the error model – to traces smoothed with running
medians, as this decreases the number of falsely predicted enriched regions considerably.
Other Procedures
Smoothing
Traces can be smoothed by assigning the average or, better, median over data points in a sliding
window to the central point. Smoothing is often applied for aesthetic reasons. We advise against
displaying smoothed data, because smoothing effectively removes information about the noise
present in the original measurements. Displaying smoothed traces can also be highly misleading:
By hiding the actual measurement noise at high spatial frequencies, noise features at lower
frequencies that are not filtered out by the smoothing procedure are easily mistaken for real
signals. The procedure should therefore at least be clearly mentioned in the figure caption.
Smoothing can, however, be very helpful for computational analysis to increase rness. It is
recommended, for example, before calculating Pearson correlation coefficients between
measurements subjected to strongly correlated noise, before applying peak detection methods
or estimating P-values for significant, or when calculating quantiles of log enrichments for
background subtraction (step 7).
Rank normalization is a method similar to quantile normalization intended to make replicate
measurements comparable. It maps the cumulative log enrichment distributions onto a uniform
distribution between 1 and m, the number of probes per array. This procedure is not
recommended, as it distorts the intensity scale in an unpredictable and unintuitive way. For
example, as it results in a uniform density of measurements along the range 1...m, intensities in
the high and low range will be compressed, whereas intensities in the medium range will be
stretched out. A better alternative that conserves the original scale is quantile normalization.
In variance stabilization normalization (VSN) (Huber, 2002), the measured intensities y are
transformed y → h(y) in such a way that the variance of the transformed measurements h(y)
will be equal along the entire h-scale. This method was proposed for expression arrays to
simplify the subsequent statistical procedures for identifying significantly up- or down-regulated
genes. Since the variance of measurements is determined by technical effects (such as
electronic noise in the photomultiplier measuring the probe fluorescence), the transformation is
arbitrary with regard to the ChIP enrichment that we want to measure and visualize. But worse,
since the noise depends strongly on intensity which again varies systematically along the
genome (depending for example strongly on GC content), the VSN procedure introduces
systematic errors into the binding profiles. Both VSN transformation and rank normalization lead
to distorted profiles that are not directly interpretable in terms of x-fold enrichment or
occupancy anymore. However, for the purpose of calculating P-values for ChIP enrichment, VSN
is justified.
Probe sequence-dependent normalization methods correct the probe sequence-dependent
hybridization bias, such as the strong GC bias. A popular method of this kind is implemented in
Model-based Analysis of Tiling arrays for ChIP-chip (MAT) (Johnson, 2006). These methods train a
linear or nonlinear regression method to predict the background probe intensity from the probe
sequence and divide probe intensities by the predicted background intensities. Recently it has
been demonstrated that the simple normalization with a reference measurement (step 4)
renders probe sequence-dependent normalization completely unnecessary (Chung and Vingron,
2009). Worse, in contrast to reference normalization, it is not able to correct for many strong
systematic sequence- and genome context-dependent effects and may even obscure biological
GC-biased signal (Gilbert and Rechtsteiner, 2009). Last, the procedures introduce an arbitrary
offset and scaling factor, hindering the interpretation of the obtained profiles in terms of x-fold
enrichment.
Deconvolution methods aim to increase the spatial resolution of the measured binding profiles
by post-processing. Two articles have been published on this topic (Reiss, 2006; Qi 2006) so far.
We find that, in practice, these methods are rarely applied for unknown reasons. We will
investigate this topic in the future.
Acknowledgements
We are grateful to Tobias Straub for detailed discussions, and several parts of this protocol are
based on them. See his Protocol on the Analysis of NimbleGen ChIP Chip Data using
Bioconductor/R (PROT43). We would also like to thank Achim Tresch and Benedikt Zacher for
fruitful discussions.
Reviewer Comments
Reviewed by: Tobias Straub, Adolf-Butenandt- Institut, Ludwig Maximilians University, München,
Germany
1. Given the fact that chromatin preparations can be subject to very strong variations
regarding for example fragment size and crosslink efficiency we strongly support the idea
that a replicate measurement will always involve the comparative analysis of IP material
with a matched genomic sample. An experiment comprising three biological replicates
would therefore require three dual-color arrays with matched IP and input hybridisation
on each array. Alternatively 6 single color arrays have to be hybridized individually.
References
1. Bolstad BM, Irizarry RA, Astrand M, Speed T (2003) A comparison of normalization
methods for high density oligonucleotide array data based on variance and bias.
Bioinformatics 19:185-193
2. Chung H-R and Vingron M (2009) Comparison of sequence-dependent tiling array
normalization approaches. BMC Bioinformatics 10:204
3. Cleveland WS (1979) Robust locally weighted regression and smoothing scatterplots. J Am
Stat Assoc 74:829–836
4. Do JH and Choi DK (2006) Normalization of microarray data: single-labeled and duallabeled arrays. Mol Cells 22, 254-261
5. Gilbert D and Rechtsteiner A. (2009) Comments on sequence normalization of tiling array
expression. Bioinformatics, in press
6. Hardle W and Steiger W. (1995) Optimal median smoothing. Applied Statistics 44:258–264
7. Huber W, von Heydebreck A, Sültmann H, Poustka A, Vingron M (2002) Variance
stabilization applied to microarray data calibration and to the quantification of
differential expression. Bioinformatics 18:96-104
8. Johnson WE, Li W, Meyer CA, Gottardo R, Carroll JS, Brown M, Liu XS (2006) Model-based
analysis of tiling-arrays for ChIP-chip. Proc Natl Acad Sci U S A 103:12457-12462
9. Knott SR, Viggiani CJ, Aparicio OM, Tavare S. (2009) Strategies for analyzing highly
enriched IP-chip datasets. BMC Bioinformatics 10:305
10. Koren A, Tirosh I, and Barkai N. (2007) Autocorrelation analysis reveals widespread spatial
biases in microarray experiments. BMC Genomics 8:164
11. Qi Y, Rolfe A, MacIsaac KD, Gerber GK, Pokholok D, Zeitlinger J, Danford T, Dowell RD,
Fraenkel E, Jaakkola TS, Young RA, Gifford DK (2006) High-resolution computational
models of genome binding events. Nat Biotechnol. 24:963-70
12. Reiss DJ, Facciotti MT, Baliga NS. (2006) Model-based deconvolution of genome-wide DNA
binding. Bioinformatics 24:396-403
13. Rocke DM and Durbin B. (2001) A model for measurement error for gene expression
arrays. J. Comput Biol. 8:557-569
14. Smyth GK, Speed T (2003) Normalization of cDNA microarray data. Methods 31:265-273
15. Struhl, K (2007) Interpreting chromatin immunoprecipitation experiments. In: Zuk D,
editor. Evaluating Techniques in Biochemical Research. Cambridge, MA: Cell Press. pp.
29–33
16. Toedling J, Huber W (2008) Analyzing ChIP-chip data using Bioconductor. PLoS Comput
Biol 4:e1000227
Download