The Affymetrix® Chromatin Immunoprecipitation (ChIP

advertisement
CHIP-chip and Genome Tiling Microarrays
Index
Summary
Experimental procedures of ChIP-chip
ChIP-chip data analysis
Implications of ChIP-chip
Conclusion remarks
Sources
Recommended reading
Summary
Interactions between proteins and DNA are fundamental to many biological processes
including transcription, DNA modification, replication and repair, nucleosome assembly and
chromatin packing. These processes are not only involved in cell survival and proliferation, but
also required for maintaining cell identity and mediating distinct physiological functions. DNA
binding proteins include transcription factors, which directly modulate the process of
transcription, nucleases which cleave DNA molecules, and histones which are involved in DNA
packaging into high-order chromatin structure. Many cofactor proteins which indirectly associate
with DNA also play important roles in modulating the activity of interacting transcription factors.
A comprehensive knowledge of where proteins and their regulatory cofactors interact with the
genome would greatly facilitate our understanding of the mechanism and logic of cellular events.
ChIP-chip, a technique that combine chromatin immunoprecipitation (ChIP) and DNA
microarray chips, has been widely used to create high-resolution genome-wide maps of the in
vivo association of DNA sequences with regulatory proteins and posttranslationally modified
histones. In this chapter, experimental procedures, data analysis and implications of Chip-chip
study are reviewed.
I.
Experimental procedures
Experimental overview:
There are two kinds of ChIP procedures including N-ChIP using ‘native chromatin’ input
and X-ChIP using ‘crosslinked chromatin’ input. N-ChIP is used to study proteins with strong
and direct association with DNA, for example histone proteins. For cofactors which indirectly or
transiently associate with DNA, a crosslink step to fix protein-DNA interactions is crucial for
pulling down DNA elements with the protein of interest. Comparing to N-ChIP, X-ChIP is more
frequently used for studying any protein-DNA interactions.
In X-ChIP, cells are first fixed with formaldehyde to crosslink DNA to any associated
proteins. The cells are then lysed and DNA is sheared into smaller fragments using sonication.
Protein-DNA complexes are then immunoprecipitated with an antibody directed against the
specific protein of interest. Following the immunoprecipitation, crosslinking is reversed, samples
are protease-treated and the purified DNA sample is amplified, fragmented and labeled to
hybridize onto genome tiling arrays. By comparing the hybridization signals generated by an
immunoprecipitated sample versus a non-specific antibody control or input DNA control, the
regions of chromatin-protein interaction can be identified.
1. Chromatin immunoprecipitation (ChIP)
Cells grown under the desired experimental condition are fixed with formaldehyde which
forms heat-reversible DNA-protein crosslinks. After crosslinking, cells were lysed and sonicated
to fragment the chromatin to the size of 500 bp – 1 kb. Random shearing does not always
produce small-enough chromatin fragments at the regions of interest. For this reason and to
conduct experiments on fresh and frozen tissues, some researchers have preferred to make use of
‘native chromatin’. In these protocols, the chromatin is fractionated by incubation of purified
nuclei with micrococcal nuclease (MNase), an enzyme that cleaves preferentially the linker DNA
between the nucleosomes. Specifically, by performing partial digestions with MNase, it is
possible to obtain native chromatin fragments of on average one to five nucleosomes in length.
These oligonucleosome fragments are purified from the nuclei and are then used to perform ChIP.
The choice of native, MNase-fractionated, chromatin as the input material for ChIP is
advantageous, because the epitopes, recognized by the antibody, remain intact during the
chromatin preparation, whereas formaldehyde mediates protein-protein interactions which may
block the exposure of epitopes recognized by the antibody. As a consequence, native chromatin
tends to give higher levels of precipitation for a specific histone modification than formaldehyde
cross-linked chromatin.
ChIP is performed by incubation of fractionated chromatin (input chromatin) with an
antibody directed to a protein of interest. The antibody recognizes its targeted protein and
precipitates protein-DNA from solution. In this way, only DNA fragments crosslinked to the
protein of interest are enriched, while DNA-protein complexes that are not recognized by the
antibody are washed away.
However, this method has its own limitations. In particular, it requires antibodies of a
quality that is sufficient for a specific interaction and at the same time can tolerate stringent
binding and washing conditions. Thus, biotinylation mediated ChIP (bioChIP), a modified ChIP
technique, is adapted to study in vivo biotinylated proteins. The interaction between biotin and
streptavidin is one of the strongest known noncovalent interactions. Streptavidin affinity captures
of biotinylated proteins permit more stringent washing conditions, resulting in lower background
noises than classic protein-antibody interaction mediated ChIP experiments. Thus, bioChIP not
only circumvents issues related to antibody availability, but also generates high-specific proteinDNA interaction data and provides a uniformed platform to compare various protein associations
with similar DNA elements. However, bioChIP requires simultaneous introduction of a bacterial
biotin ligase (BirA) and the protein of interest with a biotin recognition signal into cells. In
addition, exogenously introduced biotin-tagged proteins should not disrupt protein equilibriums
in vivo in order to study the protein-DNA interactions in a physiological context in the living cell.
2. Amplification of ChIP’ed DNA.
ChIP’ed materials are removed of crosslinks by heat incubation and treated with Protease
K to digest proteins to free DNA fragments. As the amount of ChIP’ed DNA is usually small (in
a few nanogram range), an amplification step is required for DNA microarray-based detection.
Three amplification methods can be used. Exponential amplification methods such as ligationmediated PCR (LM-PCR) and random primed PCR (WGA) have been most commonly used. An
in vitro transcription (IVT) mediated linear amplification method is likely to give higher fidelity
results, however, this method requires multiple steps involved in DNA to RNA and RNA to
DNA conversions and is rather laborious. As for LM-PCR, ChIP’ed DNA is ligated to a pair of
linker DNA and amplified by PCR using primers specific for linker DNA sequences. Low PCR
cycle numbers (usually 19 – 25 cycles) should be used to avoid saturating signals from ChIP’ed
DNA and increasing amplification of non-specific binding. Amplified DNA (1 – 4 micrograms)
with the size of 500 – 1000 bp is then fragmented into even smaller pieces (50 – 100 bp) and
labeled by appropriate dyes for microarray hybridization.
ChIP-chip experiments require a background control. Chromatin inputs without going
through ChIP procedure are reversed of crosslinks and DNA is isolated, amplified, fragmented
and labeled in a similar manner as ChIP’ed DNA. Input DNA measures genome background, cell
line variation and PCR amplification bias. In addition, a mock ChIP or ChIP using non-specific
antibody can serve as the control which measures non-specific protein-DNA interactions.
3. Hybridization to the tiling arrays.
Affymetrix, NimbleGen Systems and Agilent Technologies have developed
oligonucleotide arrays that tile all of the nonrepetitive genomic sequences of human and other
eukaryotes and are commonly used.
Affymetrix tiling arrays contain 25-nt probes tiled every 35 bp of DNA sequence.
Affymetric promoter tiling arrays covers -8 kb to 2 kb regions relative to transcription start sites
of annotated genes in human or mouse genome, while the whole genome tiling arrays (2.0R) are
comprised of 7 chips covering entire human or mouse genome that is masked of repeat
sequences. The probes are synthesized on the arrays in situ using photolithographic technology
and usually contain high error rates at the 3’ end of probes. NimblenGen promoter or wholegenome arrays consist of unique 50-mers at 100 bp resolution, with the probes being synthesized
in situ using maskless, photo-mediated array synthesizer techonology. Agilent arrays use a
maskless, industrial-scale inkjet printing process that synthesizes oligonucleotide probes that are
usually 60-mer long and unique in the genomes. NimblenGen and Agilent arrays contain good
quality of probes but are more expensive to be used for whole-genome study as compared to
Affymetrix arrays. However, the maskless process of probe synthesis by NimbleGen and Agilent
allows quick iteration of microarray designs in response to fast-paced content changes in the
continuously evolving genomics environment. This allows researchers easy access to highquality, customer-tailored arrays.
For Affymetrix arrays, ChIP’ed and control DNA samples are biotinlyated, hybridized to
two chips and processed individually. For NimbleGen and Agilent arrays, ChIP’ed and control
DNA is labeled by two different fluors such as Cy5 and Cy3, and hybridized to one chip. The
labeled DNA binds to its complementary DNA probes on the chips, generating fluorescent light
which is recorded by an image scanner. Signals from ChIP’ed DNA are then normalized to the
control DNA. As the sequence of each spot on a chip is known, DNA sequences bound by a
particular protein are then revealed in a high throughput manner.
II.
ChIP-chip data analysis.
Combinations of chromatin immnunoprecipitation and whole-genome tiling microarrays
allow biologists to conduct unbiased genome-wide location analysis. However, they also
generate massive amounts of data, creating a need for effective and efficient analysis algorithms.
Affymetric whole-genome tiling arrays are attractive to ambitious biologists due to their low cost
and intensive genome coverage. But the resulting data are very noisy and complex due to poor
quality of Affymetric probes.
Many methods have been developed to identify enriched regions based on statistics that
compare ChIP array data with control samples. The tiling array software (TAS) developed by
Affymetrix utilizes the Mann–Whitney U test by ranking of ChIP and control probe signals
within 1 kb sliding windows but does not consider the variability in probe behavior. Some
researchers have modeled probe behavior using pooled ChIP-chip data from multiple
laboratories and then infer ChIP-enriched states through a hidden Markov model (HMM).
Another method applies Welch’s t statistic comparing ChIP and control replicates, calculated for
each probe, and then uses a running window average of the t statistics to identify ChIP regions.
This method becomes unreliable when there are only a few replicates to estimate probe variance.
TileMap proposes an empirical Bayes shrinkage improvement by weighting the observed probe
variance and pooled variances of all of the probes on the array. TiMAT first calculates an
average fold change between ChIPs and controls for each probe, then uses a sliding-window
trimmed mean to find ChIP regions.
In this chapter, I will focus on Model-based Analysis of Tiling-arrays (MAT), a fast and
powerful analysis algorithm to identify regions enriched by ChIP-chip on Affymetrix tiling
arrays. This software is developed by Dr. Shirley X. Liu lab in Dana Farber Cancer institute.
1. Model-based Analysis of Tiling-arrays (MAT)
The MAT probe model relies heavily on the fact that most of the probes on the array are
measuring the nonspecific hybridization (i.e. background noise). Instead of estimating probe
behavior from multiple samples, MAT uses a simple linear model to estimate the baseline probe
behavior by considering the 25-mer sequence and copy number of all probes on a single tiling
array. By using this baseline probe model, MAT can standardize the signals of each probe in
each array individually, filter much of the noise in the data and detect the true ChIP signals in the
data. Theoretically, MAT can identity DNA binding regions from a single ChIP sample, however,
multiple ChIP samples with controls, increases accuracy. A strategy diagram of MAT is shown
here.
The variability of probes has to be considered during data analysis for the following
reasons. The error rates increase when Affymetric probes are synthesized in situ towards 3’ end.
It has been estimated that only 10% of Affymetric probes have exactly correct sequences. In
addition, Affymetric probe sequences are not optimized as compared to NimbleGen and
Agilent’s probes. The content and position of G-C nucleotides cause variance in hybridization
affinity, resulting in different probe signals. The following plots show the effect of adenine (A,
on the left), cytosine (C, center) and guanine (G, right) at each probe nucleotide position on
probe intensity. Clearly, A-G-C content and their locations affect probe-DNA hybridization and
resulting signal intensity.
As the majority of probes measure primarily nonspecific hybridization and an Affymetrix
tiling array contains about 6 million 25-mer probes, MAT uses the following equation to model
probe sequence effect and predict baseline intensity of each probe.
MAT then divides the probes on the array into “affinity bins”, each containing a few
thousand probes with similar baseline intensity. MAT estimates the observed sample variance
within each affinity bin and uses it as the probe variance for each probe in the bin. Then, MAT
standardizes each probe on an array as follows:
The distribution of t values is approximately normal, and t values can be compared across
experiments without further normalization.
A sliding window of desired length can be adjusted to the average size of sheared DNA
fragments. In this window, MAT removes the top and bottom 10% of the t-values and computes
the geometric mean of the remaining t-values. A MATscore is then calculated for each sliding
window and assigned to the probe at the center of the window:
where TM is the trimmed mean and np is the
numbers of probes in the window used to calculate the TM. MATscores are comparable across
regions and samples.
With multiple ChIP and control replicates, a MATscore will be calculated for each
window by pooling all of the probes across all replicates and subtracting the MATscore of the
control replicates from the MATscore of the ChIP replicates. This process removes any cellspecific variations that are not modeled in probe behavior equation and increases the confidence
of ChIP peak predictions that are marginally enriched. More replicates and more probes in the
sliding window will give higher confidence to the prediction. P-values for each window and
FDR q-values are also calculated by MAT. MAT software generates a bar file and a bed file
which can be viewed on Affymetrix Integrated Genome Browser (IGB). One example is shown
below.
MATscore histogram and FDR chart can tell whether a ChIP-chip experiment works or
not. For the examples shown below, a good experiment showes nearly normal distribution of
MATscores except a long tail of positive peaks, while a bad ChIP-chip experiment lacks the
extended tail of positive peaks and sometimes show even longer tail of negative peaks.
2. Model-based analysis for 2-color arrays (MA2C).
MA2C, a normalization method based on the GC content of probes, is developed for twocolor tiling arrays by Dr. Shirley Liu’s lab. MA2C takes GC content and dye biases into accounts
and normalizes probes by GC bins within each array. Detection of peak regions by MA2C is
similar to MAT, both of which use sliding window median for peak finding. MA2C has been
implemented as a stand-alone Java program, which can display various plots of statistical
analysis for quality control.
III.
Implications of ChIP-chip Analysis
1. Identification of novel transcription factor binding sites.
2. Motif finding.
ChIP-chip experiments are far more superb than microarray experiments in discovering
conserved DNA motifs in co-regulated genes. Common programs such as MDscan and
BioProspector can be used to identify motifs.
(For details, please see the previous version of this chapter.)
3. Revealing transcription regulatory networks. (Please see the previous version of this
chapter.)
4. Revealing epigenetic modifications. ChIP-chip has been used to analyze epigenetic
features including histone modification, DNA methylation, nucleosome positions and
open chromatin regions revealed by DNase I hypersensitive sites.
IV.
Conclusion remarks
ChIP-chip technology has marked the beginning of an era of rapid progress in highthroughput studies and has provided enormous, valuable data to transcription regulation and
epigenomic information in various organisms. This technology continues serving biomedical
researches to uncover the myth of development in normal and pathologic states. However,
performing ChIP-chip experiments is not trivial. It involves multiple experimental procedures
and requires optimization at each step. Although MAT and M2AC provide excellent tools in
identification of positively enriched DNA regions, data analysis post peak finding are still
challenging in terms of how to extract biological meaning, construct transcription regulatory
network from massive information of protein-DNA interactions and how to connect ChIP-chip
data with microarray expression profiling. Lastly, as high-throughput sequencing technology
(Solexa) becomes available, in a few years, ChIP-chip is likely replaced by ChIP sequencing
(ChIP-seq) technology which shows even greater genome coverage, resolution and efficiency in
revealing transcription binding sites and epigenomic features.
Sources
http://www.chiponchip.org/
http://liulab.dfci.harvard.edu/
http://ai.stanford.edu/~xsliu/BioProspector/
http://ai.stanford.edu/~xsliu/MDscan/
http://research.dfci.harvard.edu/brownlab/chipchip.html
Reference
1. Johnson, W.E., et al., Model-based analysis of tiling-arrays for ChIP-chip. Proc Natl Acad Sci
U S A, 2006. 103(33): p. 12457-62. abstract
2. Li, W., C.A. Meyer, and X.S. Liu, A hidden Markov model for analyzing ChIP-chip
experiments on genome tiling arrays and its application to p53 binding sequences.
Bioinformatics, 2005. 21 Suppl 1: p. i274-82. abstract
3. Horak, C.E. and M. Snyder, ChIP-chip: a genomic approach for identifying transcription
factor binding sites. Methods Enzymol, 2002. 350: p. 469-83. abstract
4. Buck, M.J. and J.D. Lieb, ChIP-chip: considerations for the design, analysis, and application
of genome-wide chromatin immunoprecipitation experiments. Genomics, 2004. 83(3): p. 349-60.
abstract
5. Viens, A., Mechold, U., Lehrmann, H., Harel-Bellan, A and Ogryzko, V. Use of protein
biotinylation in vivo for chromatin immunoprecipitation. Analytical Biochemistry. 2003.
6. Song, J.S., Johnson, W.E., Zhu, X., Zhang, X., Li W., Manrai, A.K., Liu, J.S., Chen, R., Liu,
X.S. Model-based analysis of two-color arrays (MA2C). Genome Biol. 2007. 8(8): R178.
7. Carroll, J.S., Liu, X.S., Brown, M. et al. Chromosome-wide mapping of estrogen receptor
binding reveals long-range regulation requiring the forkhead protein FoxA1. Cell. 2005.
122(1):33-43.
8. Kim, J., Chu, J., Shen, X., Wang, J. and Orkin, S.H. An extended transcriptional network for
pluripotency of embryonic stem cells. Cell. 2008.132(6):1049-61.
9. Ozsolak, F., Song, J.S., Liu, X.S. and Fisher, D.E. High-throughput mapping of the chromatin
structure of human promoters. Nat Biotechnol. 2007. 25(2):244-8.
10. Kim, T.H., Barrera, L.O., Ren, B. et al. A high-resolution map of active promoters in the
human genome. Nature. 2005. 436(7052):876-80.
11. Millelsen, T.S., Ku, M., Berstein, B.E. et al. Genome-wide maps of chromatin state in
pluripotent and lineage-committed cells. Nature. 2007. 448(7153):553-60.
12. Huebert, D.J., Berstein, B.E. et al. Genome-wide maps of chromatin state in pluripotent and
lineage-committed cells. Nature. 2007. 448(7153):553-60.
13. Schones, D.E., Zhao, K. et al. Dynamic regulation of nucleosome positioning in the human
genome. Cell. 2008. 132(5):887-98.
14. Barski, A., Zhao, K. et al. High-resolution profiling of histone methylations in the human
genome. Cell. 2007. 129(4):823-37.
15. Scacheri, P.C., G.E. Crawford, and S. Davis, Statistics for ChIP-chip and DNase
hypersensitivity experiments on NimbleGen arrays. Methods Enzymol, 2006. 411: p. 270-82.
Download