CHIP-chip and Genome Tiling Microarrays Index Summary Experimental procedures of ChIP-chip ChIP-chip data analysis Implications of ChIP-chip Conclusion remarks Sources Recommended reading Summary Interactions between proteins and DNA are fundamental to many biological processes including transcription, DNA modification, replication and repair, nucleosome assembly and chromatin packing. These processes are not only involved in cell survival and proliferation, but also required for maintaining cell identity and mediating distinct physiological functions. DNA binding proteins include transcription factors, which directly modulate the process of transcription, nucleases which cleave DNA molecules, and histones which are involved in DNA packaging into high-order chromatin structure. Many cofactor proteins which indirectly associate with DNA also play important roles in modulating the activity of interacting transcription factors. A comprehensive knowledge of where proteins and their regulatory cofactors interact with the genome would greatly facilitate our understanding of the mechanism and logic of cellular events. ChIP-chip, a technique that combine chromatin immunoprecipitation (ChIP) and DNA microarray chips, has been widely used to create high-resolution genome-wide maps of the in vivo association of DNA sequences with regulatory proteins and posttranslationally modified histones. In this chapter, experimental procedures, data analysis and implications of Chip-chip study are reviewed. I. Experimental procedures Experimental overview: There are two kinds of ChIP procedures including N-ChIP using ‘native chromatin’ input and X-ChIP using ‘crosslinked chromatin’ input. N-ChIP is used to study proteins with strong and direct association with DNA, for example histone proteins. For cofactors which indirectly or transiently associate with DNA, a crosslink step to fix protein-DNA interactions is crucial for pulling down DNA elements with the protein of interest. Comparing to N-ChIP, X-ChIP is more frequently used for studying any protein-DNA interactions. In X-ChIP, cells are first fixed with formaldehyde to crosslink DNA to any associated proteins. The cells are then lysed and DNA is sheared into smaller fragments using sonication. Protein-DNA complexes are then immunoprecipitated with an antibody directed against the specific protein of interest. Following the immunoprecipitation, crosslinking is reversed, samples are protease-treated and the purified DNA sample is amplified, fragmented and labeled to hybridize onto genome tiling arrays. By comparing the hybridization signals generated by an immunoprecipitated sample versus a non-specific antibody control or input DNA control, the regions of chromatin-protein interaction can be identified. 1. Chromatin immunoprecipitation (ChIP) Cells grown under the desired experimental condition are fixed with formaldehyde which forms heat-reversible DNA-protein crosslinks. After crosslinking, cells were lysed and sonicated to fragment the chromatin to the size of 500 bp – 1 kb. Random shearing does not always produce small-enough chromatin fragments at the regions of interest. For this reason and to conduct experiments on fresh and frozen tissues, some researchers have preferred to make use of ‘native chromatin’. In these protocols, the chromatin is fractionated by incubation of purified nuclei with micrococcal nuclease (MNase), an enzyme that cleaves preferentially the linker DNA between the nucleosomes. Specifically, by performing partial digestions with MNase, it is possible to obtain native chromatin fragments of on average one to five nucleosomes in length. These oligonucleosome fragments are purified from the nuclei and are then used to perform ChIP. The choice of native, MNase-fractionated, chromatin as the input material for ChIP is advantageous, because the epitopes, recognized by the antibody, remain intact during the chromatin preparation, whereas formaldehyde mediates protein-protein interactions which may block the exposure of epitopes recognized by the antibody. As a consequence, native chromatin tends to give higher levels of precipitation for a specific histone modification than formaldehyde cross-linked chromatin. ChIP is performed by incubation of fractionated chromatin (input chromatin) with an antibody directed to a protein of interest. The antibody recognizes its targeted protein and precipitates protein-DNA from solution. In this way, only DNA fragments crosslinked to the protein of interest are enriched, while DNA-protein complexes that are not recognized by the antibody are washed away. However, this method has its own limitations. In particular, it requires antibodies of a quality that is sufficient for a specific interaction and at the same time can tolerate stringent binding and washing conditions. Thus, biotinylation mediated ChIP (bioChIP), a modified ChIP technique, is adapted to study in vivo biotinylated proteins. The interaction between biotin and streptavidin is one of the strongest known noncovalent interactions. Streptavidin affinity captures of biotinylated proteins permit more stringent washing conditions, resulting in lower background noises than classic protein-antibody interaction mediated ChIP experiments. Thus, bioChIP not only circumvents issues related to antibody availability, but also generates high-specific proteinDNA interaction data and provides a uniformed platform to compare various protein associations with similar DNA elements. However, bioChIP requires simultaneous introduction of a bacterial biotin ligase (BirA) and the protein of interest with a biotin recognition signal into cells. In addition, exogenously introduced biotin-tagged proteins should not disrupt protein equilibriums in vivo in order to study the protein-DNA interactions in a physiological context in the living cell. 2. Amplification of ChIP’ed DNA. ChIP’ed materials are removed of crosslinks by heat incubation and treated with Protease K to digest proteins to free DNA fragments. As the amount of ChIP’ed DNA is usually small (in a few nanogram range), an amplification step is required for DNA microarray-based detection. Three amplification methods can be used. Exponential amplification methods such as ligationmediated PCR (LM-PCR) and random primed PCR (WGA) have been most commonly used. An in vitro transcription (IVT) mediated linear amplification method is likely to give higher fidelity results, however, this method requires multiple steps involved in DNA to RNA and RNA to DNA conversions and is rather laborious. As for LM-PCR, ChIP’ed DNA is ligated to a pair of linker DNA and amplified by PCR using primers specific for linker DNA sequences. Low PCR cycle numbers (usually 19 – 25 cycles) should be used to avoid saturating signals from ChIP’ed DNA and increasing amplification of non-specific binding. Amplified DNA (1 – 4 micrograms) with the size of 500 – 1000 bp is then fragmented into even smaller pieces (50 – 100 bp) and labeled by appropriate dyes for microarray hybridization. ChIP-chip experiments require a background control. Chromatin inputs without going through ChIP procedure are reversed of crosslinks and DNA is isolated, amplified, fragmented and labeled in a similar manner as ChIP’ed DNA. Input DNA measures genome background, cell line variation and PCR amplification bias. In addition, a mock ChIP or ChIP using non-specific antibody can serve as the control which measures non-specific protein-DNA interactions. 3. Hybridization to the tiling arrays. Affymetrix, NimbleGen Systems and Agilent Technologies have developed oligonucleotide arrays that tile all of the nonrepetitive genomic sequences of human and other eukaryotes and are commonly used. Affymetrix tiling arrays contain 25-nt probes tiled every 35 bp of DNA sequence. Affymetric promoter tiling arrays covers -8 kb to 2 kb regions relative to transcription start sites of annotated genes in human or mouse genome, while the whole genome tiling arrays (2.0R) are comprised of 7 chips covering entire human or mouse genome that is masked of repeat sequences. The probes are synthesized on the arrays in situ using photolithographic technology and usually contain high error rates at the 3’ end of probes. NimblenGen promoter or wholegenome arrays consist of unique 50-mers at 100 bp resolution, with the probes being synthesized in situ using maskless, photo-mediated array synthesizer techonology. Agilent arrays use a maskless, industrial-scale inkjet printing process that synthesizes oligonucleotide probes that are usually 60-mer long and unique in the genomes. NimblenGen and Agilent arrays contain good quality of probes but are more expensive to be used for whole-genome study as compared to Affymetrix arrays. However, the maskless process of probe synthesis by NimbleGen and Agilent allows quick iteration of microarray designs in response to fast-paced content changes in the continuously evolving genomics environment. This allows researchers easy access to highquality, customer-tailored arrays. For Affymetrix arrays, ChIP’ed and control DNA samples are biotinlyated, hybridized to two chips and processed individually. For NimbleGen and Agilent arrays, ChIP’ed and control DNA is labeled by two different fluors such as Cy5 and Cy3, and hybridized to one chip. The labeled DNA binds to its complementary DNA probes on the chips, generating fluorescent light which is recorded by an image scanner. Signals from ChIP’ed DNA are then normalized to the control DNA. As the sequence of each spot on a chip is known, DNA sequences bound by a particular protein are then revealed in a high throughput manner. II. ChIP-chip data analysis. Combinations of chromatin immnunoprecipitation and whole-genome tiling microarrays allow biologists to conduct unbiased genome-wide location analysis. However, they also generate massive amounts of data, creating a need for effective and efficient analysis algorithms. Affymetric whole-genome tiling arrays are attractive to ambitious biologists due to their low cost and intensive genome coverage. But the resulting data are very noisy and complex due to poor quality of Affymetric probes. Many methods have been developed to identify enriched regions based on statistics that compare ChIP array data with control samples. The tiling array software (TAS) developed by Affymetrix utilizes the Mann–Whitney U test by ranking of ChIP and control probe signals within 1 kb sliding windows but does not consider the variability in probe behavior. Some researchers have modeled probe behavior using pooled ChIP-chip data from multiple laboratories and then infer ChIP-enriched states through a hidden Markov model (HMM). Another method applies Welch’s t statistic comparing ChIP and control replicates, calculated for each probe, and then uses a running window average of the t statistics to identify ChIP regions. This method becomes unreliable when there are only a few replicates to estimate probe variance. TileMap proposes an empirical Bayes shrinkage improvement by weighting the observed probe variance and pooled variances of all of the probes on the array. TiMAT first calculates an average fold change between ChIPs and controls for each probe, then uses a sliding-window trimmed mean to find ChIP regions. In this chapter, I will focus on Model-based Analysis of Tiling-arrays (MAT), a fast and powerful analysis algorithm to identify regions enriched by ChIP-chip on Affymetrix tiling arrays. This software is developed by Dr. Shirley X. Liu lab in Dana Farber Cancer institute. 1. Model-based Analysis of Tiling-arrays (MAT) The MAT probe model relies heavily on the fact that most of the probes on the array are measuring the nonspecific hybridization (i.e. background noise). Instead of estimating probe behavior from multiple samples, MAT uses a simple linear model to estimate the baseline probe behavior by considering the 25-mer sequence and copy number of all probes on a single tiling array. By using this baseline probe model, MAT can standardize the signals of each probe in each array individually, filter much of the noise in the data and detect the true ChIP signals in the data. Theoretically, MAT can identity DNA binding regions from a single ChIP sample, however, multiple ChIP samples with controls, increases accuracy. A strategy diagram of MAT is shown here. The variability of probes has to be considered during data analysis for the following reasons. The error rates increase when Affymetric probes are synthesized in situ towards 3’ end. It has been estimated that only 10% of Affymetric probes have exactly correct sequences. In addition, Affymetric probe sequences are not optimized as compared to NimbleGen and Agilent’s probes. The content and position of G-C nucleotides cause variance in hybridization affinity, resulting in different probe signals. The following plots show the effect of adenine (A, on the left), cytosine (C, center) and guanine (G, right) at each probe nucleotide position on probe intensity. Clearly, A-G-C content and their locations affect probe-DNA hybridization and resulting signal intensity. As the majority of probes measure primarily nonspecific hybridization and an Affymetrix tiling array contains about 6 million 25-mer probes, MAT uses the following equation to model probe sequence effect and predict baseline intensity of each probe. MAT then divides the probes on the array into “affinity bins”, each containing a few thousand probes with similar baseline intensity. MAT estimates the observed sample variance within each affinity bin and uses it as the probe variance for each probe in the bin. Then, MAT standardizes each probe on an array as follows: The distribution of t values is approximately normal, and t values can be compared across experiments without further normalization. A sliding window of desired length can be adjusted to the average size of sheared DNA fragments. In this window, MAT removes the top and bottom 10% of the t-values and computes the geometric mean of the remaining t-values. A MATscore is then calculated for each sliding window and assigned to the probe at the center of the window: where TM is the trimmed mean and np is the numbers of probes in the window used to calculate the TM. MATscores are comparable across regions and samples. With multiple ChIP and control replicates, a MATscore will be calculated for each window by pooling all of the probes across all replicates and subtracting the MATscore of the control replicates from the MATscore of the ChIP replicates. This process removes any cellspecific variations that are not modeled in probe behavior equation and increases the confidence of ChIP peak predictions that are marginally enriched. More replicates and more probes in the sliding window will give higher confidence to the prediction. P-values for each window and FDR q-values are also calculated by MAT. MAT software generates a bar file and a bed file which can be viewed on Affymetrix Integrated Genome Browser (IGB). One example is shown below. MATscore histogram and FDR chart can tell whether a ChIP-chip experiment works or not. For the examples shown below, a good experiment showes nearly normal distribution of MATscores except a long tail of positive peaks, while a bad ChIP-chip experiment lacks the extended tail of positive peaks and sometimes show even longer tail of negative peaks. 2. Model-based analysis for 2-color arrays (MA2C). MA2C, a normalization method based on the GC content of probes, is developed for twocolor tiling arrays by Dr. Shirley Liu’s lab. MA2C takes GC content and dye biases into accounts and normalizes probes by GC bins within each array. Detection of peak regions by MA2C is similar to MAT, both of which use sliding window median for peak finding. MA2C has been implemented as a stand-alone Java program, which can display various plots of statistical analysis for quality control. III. Implications of ChIP-chip Analysis 1. Identification of novel transcription factor binding sites. 2. Motif finding. ChIP-chip experiments are far more superb than microarray experiments in discovering conserved DNA motifs in co-regulated genes. Common programs such as MDscan and BioProspector can be used to identify motifs. (For details, please see the previous version of this chapter.) 3. Revealing transcription regulatory networks. (Please see the previous version of this chapter.) 4. Revealing epigenetic modifications. ChIP-chip has been used to analyze epigenetic features including histone modification, DNA methylation, nucleosome positions and open chromatin regions revealed by DNase I hypersensitive sites. IV. Conclusion remarks ChIP-chip technology has marked the beginning of an era of rapid progress in highthroughput studies and has provided enormous, valuable data to transcription regulation and epigenomic information in various organisms. This technology continues serving biomedical researches to uncover the myth of development in normal and pathologic states. However, performing ChIP-chip experiments is not trivial. It involves multiple experimental procedures and requires optimization at each step. Although MAT and M2AC provide excellent tools in identification of positively enriched DNA regions, data analysis post peak finding are still challenging in terms of how to extract biological meaning, construct transcription regulatory network from massive information of protein-DNA interactions and how to connect ChIP-chip data with microarray expression profiling. Lastly, as high-throughput sequencing technology (Solexa) becomes available, in a few years, ChIP-chip is likely replaced by ChIP sequencing (ChIP-seq) technology which shows even greater genome coverage, resolution and efficiency in revealing transcription binding sites and epigenomic features. Sources http://www.chiponchip.org/ http://liulab.dfci.harvard.edu/ http://ai.stanford.edu/~xsliu/BioProspector/ http://ai.stanford.edu/~xsliu/MDscan/ http://research.dfci.harvard.edu/brownlab/chipchip.html Reference 1. Johnson, W.E., et al., Model-based analysis of tiling-arrays for ChIP-chip. Proc Natl Acad Sci U S A, 2006. 103(33): p. 12457-62. abstract 2. Li, W., C.A. Meyer, and X.S. Liu, A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences. Bioinformatics, 2005. 21 Suppl 1: p. i274-82. abstract 3. Horak, C.E. and M. Snyder, ChIP-chip: a genomic approach for identifying transcription factor binding sites. Methods Enzymol, 2002. 350: p. 469-83. abstract 4. Buck, M.J. and J.D. Lieb, ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics, 2004. 83(3): p. 349-60. abstract 5. Viens, A., Mechold, U., Lehrmann, H., Harel-Bellan, A and Ogryzko, V. Use of protein biotinylation in vivo for chromatin immunoprecipitation. Analytical Biochemistry. 2003. 6. Song, J.S., Johnson, W.E., Zhu, X., Zhang, X., Li W., Manrai, A.K., Liu, J.S., Chen, R., Liu, X.S. Model-based analysis of two-color arrays (MA2C). Genome Biol. 2007. 8(8): R178. 7. Carroll, J.S., Liu, X.S., Brown, M. et al. Chromosome-wide mapping of estrogen receptor binding reveals long-range regulation requiring the forkhead protein FoxA1. Cell. 2005. 122(1):33-43. 8. Kim, J., Chu, J., Shen, X., Wang, J. and Orkin, S.H. An extended transcriptional network for pluripotency of embryonic stem cells. Cell. 2008.132(6):1049-61. 9. Ozsolak, F., Song, J.S., Liu, X.S. and Fisher, D.E. High-throughput mapping of the chromatin structure of human promoters. Nat Biotechnol. 2007. 25(2):244-8. 10. Kim, T.H., Barrera, L.O., Ren, B. et al. A high-resolution map of active promoters in the human genome. Nature. 2005. 436(7052):876-80. 11. Millelsen, T.S., Ku, M., Berstein, B.E. et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007. 448(7153):553-60. 12. Huebert, D.J., Berstein, B.E. et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007. 448(7153):553-60. 13. Schones, D.E., Zhao, K. et al. Dynamic regulation of nucleosome positioning in the human genome. Cell. 2008. 132(5):887-98. 14. Barski, A., Zhao, K. et al. High-resolution profiling of histone methylations in the human genome. Cell. 2007. 129(4):823-37. 15. Scacheri, P.C., G.E. Crawford, and S. Davis, Statistics for ChIP-chip and DNase hypersensitivity experiments on NimbleGen arrays. Methods Enzymol, 2006. 411: p. 270-82.