34 Overview of the Tools for Microarray Analysis Transcription Profiling, DNA Chips, and Differential Display TOC Fig. 1. Polyacrylamide gel electrophoresis is used in differential display to identify fragments of regulated genes. Double-stranded complementary DNA is prepared from populations of RNA, then fragmented with restriction endonucleases. Linkers are attached to the ends of the fragments, then the fragments are amplified using the polymerase chain reaction (PCR). The PCR products are separated by polyacrylamide gel electrophoresis. Differentially expressed genes will yield restriction fragments of different intensity in the treated vs control samples. The primary treated and control samples give a complex mixture of unaffected and differentially expressed fragments. For this reason, fragments are often parsed or selectively amplified using one or more degenerate nucleotide at the 3'-position of one or both PCR primers to deconvolute the mixture of amplified species. Addition of a single degenerate nucleotide in one of the PCR primers results in a partial deconvolution of the sample, allowing differentially expressed fragments to be separated and excised more easily. Fig. 2. Affymetrix uses several oligonucleotides to get broad coverage of individual genes. Each gene is represented by oligos designed to be a perfect match to the target sequence, as well as oligos designed to contain a single mismatch. Pairs of perfect match and single nucleotide mismatch oligonucleotides are designed to different regions of each gene. Typically 16 or 20 oligo pairs are arrayed for each gene represented on the GeneChip®. The oligonucleotides are typically 25 bases in length. TOC Fig. 3. False color images demonstrate signal intensity on an Affymetrix GeneChip ®. Signal intensity at each element is indicated by color, with lighter, hotter colors representing greater signal intensities. Many single-channel, oligonucleotide arrays, including Affymetrix arrays, include several oligonucleotides for each gene. For this reason, there tend to be clusters of elements with similar signal intensity, representing all the elements that comprise a single gene. In the expanded region of the array, individual Perfect match/mismatch paired elements can be visualized. Perfect match/mismatch pairs for which the signal intensity of the perfect match oligo is greater than for the mismatch oligo (high average difference pairs), are used when comparing arrays to make calls regarding differential expression. Fig. 4. Experiments using cDNA microarrays require two-channel hybridizations. Two channel or cDNA arrays contain cloned or PCR amplified fragments of genes. Two separate populations of RNA are labeled with different fluorescent dyes. The dye-labeled samples are applied to the array, and differential hybridization is measured by recording fluorescence in both channels at each element. Single-channel signal intensity at any element on a cDNA array may not accurately reflect the amount of the message present in the original biological sample relative to any other message. Instead, the ratio of Cy5 vs Cy3 labeled cDNA probe at each element contains information regarding differential expression. TOC Fig. 5. An expression histogram from a two-channel hybridization reveals how well the channels are balanced. An expression histogram is a convenient way to visualize differences in input material and/ or labeling efficiency for a two-channel hybridization. After balancing the array data using a balance coefficient, the two lines should largely overlap. Balance coefficients can be determined in a number of ways. The simplest way to calculate a balance coefficient relies on a ratio of total average signal in both channels. Fig. 6. False color images can be used to demonstrate signal intensity in both channels of a twochannel array hybridization. Signal intensity at each element in either channel is indicated by color, with lighter, hotter colors representing greater signal intensities. After balancing, elements that give greater intensity in one channel than in the other are considered to represent differentially expressed genes. TOC Fig. 7. Unacceptable hybridization on a two-channel array is determined by careful examination of several parameters. In some cases, total average signal intensity, total average background, or both are very different between the two channels of a two-channel hybridization. If the background is uniform in each channel or if a gradient is present in both channels, it can often be corrected. If, as in the example shown, there is a gradient effect in one channel and not the other, it may be very difficult to achieve reliable expression data from the microarray. Fig. 8. Imperfections and impurities that affect hybridization and therefore expression data may occur on microarrays. (A) Some microarrays may contain imperfections that occur during the fabrication process. (B) Elements can also be affected by impurities introduced at the time of the hybridization. The comet-like imperfection seen here is most likely due to a dust particle. (C) It is uncertain what caused the impurity on the lower portion of this array. It may be an artifact from the scanning process. Regardless of the cause, the affected elements may need to be discarded from further consideration. TOC Fig. 9. The TIGR Spotfinder tool allows the user to adjust the grid to be sure that an entire spot is included in the reference field. It is not uncommon for spots on a microarray to be off center. Spot finding software that uses a static grid may often cut off portions of spots, resulting in a loss of useful information. Additionally, badly off-center spots may result in a portion of one spot appearing in the grid of an adjacent spot, resulting in false-positive gene expression values. Fig. 10. Scatter plots can be generated from a two-channel hybridization or from any two singlechannel hybridizations, after data is balanced with respect to signal intensity. Signal intensity for each element from 2 oligo arrays, or from both channels of a cDNA array is graphed on a logarithmic scale. Genes lying on the slanted line with a slope of one are not regulated. Those elements that demonstrate differences in signal intensity in the two channels (or on two different oligo arrays) after balancing may be differentially expressed. The differential expression value for the element marked in (A) may be reliable, even though it represents only a 1.6-fold difference in the signal intensities of the two samples. A similar differential expression value in (B) would almost certainly be meaningless. TOC Fig. 11. Confidence limits may be placed on microarray data. The lines parallel to the diagonal represents a reasonable cut-off based upon signal strength and differential expression. Elements with lower signal intensity must be highly differentially expressed to be considered reliable, owing to the increased error at the lower edge of the signal range. RNA quality, time since fabrication of the array, hybridization conditions and numerous other factors can all affect the final outcome of a microarray experiment. Reproducibility and reliability of differential expression values are affected by these variables (see Fig. 9). Therefore, confidence limits should be adjusted to reflect better or worse hybridizations. Fig. 12. The application of color gradients to differential expression values aids in rapid visualization of differential expression. The use of color gradients also permits the identification of genes that are differentially regulated by multiple treatments, and may also help identify experimental outliers. In the example shown, data from several microarrays hybridized with RNA from cells treated with vehicle (veh) and one of nine compounds (Tx1–9) is used. This approach can be used with a simple spreadsheet. However, many software tools exist that facilitate this type of data visualization. TOC Fig. 13. Heat maps aid in the visualization of transcription profiling data. A heat map such as this one produced by Spotfire™, may be used to visually identify broad patterns of gene expression. Typically genes above and below user-defined thresholds are colored red (induction) and green (repression), respectively. Individual microarray experiments are organized on the X-axis, while individual genes are organized along the Y-axis. Several large groups of similar microarrays may be observed in this example. One might reasonably infer that similar gene expression patterns suggest similar mechanisms of action for the compounds used in these studies. Fig. 14. Clustering of microarray data can be performed using a number of different techniques. Euclidean distance in n-dimensional space applies a variable derived from each microarray (a differential expression value or signal intensity) to every element present on the microarray. Each point on this graph represents an element on the microarray utilized. Elements in the same neighborhood are circled, representing six gene clusters. TOC Fig. 15. The European Bioinformatics Institute’s Expression Profiler tool allows users to cluster microarray data from multiple experiments. Genes whose expression patterns most closely match one another across many experiments cluster more closely than genes whose expression patterns are not closely matched. A dendogram is generated for each element, with branch points representing clusters. Fig. 16. Self-organizing maps represent to cluster microarray data into groups of similarly regulated genes. Millennium Pharmaceuticals and others provide tools that enable scientists to generate selforganizing maps from their TxP data. TOC Fig. 17. Partek Pro 2000 allows scientists multiple data analysis options. Principal component analysis, multidimensional scaling, and inferential analysis can all be performed by tools such as Partek Pro. These higher-order statistical analyses of microarray results are vital to avoid falsepositives and false-negatives. The costs of following up on false results can be staggering. As transcription profiling finds application in medical diagnostics, false results may even endanger patients. Fig. 18. Correlation analysis can be used to identify genes whose expression correlates with a specific phenotype of interest. In the experiment shown, 12 rats were randomly assigned to one of three diet groups, A, B, or C for 4 wk. Fasted serum triglycerides were measured in individual rats prior to sacrifice. Liver RNA was used to prepare probes for hybridization onto microarrays. The triglyceride phenotype was used to calculate a balanced differential triglyceride (BDTG) value, using the same equation to determine differential expression values for the array elements. The BDTG value was then compared to each element. In this example, the expression of gene 1 correlates well with the measured phenotype. TOC Table 1 Differential Expression Gene ID a b c d e f g h i Control (Cy3) Treated (Cy5) bCy5 Ratio Log ratio BDE fBDE 500 500 500 500 500 500 505 250 2500 800 200 100 80 2000 404 400 600 3800 1000 250 125 100 2500 505 500 750 4750 2.00 0.50 0.25 0.20 5.00 1.01 0.99 3.00 1.90 0.30 – 0.30 – 0.60 – 0.70 0.70 0.00 0.00 0.48 0.28 2.00 – 2.00 – 4.00 – 5.00 5.00 1.01 – 1.01 3.00 1.90 1.00 – 1.00 – 3.00 – 4.00 4.00 0.01 – 0.01 2.00 0.90 Table 2 Clustering Expression Data ID Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Array 1 Array 2 1200 1050 990 700 690 660 650 640 700 750 730 500 1200 120 450 1090 ID Gene 9 Gene 10 Gene 11 Gene 12 Gene 13 Gene 14 Gene 15 Gene 16 Array 1 Array2 630 600 540 420 380 260 200 190 1140 470 140 480 490 220 180 250 TOC