A Tutorial on Cross-Phenotype Normalization by Iterative Nonlinear Regression 1. Introduction baseline array baseline array baseline array baseline array The cross-phenotype normalization (CPN) method by iterative nonlinear regression (INR) has been implemented in C/C++ and integrated into DNA-Chip Analyzer (dChip) software (Li and Wong, 2003) for normalization of microarray data. Fig. T-1. shows a snapshot of the CPN-INR method and dChip software working together for microarray data analysis. dChip is a software package that implements model-based expression analysis of oligonucleotide arrays (Li and Wong, 2001). It also implements many low-level data analysis routines (e.g., data I/O, gene filtering, etc.) and several high-level analysis procedures (e.g., clustering and classification). Normalization by invariant ranking (IR) was also implemented as one of the low-level routines, which is of particular interest for us to compare our CPNINR method (Xuan et al., 2005) to IR (Schadt et al., 2001). The dChip program is a single executable program running on Windows, under which it was originally developed in Wong’s Lab at Harvard University. Fig. T-1. A snapshot of the INR method and dChip for data The novel feature of our CPN-INR analysis. method is that the method unifies the tasks of estimating normalization regression function and identifying invariantly expressed genes (IEGs) for normalization. An iterative IEG selection algorithm was developed to couple two processes so that the regression function estimation and IEG selection bootstrap one another (Wang et al., 2002, Xuan et al., 2005). Fig. T-2 shows an example of the iterative IEG selection process in floating array floating array action for normalization. In this tutorial, we will briefly describe the CPN-INR algorithm and use an example to illustrate the usage of this utility. In particular, Section 2 will summarize the block diagram of the CPN-INR algorithm. In Section 3, we will describe the procedure of using INR for normalization of a microarray data set. It is strongly suggested that the users shall follow the floating array floating array steps described in the procedure to explore the Fig. T-2. Iterative IEG selection selected IEGs are in red. functionalities of CPN-INR. 2. A Brief Description of the CPN-INR Method As shown in Fig. T-3, the CPN-INR algorithm has been implemented as a two-step Phenotype A (Floating) Phenotype B (Baseline) boosting approach by iteratively performing Within-Phenotype Within-Phenotype the followings: (1) within-phenotype Normalization Normalization normalization and (2) between-phenotype Sector-Shaped IEG Selection Sector-Shaped IEG Selection normalization. In the first step (withinNormalization by Nonlinear phenotype normalization), INR method is Normalization by Nonlinear Regression Based on Regression Based on Selected IEGs applied to all the arrays within a phenotype for Selected IEGs normalization. In the second step (betweenNo No Converged? Converged? phenotype normalization), a within-phenotype Yes Yes Normalization by Nonlinear Normalization by Nonlinear “mean profile” is first generated from the Regression Based on Regression Based on Final Selected IEGs Final Selected IEGs normalized arrays for each phenotype. The CPN-INR method is then applied to these Generate Within-Phenotype Generate Within-Phenotype within-phenotype “mean profiles” to select the “Mean” Profile “Mean” Profile IEGs and estimate the regression function. The Cross-Phenotype estimated regression function will then be Normalization Sector-Shaped IEG Selection applied to all the arrays in “floating” phenotype for normalization. This process Normalization by Nonlinear Regression Based on cycles back and froth until the following two Selected IEGs No conditions are met: (1) the selected ISGs differ Converged? from previously selected ISGs very little, and Yes Estimate Nonlinear (2) the estimated regression function Regression Function Based on Final Selected IEGs approaches to a 45-degree line shown in scatter Apply Estimated Regression Function plot. to All the Arrays in “Floating” Phenotype In practice, CPN-INR method has been No implemented in a way that normalization can Converged? be carried out either at probe level for all the Yes probes on oligonucleotide array data, or at gene level for all the genes on cDNA array. Fig. T-3. Block diagram of the CPN-INR algorithm. When carried out in probe level, we will only use perfect match (PM) probes to select IEGs (or to be exact, invariantly expressed probes (IEPs)). Note that this is consistent to the implementation of iterative ranking (IR) method (Schadt et al., 2001), but different from Bolstad’s implementation where both PM and mismatch (MM) probes are used for selecting invariant genes (Bolstad et al., 2003). 3. A Procedural Description of the CPN-INR Method In this section, we will describe a procedure to perform normalization of microarray arrays by using CPN-INR. The following steps are described and illustrated in details: (1) preparing data for normalization, (2) reading in data, (3) normalizing arrays using CPN-INR, (4) exporting normalized data to files, and (5) plotting normalization function. 3.1. Prepare Data To use dChip, the user needs to provide Affymetrix array data files (in CEL or DAT format) and the CDF file (Chip Description File). If it is desired to read in cDNA array data or gene expression data like those generated by MAS5.0, one may make an external data using tabdelimited text format, and then read it into dChip by “Analysis/Get external file”. To use CPN-INR for normalizing cell files among multiple phenotypes, it is required to create a directory that contains the subdirectories for the phenotypes. Each subdirectory will hold all the cell arrays in one-specific phenotype. To normalize gene expression data (external file), we will create separate tab-delimited text file for each phenotype. Note that an upcoming version is also under development, which can use label to assign each sample to a particular class (or phenotype). 3.2. Read in Data dChip analysis is based on a group of array data files a researcher generates, either in DAT or CEL format. All the arrays to be used in a single analysis should be of the same chip type. To read in the data, select the menu “Analysis/Open Group”. With a dialog window shown in Fig. T-4, the user can give a group name, select a data directory and specify the data format of the files to be read in. After completing the “Analysis/Open group/Data files” dialog, click “Other information” tab on the top and another dialog will be shown (Fig. T-5). In this dialog, the CDF (Chip Description File) file for a particular array type should be specified. See Affymetrix website (URL: http://www.affymetrix.com) for obtaining the CDF files. To read in a group of multiple phenotype arrays, click “Analysis/Open Supergroup (CPN)” to specify each group’s name, data directory, file format, and CDF as shown in Fig. T-4 and T-5. For each group, a dialog window will be prompted for user to specify the information and the dChip software will read in the data when the user clicks “OK”. After all groups of arrays read in, click “Cancel” to signal the completion of data “read Fig. T-4. The dialog window of “Analysis/Open group/Data files”. Fig. T-5. The dialog window of “Analysis/Open group/Other information”. in” process. A similar procedure can be used to read in gene expression data files in a tab-delimited txt format for CPN normalization. Click “OK” to perform read-ins and in the output we can read some brief information about (1) the CDF file and (2) a summary of the arrays indicating median intensities and percentage of “Present Call”: {Open group GL_SNB30 Read in existing binary CDF file G:\NormMS4Final\CDF\HG_U95A.cdf.bin (file format 3)... Search and extract PM/MM data from 'CEL' files of chip type 'HG_U95A' in G:\NormMS4Final\Variance\GL_SNB30\TuT... Found G:\NormMS4Final\Variance\GL_SNB30\TuT\94424hgu95a11.cel Reading CEL file... Line 400000 Calculating presence calls... Gene 12600: P Found G:\NormMS4Final\Variance\GL_SNB30\TuT\94425hgu95a11.cel Reading CEL file... Line 400000 Calculating presence calls... Gene 12600: A Found G:\NormMS4Final\Variance\GL_SNB30\TuT\94426hgu95a11.cel Reading CEL file... Line 400000 Calculating presence calls... Gene 12600: P Found G:\NormMS4Final\Variance\GL_SNB30\TuT\94427hgu95a11.cel Reading CEL file... Line 400000 Calculating presence calls... Gene 12600: P Calculating background... Writing array summary file G:\NormMS4Final\IEGOverlap\GL_SNB30\GL_SNB30 arrays.xls Group "GL_SNB30" contains 4 files, see G:\NormMS4Final\IEGOverlap\GL_SNB30\GL_SNB30 arrays.xls for details Number 1 2 3 4 Array File Name Median Intensity P call % 94424hgu95a11 G:\NormMS4Final\Variance\GL_SNB30\TuT\94424hgu95a11.cel 235 59.1874 94425hgu95a11 G:\NormMS4Final\Variance\GL_SNB30\TuT\94425hgu95a11.cel 209 58.8389 94426hgu95a11 G:\NormMS4Final\Variance\GL_SNB30\TuT\94426hgu95a11.cel 222 58.633 94427hgu95a11 G:\NormMS4Final\Variance\GL_SNB30\TuT\94427hgu95a11.cel 170 58.7835 Treat 4 arrays as 4 experiments Finished in 0 minutes 24 seconds} 3.3. Normalize Arrays Normalization is an important prerequisite for almost all follow-up microarray data analysis steps. Since scanned images may have different overall brightness, normalization is needed to adjust the brightness of the arrays to comparable level. To normalize multiple arrays, click “Analysis/Normalize” to bring up the normalization dialog as shown in Fig. T-6. As we can see, the CPN-INR method has been implemented as an alternative method in dChip in addition to IR method. With the dialog, we can select several parameters for INR method including (1) baseline array and (2) different forms of nonlinear regression function. Both INR and IR methods are baseline-based normalization methods. By default an array with median overall intensity is chosen as the baseline array against which other arrays are normalized at probe intensity level. Using the normalization dialog we can specify a different array as the baseline if necessary. For the different forms of nonlinear regreesion function, we have implemented the followings: (1) quadric polynomials, (2) cubic polynomials and (3) smoothing spline with generalized cross-validation (GCVSS). Among them, cubic polynomials possess some advantage over quadratic polynomials and GCVSS due to its accuracy in model fitting and its low computational complexity in model parameter estimation. Notice that normalization in dChip will first check if arrays were normalized before. If normalized, dChip will use existing normalized CEL intensities in DCP files. To force dChip to always do normalization, check “Ignore normalized data in DCP files”. Click “OK” to perform Fig. T-6. A snapshot of the INR method within dChip normalization and in the output we can read how many PM for data analysis. probes are selected in the “Invariant set” (i.e., the IEGs) to determine the normalization curve, and the median probe intensity before and after normalization: {Normalize arrays using baseline: 94426hgu95a11... Baseline chip Accessing 'G:\NormMS4Final\Variance\GL_SNB30\TuT\94426hgu95a11.dcp' 94424hgu95a11 Accessing 'G:\NormMS4Final\Variance\GL_SNB30\TuT\94424hgu95a11.dcp' Searching Invariant-set: 16290 Normalizing CEL: 400000 Median intensity: 235 -> 201 94425hgu95a11 Accessing 'G:\NormMS4Final\Variance\GL_SNB30\TuT\94425hgu95a11.dcp' Searching Invariant-set: 18710 Normalizing CEL: 400000 Median intensity: 209 -> 218 94426hgu95a11 Accessing 'G:\NormMS4Final\Variance\GL_SNB30\TuT\94426hgu95a11.dcp' Do not normalize the baseline array (Median intensity 222) 94427hgu95a11 Accessing 'G:\NormMS4Final\Variance\GL_SNB30\TuT\94427hgu95a11.dcp' Searching Invariant-set: 19041 Normalizing CEL: 400000 Median intensity: 170 -> 199 Calculating background... Finished in 0 minutes 58 seconds} (file format 3) (file format 3) (file format 3) (file format 3) (file format 3) 3.4. Save Normalized Data After normalization, the user may use “Image/Export CEL” to export probe level data of one or all arrays into CEL files. The dChip-generated CEL file does not have the standard deviation and number of pixel information in the original CEL file. As a result Affymetrix MAS5 software cannot recognize dChip-generated CEL files. We have implemented a temporary/alternative way to remedy this problem by filling in “fake” numbers for the standard deviation and number of pixel information. Click “Image/Export CEL(Mas5.0)” to bring out the dialog shown in Fig. T-7. The user can specify to save either the current normalized array or all normalized arrays. For the normalized CEL files, the user can specify a word to be inserted into the file names. 3.5. Fig. T-7. Export normalized intensities to CEL files. Plot Normalization Function When in the “Image View” (where an array image is displayed), one can use “Image/Normalization plot” (Installation of R software is needed, see R-Project, (2005) for more information about the R software) to view the normalization scatter plot between the current array and the baseline array. Fig. T-8 shows the normalization plot dialog where the user can specify whether to use smoothing spline for computing the normalization curve or not. Fig. T-9 shows a scatter plot displayed by R software. In the plot, each point represents PM or MM probe values in the two arrays. The blue line is the diagonal line y = x, the red circles are the probes selected in the IEG set, and the green curve is the normalization curve based on the selected IEGs. Fig. T-8. Normalization scatter plot. 4. Summary Fig. T-9. Scatter plot of two arrays before normalization. The IEGs selected by INR are shown in red, and the estimated normalization function is shown in green. The blue line is at 45-degree. In this tutorial, we have presented the CPN-INR method with a procedural description of the method. The CPN-INR method has been implemented in C/C++ and integrated into dChip software. The block diagram of the CPNCPN-INR algorithm has been summarized in Section 2. The procedural steps for performing normalization using CPN-INR have been detailed and illustrated in Section 3. References 1. 2. 3. 4. 5. 6. 7. Li, C., and Wong, W.H. (2003) DNA-Chip analyzer (dChip). In The analysis of gene expression data: methods and software. Edited by Parmigiani, G., Garrett, E.S., Irizarry, R. and Zeger, S.L. Springer. Li, C., and Wong, W.H. (2001) Model-based analysis of oligonucleotide arrays: model variation, design issues and standard error application. Genome Biol., 2(8) 1-11. Xuan, J., Khan, J., Hoffman, E., Clarke, R., and Wang, J. (2005) Normalization of microarray data by iterative nonlinear regression. Submitted to IEEE 5th Symposium on Bioinformatics and Bioengineering (BIBE05). Schadt, E., Li. C., Eliss, B., and Wong, W.H. (2001) Feature extraction and normalization algorithms for highdensity oligonucleotide gene expression array data. J. Cell. Biochem., 84(S37), 120-125. Wang, Y., Lu, J., Lee, R., Gu, Z., and Clarke, R. (2002) Iterative normalization of cDNA microarray data. IEEE Trans. on Info. Tech. in Biomedicine, 6(1), 29-37. Bolstad, B.M., Irizarry, R.A., Astrand, M., and Speed, T.P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2), 185-193. The R-Project (2005) The R project for statistical computing, URL: http://www.r-project.org/.