中華民國 94 年 12 月 華岡農科學報第十六期:43-50 Hwa Kang Journal of Agriculture Vol. 16:43-50, December 2005 Microarray Image Pre-Analysis for Critical Gene Expression Computation with Implemented Algorithmic Kernel Ming-Yueh Tsai1, Chun-Fan Chang2, Hsueh-Ting Chu3, Chen-hsiung Chan4, King-Jen Chang5, Cheng-Yao Kao6, and Chaur-Chin Chen7 Abstract: Microarray hybridization analysis on transcriptomic specimens has become an efficient technology platform towards identifying valid biomarkers of phenotypic traits or diseases by interrogating simultaneously almost genome-wide genes simply with corresponding probes microarrayed on matrix of glass slide or nylon membrane. Still, the computational processing on microarray image is profoundly essential for mining transcriptomic biomarkers by means of acquiring truthful intensity data of respective spot objects in order for accurate gene expression analysis through critical preprocessing procedures including displayable TIFF image data input, applicable RGB-CMYK 24/16-bit greyscaling, global object layout gridding, intensity 16/8-bit converting, and so forth. This paper implemented an in-house algorithmic kernel of Otsu’s Simple Thresholding, Gaussian Mixture Model (GMM), and Iterated Conditional Modes (ICM) for reliable gridding and segmentation pre-analysis which computes gene expression pattern on microarray image data for subsequent automatic image data analysis (Aida). In comparison with commercial microarray software, our Aida algorithmic kernel demonstrated higher average Pearson correlation coefficient of 0.9933 (cy3, ICM/Otsu’s) and 0.9721 (cy3, ICM/GMM) despite of the inferior result of 0.9538 (cy3, ICM/ArrayPro) with commercial software. Additional procedures with implemented algorithmic Aida modules were also integrated for statistical computation on microarray images of 50 clinical hepatocellular carcinoma (HCC) specimens with different a priori etiology of hepatitis B virus (HBV) and hepatitis C virus (HCV). Keywords: Microarray, Gene expression, Gridding, Segmentation, Simple thresholding, Gaussian mixture model, Iterated conditional modes. 1 Graduate Student, Department of Computer Science, National Tsing Hua University; first author. Professor, Graduate Institute of Biotechnology, Chinese Culture University; joint first author. 3 Assistant Professor, Department of Computer and Information Engineering, Asia University. 4 Post-doctoral Research Fellow, Department of Computer Science and Information Engineering, National Taiwan University. 5 Professor, Department of Medicine, and Director, Angiogenesis Research Center, National Taiwan University. 6 Professor, Department of Computer Science and Information Engineering, National Taiwan University. 7 Professor, Department of Computer Science, National Tsing Hua University; corresponding author. 2 Associate 43 Microarray Image Pre-Analysis for Critical Gene Expression Computation with Implemented Algorithmic Kernel methods to accomplish the gene expression computation in a fast, accurate, and reproducible procedure. Introduction The microarray technique first appeared in E.M. Southern’s publication of 1970 has now become an efficient technology platform for identifying valid biomarkers by hybridizing transcriptomic specimens against almost genome-wide genes simultaneously (Brazma et al., 2001). Genome projects massively disclosing the life secret of human gene sequences attracted researchers cross disciplines to investigate biologically and computationally on the data potential for improving human health. Typical microarray chips of polymer-coated glass slides are arrayed with thousands of spots which respectively contains lots of DNA molecules identical in sequence. Microarray images are retrieved after hybridization with high resolution color scanner to acquire green and red light illumination of the labeled cy3 (green) and cy5 (red) dyes on specimens. A flowchart for microarray array image pattern analysis is given in Figure 1. Subsequent sections perform grid fitting, introduce three methods for gene expression computation, show experimental results, and draw a conclusion, respectively. In this paper, we address on the essential pre-analysis issues of microarray image data including gridding and object segmentation while disregard additional specificity issues on transcriptomic RNA specimen and microarray probes. Accurate gene expression computation is profoundly essential for identifying valid biomarkers through analyzing microarray image patterns (Causton et al., 2003). Regular microarray image of 1000-pixels per centimeter or 3000-pixels per inch resolution has often installed tough tasks in gene expression computation due to large dimensions of images. In addition, the noise on microarray image due to manual operations in biological experiments may need to adopt efficient image preprocessing kernel before statistical data analysis. Figure 1. A paradigm of microarray image pattern analysis. Despite of the commercial software GenePix Pro and Array-Pro Analyzer developed for microarray image data analysis, the gene expression computation still requires complex and intensive manual operations which make the computational analysis to be painfully slow and not repeatable. The purpose of this paper is to implement three algorithmic (a) (b) Figure 2. (a) The original image (b) The result image of gridding pre-processing. 44 華岡農科學報第十六期 p i ni / N Gridding The goal of gridding or addressing microarray objects is to reliably compute intensity of corresponding spots in accord with microarray layout design. The object intensity of respective microarray spot shows the expression level of arrayed probe sequence from related gene. Given top left corner as the reference of an intended microarray image, the gene expression of spot objects are computed by image segmentation algorithms along with the layout design of designated spot size and spacing in horizontal and vertical. Figure 2 shows our gridding result tested on partial microarray image with horizontal and vertical spacing of 150 um and 145.161 um, where the pixel size is 10 um. Inevitably, microarray experiments often produce skew noises in which rotation or smoothing modules are added for removal before the local segmentation (Gonzalez and Woods, 2002). L p pi 0 , i 1 (1) 1 i where ni is the number of pixels in gray level i, and N is the total number of pixels. We suppose that all pixels of each grid can be separated into two classes, denoted as C B and C F (background and foreground) with a threshold k. Then, calculated probability and mean of each class are as follows. PB Pr(C B ) k p i Lmin i (2) P F Pr(C F ) Lmax p i k 1 i 1 PB (3) B k k i Pr(i | C ) ip / P B i Lmin i i Lmin B (k ) / PB (4) Gene Expression Computation F The most essential task in microarray data processing is to compute gene expression. Here we introduce three computation methods of the gene expression. I. Lmax Lmax i Pr(i | C ) ip / P F i k 1 i k 1 i F (T (k )) / PF (5) where (k ) Simple thresholding algorithm k ip i Lmin The simple thresholding algorithm proposed by Otsu (1979) determines a threshold to separate the pixels of a grid into foreground and background. The optimal threshold is calculated by the discriminant criterion which maximizes the separatability of the classes in gray levels. i and T ( Lmax ) Lmax ip i Lmin i (6) The within-class scatter, between-class scatter and the total scatter of gray levels are defined as follows. (7) W2 w0 02 w1 12 B2 w0 w1 ( 1 0 ) 2 The algorithm loads microarray image to fit gridding and find the maximum and minimum gray level of each grid, denoted as Lmin and Lmax . Subsequently, to compute and normalize the gray level histogram as a probability distribution by the following equation. T2 W2 B2 (8) k (i i Lmin T ) 2 pi (9) To get the optimal threshold k*, we maximize B2 (Fukunaga, 1972) by applying equations (2) ~ (5) 45 Microarray Image Pre-Analysis for Critical Gene Expression Computation with Implemented Algorithmic Kernel B2 (k * ) max B2 (k ) max 1 k L 1 k L Maximum likelihood estimation: [ T w(k ) (k )]2 (10) w(k )[1 w(k )] J ( ) J ( x1 , x2 , , xn , ) (13) n f ( xi ) Consequently, the gene expression is computed by the difference of foreground and background G | f b | i 1 n f1 ( xi ) (1 ) f 2 ( xi ) i 1 (11) We want to find the maximum likelihood estimate ˆ such that J (ˆ ) J ( ) for all . where G is the gene expression, f and b are the mean intensity of foreground and background. Expectation maximization algorithm: II. Gaussian Mixture Model (GMM) We use expectation-maximization algorithm to find maximum likelihood estimates of the parameters. Then, we transfer the likelihood function into the log-likelihood J ( ) function. In microarray experiments, the statistical analysis is the basis of obtaining gene expression data from microarray images. The mean and median of pixel intensities are often used to represent gene expression in many statistical methods despite of the realistic controversy of which one may work better. Recently, Gaussian Mixture Models have been proposed for analyzing microarray data (Causton et al., 2003). The GMM can solve the problem of using median or mean values of pixel intensities for gene expression computation for that the mean and the median in a Gaussian distribution are ideally equal. n L( ) ln f ( xi ) i 1 n ln( f1 ( x) (1 ) f 2 ( x)) i 1 n 2 2 2 2 1 1 ln e( x 1 ) /(21 ) (1 ) e( x 2 ) /(2 2 ) i 1 21 2 2 With respect to mean and standard deviation, we do partial derivatives to solve the parameters 1, 2, 1 and 2. We suppose that all pixels of a grid on the microarray image can be separated into two clusters (foreground and background). Each cluster is assumed to be a Gaussian distribution (Dempster et al., 1977). f ( x) f1 ( x) (1 ) f 2 ( x) where f1 ~ N ( 1, 12 ) , f 2 ~ N (2 , 22 ) (14) L( ) L( ) 0 , 0, for j 1,2 j j (12) (15) The EM algorithm is further split into two steps to update the estimated parameters. and (1) The expectation step (E-step) (2) The maximization step (M-step) 0 1. The parameters of f (x) are θ = (α, μ1, μ2, σ1, σ2) Let α be 0.5 and pixels I = {x1, x2…, x2} are sorted into two equal groups S1 and S2. Compute the mean and standard deviation of each group as for the initial values of (μ1 ,σ1) and (μ2 ,σ2). There are two important statistical concepts that are further incorporated to solve optimization problem of the parameter . (1) maximum-likelihood estimation (MLE) (2) Expectation-Maximization (EM) algorithm. E-step: Compute the conditional expectation 46 華岡農科學報第十六期 1 ( x) f1 ( x) and f ( x) 2 ( x) (1 ) f 2 ( x) (16) f ( x) conditioned on y is a Gibbs distribution whose energy function U(x|y) is thus defined as follows. M-step: Update the parameters of new ~ likelihood estimate (~, ~1 ,~1 , ~2 ,~2 ) as n n i 1 1 1 ( x)( xi 1 ) i 1 n 1 ( x) (17) where J(a,b)= -1 if a=b , J(a,b)=0 if a≠b; c=2 for the 1st-order MRF and c=4 for the 2nd-order MRF. i 1 n 2 and 2 2 ( x)( xi 1 ) 2 i 1 n Besag (Besag, 1986) demonstrated in experiments that it was usually enough to take 6 complete scans or fewer to achieve a reasonable labeling. An ICM algorithm (Chen and Dubes, 1990; Besag, 1986) is listed below. 2 ( x) i 1 i 1 Repeat E-step and M-step recursively until achieving convergence condition. The gene expression level is thus computed by the difference of μ1 and μ2. G 2 1 ICM algorithm (a) Initialize a labeling by applying MLE for each pixel (18) where 1 is the mean of background cluster, and 2 is the mean of foreground cluster. (a) For t = 1 to MN x t g 0 if f ( x t g 0 | x t , y ) f ( x t g | x t , y ) III. Iterated Conditional Modes (ICM) for all g A, and g 0 A. The ICM algorithm was first proposed by Besag in 1986 to find an optimal labeling x based on a given intensity image y (Chen and Dubes, 1990). The Markov random field (MRF) is usually incorporated into the efficient MRF-based labeling algorithm for image segmentation (Dubes et al., 1990) in capturing contextual information. The ICM algorithm updates the label of each pixel iteratively until a prescribed criterion is achieved. (b) Repeat (b) until “the energy achieves a local minimum” (6 scans hereby). (c) x is the required labeling. All pixels of a grid are labeled as either foreground or background by ICM algorithm in order for computing the gene expression with the formula given in (11). Results We suppose the image of M by N is degraded with Gaussian noise. The observed image denoted by a random vector Y, is assumed to come from a known distribution conditioned on X=x, whose distribution is given by We have implemented in-house algorithmic kernel of Otsu’s Simple Thresholding, Gaussian Mixture Model (GMM), and Iterated Conditional Modes (ICM) for computing the gene expression levels of spots on microarray image. The analyzed 50 microarray images of clinical hepatocellular carcinoma (HCC) cases are from Angiogenesis Research Center (ARC) of National Taiwan University (NTU) and the average height and MN f ( y | x) f ( y i | xi ) (19) i 1 f ( y i | xi ) ~ N ( xi , x2i ) The a posteriori distribution of (20) ( yt xt )2 c MN 1 U ( x | y) ln( x2t ) r J ( xt , xt: r ) 2 t 1 2 r 1 2 xt n 1 ( xi ) 1 ( xi ) xi 2 ( xi ) xi i 1 , 1 i 1n , 2 i 1n n 1 ( xi ) 2 ( xi ) n f ( x | y ) eU ( x| y ) / Z x| y X 47 Microarray Image Pre-Analysis for Critical Gene Expression Computation with Implemented Algorithmic Kernel width are 2300 and 2200 pixels (Chen et al., 2005). The horizontal and vertical distances in-between spot centers are gridded according to the microarray layout design of 150 um and 145.161 um without intensive manual layout addressing. venous non-invasion anti-HBV(+) and (2) venous invasion anti-HBV(+) and as well HCV-related (3) venous non-invasion anti-HCV(+) and (4) venous invasion anti-HCV(+). Table 2. Up- and down-regulated gene sets from expression profiles of 50 microarray images pertaining to anti-viral antibody presence with or without HCC venous invasion. Table 1. The average Pearson’s correlation coefficients of gene expression between Otsu’s, GMM, ICM and commercial software “ArrayPro Analyzer.” Average Pearson correlation coefficient cy3 cy5 ArrayPro v.s. Otsu’s ArrayPro v.s. GMM ArrayPro v.s. ICM 0.9492 0.9563 0.9538 0.9586 0.9624 0.9612 Otsu’s v.s. GMM 0.9577 0.9608 Otsu’s v.s. ICM GMM v.s. ICM 0.9933 0.9721 0.9940 0.9734 HCC cases with anti-viral antibody(+) HBV HCV (Venous Non-invasion) HCCs Up-regulated Down-regulated (1) 17 36 5 (3) 16 34 7 (Venous Invasion) HCCs Up-regulated Down-regulated (2) 12 75 17 (4) 5 103 27 Up-regulated gene numbers at combinatorial subsets presence are listed as “(combo subsets) gene number” as of the followings: (1000) 6, (0200) 20, (0030) 1, (0004) 46, (1200) 1, (1030) 0, (1004) 4, (1230) 2, (1204) 7, (1034) 2, (0234) 8, and (1234) 14, respectively. The computational tests are run on a Windows-based PC with Pentium4 3.00GHz CPU for gene expression computation and the respective run time by the Otsu’s, GMM and ICM algorithms are 4, 14, and 21 seconds, respectively. The test results of the average Pearson’s correlation coefficients on the pair-wise algorithmic comparison are in general greater than 95% among three proposed methods and commercial software (Table 1). Notably, best correlations in (cy3, cy5) are observed in the cases of ICM versus either the Otsu’s (0.9933, 0.9940) or GMM (0.9721, 0.9734) despite of the inferior result while versus commercial ArrayPro software (0.9538, 0.9612). Conclusion and Discussion Microarray technology is popular because of its ample mine of information. In this paper, we demonstrate how to do gridding and computing on a microarray image towards an important basis of mining gene expression patterns. We propose three different methods: Otsu’s thresholding, GMM, and ICM algorithms to compute gene expression of microarray images and partially improve some drawbacks of commercial software. From the experimental results, we observe that the Pearson’s correlation coefficients between these three methods and commercial software are as high as over 95%. Moreover, the proposed gridding methods with three segmentation algorithms comprise nearly automatic procedures for computing gene expression levels in a huge microarray image without manual operation as of the compared commercial software. The gridding and segmentation may collectively substantiate The pre-analysis results of microarray image spots seem to be with significant object intensities which may consequently assist the accurate mining of valid biomarkers (VBM) of various HCC categories. In this regard, the preliminary data of the expression profile analysis among 50 HCC microarray images has revealed a collective set of 138 up-regulated genes (Table 2) with the up-regulated gene numbers of 36, 75, 34, and 103 within respective categories of HBV-related (1) 48 華岡農科學報第十六期 the software system of automatic image data analysis (Aida) on microarray images either without or with reference spots. F.C.P. Holstege, I.F. Kim, V. Markowitz, J.C. Matese, H. Parkinson, A. Robinson, U. Sarkans, S. Schulze-Kremer, J. Stewart, R. Taylor, J. Vilo, and M. Vingron. (2001) Minimum Information about Microarray Experiment (MIAME) toward Standards for Microarray Data. Nature Genetics. 29:365-371. With highly reproducible results of gene expression computation, subsequent normalization procedure can be imposed at both system level and expression level respectively with MA-plot and maximal invariant subset algorithms. Preliminarily selected biomarker genes of tumor angiogenesis (tag) may be with great clinical value of HCC stage definition and treatment prognosis. However, the value of tag-microarray may exert in assisting the development on preventative measures of diagnosis and therapy. Respectively, the preventative diagnostics may include serum proteins of connective tissue growth factor and peripheral blood mononuclear cells. Preventative therapeutics may include high-throughput screening of botanical drug products with health-regain efficacy evidenced with the gene expression profile of tag-microarray data along with drug toxicity genes onto the cell culture evaluation system. Causton, H.C., J. Quackenbush, and A. Brazma. (2003) Microarray Gene Expression Analysis. Blackwell Publishing, New York. USA. Chen, C.C. and R.C. Dubes. (1990) Environmental Studies of Iterated Conditional Modes Segmentation Algorithm. Journal of Information Science and Engineering. 6:325-337. Chen, C.N., J.J. Lin, J.J.W. Chen, P.H. Lee, C.Y. Yang, M.L. Kuo, K.J. Chang, and F.J. Hsieh. (2005) Gene expression profile predicts patient survival of gastric cancer after surgical resection. Journal of Clinical Oncology. 23(29):7286-7295. Dempster, A.P., N.M. Laird, and D.B. Rubin. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. 39:1-38. Acknowledgments The authors would like to thank the National Science Council (NSC), the Council of Agriculture (COA), and the Department of Industrial Technology (DoIT) of Ministry of Economic Affairs (MOEA), Taiwan for financial support: NSC 94-2213-E-007-089, NSC 92-2745-B-034-001, COA 94-AS-5.2.1AD-U2, and MOEA 93-EC-17-A-19- S1- 0016. Dubes, R.C., A.K Jain, S.C. Sateesha, and C.C. Chen (1990) MRF Model-Based Algorithms for Image Segmentation. in Proc. IEEE International Conference on Pattern Recognition. 808-814. Fukunaga, K. (1972) Introduction to Statistical Pattern Recongnition. New York Academic. 260-267. References Gonzalez, R.C. and R.E. Woods. (2002) Digital Image Processing. Prentice-Hall Publishing, New York, USA. Besag, J. (1986) On the Statistical Analysis of Dirty Pictures. J. Royal Stat. Soc. Ser. B. 48:259-302. Otsu, N. (1979) A Threshold Selection Method from Gray-Level Histograms. Journal of IEEE Transactions on Systems, Man, and Cybernetics. SMC-9:62-66. Brazma, A., P. Hingamp, J. Quackenbush, G. Sherlock, P. Spellman, C. Stoeckert, J. Aach, W. Ansorge, C.A. Ball, H.C. Causton, T. Gaasterland, P. Glenisson, 49 中華民國 94 年 12 月 華岡農科學報第十六期:43-50 Hwa Kang Journal of Agriculture Vol. 16:43-50, December 2005 基因表現程度嚴謹計算的微陣影像初期分析演算核心 蔡明岳 1 張春梵 2 朱學亭 3 詹鎮熊 4 張金堅 5 高成炎 6 陳朝欽 7 摘要: 轉錄組檢體的微陣雜合分析是篩選表現型性狀或疾病之實效生物標記的有效技術平台,實 際操作係經由同時雜合詢答將近整體基因組相關基因探針所產製的微陣晶片。微陣影像的計算處 理至今仍是探採轉錄組生物標記的底層根基,主要任務為攫取個別圖點物件的真實緻密程度以便 進行精確的基因表現程度分析,嚴謹的前處理操作程序包括螢幕顯示輸入影像數據、特用型紅綠 藍與紅綠黃黑的 24 與 16 位元灰階轉換、整體物件佈局格網化、與物件緻密程度 16 與 8 位元互換 等;本篇報告實作 Otsu 簡單閾值、高斯混合模態、與迭代條件模式的自有演算核心,執行物件格 網化與分割的高度可信前處理,以便計算微陣影像數據的基因表現類型達成後續的阿伊達自動影 像數據分析專家系統。經比較微陣影像分析商業軟體,本篇報告的阿伊達演算核心展現較高的平 均皮爾森關連係數 0.9933 與 0.9721 於 ICM/Otsu 與 ICM/GMM 分組的整體 cy3 數值,而商業軟體卻 是呈現稍差的 0.9538 於 ICM/ArrayPro 分組;增益的相關議題所實作完成的阿伊達演算模組則是進 行整合調校,並應用於統計分析計算源自早先 HBV 與 HCV 不同感染病因的 50 個臨床肝癌檢體微 陣影像。 關鍵詞: 微陣、基因表現程度、格網化、分割、簡單閾值、高斯混合模態、迭代條件模式 1 國立清華大學資訊工程學系碩士研究生,第一作者。 2 中國文化大學農學院生物科技研究所副教授,共同第一作者。 3 亞洲大學資訊工程學系助理教授。 4 國立台灣大學資訊工程學系博士後研究員。 5 國立台灣大學醫學系教授暨血管新生研究中心主任。 6 國立台灣大學資訊工程學系教授。 7 國立清華大學資訊工程學系教授,通訊作者。 50