Gene Expression Computing For Microarray Image

advertisement
中華民國 94 年 12 月
華岡農科學報第十六期:43-50
Hwa Kang Journal of Agriculture
Vol. 16:43-50, December 2005
Microarray Image Pre-Analysis for Critical Gene Expression
Computation with Implemented Algorithmic Kernel
Ming-Yueh Tsai1, Chun-Fan Chang2, Hsueh-Ting Chu3, Chen-hsiung Chan4,
King-Jen Chang5, Cheng-Yao Kao6, and Chaur-Chin Chen7
Abstract: Microarray hybridization analysis on transcriptomic specimens has become an efficient
technology platform towards identifying valid biomarkers of phenotypic traits or diseases by
interrogating simultaneously almost genome-wide genes simply with corresponding probes
microarrayed on matrix of glass slide or nylon membrane. Still, the computational processing on
microarray image is profoundly essential for mining transcriptomic biomarkers by means of acquiring
truthful intensity data of respective spot objects in order for accurate gene expression analysis through
critical preprocessing procedures including displayable TIFF image data input, applicable RGB-CMYK
24/16-bit greyscaling, global object layout gridding, intensity 16/8-bit converting, and so forth. This
paper implemented an in-house algorithmic kernel of Otsu’s Simple Thresholding, Gaussian Mixture
Model (GMM), and Iterated Conditional Modes (ICM) for reliable gridding and segmentation
pre-analysis which computes gene expression pattern on microarray image data for subsequent
automatic image data analysis (Aida). In comparison with commercial microarray software, our Aida
algorithmic kernel demonstrated higher average Pearson correlation coefficient of 0.9933 (cy3,
ICM/Otsu’s) and 0.9721 (cy3, ICM/GMM) despite of the inferior result of 0.9538 (cy3, ICM/ArrayPro)
with commercial software. Additional procedures with implemented algorithmic Aida modules were also
integrated for statistical computation on microarray images of 50 clinical hepatocellular carcinoma
(HCC) specimens with different a priori etiology of hepatitis B virus (HBV) and hepatitis C virus
(HCV).
Keywords: Microarray, Gene expression, Gridding, Segmentation, Simple thresholding, Gaussian
mixture model, Iterated conditional modes.
1
Graduate Student, Department of Computer Science, National Tsing Hua University; first author.
Professor, Graduate Institute of Biotechnology, Chinese Culture University; joint first author.
3 Assistant Professor, Department of Computer and Information Engineering, Asia University.
4 Post-doctoral Research Fellow, Department of Computer Science and Information Engineering, National Taiwan
University.
5 Professor, Department of Medicine, and Director, Angiogenesis Research Center, National Taiwan University.
6 Professor, Department of Computer Science and Information Engineering, National Taiwan University.
7 Professor, Department of Computer Science, National Tsing Hua University; corresponding author.
2 Associate
43
Microarray Image Pre-Analysis for Critical Gene Expression Computation with Implemented Algorithmic Kernel
methods to accomplish the gene expression
computation in a fast, accurate, and
reproducible procedure.
Introduction
The microarray technique first appeared in
E.M. Southern’s publication of 1970 has now
become an efficient technology platform for
identifying valid biomarkers by hybridizing
transcriptomic specimens against almost
genome-wide genes simultaneously (Brazma et
al., 2001). Genome projects massively
disclosing the life secret of human gene
sequences
attracted
researchers
cross
disciplines to investigate biologically and
computationally on the data potential for
improving human health. Typical microarray
chips of polymer-coated glass slides are arrayed
with thousands of spots which respectively
contains lots of DNA molecules identical in
sequence. Microarray images are retrieved after
hybridization with high resolution color
scanner to acquire green and red light
illumination of the labeled cy3 (green) and cy5
(red) dyes on specimens.
A flowchart for microarray array image
pattern analysis is given in Figure 1. Subsequent
sections perform grid fitting, introduce three
methods for gene expression computation, show
experimental results, and draw a conclusion,
respectively.
In this paper, we address on the essential
pre-analysis issues of microarray image data
including gridding and object segmentation
while disregard additional specificity issues on
transcriptomic RNA specimen and microarray
probes. Accurate gene expression computation
is profoundly essential for identifying valid
biomarkers through analyzing microarray
image patterns (Causton et al., 2003). Regular
microarray image of 1000-pixels per centimeter
or 3000-pixels per inch resolution has often
installed tough tasks in gene expression
computation due to large dimensions of images.
In addition, the noise on microarray image due
to manual operations in biological experiments
may need to adopt efficient image
preprocessing kernel before statistical data
analysis.
Figure 1. A paradigm of microarray image pattern
analysis.
Despite of the commercial software
GenePix Pro and Array-Pro Analyzer
developed for microarray image data analysis,
the gene expression computation still requires
complex and intensive manual operations
which make the computational analysis to be
painfully slow and not repeatable. The purpose
of this paper is to implement three algorithmic
(a)
(b)
Figure 2. (a) The original image (b) The result image
of gridding pre-processing.
44
華岡農科學報第十六期
p i  ni / N
Gridding
The goal of gridding or addressing
microarray objects is to reliably compute
intensity of corresponding spots in accord with
microarray layout design. The object intensity
of respective microarray spot shows the
expression level of arrayed probe sequence
from related gene. Given top left corner as the
reference of an intended microarray image, the
gene expression of spot objects are computed
by image segmentation algorithms along with
the layout design of designated spot size and
spacing in horizontal and vertical. Figure 2
shows our gridding result tested on partial
microarray image with horizontal and vertical
spacing of 150 um and 145.161 um, where the
pixel size is 10 um. Inevitably, microarray
experiments often produce skew noises in
which rotation or smoothing modules are added
for removal before the local segmentation
(Gonzalez and Woods, 2002).
L
p
pi  0 ,
i 1
(1)
1
i
where ni is the number of pixels in gray level
i, and N is the total number of pixels.
We suppose that all pixels of each grid can
be separated into two classes, denoted as C B
and C F (background and foreground) with a
threshold k. Then, calculated probability and
mean of each class are as follows.
PB  Pr(C B ) 
k
p
i  Lmin
i
(2)
P F  Pr(C F ) 
Lmax
p
i  k 1
i
 1  PB
(3)
B 
k
k
 i Pr(i | C )   ip / P
B
i  Lmin
i
i  Lmin
B
 (k ) / PB
(4)
Gene Expression Computation
F 
The most essential task in microarray data
processing is to compute gene expression. Here
we introduce three computation methods of the
gene expression.
I.
Lmax
Lmax
 i Pr(i | C )   ip / P
F
i  k 1
i  k 1
i
F
(T   (k )) / PF
(5)
where
 (k ) 
Simple thresholding algorithm
k
 ip
i  Lmin
The simple thresholding algorithm
proposed by Otsu (1979) determines a
threshold to separate the pixels of a grid into
foreground and background. The optimal
threshold is calculated by the discriminant
criterion which maximizes the separatability of
the classes in gray levels.
i
and  T   ( Lmax ) 
Lmax
 ip
i  Lmin
i
(6)
The within-class scatter, between-class
scatter and the total scatter of gray levels are
defined as follows.
(7)
 W2  w0 02  w1 12
 B2  w0 w1 ( 1   0 ) 2
The algorithm loads microarray image to
fit gridding and find the maximum and
minimum gray level of each grid, denoted as
Lmin and Lmax . Subsequently, to compute and
normalize the gray level histogram as a
probability distribution by the following
equation.
 T2   W2   B2 
(8)
k
 (i  
i  Lmin
T
) 2 pi
(9)
To get the optimal threshold k*, we
maximize  B2 (Fukunaga, 1972) by applying
equations (2) ~ (5)
45
Microarray Image Pre-Analysis for Critical Gene Expression Computation with Implemented Algorithmic Kernel
 B2 (k * )  max  B2 (k )  max
1 k  L
1 k  L
Maximum likelihood estimation:
[ T w(k )   (k )]2 (10)
w(k )[1  w(k )]
J ( )  J ( x1 , x2 ,
, xn ,  )
(13)
n
  f ( xi )
Consequently, the gene expression is
computed by the difference of foreground and
background
G | f   b |
i 1
n
   f1 ( xi )  (1   ) f 2 ( xi ) 
i 1
(11)
We want to find the maximum likelihood
estimate ˆ such that J (ˆ )  J ( ) for all .
where G is the gene expression, f  and b
are the mean intensity of foreground and
background.
Expectation maximization algorithm:
II.
Gaussian Mixture Model (GMM)
We use expectation-maximization algorithm
to find maximum likelihood estimates of the
parameters. Then, we transfer the likelihood
function
into the log-likelihood
J ( )
function.
In microarray experiments, the statistical
analysis is the basis of obtaining gene
expression data from microarray images. The
mean and median of pixel intensities are often
used to represent gene expression in many
statistical methods despite of the realistic
controversy of which one may work better.
Recently, Gaussian Mixture Models have been
proposed for analyzing microarray data
(Causton et al., 2003). The GMM can solve the
problem of using median or mean values of
pixel intensities for gene expression
computation for that the mean and the median
in a Gaussian distribution are ideally equal.
n

L( )  ln   f ( xi ) 
i 1

n
  ln( f1 ( x)  (1   ) f 2 ( x))
i 1
n

2
2
2
2 
1
1
  ln 
e( x  1 ) /(21 )  (1   )
e( x  2 ) /(2 2 ) 
i 1 
21
2 2

With respect to mean and standard
deviation, we do partial derivatives to solve the
parameters 1, 2, 1 and 2.
We suppose that all pixels of a grid on the
microarray image can be separated into two
clusters (foreground and background). Each
cluster is assumed to be a Gaussian distribution
(Dempster et al., 1977).
f ( x)  f1 ( x)  (1   ) f 2 ( x)
where
f1 ~ N ( 1,  12 ) ,
f 2 ~ N (2 , 22 )
(14)
L( )
L( )
0 ,
 0, for j  1,2
 j
 j
(12)
(15)
The EM algorithm is further split into two
steps to update the estimated parameters.
and
(1) The expectation step (E-step)
(2) The maximization step (M-step)
0    1.
The parameters of f (x) are θ = (α, μ1, μ2, σ1, σ2)
Let α be 0.5 and pixels I = {x1, x2…, x2}
are sorted into two equal groups S1 and S2.
Compute the mean and standard deviation of
each group as for the initial values of (μ1 ,σ1)
and (μ2 ,σ2).
There are two important statistical
concepts that are further incorporated to solve
optimization problem of the parameter .
(1) maximum-likelihood estimation (MLE)
(2) Expectation-Maximization (EM)
algorithm.
E-step: Compute the conditional expectation
46
華岡農科學報第十六期
1 ( x) 
f1 ( x) and
f ( x)
 2 ( x) 
(1   ) f 2 ( x) (16)
f ( x)
conditioned on y is a Gibbs distribution whose
energy function U(x|y) is thus defined as
follows.
M-step: Update the parameters of new
~
likelihood estimate   (~, ~1 ,~1 , ~2 ,~2 ) as
n
n
i 1
1 
 1 ( x)( xi  1 )
i 1
n
 1 ( x)
(17)
where J(a,b)= -1 if a=b , J(a,b)=0 if a≠b; c=2
for the 1st-order MRF and c=4 for the
2nd-order MRF.
i 1
n
2
and  2 
  2 ( x)( xi  1 )
2
i 1
n
Besag (Besag, 1986) demonstrated in
experiments that it was usually enough to take
6 complete scans or fewer to achieve a
reasonable labeling. An ICM algorithm (Chen
and Dubes, 1990; Besag, 1986) is listed below.
  2 ( x)
i 1
i 1
Repeat E-step and M-step recursively until
achieving convergence condition. The gene
expression level is thus computed by the
difference of μ1 and μ2.
G   2  1
ICM algorithm
(a) Initialize a labeling by applying MLE for
each pixel
(18)
where 1 is the mean of background cluster, and
 2 is the mean of foreground cluster.
(a) For t = 1 to MN
x t  g 0 if f ( x t  g 0 | x  t , y )  f ( x t  g | x  t , y )
III. Iterated Conditional Modes (ICM)
for all g  A, and g 0  A.
The ICM algorithm was first proposed by
Besag in 1986 to find an optimal labeling x
based on a given intensity image y (Chen and
Dubes, 1990). The Markov random field (MRF)
is usually incorporated into the efficient
MRF-based labeling algorithm for image
segmentation (Dubes et al., 1990) in capturing
contextual information. The ICM algorithm
updates the label of each pixel iteratively until
a prescribed criterion is achieved.
(b) Repeat (b) until “the energy achieves a
local minimum” (6 scans hereby).
(c) x is the required labeling.
All pixels of a grid are labeled as either
foreground or background by ICM algorithm in
order for computing the gene expression with
the formula given in (11).
Results
We suppose the image of M by N is
degraded with Gaussian noise. The observed
image denoted by a random vector Y, is
assumed to come from a known distribution
conditioned on X=x, whose distribution is
given by
We
have
implemented
in-house
algorithmic
kernel
of
Otsu’s
Simple
Thresholding, Gaussian Mixture Model (GMM),
and Iterated Conditional Modes (ICM) for
computing the gene expression levels of spots
on microarray image. The analyzed 50
microarray images of clinical hepatocellular
carcinoma (HCC) cases are from Angiogenesis
Research Center (ARC) of National Taiwan
University (NTU) and the average height and
MN
f ( y | x)   f ( y i | xi )
(19)
i 1
f ( y i | xi ) ~ N (  xi ,  x2i )
The
a
posteriori
distribution
of
(20)

( yt   xt )2 c
MN  1
U ( x | y)    ln( x2t ) 
  r J ( xt , xt: r ) 
2
t 1  2
r 1
2 xt


n
 1 ( xi )
 1 ( xi ) xi
  2 ( xi ) xi
  i 1
, 1  i 1n
, 2  i 1n
n
 1 ( xi )
  2 ( xi )
n
f ( x | y )  eU ( x| y ) / Z x| y
X
47
Microarray Image Pre-Analysis for Critical Gene Expression Computation with Implemented Algorithmic Kernel
width are 2300 and 2200 pixels (Chen et al.,
2005). The horizontal and vertical distances
in-between spot centers are gridded according
to the microarray layout design of 150 um and
145.161 um without intensive manual layout
addressing.
venous non-invasion anti-HBV(+) and (2)
venous invasion anti-HBV(+) and as well
HCV-related
(3)
venous
non-invasion
anti-HCV(+) and (4) venous invasion
anti-HCV(+).
Table 2. Up- and down-regulated gene sets from
expression profiles of 50 microarray images
pertaining to anti-viral antibody presence
with or without HCC venous invasion.
Table 1. The average Pearson’s correlation
coefficients of gene expression between
Otsu’s, GMM, ICM and commercial
software “ArrayPro Analyzer.”
Average Pearson
correlation coefficient
cy3
cy5
ArrayPro v.s. Otsu’s
ArrayPro v.s. GMM
ArrayPro v.s. ICM
0.9492
0.9563
0.9538
0.9586
0.9624
0.9612
Otsu’s v.s. GMM
0.9577
0.9608
Otsu’s v.s. ICM
GMM v.s. ICM
0.9933
0.9721
0.9940
0.9734
HCC cases with anti-viral antibody(+)
HBV
HCV
(Venous Non-invasion) HCCs
Up-regulated
Down-regulated
(1) 17
36
5
(3) 16
34
7
(Venous Invasion) HCCs
Up-regulated
Down-regulated
(2) 12
75
17
(4) 5
103
27
Up-regulated gene numbers at combinatorial
subsets presence are listed as “(combo subsets)
gene number” as of the followings: (1000) 6,
(0200) 20, (0030) 1, (0004) 46, (1200) 1, (1030)
0, (1004) 4, (1230) 2, (1204) 7, (1034) 2, (0234)
8, and (1234) 14, respectively.
The computational tests are run on a
Windows-based PC with Pentium4 3.00GHz
CPU for gene expression computation and the
respective run time by the Otsu’s, GMM and
ICM algorithms are 4, 14, and 21 seconds,
respectively. The test results of the average
Pearson’s correlation coefficients on the
pair-wise algorithmic comparison are in general
greater than 95% among three proposed
methods and commercial software (Table 1).
Notably, best correlations in (cy3, cy5) are
observed in the cases of ICM versus either the
Otsu’s (0.9933, 0.9940) or GMM (0.9721,
0.9734) despite of the inferior result while
versus commercial ArrayPro software (0.9538,
0.9612).
Conclusion and Discussion
Microarray technology is popular because of
its ample mine of information. In this paper, we
demonstrate how to do gridding and computing
on a microarray image towards an important basis
of mining gene expression patterns. We propose
three different methods: Otsu’s thresholding,
GMM, and ICM algorithms to compute gene
expression of microarray images and partially
improve some drawbacks of commercial software.
From the experimental results, we observe that
the Pearson’s correlation coefficients between
these three methods and commercial software are
as high as over 95%. Moreover, the proposed
gridding methods with three segmentation
algorithms comprise nearly automatic procedures
for computing gene expression levels in a huge
microarray image without manual operation as of
the compared commercial software. The gridding
and segmentation may collectively substantiate
The pre-analysis results of microarray
image spots seem to be with significant object
intensities which may consequently assist the
accurate mining of valid biomarkers (VBM) of
various HCC categories. In this regard, the
preliminary data of the expression profile
analysis among 50 HCC microarray images has
revealed a collective set of 138 up-regulated
genes (Table 2) with the up-regulated gene
numbers of 36, 75, 34, and 103 within
respective categories of HBV-related (1)
48
華岡農科學報第十六期
the software system of automatic image data
analysis (Aida) on microarray images either
without or with reference spots.
F.C.P. Holstege, I.F. Kim, V. Markowitz,
J.C. Matese, H. Parkinson, A. Robinson, U.
Sarkans, S. Schulze-Kremer, J. Stewart, R.
Taylor, J. Vilo, and M. Vingron. (2001)
Minimum Information about Microarray
Experiment (MIAME) toward Standards
for Microarray Data. Nature Genetics.
29:365-371.
With highly reproducible results of gene
expression
computation,
subsequent
normalization procedure can be imposed at
both system level and expression level
respectively with MA-plot and maximal
invariant subset algorithms. Preliminarily
selected
biomarker
genes
of
tumor
angiogenesis (tag) may be with great clinical
value of HCC stage definition and treatment
prognosis.
However,
the
value
of
tag-microarray may exert in assisting the
development on preventative measures of
diagnosis and therapy. Respectively, the
preventative diagnostics may include serum
proteins of connective tissue growth factor and
peripheral
blood
mononuclear
cells.
Preventative
therapeutics
may
include
high-throughput screening of botanical drug
products with health-regain efficacy evidenced
with the gene expression profile of
tag-microarray data along with drug toxicity
genes onto the cell culture evaluation system.
Causton, H.C., J. Quackenbush, and A. Brazma.
(2003) Microarray Gene Expression
Analysis. Blackwell Publishing, New
York. USA.
Chen, C.C. and R.C. Dubes. (1990)
Environmental
Studies of
Iterated
Conditional
Modes
Segmentation
Algorithm.
Journal of Information
Science and Engineering. 6:325-337.
Chen, C.N., J.J. Lin, J.J.W. Chen, P.H. Lee, C.Y.
Yang, M.L. Kuo, K.J. Chang, and F.J.
Hsieh. (2005) Gene expression profile
predicts patient survival of gastric cancer
after surgical resection. Journal of Clinical
Oncology. 23(29):7286-7295.
Dempster, A.P., N.M. Laird, and D.B. Rubin.
(1977) Maximum likelihood from
incomplete data via the EM algorithm.
Journal of the Royal Statistical Society.
39:1-38.
Acknowledgments
The authors would like to thank the
National Science Council (NSC), the Council
of Agriculture (COA), and the Department of
Industrial Technology (DoIT) of Ministry of
Economic Affairs (MOEA), Taiwan for
financial support: NSC 94-2213-E-007-089,
NSC 92-2745-B-034-001, COA 94-AS-5.2.1AD-U2, and MOEA 93-EC-17-A-19- S1- 0016.
Dubes, R.C., A.K Jain, S.C. Sateesha, and C.C.
Chen
(1990)
MRF
Model-Based
Algorithms for Image Segmentation. in
Proc. IEEE International Conference on
Pattern Recognition. 808-814.
Fukunaga, K. (1972) Introduction to Statistical
Pattern Recongnition.
New York
Academic. 260-267.
References
Gonzalez, R.C. and R.E. Woods. (2002) Digital
Image
Processing.
Prentice-Hall
Publishing, New York, USA.
Besag, J. (1986) On the Statistical Analysis of
Dirty Pictures. J. Royal Stat. Soc. Ser. B.
48:259-302.
Otsu, N. (1979) A Threshold Selection Method
from Gray-Level Histograms. Journal of
IEEE Transactions on Systems, Man, and
Cybernetics. SMC-9:62-66.
Brazma, A., P. Hingamp, J. Quackenbush, G.
Sherlock, P. Spellman, C. Stoeckert, J.
Aach, W. Ansorge, C.A. Ball, H.C.
Causton, T. Gaasterland, P. Glenisson,
49
中華民國 94 年 12 月
華岡農科學報第十六期:43-50
Hwa Kang Journal of Agriculture
Vol. 16:43-50, December 2005
基因表現程度嚴謹計算的微陣影像初期分析演算核心
蔡明岳 1 張春梵 2
朱學亭 3 詹鎮熊 4
張金堅 5
高成炎 6 陳朝欽 7
摘要: 轉錄組檢體的微陣雜合分析是篩選表現型性狀或疾病之實效生物標記的有效技術平台,實
際操作係經由同時雜合詢答將近整體基因組相關基因探針所產製的微陣晶片。微陣影像的計算處
理至今仍是探採轉錄組生物標記的底層根基,主要任務為攫取個別圖點物件的真實緻密程度以便
進行精確的基因表現程度分析,嚴謹的前處理操作程序包括螢幕顯示輸入影像數據、特用型紅綠
藍與紅綠黃黑的 24 與 16 位元灰階轉換、整體物件佈局格網化、與物件緻密程度 16 與 8 位元互換
等;本篇報告實作 Otsu 簡單閾值、高斯混合模態、與迭代條件模式的自有演算核心,執行物件格
網化與分割的高度可信前處理,以便計算微陣影像數據的基因表現類型達成後續的阿伊達自動影
像數據分析專家系統。經比較微陣影像分析商業軟體,本篇報告的阿伊達演算核心展現較高的平
均皮爾森關連係數 0.9933 與 0.9721 於 ICM/Otsu 與 ICM/GMM 分組的整體 cy3 數值,而商業軟體卻
是呈現稍差的 0.9538 於 ICM/ArrayPro 分組;增益的相關議題所實作完成的阿伊達演算模組則是進
行整合調校,並應用於統計分析計算源自早先 HBV 與 HCV 不同感染病因的 50 個臨床肝癌檢體微
陣影像。
關鍵詞: 微陣、基因表現程度、格網化、分割、簡單閾值、高斯混合模態、迭代條件模式
1 國立清華大學資訊工程學系碩士研究生,第一作者。
2 中國文化大學農學院生物科技研究所副教授,共同第一作者。
3 亞洲大學資訊工程學系助理教授。
4 國立台灣大學資訊工程學系博士後研究員。
5 國立台灣大學醫學系教授暨血管新生研究中心主任。
6 國立台灣大學資訊工程學系教授。
7 國立清華大學資訊工程學系教授,通訊作者。
50
Download