pdf - IEIE SPC (IEIE Transactions on Smart Processing

advertisement
IEIE TRANSACTIONS ON SMART PROCESSING AND COMPUTING
www.ieiespc.org
ISSN 2287-5255
Officers
President
Steering Committee
Editor-in-Chief:
Byung Gook Park, Seoul National Univ.
Joonki Paik
Chung-Ang University, South Korea
paikj@cau.ac.kr
President-elect
Yong Seo Koo, Dankook Univ.
Vice President
Daesik Hong, Yonsei Univ.
Hong June Park, POSTECH
Hyun Wook Park, KAIST
Joonki Paik, Chung-Ang Univ.
Seung Kwon Ahn, LG Electronics
Seon Wook Kim
Korea University, South Korea
seon@korea.ac.kr
Founder
Minho Jo
Korea University, South Korea
minhojo@korea.ac.kr
Editors
Associate Editors
Anand PRASAD
NEC, Japan
anand@bq.jp.nec.com
André L. F. de ALMEIDA
Universidade Federal do Ceará, Brazil
andre@gtel.ufc.br
Aldo W. MORALES
Penn State University, USA
awm2@psu.edu
Aggelos K. KATSAGGGELOS
Northwestern, USA
aggk@ece.northwestern.edu
Athanasios (Thanos) VASILAKOS
University of Western Macedonia, Greece
vasilako@ath.forthnet.gr
Bong Jun KO
IBM T. J. Watson Research Center, USA
bongjun_ko@us.ibm.com
Dong Gyu SIM
Kwang Woon University, South Korea
dgsim@kw.ac.kr
Eun-Jun YOON
Jinsuk BAEK
Winston-Salem State University, NC, USA
baekj@wssu.edu
Kyu Tae LEE
SeungWon JUNG
Dongguk University, South Korea
swjung83@gmail.com
Sunghyun CHOI
Seoul National University, South Korea
schoi@snu.ac.kr
Kyungil University, South Korea
ejyoon@kiu.ac.kr
Kongju National University, South Korea
ktlee@kongju.ac.kr
Motorola, USA
faisal@motorola.com
Faisal ISHTIAQ
University of Ionnia, Greece
lkon@cs.uoi.gr
McGill University, Canada
heungsun.hwang@mcgill.ca
Heungsun HWANG
Osaka University, Japan
lei.shu@ieee.org
I2R, Singapore
sunsm@i2r.a-star.edu.sg
Ho-Jin CHOI
Dalian University of Technology, China
lei.wang@ieee.org
Lei WANG
Imperial College, UK
t.stathaki@imperial.ac.uk
Hae Yong KIM
Nanjing University of Posts and
Telecommunications, China
liang.zhou@ieee.org
Liang ZHOU
Huazhong University of Science and
Technology, Hubei, China
Tao.Jiang@ieee.org
Monson HAYES
Tokyo University of Agriculture and
Technology, Japan
tanakat@cc.tuat.ac.jp
Lisimachos P. KONDI
KAIST, South Korea
hojinc@kaist.ac.kr
Soo Yong CHOI
Yonsei University, South Korea
csyong@yonsei.ac.kr
Sumei SUN
Lei SHU
Tania STATHAKI
Tao JIANG
Byeungwoo JEON
University of Sao Paulo, Portugal
hae@lps.usp.br
Byonghyo SHIM
University of Palermo, Italy
ilenia.tinnirello@tti.unipa.it
Georgia Institute of Tech. USA
mhh3@gatech.edu
Chang D. YOO
Qualcomm, USA
insungk@qualcomm.com
Insung KANG
Chung-Ang Soongsil University, South Korea
mhong@e.ssu.ac.kr
Charles Casimiro CAVALCANTE
Kangwon University, South Korea
ihwang@kangwon.ac.kr
Inchul HWANG
Singapore University of Technology and
Design, Singapore
ngaiman_cheung@sutd.edu.sg
Yonsei University, South Korea
wro@yonsei.ac.kr
Oscar AU
KAIST, South Korea
wchoi@ee.kaist.ac.kr
Peng MUGEN
Huazhong University of Science and
Technology, China
xhge@mail.hust.edu.cn
Sungkyunkwan University, South Korea
bjeon@skku.edu
Korea University, South Korea
bshim@korea.ac.kr
KAIST, South Korea
cdyoo@ee.kaist.ac.kr
Universidade Federal do Ceará, Brazil
charles@gtel.ufc.br
Chun Tung CHOU
Ilenia TINNIRELLO
Min-Cheol HONG
Ngai-Man CHEUNG
Jaehoon LEE
The University of New South Wales,
Australia
ctchou@cse.unsw.edu.au
Korea University, South Korea
ejhoon@korea.ac.kr
Chun-Ting CHOU
University of Wollongong, Australia
jhk@uow.edu.au
Daesik HONG
Yonsei University, South Korea
Jun JO
National Taiwan University, Taiwan
chuntingchou@cc.ee.ntu.edu.tw
daesikh@yonsei.ac.kr
Daniel da COSTA
Federal University of Ceara (UFC), Brazil
danielbcosta@ieee.org
Daji QIAO
Iowa State University, USA
daji@iastate.edu
Dongsoo KIM
Indiana University-Purdue University,
USA
dskim@iupui.edu
Daqiang ZHANG
Jung Ho KIM
Griffith University, Australia
j.jo@griffith.edu.au
Hong Kong University of Science and
Technology, Hong Kong
eeau@ust.hk
Beijing Univ. of Posts and
Telecommunications, China
pmg@bupt.edu.cn
Qiang NI
Brunel University, UK
Qiang.Ni@brunel.ac.uk
Joonki PAIK
University of Waterloo, Canada
rxlu@bbcr.uwaterloo.ca
Chung-Ang University, South Korea
paikj@cau.ac.kr
Jaime Lloret MAURI
Polytechnic University of Valencia, Spain
illoret@dcom.upv.es
Junmo KIM
KAIST, South Korea
junmo@ee.kaist.ac.kr
Dong Sun KIM
Nanyang Technological University,
Singapore
sxyu1@ualr.edu
Junsung YUAN
Stefan MANGOLD
Disney Research Laboratory, Switzerland
stefan@disneyresearch.com
Shucheng YU
University of Arkansas at Little Rock, USA
jsyuan@ntu.edu.sg
Sasitharan BALASUBRAMANIAM
Waterford Institute of Technology, Ireland
sasib@tssg.org
IMT, Italy
stsaft@gmail.com
Won Woo RO
Wan CHOI
Xiaohu GE
Xudong WANG
Univ. of Michigan-Shanghai Jiao Tong
Yo-Sung HO
Rongxing LU
Sotrios TSAFTARIS
Taesang YOO
Qualcomm, USA
taesangy@qualcomm.com
University Joint Institute, China
wxudong@ieee.org
Jianhua HE
Swansea University, UK
j.he@swansea.ac.uk
Institute Telecom, France
Nanjing Normal University, China
dqzhang@njnu.edu.cn
KETI, South Korea
dskim@keti.re.kr
Tanaka TOSHIHISA
Gwangju Institute of Science and
Technology (GIST), South Korea
hoyo@gist.ac.kr
Yan (Josh) ZHANG
Simula Research Laboratory, Norway
yanzhang@ieee.org
Young-Ro KIM
Myongji College, South Korea
foryoung@mjc.ac.kr
Yongsheng GAO
Griffith University, Australia
yongsheng.gao@griffith.edu.au
Won-Yong SHIN
Dankook University, South Korea
wyshin@dankook.ac.kr
Administration & Office
Journal Coodinator
Prof. Kang-Sun CHOI
Korea University of Technology and Education, South Korea
ks.choi@koreatech.ac.kr
Prof. Youngsun HAN
Kyungil University, South Korea
youngsun@kiu.ac.kr
Administration Manager
Ms. Yunju KIM
inter@theieie.org
TEL) +82-2-553-0255(ext.4) FAX) +82-2-552-6093
THE INSTITUTE OF ELECTRONICS AND INFORMATION ENGINEERS
Room #907 Science and Technology New Building (635-4, Yeoksam-dong) 22, Teheran-ro 7-gil, Gangnam-gu 135-703, Seoul, Korea
TEL : +82-2-553-0255~7 , FAX : +82-2-552-6093
http://www.ieiespc.org/
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.4.195
195
IEIE Transactions on Smart Processing and Computing
Median Filtering Detection of Digital Images Using Pixel
Gradients
Kang Hyeon RHEE
Dept. of Electronics Eng. and School of Design and Creative Eng., Chosun University / Gwangju 501-759, Korea
khrhee@chosun.ac.kr
* Corresponding Author:
Received July 15, 2015; Revised August 5, 2015; Accepted August 24, 2015; Published August 31, 2015
* Short Paper: This paper is invited by Seung-Won Jung, the associate editor.
* Extended from a Conference: Preliminary results of this paper were presented at the ITC-CSCC 2015. The present paper
has been accepted by the editorial board through the regular reviewing process that confirms the original contribution.
Abstract: For median filtering (MF) detection in altered digital images, this paper presents a new
feature vector that is formed from autoregressive (AR) coefficients via an AR model of the
gradients between the neighboring row and column lines in an image. Subsequently, the defined
10-D feature vector is trained in a support vector machine (SVM) for MF detection among forged
images. The MF classification is compared to the median filter residual (MFR) scheme that had the
same 10-D feature vector. In the experiment, three kinds of test items are area under receiver
operating characteristic (ROC) curve (AUC), classification ratio, and minimal average decision
error. The performance is excellent for unaltered (ORI) or once-altered images, such as 3×3
average filtering (AVE3), QF=90 JPEG (JPG90), 90% down, and 110% up to scale (DN0.9 and
Up1.1) images, versus 3×3 and 5×5 median filtering (MF3 and MF5, respectively) and MF3 and
MF5 composite images (MF35). When the forged image was post-altered with AVE3, DN0.9,
UP1.1 and JPG70 after MF3, MF5 and MF35, the performance of the proposed scheme is lower
than the MFR scheme. In particular, the feature vector in this paper has a superior classification
ratio compared to AVE3. However, in the measured performances with unaltered, once-altered and
post-altered images versus MF3, MF5 and MF35, the resultant AUC by ‘sensitivity’ (TP: true
positive rate) and ‘1-specificity’ (FN: false negative rate) is achieved closer to 1. Thus, it is
confirmed that the grade evaluation of the proposed scheme can be rated as ‘Excellent (A)’.
Keywords: Forgery image, Median filtering (MF), Median filtering detection, Median filter residual (MFR),
Median filtering forensic, Autoregressive (AR) model, Pixel gradient
1. Introduction
In image alteration, content-preserving manipulation
uses compression, filtering, averaging, rotation, mosaic
editing and scaling, and so on, using the forgery method
[1-4]. Median filtering (MF) is especially preferred among
some forgers because it has characteristics of non-linear
filtering based on order statistics. Furthermore, the MF
detection technique could classify altered images by MF.
The state of the art is well-documented [5-9]. Consequently,
an MF detector becomes a significant forensic tool for
recovery of the processing history of a forged image.
To detect MF in a forged image, Cao et al. [10]
analyzed the probability that an image's first-order pixel
difference is zero in textured regions. In this regard,
Stamm et al. [2] accounted for a method that is highly
accurate with unaltered or uncompressed images.
Meanwhile, to extract the feature vector for median
filtering detection, Kang et al. [6] obtained autoregressive
(AR) coefficients as feature vectors via an AR model to
analyze the median filter residual (MFR), which is the
difference values of the original and its median filtering
image.
In this paper, a new MF detection algorithm is proposed in which the feature vector is formed from AR
coefficients via an AR model of the gradients of the
RHEE : Median Filtering Detection of Digital Images Using Pixel Gradients
196
neighboring horizontal and vertical lines in an image.
The rest of the paper is organized as follows. Section 2
briefly introduces the theoretical background of MFR, and
a gradient of the neighboring lines in an image. Section 3
describes the extraction method of a new feature vector in
the proposed MF detection algorithm. The experimental
results of the proposed algorithm are shown in Section 4.
The performance evaluation is compared to a previous one,
and is followed by some discussion. Finally, the conclusion is drawn, and future work presented in Section 5.
and in the column direction:
d ( i, j ) = −
p
∑a( ) d (i − q, j ) + ε ( ) ( i, j ) ,
c
k
c
(6)
q =1
where ε(r)(i,j) and ε(c)(i,j) are the prediction errors [11], and
p refers to the order of the AR model. Again, AR
coefficients are the difference image (d).
2.2 Gradient of Neighboring Lines in Image
The gradients of the neighboring row and column
2. Theoretical Background
r
direction lines in an image (x) are defined as G ( ) and
In this section, the MFR and a gradient of the neighboring lines in an image are briefly introduced.
2.1 MFR
Kang et al. proposed a MFR [6] that used a 10-D
feature vector, which computed AR coefficients of the
difference values between the original image (y) and its
median filtering image (med(y)). Yuan [7] attempted to
reduce interference from an image’s edge content and
block artifacts from JPEG compression, proposing to
gather detection features from an image’s MFR.
The difference values (d) between the original and its
median filtering image are AR-modeled. The difference is
referred to as the median filter residual, which is formally
defined as
d ( i, j ) = med w ( y ( i, j ) ) − y ( i, j )
= z ( i, j ) − y ( i, j )
(2)
c
r
(c)
(3)
)/2.
(4)
where r and c are the row and column directions, k is the
(r )
(c)
AR order number, and ak and ak are the AR coefficients
in the row and column directions, respectively. Then a
single ak for a one-dimensional AR model is obtained
from Eq. (4), for which the average AR coefficients in
both directions are Eqs. (2) and (3).
The author attempts to reduce the dimensionality of the
feature vector according to an image’s statistical property,
fitting the MFR to a one-dimensional AR model in the row
direction:
d ( i, j ) = −
p
∑a( ) d ( i, j − q ) + ε ( ) (i, j ) ,
r
k
q =1
c
G ( ) ( i, j ) = x ( i + 1, j ) − x ( i, j )
(8)
3. The Proposed MF Detection Algorithm
For the proposed MF detection algorithm, AR coefficients are computed via an AR model with Eqs. (7) and (8),
then Eqs (9) and (10) as follows:
(
( ))
( )
g = AR ( mean ( G ( ) ) ) .
( )
( )
g = (g + g ) / 2 .
r
(r )
g k = AR mean G ( )
c
r
k
( ( ))
( )
a = AR ( mean ( d ( ) ) )
( ()
(7)
c
(1)
r
(r )
ak = AR mean d ( )
ak = ak + ak
r
G ( ) ( i, j ) = x ( i, j + 1) − x ( i, j )
k
where (i,j) is a pixel coordinate, and w is the MF window
size (w ∈ {3,5}). AR coefficients ak are computed as
c
k
c
G ( ) respectively as follows:
r
(5)
c
k
k
(9)
(10)
(11)
In Eq. (11), g k [1st:10th] is formed as a 10-D feature
vector. The flow diagram of the proposed algorithm for
MF detection is shown in Fig. 1.
The MF detection algorithm is described in the
following steps, and is presented in Fig. 2.
[Step 1] Compute the neighboring row and column line
gradients in the image.
[Step 2] Build the AR model of Step 1’s gradients.
[Step 3] In Step 2, AR coefficients [1st :10th] of the
gradients are formed as a 10-D feature vector.
[Step 4] The feature vector is trained in an SVM
classifier.
[Step 5] Implement the MF detector via the trained
SVM.
4. Performance Evaluation
The proposed scheme uses C-SVM with Gaussian
kernel (12) with a 10-D feature vector.
(
)
(
K xi , x j = exp −γ ٛxi − x j 2
)
(12)
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
197
Fig. 3. The feature vector distribution of the MFR.
Fig. 1. The flow diagram of the proposed MF detection
algorithm.
Fig. 4. The feature vector distribution of the proposed
MF detection.
Main Median Filtering detection
Begin Feature Vector Extraction
Gradients ← Neighboring Row and Col.
Lines in Image.
Feature Vector ← AR_Model(Gradients).
End Feature Vector Extraction
Begin Training Feature Vector
SVM Classifier (Feature Vector).
End Training Feature Vector
Begin Test Images
Feature Vectors of Test Images → Trained
SVM Classifier.
End Test Images
Begin Classification and Analysis
Score, Classification and Confusion Table
by Trained SVM Classifier.
Median Filtering Decision ← Analyze
Confusion Table.
End Classification and Analysis
Leave
Median Filtering detection
Fig. 2. Proposed MF detection.
Fig. 5. Average AR coefficients of ‘A’ group images of
the proposed MF detection scheme.
and a formed 10-D feature vector to an SVM classifier
with trained five-fold cross-validation in conjunction with
a grid search for the best parameters of C and γ in the
multiplicative grid.
( C , γ ) ∈ {( 2i , 2 j ) | 4 × i, 4 × j ∈ Z }
(13)
198
RHEE : Median Filtering Detection of Digital Images Using Pixel Gradients
Fig. 6. ROC curves of MFR [7] scheme.
The searching step size for i,j is 0.25, and those
parameters are used to get the classifier model on the
training set.
UCID (Uncompressed Color Image Database)’s 1,388image database (DB) [12] is used for MF detection, and
test image types prepared were MF3, MF5, unaltered
(ORI), 3×3 average filtering (AVE3), JPEG (QF=90) and
90% down and 110% up to scale (DN0.9 and UP1.1).
Subsequently, the trained classifier model was used to
perform classification on the testing set. From among
UCID’s
1,388-image DB, 1,000 images were randomly selected
for training, and the other 388 images were allocated to
testing.
In Figs. 3 and 4, the feature vector distributions of the
MFR and the proposed MF detection scheme, respectively,
are presented.
The test image group was prepared with three kinds:
• Group A: the unaltered and the once-altered images.
- ORI
- AVE3
- JPG90
- DN0.9
- UP1.1
199
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
Fig. 7. ROC curves of the proposed MF detection scheme.
• Group B: Post-altered two times more, after MF3.
- MF3+AVE3+JPG70
- MF3+DN0.9+JPG70
- MF3+UP1.1+JPG70
• Group C: Post-altered two times more, after MF5.
- MF5+AVE3+JPG70
- MF5+DN0.9+JPG70
- MF5+UP1.1+JPG70
Fig. 5 presents the average AR coefficients of the ‘A’
group images.
In Fig. 6, ROC curves show each performance on MFw
versus test images under the MFR [6] scheme, and in Fig.
7, ROC curves show each performance on the MFs versus
test images on the proposed MF detection scheme.
Table 1 shows the experimental results of the MFw and
test image types on AUC and Pe (the minimal average
decision error) under the assumption of equal priors and
equal costs [13] and classification ratio, respectively.
Pe = min(
PFP + 1 − PTP
)
2
(14)
The above procedure was repeated 30 times to reduce
performance variations caused by different selections of
the training samples. The detection accuracy, which is the
RHEE : Median Filtering Detection of Digital Images Using Pixel Gradients
200
Table 1. Performance comparison between the MFR and the proposed MF detection scheme.
MFw: Median filtering window size, w ∈ {3,5,35}
RI(Result Item) 1: AUC, 2: Pe and 3: Classification ratio
A a: ORI
B a: MF3+AVE3+JPG70
b: AVE3
b: MF3+DN0.9+JPG70
c: JPG90
c: MF3+UP1.1+JPG70
d: DN0.9
e: UP1.1
arithmetic average of the true positive (TP) rate and true
negative (TN) rate, was averaged over 30 times in random
experiments [9].
From Table 1, it can be seen that the performance of
the proposed scheme is excellent with MF3, MF5 and
MF35 versus ORI, AV3, JPG90, DN0.9 and UP1.1 images
compared to the MFR scheme. For the forged image that
was post-altered two times more with AVE3, DN0.9,
UP1.1 and JPEG70 after MF3 and MF5, the performance
of the scheme is lower than the MFR scheme. In particular,
the feature vector in this paper has a superior classification
against AVE3.
However, in the measured performances of all items,
AUC by ‘sensitivity’ (TP) and ‘1-specificity’ (FP) was
achieved closer to 1. Thus, it was confirmed that the grade
evaluation of the proposed algorithm can be rated as
‘Excellent (A)’.
In all the above experiments, the proposed MF detection considered only AR coefficients of the image line’s
gradients to form the feature vector in the spatial domain.
C a: MF5+AVE3+JPG70
b: MF5+DN0.9+JPG70
c: MF5+UP1.1+JPG70
From an AR model of an image pixel’s gradients, the
scheme uses the AR coefficients as MF detection feature
vectors. The proposed MF detection scheme is compared
to the MFR [6], so these results will serve as further
research content on MF detection.
This appears to be the complete solution of an AR
model from gradients of the neighboring row and column
lines in a variety of images. However, despite the feature
vector of the proposed MF detection scheme having a short
length, the performance results are excellent due to AUC,
the classification ratio is more than 0.9, and Pe is closer
to 0.
Future work should consider a performance evaluation
of smaller sizes, such as 64 × 64 or 32 × 32 of an altered
image, which were not considered in this paper.
Finally, the proposed approach can also be applied to
solve different forensic problems, like previous MF detection techniques.
Acknowledgement
5. Conclusion
This paper proposes a new robust MF detection scheme.
This research was supported by the Ministry of Trade,
Industry and Energy (MOTIE), KOREA, through the
Education Program for Creative and Industrial Conver-
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
201
gence. (Grant Number N0000717)
References
[1] Kang Hyeon RHEE, “Image Forensic Decision
Algorithm using Edge Energy Information of Forgery
Image, ” IEIE, Journal of IEIE, Vol. 51, No. 3, pp.
75-81, March 2014. Article (CrossRef Link)
[2] Stamm, M.C., Min Wu, K.J.R. Liu, “Information
Forensics: An Overview of the First Decade,” Access
IEEE, pp. 167-200, 2013. Article (CrossRef Link)
[3] Kang Hyeon RHEE, “Forensic Decision of Median
Filtering by Pixel Value's Gradients of Digital
Image,” IEIE, Journal of IEIE, Vol. 52, No. 6, pp.
79-84, June 2015. Article (CrossRef Link)
[4] Kang Hyeon RHEE, “Framework of multimedia
forensic system,” Computing and Convergence Technology (ICCCT), 2012 7th International Conference
on, IEEE Conf. Pub., pp. 1084-1087, 2012. Article
(CrossRef Link)
[5] Chenglong Chen, Jiangqun Ni and Jiwu Huang,
“Blind Detection of Median Filtering in Digital
Images: A Difference Domain Based Approach,”
Image Processing, IEEE Transactions on, Vol. 22, pp.
4699-4710, 2013. Article (CrossRef Link)
[6] Xiangui Kang, Matthew C. Stamm, Anjie Peng, and
K. J. Ray Liu, “Robust Median Filtering Forensics
Using an Autoregressive Model,” IEEE Trans. on
Information Forensics and Security, vol. 8, no. 9, pp.
1456-1468, Sept. 2013. Article (CrossRef Link)
[7] H. Yuan, “Blind forensics of median filtering in
digital images,” IEEE Trans. Inf. Forensics Security,
Vol. 6, no. 4, pp. 1335–1345, Dec. 2011. Article
(CrossRef Link)
[8] Tomáˇs Pevný, “Steganalysis by Subtractive Pixel
Adjacency Matrix,” Information Forensics and
Security, IEEE Transactions on, Vol. 5, pp. 215-224,
2010. Article (CrossRef Link)
[9] Yujin Zhang, Shenghong Li, Shilin Wang and Yun
Qing Shi, “Revealing the Traces of Median Filtering
Using High-Order Local Ternary Patterns,” Signal
Processing Letters, IEEE, Vol. 21, pp. 275-279, 2014.
Article (CrossRef Link)
[10] G. Cao, Y. Zhao, R. Ni, L. Yu, and H. Tian,
“Forensic detection of median filtering in digital
images,” in Multimedia and Expo (ICME), 2010, Jul.
2010, pp. 89–94, 2010. Article (CrossRef Link)
[11] S. M. Kay, Modern Spectral Estimation: Theory and
Application, Englewood Cliffs, NJ, USA: PrenticeHall, 1998.
[12] Article (CrossRef Link) (2015.4.1)
[13] M. Kirchner and J. Fridrich, “On detection of median
filtering in digital images.” In Proc. SPIE, Electronic
Imaging, Media Forensics and Security II, vol. 7541,
pp. 1–12, 2010. Article (CrossRef Link)
Copyrights © 2015 The Institute of Electronics and Information Engineers
Kang Hyeon RHEE graduated and
received a BSc and an MSc in Electronics Engineering from Chosun University, Korea, in 1977 and 1981,
respectively. In 1991, he was awarded
a Ph.D. in Electronics Engineering
from Ajou University, Korea. Since
1977, Dr. Rhee has been with the Dept.
of Electronics Eng. and School of Design and Creative
Engineering, Chosun University, Gwangju, Korea. His
current research interests include Embedded System
Design related to Multimedia Fingerprinting/Forensics. He
is on the committee of the LSI Design Contest in Okinawa,
Japan. Dr. Rhee is also the recipient of awards such as the
Haedong Prize from the Haedong Science and Culture
Juridical Foundation, Korea, in 2002 and 2009.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.4.202
202
IEIE Transactions on Smart Processing and Computing
New Inference for a Multiclass Gaussian Process
Classification Model using a Variational Bayesian EM
Algorithm and Laplace Approximation
Wanhyun Cho1, Sangkyoon Kim2 and Soonyoung Park3
1
Department of Statistics, Chonnam National University / Gwangju,500-757 South Koreawhcho@chonnam.ac.kr
Department of Electronics Engineering, Mokpo National University / Chonnam, South Koreanarciss76@mokpo.ac.kr
3
Department of Electronics Engineering, Mokpo National University / Chonnam, South Koreasypark@mokpo.ac.kr
2
* Corresponding Author: Sangkyoon Kim
Received July 15, 2015; Revised August 5, 2015; Accepted August 24, 2015; Published August 31, 2015
* Short Paper
* Extended from a Conference: Preliminary results of this paper were presented at the ICT-CSCC, Summer 2015. The
present paper has been accepted by the editorial board through the regular reviewing process that confirms the original
contribution.
Abstract: In this study, we propose a new inference algorithm for a multiclass Gaussian process
classification model using a variational EM framework and the Laplace approximation (LA)
technique. This is performed in two steps, called expectation and maximization. First, in the
expectation step (E-step), using Bayes’ theorem and the LA technique, we derive the approximate
posterior distribution of the latent function, indicating the possibility that each observation belongs
to a certain class in the Gaussian process classification model. In the maximization step, we
compute the maximum likelihood estimators for hyper-parameters of a covariance matrix necessary
to define the prior distribution of the latent function by using the posterior distribution derived in
the E-step. These steps iteratively repeat until a convergence condition is satisfied. Moreover, we
conducted the experiments by using synthetic data and Iris data in order to verify the performance
of the proposed algorithm. Experimental results reveal that the proposed algorithm shows good
performance on these datasets.
Keywords: Multiclass gaussian process classification model, Variational bayesian EM algorithm, Laplace
approximation technique, Latent function, Softmax function, Synthetic data, Iris data
1. Introduction
Gaussian process (GP) can be conveniently used to
specify prior distributions of hidden functions for Bayesian
inference. In the case of regression with Gaussian noise,
inference can be done simply in closed form, since the
posterior is also a GP. But in the case of classification,
exact inference is analytically intractable because the
likelihood function is given as a non-Gaussian form.
One prolific line of attack is based on approximating
the non-Gaussian posterior with a tractable Gaussian
distribution. Three different types of solutions have been
suggested in the recent literature [1]. These are the Laplace
approximation (LA) and expectation propagation (EP),
Kullback-Leibler divergence minimization comprising
variational bounding as a special case, and factorial
approximation. First, Williams et al. proposed the use of a
second-order Taylor expansion around the posterior mode
to a natural way of constructing a Gaussian approximation
to the log-posterior distribution [2]. The mode is taken as
the mean of the approximate Gaussian. Linear terms of the
log-posterior vanish because the gradient at the mode is
zero. The quadratic term of the log-posterior is given by
the negative Hessian matrix. Minka presented a new
approximation technique (EP) for Bayesian networks [3].
This is an iterative method to find approximations based
on approximate marginal moments, which can be applied
to Gaussian processes. Second, Opper et al. discussed the
relationship between the Laplace and variational
approximations, and they show that for models with
203
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
Gaussian priors and factoring likelihoods, the number of
variational parameters is actually O(N) [4]. They also
considered a problem that minimizes the KL-divergence
measure between the approximated posterior and the exact
posterior. Gibbs et al. showed that the variational methods
of Jaakkola and Jordan are applied to Gaussian processes
to produce an efficient Bayesian binary classifier [5]. They
obtained tractable upper and lower bounds for the unnormalized posterior density. These bounds are
parameterized by variational parameters that are adjusted
to obtain the tightest possible fit. Using the normalized
versions of the optimized bounds, they then compute
approximations to the predictive distributions. Third, Csato
et al. presented three simple approximations for the
calculation of the posterior mean in Gaussian process
classification [6]. The first two methods are related to
mean field ideas known in statistical physics. The third
approach is based on a Bayesian outline approach. Finally,
Kim et al. presented an approximate expectation–
maximization (EM) algorithm and the EM-EP algorithm to
learn both the latent function and hyper-parameters in a
Gaussian process classification model [7].
We propose a new inference algorithm that can
simultaneously derive both a posterior distribution of a
latent function and maximum likelihood estimators of
hyper-parameters in a Gaussian process classification
model. The proposed algorithm is performed in two steps:
called the expectation step (E-step) and the maximization
step (M-step). First, in the expectation step, using the
Bayesian formula and LA, we derive the approximate
posterior distribution of the latent function based on
learning data. Furthermore, we calculate a mean vector and
covariance matrix of the latent function. Second, in the
maximization step, using a derived posterior distribution of
the latent function, we derive the maximum likelihood
estimator for hyper-parameters necessary to define a
covariance matrix. Moreover, we conducted the
experiments by using synthetic data and Iris data in order
to verify the performance of the proposed algorithm.
The rest of this paper is organized as follows. The next
section describes a multiclass Gaussian process
classification model. In the Section 3 and 4, inference
scheme section, we propose a new inference method that
can derive the approximate distribution for a posterior
distribution of latent variables and estimate the hyperparameters of the covariance function for prior distribution
of the latent function. The section 5 includes performance
evaluations and discussion of the effects of the proposed
model. Finally, we conclude this paper in the last section.
2. Multiclass Gaussian Process
Classification Model
We first consider a multiclass Gaussian process
classification model (MGPCM). The model consists of
three components: a latent function with a Gaussian
process prior distribution, a multiclass response, and a link
function that relates between the latent function and
response mean. First, we consider the multivariate latent
function. Here, we define the latent function f (x) for
Gaussian process classification having C classes at a set of
observations x1 ,", x n as
f (x | Θ) = (f11 (x),",f n1 (x),",f1c (x),",f1c (x),
",f1C (x),",f nC (x))T
(1)
Then, we assume a GP prior for the latent function
f (x) as defined by
f (x | Θ) ~ GP(0, K (xi , x j | Θ))
(2)
where K (xi , x j ) is the covariance matrix. In this paper, we
assume that the latent function f (x) represents the
C classes, and the individual variables of the c -th
component vector f c (x) of latent function f (x) are
uncorrelated. Therefore, the GP covariance matrix
K (xi , x j ) can be assumed from the following block
diagonal form:
K (xi , x j | Θ) = diag(K1 (xi , x j | Θ1 ),", K c (xi , x j | Θc ),
", K C (xi , x j | ΘC ))
(3)
where
K c (xi , x j | Θ c ) = (k c (xi , x j ) | (θ1c ,θ 2c ))( n×n ) , i, j = 1,", n
is also the covariance matrix for the c -th component
vector of the latent function.
Second, the response vector Y is constituted by
identical independent multinomial random variables where
each component variable represents a c class. That is, let
us define the response vector Y as
Y = ( y11 (x),", y1n (x),", y1c (x),", ync (x),
", y1C (x),", ynC (x))T ,
(4)
where a vector of response Y has the same length as f (x) ,
and each component ykc of the c -th response vector
y c = ( y1c ,", ync )T for c = 1,", C has 1 for the class, which
is the label for observation, and 0 for the other
C − 1 classes. Here, we are able to assume that the
multinomial density function p( Y | π) of the response
vector Y is given in the following form:
C
n
p(Y | π) = ∏∏ (π ck ) yk
c
(5)
c =1 k =1
where the indicator variable ykc takes one or zero with
probability π ck and 1- π ck , and π ck denotes the probability
that the k -th observation vector belongs to the particular
204
Cho et al.: New Inference for a Multiclass Gaussian Process Classification Model using a Variational Bayesian EM Algorithm…
class c .
Third, we consider the link function that specifies the
relation between the latent function f (x) and the response
mean vector E(Y | f ) . Here, the link function can be
defined as
Ψ (f ) = ln p(f | Y, X, Θ)
= ln p(f | X, Θ) + ln p(Y | f )
1
1
nC
ln 2π
= ln p(Y | f ) − f T K −1f − ln | K | −
2
2
2
Here, taking the first and second derivatives of Eq. (8)
with respect to f , we obtain
E(Y | f ) = (E(y1 | f ),", E(y c | f ),", E(y C | f ))T ,
∇Ψ (f ) = ∇ ln p (Y | f ) − K −1f ,
where
∇∇Ψ (f ) = ∇∇ ln p(Y | f ) − K −1 = − W − K −1
E(y c | f ) = ( E (y1c | f ),", E (y ck | f ),", E (y cn | f )),
c = 1,", C ,
And
E (yck | f ) = π ck =
∑
exp(f kc )
C
'
exp(f kc )
c =1
, k = 1,", n
(6)
'
3. Variational EM Framework and Laplace
Approximation Method
One important issue in the Gaussian process
classification model is to both derive the approximate
distribution for a posterior distribution of latent variables
and to estimate the hyper-parameters of the covariance
function for prior distribution of the latent function. One
possible approach is to consider the variational EM
algorithm that is widely used in the incomplete data.
In the E-step of the variational EM algorithm, we
derive the approximate Gaussian posterior q(f | X, Y, Θ)
for latent function value f using Laplace approximation. In
the M-step of the variational EM algorithm, we seek an
estimator of hyper-parameter Θ that can maximize a
lower bound on a logarithm of the marginal likelihood
q(Y | X, Θ) using the approximate posterior q(f | X, Y, Θ)
obtained in the E-step. The E-step and M-step are
iteratively repeated until a convergence condition is
satisfied. Our algorithm is given in detail in the following
sections.
= arg max f ln p (Y | f ) p (f | X, θ)
First, using Bayes' rule at a variational E-step, the
posterior over the latent variable f is given by
(7)
but because the denominator p(y | X , Θ) is independent
with latent function f , we need only consider the unnormalized posterior when maximizing with respect to f .
Taking the logarithm of the un-normalized posterior of
latent function f , it can be given as
(10)
It gives us the following equation:
Ψ (f ) = Ψ (m f ) + ∇Ψ (f )f =mf (f − m f )
1
+ (f − m f )T (∇∇Ψ (f )f =mf )(f − m f )
2
1
= Ψ (m f ) − (f − m f )T ( W + K −1 )(f − m f )
2
≅ ln N (f | m f ,(K −1 + W ) −1 ).
(11)
Thus, we have obtained a Gaussian approximation
posterior q(f | Y, X, Θ) to the true posterior p(f | Y, X, Θ)
with mean vector m f and covariance matrix
V = (K −1 + W) −1.
That is, using the Laplace
approximation, the true posterior p(f | X, Y, Θ) of latent
function f is approximated as a Gaussian posterior
q(f | X, Y, Θ) as the following:
q(f | X, Y, Θ) ~ N (m F , V = (K −1 + W ) −1 )
3.1 Variation E-step and Laplace
Approximation
(9)
where W ≡ −∇∇ ln p (Y | F ) is diagonal, since the
likelihood factorizes over the case.
A natural way of constructing a Gaussian approximation
to the log-posterior Ψ (f ) = ln p (f | Y, X, Θ) is to perform a
second-order Taylor expansion at the mode m F of the
posterior, i.e.
mf = arg max f Ψ (f )
p(f | X, y , Θ) = p (y | f ) p(f | X , Θ) / p(y | X , Θ)
(8)
.(12)
Here, the mode or maximum m f of the log-posterior
Ψ (f ) can be found iteratively using the Newton-Rapson
algorithm. That is, given an initial estimate, m f , a new
estimate is iteratively found, as follows:
mfnew = mf − (∇∇Ψ (f )f =mf ) −1 ∇Ψ (f )f =mf
= mf + (K −1 + W ) −1 (∇ ln p(Y | f )f =mf − K −1mf ) (13)
= (K −1 + W ) −1 ( W mf + ∇ ln p (Y | f )f =mF ).
Moreover, since the log-likelihood function ln p(Y | f )
205
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
can be expressed as
n
∑ ln p( y ,", y
1
k
k =1
C
k
| f k ) , we obtain the
following equation by differentiating the log-likelihood
function ln p(Y | f ) with respect to f :
n
∇f ln p (Y | f ) = ∇ f (∑ ln p( y1k ,", ykC | f k ))
Here, the low bound F (q, Θ) can be written as
⎛ p (f | X, Θ) p (Y | f ) ⎞
F (q, Θ) = ∫ q (f ) ln ⎜
⎟d f
q (f )
⎝
⎠
= ∫ q (f ) ln p (f | X, Θ)d f + ∫ q (f ) ln p (Y | f )d f
− ∫ q (f ) ln q (f )d f
k =1
n
C
⎛ n C
⎞
= ∇ F ⎜ ∑∑ yic f ic −∑ ln (∑ exp(f kc )) ⎟ (14)
k =1
c =1
⎝ k =1 c =1
⎠
=Y −π
where a vector π is defined by
π ic =
C
∑ exp(f
c* =1
c*
i
(19)
Moreover, since the second term and the third term are
independent with hyper-parameters Θ , we only need to
maximize the first term, Eq (f ) (ln p (f | X, Θ)) , with respect
to Θ . By computing Eq (f ) (ln p (f | X, Θ)) using a Gaussian
π ( nC×1) = (π11 ,", π ic ,", π Cn )T ,
exp(f ic )
= Eq (f ) (ln p (f | X, Θ)) + Eq (f ) (ln p (Y | f )) + H(q(f )).
posterior, we obtain:
, i = 1,", n, c = 1,", C
(15)
)
Eq (f ) (ln p (f | X, Θ))
nc
1
1
ln 2π - ln|K (Θ)|- Eq (f ) f T K (Θ) −1 f
2
2
2
nc
1
1
= − ln 2π - ln|K (Θ)|- Eq (f ) (f )T K (Θ) −1 Eq (f ) (f )
2
2
2
(20)
1
− tr K (Θ) −1 Cov(f )
2
nc
1
1
= − ln 2π - ln|K (Θ)|- m f T K (Θ) −1 m f
2
2
2
1
− tr K (Θ) −1 Cov(f )
2
(
=−
Second, the matrix W can be given as
)
(
⎛ ∂ ln p(Y | f ) ⎞
W = −∇∇ ln p (Y | f ) = ⎜ −
⎟
∂f ∂f T ⎠
⎝
(16)
= diag(π) − ΠT Π,
(
)
)
(
where Π is an (n × nC ) matrix obtained by horizontally
(
)
stacking the diagonal matrices diag(π c ) , c = 1,", C . This
is given in the following form:
3.2 Variational M-Step
As we assume a derived approximate Gaussian
posterior q(f | X, Y, Θ) is held fixed, we seek the new
parameter values Θ new that the lower bound F (q, Θ) ,
given in the following Eq. (18) can be maximized with
respect to Θ :
to Θ using the E-step result, we obtain
∂Eq (f ) (ln p(f | X, Θ))
∂Θ
∂K (Θ)
1
= - tr( K (Θ) −1
)
∂Θ
2
(21)
K (Θ)
1⎛
⎞
K (Θ) −1 m f ⎟
+ ⎜ m f T K ( Θ ) −1
∂Θ
2⎝
⎠
K
Θ
1 ⎛
(
)
⎞
K (Θ) −1 Cov(f ) ⎟ .
+ tr ⎜ K (Θ) −1
∂Θ
2 ⎝
⎠
Therefore, we can obtain the hyper-parameter
maximizing the free energy by the following gradient
update rule:
⎛ ∂E (ln p (f | X, Θ)) ⎞
Θ new = Θ old + η ⎜ q (f )
⎟
∂Θ
⎝
⎠Θ=Θold
ln p (Y | X, Θ) = ln ∫ p (f | X, Θ) p (Y | f ) d f
⎛ p (f , Y | X, Θ) ⎞
= ∫ q (f ) ln ⎜
⎟d f
q (f )
⎝
⎠
⎛ p (f , Y | X, Θ) ⎞
≥ F (q, Θ) = ∫ q (f ) ln ⎜
⎟ d f.
q (f )
⎝
⎠
)
Here, by differentiating Eq (f ) (ln p(f | X, Θ)) with respect
⎡ diag(π11 ) "
⎤
0 " diag(π1C ) "
0
⎢
⎥
Π=⎢
#
%
#
"
#
%
#
⎥
1
C ⎥
⎢ 0
diag(π
)
0
diag(π
)
"
"
"
n
n ⎦
⎣
(17)
⎛
⎞
q (f )
+ ∫ q (f ) ln ⎜
⎟d f
⎝ p (f | Y, X, Θ) ⎠
(
)
(18)
(22)
4. Prediction Method
Here, if we denote a vector f* as the latent function
value corresponding with test point x* , then the joint prior
distribution of the training latent function f and the test
206
Cho et al.: New Inference for a Multiclass Gaussian Process Classification Model using a Variational Bayesian EM Algorithm…
latent function f* is
And the covariance matrix of the latent function f* can
be represented as
⎛⎡f ⎤ ⎡ K
p(f* , f | x* , X, Θ) ~ N ⎜ ⎢ ⎥ | 0, ⎢ T
⎜ f
⎝ ⎣ * ⎦ ⎣K *
K* ⎤ ⎞
⎟
k ** ⎥⎦ ⎟⎠
(23)
= k ** − Q*T (K + W −1 ) −1 Q*
where
Κ * = Vertical+Diag(K (x, x* ),", K (x, x* ),", K (x, x* )),
1
*
c
*
C
*
K*c (x, x* ) = (k c (x1 , x* ),", k c (x n , x* ))T ,c = 1,",C,
and
k ** = diag (k 1 (x* , x* ),", k C (x* , x* ))
Hence, given a novel test point x* , the posterior
distribution of latent function f* corresponding to a test
point x* can be obtained by marginalizing the latent
functions of the training set:
p(f* | x* , Y, X, Θ) = ∫ p(f* , f | x* , Y, X, Θ)d f
= ∫ p(f* | f , x* , X, Θ) p (f | Y, X, Θ)d f
given as the Gaussian with mean vector K T* (K −1 + W )m F
and covariance matrix k ** − K *T K −1K * .
Hence, the predictive mean vector for class c of the
latent function value f* corresponding with test point x* is
given by
= K *c (x, x* )T (y c − π c )
vector of these probabilities (π*1 ,", π*C ). Therefore, we
will classify the input vector X* into the class which its
classification probability is maximized. That is,
'
π*c = arg max1≤c≤C (π*1 ,", π*C ).
In order to evaluate the performance improvement
achieved by the proposed inference method, we consider a
bivariate normal synthetic data and Iris data.
5.1 Synthetic Data
Here, we will consider four partially overlapping
Gaussian sources of data in two dimensions. First, in order
to train a model, we generate four classes of bivariate
Gaussian random samples. One hundred sixty data points
were generated by the four bivariate normal distributions
with the mean vectors and covariance matrices described
in Table 1. Fig. 1(a) plots these data points in a twodimensional space.
(25)
Table 1. Mean Vector and Covariance Matrix for Each
Class.
(K ) m = (y − π ) . Moreover, if these are put into
vector form, then the expectation of latent function
f* under the Laplace approximation is given as
c
F
c
c
μ* = Ε q (f*c | x* , X, Y, Θ) = Q*T (y − π),
(29)
5. Performance Evaluation
where the last equality comes from K −1m F = Y − π , and
c −1
(28)
Therefore, we have obtained the approximate Gaussian
posterior distribution G (μ* , Σ* ) of the latent function f* .
Finally, in order to classify input vector X* into its
proper class, we first extract the n random samples
f*1 ,", f*n from the predictive distribution of latent function
f* corresponding to the input vector. Further, using the Eq.
(2), we calculate the estimate of the classification
probability (π*1c ,", π*n c ), c = 1,", C ,and compute a mean
(24)
But the posterior distribution of the latent function is
unfortunately not Gaussian due to the non-Gaussian
likelihood, as mentioned above. Hence, the approximate
posterior distribution of the latent function is necessary.
Here, if we use the Laplace approximation posterior
q(f | X, Y, Θ) to a true posterior p(f | X, Y, Θ) , we have
obtained the approximate posterior distribution
q(f* | x* , X, Y, Θ) of latent function f* . It is obviously
Ε q (f*c | x* , X, Y, Θ) = K *c (x, x* )T (K c ) −1 m Fc
Σ* = Covq (f* | x* , X, Y, Θ)
Mean
vector
Covariance
matrix
Class 1
Class 2
Class 3
Class 4
(1.75,-1.0)
(-1.75,1.0)
(2,2)
(-2,-2)
⎛ 1 0.5 ⎞
⎜
⎟
⎝ 0.5 1 ⎠
⎛ 1 −0.5 ⎞
⎜
⎟
⎝ −0.5 1 ⎠
⎛ 1 −0.5 ⎞
⎜
⎟
⎝ −0.5 1 ⎠
⎛1
⎜
⎝0
0⎞
⎟
1⎠
(26)
where a matrix QT* is defined as the (nC × C ) matrix
⎛ K 1* (x, x* )
0
⎜
2
0
K * ( x, x* )
Q* = ⎜
⎜
M
M
⎜⎜
0
0
⎝
⎞
⎟
K
M
⎟
⎟
O
M
⎟
K K *C (x, x* ) ⎟⎠
K
0
(27)
Fig. 1. (a) Training data, (b) testing data, and (c) class
region and misclassification observations.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
Fig. 2. Iris dataset.
Table 2. Classification of Iris Species.
setosa
versicolor
virginica
Setosa
1
0
0
Versicolor
0
0.96
0.04
Virginica
0
0.01
0.99
Second, in order to verify the performance of the model,
we generate four different classes of bivariate Gaussian
random samples. Four hundred data points were generated
by the bivariate normal distribution. Fig. 1(b) plots the
testing data points. Fig. 1(c) shows each region and
misclassification data points. We can see that it totals
about 7-8% misclassification. Therefore, we know that the
proposed method can completely classify the data points
well.
5.2 Iris Dataset
Here, we considered real data called an Iris dataset.
This dataset consists of 50 samples from each of three
species of Iris flowers: setosa, versicolor and virginica.
Four features were measured from each sample (length and
width of sepal and petal) in centimeters. Based on the
combination of the four features, we developed a GP
classifier model to distinguish one species from another.
Fig. 2 shows the Iris dataset from different viewpoints.
First, in order to train a model, we used a total of 90
observations from three classes. And in order to verify the
performance of the model, we selected 60 samples, except
for ones used in the training set.
Next, we want to measure the performance of our
proposed model when classifying the Iris species. To find
the best performance, we chose to find the optimal hyperparameters at the point where the marginal likelihood has a
maximum using the EM algorithm.
Table 2 shows the results of the Iris species classification.
To calculate the rates, we estimate the number of correctly
classified negatives and positives and divide by the total
number of each species.
We had to try many experiments to get meaningful
results using randomly selected samples. Experimental
results reveal that the average for a successful
classification rate is about 98%.
6. Conclusion
This paper proposed a new inference algorithm that can
207
simultaneously derive both a posterior distribution of a
latent function and estimators of hyper-parameters in the
Gaussian process classification model. The proposed
algorithm was performed in two steps: the expectation step
and the maximization step. In the expectation step, using a
Bayesian formula and Laplace approximation, we derived
the approximate posterior distribution of the latent function
on the basis of the learning data. Furthermore, we
considered a method of calculating a mean vector and
covariance matrix of a latent function. In the classification
step, using the derived posterior distribution of the latent
function, we derived the maximum likelihood estimator for
hyper-parameters necessary to define a covariance matrix.
Finally, we conducted experiments by using synthetic
data and Iris data in order to verify the performance of the
proposed algorithm. Experimental results reveal that the
proposed algorithm shows good performance on these
datasets. Our future work will extend the proposed method
to other video recognition problems, such as 3D human
action recognition, gesture recognition, and surveillance
systems.
Acknowledgement
This work was jointly supported by the National Research
Foundation of the Korea Government (2014R1A1A4A0109398)
and the research fund of Chonnam National University
(2014-2256).
References
[1] H. Nickish et al., “Approximations for Binary
Gaussian Process Classification,” Journal of Machine
Learning Research, Vol. 9, pp. 2035-2078, 2008.
Article (CrossRef Link)
[2] C. K. I. Williams et al., “Bayesian Classification with
Gaussian Processes,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol. 12, pp.
1103-1118, 1998. Article (CrossRef Link)
[3] T. P. Minka, “Expectation Propagation for Approximate
Bayesian Inference,” Technical Report, Depart. of
Statistics, Carnegie Mellon University, Pittsburgh,
PA 15213, 2001. Article (CrossRef Link)
[4] M. Opper et al., “The Variational Gaussian Approximation
Revisited,” Neural Comput., Vol. 21, No. 3, pp. 78692, 2009. Article (CrossRef Link)
[5] M. N. Gibbs et al., “Variational Gaussian Process
Classifiers,” IEEE Transactions on Neural Networks,
Vol. 11, No. 6, pp. 1458-1464, 2000. Article
(CrossRef Link)
[6] L. Csato et al., “Efficient Approaches to Gaussian
Process Classification,” in Neural Information
Processing Systems, Vol. 12, pp. 251-257, MIT Press,
2000. Article (CrossRef Link)
[7] H. Kim, et al., "Bayesian Gaussian Process Classification
with the EM-EP algorithm," IEEE Trans. on PAMI,
Vol. 28, No. 12, pp 1948-1959, 2006. Article
(CrossRef Link)
208
Cho et al.: New Inference for a Multiclass Gaussian Process Classification Model using a Variational Bayesian EM Algorithm…
Wanhyun Cho received both a BSc
and an MSc from the Department of
Mathematics, Chonnam National University, Korea, in 1977 and 1981,
respectively, and a PhD from the
Department of Statistics, Korea University, Korea, in 1988. He is now
teaching at Chonnam National University. His research interests are statistical modeling,
pattern recognition, image processing, and medical image
processing.
Sangkyoon Kim received a BSc, an
MSc and a PhD in Electronics Engineering, Mokpo National University,
Korea, in 1998, 2000 and 2015, respectively. From 2011 to 2015, he was
a Visiting Professor in the Department
of Information & Electronics Engineering, Mokpo National University,
Korea. His research interests include image processing,
pattern recognition and computer vision.
Copyrights © 2015 The Institute of Electronics and Information Engineers
Soonyoung Park received a BSc in
Electronics Engineering from Yonsei
University, Korea, in 1982 and an
MSc and PhD in Electrical and
Computer Engineering from State
University of New York at Buffalo, in
1986 and 1989, respectively. From
1989 to 1990 he was a Postdoctoral
Research Fellow in the Department of Electrical and
Computer Engineering at the State University of New
York at Buffalo. Since 1990, he has been a Professor with
the Department of Electronics Engineering, Mokpo
National University, Korea. His research interests include
image and video processing, image protection and
authentication, and image retrieval techniques.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.4.209
209
IEIE Transactions on Smart Processing and Computing
A Simulation Study on The Behavior Analysis of The
Degree of Membership in Fuzzy c-means Method
Takeo Okazaki1, Ukyo Aibara1 and Lina Setiyani2
1
Department of Information Engineering, University of the Ryukyus / Okinawa, Japan
okazaki@ie.u-ryukyu.ac.jp, ukyo@ms.ie.u-ryukyu.ac.jp
2
Department of Information Engineering, Graduate school of Engineering and Science, University of the Ryukyus / Okinawa,
Japantya.sachi@ms.ie.u-ryukyu.ac.jp
* Corresponding Author: Takeo Okazaki
Received July 15, 2015; Revised August 5, 2015; Accepted August 24, 2015; Published August 31, 2015
* Short Paper
* Extended from a Conference: Preliminary results of this paper were presented at the ITC-CSCC 2015. This paper has
been accepted by the editorial board through the regular reviewing process that confirms the original contribution.
Abstract: Fuzzy c-means method is typical soft clustering, and requires a degree of membership
that indicates the degree of belonging to each cluster at the time of clustering. Parameter values
greater than 1 and less than 2 have been used by convention. According to the proposed datageneration scheme and the simulation results, some behaviors in the degree of “fuzziness” was
derived.
Keywords: Fuzzy c-means, Degree of membership, Numerical simulation, Correct ratio, Incorrect ratio
1. Introduction
Soft clustering is clustering that permits belonging to
more than one cluster, whereas hard clustering requires
belonging to just one cluster to provide crisp classification.
Fuzzy c-means (FCM) method [1, 2] is typical soft
clustering, which is achieved to estimate a membership
value that indicates the degree of belonging to each cluster.
Since the parameters for the degree of “fuzziness” are
included, it is necessary to provide a parameter value at the
time of clustering. In most of the traditional research,
parameter values greater than 1 and less than 2 have been
used with little theoretical explanation.
In this study, we analyzed some behaviors of the
degree of fuzziness by numerical simulations.
2. Fuzzy c-means Method
Given a finite set of n objects X = {x1 ,", x n } and the
number of clusters c, we consider partitioning X to c
clusters while allowing duplicate belonging. With the
belonging coefficient uki (k:cluster_id, i:object_id), FCM
aims to minimize the objective function Err ( u , μ ) .
c
n
Err ( u, μ ) = ∑∑ ( uki )
m
k =1 i =1
uki =
1
⎛ x −μ
∑j ⎜⎜ xi − μ k
j
⎝ i
c
n
μk =
xi − μ k
∑ (u )
ki
i
m
2
(1)
(2)
2
⎞ m −1
⎟
⎟
⎠
xi
n
∑ (u )
m
(3)
ki
i
μ k denotes each cluster center and m means the degree
of fuzziness, with m>1.
The degree m corresponds to the level of cluster
fuzziness. A larger m causes fuzzier clusters; in contrast,
m = 1 indicates crisp classification. We need to determine
the value of m at the time of clustering, and m = 2 has been
applied in the absence of domain knowledge by convention.
3. Approach to Finding Properties of The
Degree of Membership
Although we would like to find the universal or
210
Okazaki et al.: A Simulation Study on The Behavior Analysis of The Degree of Membership in Fuzzy c-means Method
mathematical properties of m, it is difficult to avoid the
relation to data-specific characters. They may be found
when considering a suitable m applied to various
constructed data by designing a generation model to create
various data and values to estimate an optimal m.
In order to design the data model, the following indexes,
which concern cluster relationships, were picked up.
▪ Distance between clusters (cluster placement)
▪ Number of clusters
▪ Shape of clusters
▪ Number of objects in each cluster
The meaning of cluster in distance and shape is the set
of objects to give the initial objects placement for our
experiments, but it is not the target cluster. For distance,
regular intervals give the typical placement, and we can
arrange the number of cluster overlaps. For shape, the
circle type is easy to handle because of density; however,
the oval type requires consideration of bias.
The procedure for data generation and cluster assignment
is as follows.
In order to analyze the accuracy of each cluster, the
correct ratio inside of a cluster denotes the rate by which
objects should belong to it.
CRiinside =
⎞
⎛ 2π
i ⎟ , d × sin ⎜
⎠
⎝ c
⎞⎞
i⎟⎟
⎠⎠
CRioutside =
pik =
n − Cni
(8)
n − Cn*i
On the other hand, the incorrect ratio for each cluster
can be the evaluation indexes in a similar manner.
IRiinside =
IRioutside =
Cni − Cn+i
(9)
Cni
Cn*i − Cn+i
(10)
Cn*i
(4)
4. Evaluation Experiments
d: distance from the origin
[Step.2] Generate normal random numbers for each
cluster with mean vector vi and covariance matrix E.
[Step.3] Calculate the coefficient pik that indicates
object xi belongs to cluster i.
1
dik + 1
(7)
Cn*i
The correct ratio outside of a cluster denotes the rate at
which objects that do not belong to that cluster should not
belong to it.
[Step.1] Decide the center vector vi for each cluster.
⎛
⎛ 2π
vi = ⎜ d × cos ⎜
⎝ c
⎝
Cn+i
According to the strategy in Section 3, we designed the
evaluation experiment scheme as follows.
The results of the basic case with a regular interval
and a circle shape at c = 5, Cn*i = 50 are shown in Fig. 1 to
Fig. 6.
Table 1. Experimental conditions.
c
∑d
j =1
1
jk + 1
(5)
dik : distance between object and cluster
[Step.4] Calculate the mean pik of the normal random
numbers with mean pik and standard deviation 1
for
10c
1
all objects. If pik ≥ then object x k is deemed to belong
c
to cluster i.
To obtain an optimal m, we need some indexes to
evaluate the FCM results. Assuming that Cn*i is the
Parameters
m
c : number of clusters
Values
1.1 ~ 7
5~7
Cn*i
50 ~ 100
Distance between clusters
Shape of clusters
regular interval or
biased placement
circle or oval
number of objects belonging to cluster i in the input data,
Cni is the number of objects belonging to cluster i in the
results, Cn+i is the number of correct objects belonging to
cluster i in the results, and the correct ratio is used for
overall suitability.
c
CR =
∑C
i =1
+
ni
c
∑ Cn*i
i =1
(6)
Fig. 1. A case of input data with regular placement,
c = 5, Cn*i = 50
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
211
Fig. 2. CR : correct ratio overall with regular placement,
c = 5, Cn*i = 50 .
Fig. 6. IRioutside : incorrect ratio outside a cluster with
regular placement, c = 5, Cn*i = 50 .
Fig. 3. CRiinside : correct ratio inside a cluster with regular
placement, c = 5, Cn*i = 50 .
Fig. 7. A case of input data with regular placement,
c = 7, Cn*i = 50
Fig. 4. CRioutside : correct ratio outside a cluster with
regular placement, c = 5, Cn*i = 50 .
Fig. 8. CR : correct ratio overall with regular placement,
c = 7, Cn*i = 50 .
Fig. 5. IRiinside : incorrect ratio inside a cluster with
regular placement, c = 5, Cn*i = 50 .
The correct ratio overall had a peak from m = 3.5 to
m = 5.5, and a clear inflection point could be seen around
m = 4 for the each cluster evaluation index.
The results of the basic case with a regular interval
and circle shape at c = 7, Cn*i = 50 are shown in Fig. 7 to
Fig. 12.
212
Okazaki et al.: A Simulation Study on The Behavior Analysis of The Degree of Membership in Fuzzy c-means Method
The correct ratio overall had a peak at m = 3, and a
clear inflection point could be seen around m = 3 for each
cluster evaluation index. Cluster C7 was located at the
center of the objects, therefore C7 was error prone, and its
evaluation values were bad compared with the other six
clusters.
The results of the modified case with biased placement
and a circle shape at c = 5, Cn*i = 100 are shown in Fig. 13
to Fig. 18.
Fig. 9. CRiinside : correct ratio inside a cluster with regular
placement, c = 7, Cn*i = 50 .
Fig. 13. A case of input data with biased placement,
c = 5, Cn*i = 100 .
Fig. 10. CRioutside : correct ratio outside a cluster with
regular placement, c = 7, Cn*i = 50 .
Fig. 11. IRiinside : incorrect ratio inside a cluster with
Fig. 14. CR : correct ratio overall with biased placement,
c = 5, Cn*i = 100 .
regular placement, c = 7, Cn*i = 50 .
Fig. 12. IRioutside : incorrect ratio outside a cluster with
Fig. 15. CRiinside : correct ratio inside a cluster with
regular placement, c = 7, Cn*i = 50 .
biased placement, c = 5, Cn*i = 100 .
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
Fig. 16. CRioutside : correct ratio outside a cluster with
213
Fig. 19. A case of input data with biased placement,
c = 7, Cn*i = 100 .
biased placement, c = 5, Cn*i = 100 .
Fig. 20. CR : correct ratio overall with biased placement,
c = 7, Cn*i = 100 .
Fig. 17. IRiinside : incorrect ratio inside a cluster with
biased placement, c = 5, Cn*i = 100 .
Fig. 21. CRiinside : correct ratio inside a cluster with
biased placement, c = 7, Cn*i = 100
Fig. 18. IRioutside : incorrect ratio outside a cluster with
biased placement, c = 5, Cn*i = 100 .
The correct ratio overall had a peak at m = 4.25, and a
inflection area could be seen from m = 3 to m = 4 for each
cluster evaluation index. Cluster C3 was located in
isolation, therefore it could be distinguished stably.
The results of the modified case with biased placement
and a circle shape with c = 7, Cn*i = 100 are shown in Fig.
Fig. 22. CRioutside : correct ratio outside a cluster with
19 to Fig. 24.
biased placement, c = 7, Cn*i = 100 .
214
Okazaki et al.: A Simulation Study on The Behavior Analysis of The Degree of Membership in Fuzzy c-means Method
Table 2. Comparison of clustering results.
Category
C1
Fig. 23. IRiinside : incorrect ratio inside a cluster with
biased placement, c = 7, Cn*i = 100 .
C2
C3
Fig. 24. IRioutside : incorrect ratio outside a cluster with
biased placement, c = 7, Cn*i = 100 .
The correct ratio overall had a peak at m = 4.25, and a
clear inflection point could be seen around m = 4 for each
cluster evaluation index. These values were larger than
those from regular placement. Cluster C4 and C7 were
located at the center of the objects; therefore, these were
error prone. Cluster C1 was located apart from other
objects, and could be distinguished stably.
In a limited number of clusters, both regular and biased
placement cases showed that the optimal m was larger. The
optimal m for biased placement was larger than those for
regular placement. A value of 3 or more for m was valid
when the number of clusters was 7 or less.
5. Application to Motor Car Type
Classification
We confirmed the validity of the experimental results
through application of actual data from motor car road
tests [4]. The 32 cars had 5 variables, such as fuel
consumption, amount of emissions, horsepower, vehicle
weight and 1/4 mile time. Because of the data description,
we assumed four clusters: big sedan, midsize sedan, small
sedan and sports car.
The results of FCM for the conventional m = 2 and
proposal m = 4 are shown in Table 2 and Fig. 25. Blue
lines correspond to m = 2, and red lines correspond to m =
4. Black line categories have no difference between m = 2
and m = 4.
C4
m=2
Datsun 710
Merc 240D
Merc 230
Fiat 128
Honda Civic
Toyota Corolla
Toyota Corona
Fiat X1-9
Porsche 914-2
Lotus Europa
Volvo 142E
Hornet Sportabout
Duster 360
Cadillac Fleetwood
Lincoln Continental
Chrysler Imperial
Camaro Z28
Pontiac Firebird
Ford Pantera L
Maserati Bora
Hornet 4 Drive
Hornet Sportabout
Merc 450SE
Merc 450SL
Merc 450SLC
Dodge Challenger
AMC Javelin
Camaro Z28
Ford Pantera L
Maserati Bora
Mazda RX4
Mazda RX4 Wag
Hornet 4 Drive
Valiant
Merc 240D
Merc 230
Merc 280
Merc 280C
Toyota Corona
Ferrari Dino
Volvo 142E
m=4
Datsun 710
Merc 240D
Merc 230
Fiat 128
Honda Civic
Toyota Corolla
Toyota Corona
Fiat X1-9
Porsche 914-2
Lotus Europa
Volvo 142E
Hornet Sportabout
Duster 360
Cadillac Fleetwood
Lincoln Continental
Chrysler Imperial
Camaro Z28
Pontiac Firebird
Ford Pantera L
Maserati Bora
Hornet 4 Drive
Hornet Sportabout
Valiant
Duster 360
Merc 450SE
Merc 450SL
Merc 450SLC
Dodge Challenger
AMC Javelin
Camaro Z28
Pontiac Firebird
Ford Pantera L
Maserati Bora
Mazda RX4
Mazda RX4 Wag
Datsun 710
Hornet 4 Drive
Valiant
Merc 240D
Merc 230
Merc 280
Merc 280C
Toyota Corona
Porsche 914-2
Lotus Europa
Ferrari Dino
Volvo 142E
The cluster placement was biased, and the results for
m = 4 gave a more appropriate classification because of the
original car descriptions.
6. Conclusion
For typical soft clustering (FCM), the degree of
membership has an important role. Parameter values
greater than 1 and less than 2 have been used by
convention with little theoretical explanation. We analyzed
the behavior of the parameter with simulation studies. The
results showed the relations between the optimum value
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
215
with the L1 and L ∞ norms”, IEEE Transactions on
Systems Man and Cybernetics, Vol.21, No.3, pp.545–
554, 1991. Article (CrossRef Link)
[7] R. J. Hathaway, J. C. Bezdek and W. Pedrycz, “A
parametric model for fusing heterogeneous fuzzy
data”, IEEE Transactions on Fuzzy Systems, Vol.4,
No.3, pp.270–281, 1996. Article (CrossRef Link)
Fig. 25. Mapping of clustering results
and cluster placements or the number of clusters. We
mentioned that at least a larger value than that used by
convention was suitable. It is clear that a lower m provides
a conservative decision that does not allow too much
overlap among the clusters. For the correct ratio inside the
cluster and the incorrect ratio outside the cluster, a smaller
m is desirable. However a larger m is desirable for the
correct ratio outside the cluster and the incorrect ratio
inside the cluster. With judgment from a multi-faceted
perspective, the optimal m should be larger than the
conventional value.
As a future issue for research, we need to know more
sophisticated features of the parameter by using a greater
variety of data generation.
References
[1] J. C. Dunn, “A Fuzzy Relative of the ISODATA
Process and Its Use in Detecting Compact WellSeparated Clusters”, Journal. of Cybernetics, Vol.3,
pp.32–57, 1974. Article (CrossRef Link)
[2] J. C. Bezdek, “Pattern Recognition with Fuzzy
Objective Function Algorithms”, Plenum Press, New
York, 1981. Article (CrossRef Link)
[3] S. Miyamoto, K. Umayahara and M. Mukaidono,
“Fuzzy Classification Functions in the Methods of
Fuzzy c-Means and Regularization by Entropy”,
Journal. of Japan Society for Fuzzy Theory and
Intelligent Informatics, Vol.10, No.3, pp.548–557,
1998. Article (CrossRef Link)
[4] H. Henderson and P. Velleman, “Building multiple
regression models interactively”. Biometrics, vol.37,
pp.391–411, 1981. Article (CrossRef Link)
[5] S. Hotta and K. Urahama, “Retrieval of Videos by
Fuzzy Clustering”, Image Information and Television
Engineers Journals, Vol.53, No.12, pp.1750–1755,
1999. Article (CrossRef Link)
[6] L. Bobrowski and J. C. Bezdek, “c-means clustering
Copyrights © 2015 The Institute of Electronics and Information Engineers
Takeo Okazaki is Associate Professor of Information
Engineering at Univer-sity of the
Ryukyus, Japan. He recei-ved his
B.Sci. and M.Sci. degrees in Algebra
and Mathematical Statistics from
Kyushu University, Japan, in 1987 and
1989, respectively and his Ph.D. in
Information
Engineering
from
University of the Ryukyus in 2014. He
was a research assistant at Kyushu University from 1989 to
1995. He has been an assistant professor at University of
the Ryukyus since 1995. His research interests are
statistical data normalization for analysis, statistical causal
relationship analysis. He is a member of JSCS, IEICE, JSS,
GISA, and BSJ Japan.
Ukyo Aibara received his B.Eng. degree in Information
Engineering from University of the
Ryukyus, Japan, in 2015. In 2013, he
graduated from National Institute of
Technology, Kumamoto College. He
discussed the evaluation of the
performance and character for the soft
clustering in his graduate research and
thesis. Especially he investigated a
variety of applications for fuzzy c-means method, and
developed an enhancement package with Statistical
Language R.
Lina Setiyani is Master Course Student of Information
Engineering at Uni-versity of the
Ryukyus, Japan. She received her
B.Comp. degree in Infor-mation
Engineering
from
Janabadra
University, Yogyakarta, Indonesia, in
2007. Her graduation research thema
was SMS Based Information System
of Janabadra University Student
Admission. Her research interest is about finding optimal
solution of a problem with genetic algorithm using Java
language program.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.4.216
216
IEIE Transactions on Smart Processing and Computing
Development of Visual Odometry Estimation for
an Underwater Robot Navigation System
Kandith Wongsuwan and Kanjanapan Sukvichai
Department of Electrical Engineering, Kasetsart University / Chatuchak, Bangkok, Thailand
kandithws@yahoo.com, fengkpsc@ku.ac.th
* Corresponding Author: Kandith Wongsuwan
Received July 15, 2015; Revised August 5, 2015; Accepted August 24, 2015; Published August 31, 2015
* Regular Paper: This paper reviews the recent progress, possibly including previous works in a particular research topic,
and has been accepted by the editorial board through the regular reviewing process.
* Extended from a Conference: Preliminary results of this paper were presented at the ITC-CSCC 2015. The present paper
has been accepted by the editorial board through the regular reviewing process that confirms the original contribution.
Abstract: The autonomous underwater vehicle (AUV) is being widely researched in order to
achieve superior performance when working in hazardous environments. This research focuses on
using image processing techniques to estimate the AUV's egomotion and the changes in orientation,
based on image frames from different time frames captured from a single high-definition web
camera attached to the bottom of the AUV. A visual odometry application is integrated with other
sensors. An internal measurement unit (IMU) sensor is used to determine a correct set of answers
corresponding to a homography motion equation. A pressure sensor is used to resolve image scale
ambiguity. Uncertainty estimation is computed to correct drift that occurs in the system by using a
Jacobian method, singular value decomposition, and backward and forward error propagation.
Keywords: Underwater robot, Visual odometry, Monocular odometry, AUVs, Robot navigation
1. Introduction
2. Odometry Estimation via Homography
The underwater autonomous vehicle (AUV) is still in
development but aims to be effective when working in the
industrial field. To create an autonomous robot, one of the
important things is a strategy to autonomously navigate the
robot to desired destinations. Several techniques are used
to estimate its motion by using imaging sonar or Doppler
velocity log (DVL). Because the cost per sensor device is
extremely high, an alternative for AUV navigation is
implemented in this research by using a visual odometry
concept, which is normally used in mobile robots.
In our design procedure, the monocular visual
odometry estimation was done by using a single highdefinition camera attached to the bottom of the robot,
grabbing different time sequences of images and
calculating the robot’s movement from changes between
two images. Assume that the roll, pitch and depth of the
robot in relation to the floor of the testing field is known,
the monocular visual odometry concept is designed as seen
in Fig. 1.
The implementation is based on using a single pin-hole
camera. The Shi-Tomasi [1] method and Lucas-Kanade
pyramidal optical flow [2] are used in order to estimate a
different time image homography. Optical flow is
implemented in OpenCV for the feature matching
algorithm. A random sample consensus (RANSAC)
method is used to eliminate any feature outliers. Let the
estimated projective homography between frames be H12 ,
and let the camera intrinsic parameter be A1 . Hence, the
calibrated homography is shown in Eq. (1):
⎛ t
⎞
H12c = A1−1 H12 A1 = R 2 ⎜ I − 2 n1t ⎟
d
⎝
⎠
where
- R2 is the camera's rotation matrix
- t 2 is the camera's translation vector
(1)
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
217
Fig. 1. The monocular visual odometry concept.
homography error covariance matrix [3]:
- n1 is a normal vector to the object (ground) plane, and
- d is the distance to the object (ground) plane in meters
In advanced, every single image frame is adjusted for
its rotation about the x and y axes, Roll (ϕ )ٛand Pitch (θ )
are known by applying a compensated homography to the
image. If t 2 = [0 0 0]T from Eq. (1), the compensated
homography can be rewritten as Eq. (2):
H = A1 R ϕ Rθ A1−1
(2)
From Eq. (1), two sets of solutions can be obtained:
S1 = {R112 , t12 , n11 } and S 2 = {R122 , t 22 , n12 } . The criteria for
choosing a correct solution is, as both frames are
compensated on the same plane (plane normal vector
direction toward camera), Roll (ϕ )ٛand Pitch (θ ) of the
correct rotation matrix (either R112 or R122 ٛ) must be set to
about zero, which must correspond to the correct set of
answers, as shown in Eq. (3):
Si = argmin( ϕi2 + θ i2 ), i = 1, 2
(3)
i
⎛ ∂p' T −1 ∂p'
∑ p' = ⎜⎜
∑x'
∂hij , xi
⎝ ∂hij , xi
⎞
⎟⎟
⎠
†
where
- p' is the vector of the matched feature point in the
second image
∂p'
is a Jacobian matrix of p'
∂hij , xi
- † is a pseudo inverse.
- ∑ is a covariance matrix
For pi' and each element of pi' in the Jacobian
An odometry estimation of the sensors’ covariance
matrix is needed in order to determine the uncertainty
occurring in the system. To estimate the uncertainty of
rotation about the z-axis, or yaw (ψ ) , and horizontal
translation (x, y), there are two steps as shown in sections
3.1 and 3.2.
3.1 Homography Covariance Matrix
First, backward error propagation is used to find a
∂p'
,
∂hij , xi
if the estimate point is normalized in the image plane, the
estimate homography is affine in all cases ( h31 = h32 = 0
and h33 = 1 ). So, we have
xˆi' = h11 xˆi + h12 yˆ i + h13
yˆ i' = h21 xˆi + h22 yˆi + h23
3. Covariance Matrix Estimation
(4)
(5)
By taking a partial derivative of Eq. (5), the Jacobian
element is:
∂xˆi'
∂xˆ '
∂xˆ '
= xˆi , i = yˆ i , i = 1
∂h12
∂h13
∂h11
'
'
∂yˆ i
∂yˆ
∂yˆ '
= xˆi , i = yˆ i , i = 1
∂h22
∂h23
∂h21
'
'
'
∂xˆi
∂xˆ
∂yˆ
∂yˆ '
= h11 , i = h12 , i = h21 , i = h22
∂xˆi
∂yˆ i
∂xˆi
∂yˆ i
(6)
218
Wongsuwan et al.: Development of Visual Odometry Estimation for an Underwater Robot Navigation System
And the other element of the Jacobian is 0, since we
assume that each pixel feature is independent.
In this case, variance of error that occurs in the
matching algorithm is assumed to be less than 1 pixel, and
there is no error in first-image acquisition. 6σ is used to
determine the variance in every single pixel when all of
them are independent. Then, the Jacobian of the SVD is
computed by using Eq. (7) in order to solve another layer
of back error propagation:
∂H12c
∂U
∂D T
∂V T
=
DV T + U
V + UD
∂hij
∂hij
∂hij
∂hij
(7)
From Eq. (7), the equation is solved as referred to by
Papadopoulo and Lourakis [4], as follows:
∂D ∂λk
=
= uik v jk , k = 1, 2,3
∂hij ∂hij
(8)
λl ΩUij ,kl + λk ΩVij ,kl = uik v jl ⎫
λk ΩUij ,kl + λl ΩVij ,kl = −uil v jk ⎬⎭
(9)
where the index ranges are k = 1, 2,3 and l = i + 1, i + 2 .
Since λk ≠ λl , the 2x2 equation system (9) has a unique
solution that is practically solved by using Cramer’s Rule:
ΩUij ,kl =
ΩVij ,kl =
uik v jl λk
−uil v jk λl
λl2 − λk2
λk uik v jl
λl −uil v jk
λl2 − λk2
U, V, D is ∑ U,V,D concatenated together as
diag (∑ U,V,D ) = ⎡⎣σ u2 "σ u2 σ v2 "σ v2 σ λ2 σ λ2 σ λ2 ⎤⎦
11
1
2
3
(15)
For the rotation matrix, recall that, in this paper, we are
only interested in yaw ( ψ ), which could be determined
from the camera’s frame rotation matrix ( R 2 ) which can
be computed from
⎡ α 0β ⎤
R 2 = UMV = U ⎢ 0 1 0 ⎥ V ٛ
⎢ −sβ 0s β ⎥
⎣
⎦
(15)
where
- s = det(U)det(V )
-α=
λ1 + sλ3δ 2
λ2 (1 + δ 2 )
λ12 − λ22
λ22 − λ32
- β = ± 1−α 2
with criteria such that sgn ( β ) = − sgn (δ ) . And the yaw of
(11)
∂U
= UΩUij
∂hij
(12)
∂V
= − VΩVij
∂hij
(13)
∂U ∂D
∂V T
,
and
∂hij ∂hij
∂hij
can be obtained, where U, D, V are results of applying
SVD to H12c . Their covariance matrix ∑U , ∑ D , ∑V can be
computed via a forward error propagation method as
follows:
T
∑U = J U ∑ H J U
T
∑D = J D ∑H J D
T
∑V = J V ∑ H J V
33
3.2 Rotation Matrix and Translation
Vector Covariance Matrix Estimation
R 2 can be obtained as:
⎛ r21 ⎞
⎟
⎝ r11 ⎠
ψ = arctan ⎜
And finally, we obtain the Jacobian of U and V from
From Eqs. (8), (12), and (13),
11
In this case, we assume that each parameter in U, V, D
matrices are independent, so as we propagate the value
from Eq. (14), we force the other element to be 0.
- δ =±
(10)
33
(14)
where JU , J D , JV is the Jacobian of U, D, V ,
respectively. From Eq. (14), the covariance matrix of
(16)
where the rotation matrix elements r11 , r21 can be derived
from Eq. (17):
r11 = v11 (α u11 − s β u13 ) + v12 u12 + v13 ( β u11 + sα u13 ) ⎫
r21 = v21 (α u11 − s β u13 ) + v22 u12 + v23 ( β u11 + sα u13 ) ⎬⎭
(17)
To achieve the target to determine yaw uncertainty, we
apply forward propagation to Eq. (17) where its Jacobian
can be retrieved by doing a partial derivative function of
yaw with respect to the U, V, D parameters. Therefore,
Jψ =
∂ψ
1
∂a
=
(
)
∂U, V, D 1 + a 2 ∂U, V, D
(18)
where
r
- a = 21
r11
And finally, the covariance of yaw can be obtained
from forward error propagation:
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
σ Ψ2 = Jψ Σ U,V,D JψT
219
(19)
For translation covariance, from Eq. (1), the translation
vector can be obtained by
t2 =
where
- t 2 = [ x ٛy z
⎛λ
⎞ ⎞
1⎛
⎜⎜ − β u1 + ⎜ 3 − sα ⎟ u 3 ⎟⎟
ω⎝
⎝ λ2
⎠ ⎠
(20)
T
- ω is a scale factor of normal vector
ω is a factor that scales the normal vector to n1 = 1 ,
Fig. 2. Prototype frame for testing visual odometry.
so we have
ω=
1
n + ny2 + nz2
2
x
(21)
By substituting Eq. (21) into Eq. (20), we get a full
camera translation vector:
⎛
⎛λ
⎞ ⎞
t 2 = nx2 + ny2 + nz2 ⎜ − β u1 + ⎜ 3 − sα ⎟ u 3 ⎟
⎜
⎟
⎝ λ2
⎠ ⎠
⎝
(22)
Table 1. Vectornav VN-100 IMU Specification.
The translation vector Jacobian matrix J t can be
derived from a partial derivative to all 21 parameters
( all element of U, diag ( D ) ,ٛ ) that have dependency in
Eq. (22). Finally, we apply forward propagation and we
have
Σt = J t Σ U,V,D J Tt
Fig. 3. Vectornav VN-100 IMU.
(23)
4. Experimental Procedure and Results
In order to implement our visual odometry algorithm,
the OpenCV library (in C++) for image processing is used.
Now, the visual odometry component is tested on a
prototype frame to which the selected camera (Logitech
C920) is attached, as shown in Fig. 2. A Vectornav VN100 (Fig. 3) internal measurement unit was chosen to
measure the prototype frame orientation. The IMU
specification is described in Table 1, in which we used
fused data from a gyroscope, an accelerometer and a
magnetometer. Leaves and rocks were selected as the
objects in order to simulate a real underwater environment.
They are a good for a real-time feature tracking with good
tracking results. The features of leaves and rocks are
displayed in Fig. 4.
The conclusion of the implementation procedure for
the monocular visual odometry is described in Fig. 5.
In system integration, all of the software components
are run on the robot operating system (ROS). ROS is a
middleware or a framework for robot software
development. Instead of programming every single module
Specification
Range: Heading, Roll
±180°
Range: Pitch
±90°
Static Accuracy (Heading, Magnetic)
2.0° RMS
Static Accuracy (Pitch/Roll)
0.5° RMS
Dynamic Accuracy (Heading, Magnetic)
2.0° RMS
Dynamic Accuracy (Pitch/Roll)
1.0° RMS
Angular Resolution
< 0.05°
Repeatability
< 0.2°
Fig. 4. Real-time feature tracking using Lucas-Kanade
pyramidal optical flow on a prototype frame.
in one project or process, ROS provides tools and libraries
to do inter-process communication using a publish-andsubscribe mechanism to the socket servers that handle all
220
Wongsuwan et al.: Development of Visual Odometry Estimation for an Underwater Robot Navigation System
Fig. 7. Visual odometry experimental results (trajectory).
Fig. 5. Implementation procedure for visual odometry.
Fig. 6. Visual odometry translation error.
messages, parameters and services that occur in the
systems, which support python and C++. Moreover, ROS
also links many useful libraries for robotics programming,
such as OpenCV and OpenNI, and it provides some
hardware driver packages (e.g., Dynamixel Servo) and a
visualization package to use with ROS messages.
Furthermore, ROS handles sensors, images and data flow
in the system, which makes system integration easier.
4.1 Error Evaluation
The system was tested over several iterations. In our
experiment, bias of 15.6344% of translation was added to
the system in order to compensate for the translation error.
Percent error of translation, by using the LK optical flow
algorithm, in the experiments is packed enough to
compensate, as shown in Fig. 6.
4.2 Results
The visual odometry algorithm was subjected to
experimentation. The prototype frame was driven along
the ground in order to create translation motion as a fixed
trajectory. In the experiment, the prototype frame was
turned clockwise 90 degrees, and then sent straight for a
while; after that, we turned it back by 90 degrees. The real
Fig. 8. Visual odometry experimental result (ψ ) .
experimental results compared with the ground truth are
shown in Fig. 7.
The experimental results show that the estimated
trajectory is close to the real translation trajectory. The
second experiment was conducted in order to obtain the
estimated yaw angle by using the proposed algorithm. The
Vectornav VN-100 internal measurement unit (IMU) was
used in the second experiment in order to obtain the yaw
angle when the prototype frame is rotated. With the same
trajectory as in Fig. 7, the results from the visual odometry
estimation algorithm and from the real yaw angle from the
IMU were compared and are shown in Fig. 8. The
experimental results show that the estimated trajectory is
also close to the real translation trajectory, even when the
frame is rotated.
The covariance matrix estimation of the visual
odometry algorithm was obtained from a real experiment
in real time. As we calculated them frame by frame, each
output parameter variance is shown compared with x, y,
yaw output from visual odometry estimation in Fig. 9, Figs.
10 and 11, respectively. Note that we scale the value of
yaw so that it can be seen in relation to the value transition
and its covariance value.
In addition, we applied our monocular visual odometry
algorithm to a video data log of a real underwater robot.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
Fig. 9. X variance estimation from visual odometry results.
Fig. 10. Y variance estimation from visual odometry results.
Fig. 11. Yaw variance estimation from visual odometry results.
The data log was collected from a Robosub 2014
competition organized by the Association for Unmanned
Vehicle Systems International (AUVSI), which was an
international competition. By using the ROS, IMU data
and barometer data was also collected with time
synchronization with the video data log so that we could
apply our algorithm. Results are shown in Fig. 12, which
demonstrates that our algorithm can be used in a hazardous
underwater environment with good performance.
Despite unavailability of the real ground truth of the
competition field, we could still estimate the robot
displacement using a GPS and the competition field plan
given by AUVSI. The displacement from the AUV
deployment point to where it stopped is about 8.5 meters,
which is shown in Fig. 12, such that our algorithm could
estimate AUV trajectory.
All of the experiments were well tested in a prototype
Fig. 12. Final trajectory result of the data log.
221
222
Wongsuwan et al.: Development of Visual Odometry Estimation for an Underwater Robot Navigation System
Fig. 13. Bird’s eye view of testing the AUV in the Robosub 2014 competition.
Fig. 14. Visual odometry from the Robosub 2014 competition data log.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
Fig. 15. Autonomous underwater robot.
frame. In practice, it will be implemented and tested “in
real time” in our designed autonomous underwater robot,
as shown in Fig. 15.
5. Conclusion
A monocular visual odometry estimation was
implemented and tested in a prototype frame by looking
downward to the ground plane and compensating every
input image using pitch and roll from an IMU to guarantee
that the input features were not distorted by the camera’s
direction. With Lucas-Kanade pyramidal optical flow, the
tracked Shi-Tomasi method on the ground can be
calculated between frame homography. After the
homography is decomposed and the two sets of answers
are obtained, the criteria for choosing them was explained.
In order to control robot navigation, covariance of the
visual odometry output, x y and yaw, is computed by using
back and forward error propagation and Jacobian matrix
and all mathematical derivatives are explained. The
experimental visual odometry algorithm was tested,
showed good results, and will later be implemented on a
real underwater robot.
223
1981. Article (CrossRef Link)
[3] R. Hartley, A. Zisserman, “Multiple View Geometry in
Computer Vision Second Edition”. Cambridge
University Press, March 2004. Article (CrossRef Link)
[4] Papadopoulo, T., Lourakis, M.I.A., “Estimating the
jacobian of the singular value decomposition:theory
and applications”. In: Proceedings of the 2000
European Conference on Computer Vision,vol. 1, pp.
554–570 (2000) Article (CrossRef Link)
[5] F. Caballero, L. Merino, J. Ferruz, A. Ollero.
“Vision-Based Odometry and SLAM for Medium
and High Altitude Flying UAVs”. J Intell Robot Syst,
June 2008. Article (CrossRef Link)
[6] P. Drews, G.L. Oliveira and M. da Silva Figueiredo.
Visual odometry and mapping for Underwater
Autonomous Vehicles In Robotics Symposium (LARS),
2009 6th Latin American Article (CrossRef Link)
[7] D. Scaramuzzaand and F. Fraundorfer. Visual
Odometry [Tutorial] In Robotics & Automation
Magazine, IEEE (Volume:18, Issue: 4) Article
(CrossRef Link)
[8] E. Malis and M. Vargas. Deeper understanding of the
homography decomposition for vision-based control.
[Research Report] RR-6303, 2007, pp.90.
[9] Article (CrossRef Link)
Kandith Wongsuwan received his BSc
in Electrical Engineering from Kasetsart
University (KU), Bangkok, Thailand, in
2015. He is now a team leader of the
SKUBA robot team, which was a fiveyear world champion in world robocup
small-size robot competition. Furthermore,
he is experienced in autonomous robot
competition, as well, such as the @Home robot, which is an
autonomous servant robot, and the autonomous underwater
vehicle (AUV). His current research interests include
computer vision, robotics, signal processing, image
processing and embedded systems application.
Acknowledgement
This research was supported by the Department of
Electrical Engineering in the Faculty of Engineering,
Kasetsart University, as part of courses 01205491 and
01205499, Electrical Engineering Project I & II. Finally,
the authors want to express sincere gratitude to Mr.
Somphop Limsoonthrakul, a doctoral student from the
Asian Institute of Technology, Prathum-Thani, Thailand,
who has always been a supporter of this research.
References
[1] Shi, Jianbo, and Carlo Tomasi. "Good features to track."
Computer Vision and Pattern Recognition, 1994.
Proceedings CVPR'94., 1994 IEEE Computer Society
Conference on. IEEE, 1994 Article (CrossRef Link)
[2] B. D. Lucas, T. Kanade. An iterative image registration
technique with an application to stereo vision. IJCAI,
Copyrights © 2015 The Institute of Electronics and Information Engineers
Kanjanapan Sukvichai received a BSc
with first class honours in electrical
engineering from Kasetsart University,
Thailand, an MSc in electrical and
computer engineering from University
of New Haven, CT, U.S.A., in 2007 and
a DEng in Mechatronics from the Asian
Institute of Technology, Thailand, in
2014. He worked as a Professor in the Department of
Electrical Engineering, Faculty of Engineering, Kasetsart
University, Thailand. He is the advisor to the SKUBA robot
team. He served on the Executive Committee, Organizing
Committee and Technical Committee of the Small Size
Robot Soccer League, which is one division in the RoboCup
organization, from 2009 to 2014. His current research
interests include multi-agent autonomous mobile robot
cooperation, underwater robots, robot AI, machine vision,
machine learning, robot system design, robot system
integration and control of an unstable robot system.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.4.224
224
IEIE Transactions on Smart Processing and Computing
Real-time Full-view 3D Human Reconstruction using
Multiple RGB-D Cameras
Bumsik Yoon1, Kunwoo Choi2, Moonsu Ra2, and Whoi-Yul Kim2
1
2
Department of Visual Display, Samsung Electronics / Suwon, South Korea bsyoon@samsung.com
Department of Electronics and Computer Engineering, Hanyang University / Seoul, South Korea
{kwchoi, msna}@vision.hanyang.ac.kr, wykim@hanyang.ac.kr
* Corresponding Author: Whoi-Yul Kim
Received July 15, 2015; Revised August 5, 2015; Accepted August 24, 2015; Published August 31, 2015
* Short Paper
* Extended from a Conference: Preliminary results of this paper were presented at the ITC-CSCC, Summer 2015. The
present paper has been accepted by the editorial board through the regular reviewing process that confirms the original
contribution.
Abstract: This manuscript presents a real-time solution for 3D human body reconstruction with
multiple RGB-D cameras. The proposed system uses four consumer RGB/Depth (RGB-D)
cameras, each located at approximately 90° from the next camera around a freely moving human
body. A single mesh is constructed from the captured point clouds by iteratively removing the
estimated overlapping regions from the boundary. A cell-based mesh construction algorithm is
developed, recovering the 3D shape from various conditions, considering the direction of the
camera and the mesh boundary. The proposed algorithm also allows problematic holes and/or
occluded regions to be recovered from another view. Finally, calibrated RGB data is merged with
the constructed mesh so it can be viewed from an arbitrary direction. The proposed algorithm is
implemented with general-purpose computation on graphics processing unit (GPGPU) for real-time
processing owing to its suitability for parallel processing.
Keywords: 3D reconstruction, RGB-D, Mesh construction, Virtual mirror
1. Introduction
Current advances in technology require three-dimensional (3D) information in many daily life applications,
including multimedia, games, shopping, augmented-reality,
and many other areas. These applications analyze 3D
information and provide a more realistic experience for
users. Rapid growth of the 3D printer market also directs
aspects of practical use for 3D reconstruction data.
However, 3D reconstruction is still a challenging task.
The 3D reconstruction algorithms for given point
clouds can be classified according to spatial subdivision
[1]: surface-oriented algorithms [2, 3], which do not
distinguish between open and closed surfaces; and volumeoriented algorithms [4, 5], which work in particular with
closed surfaces and are generally based on Delaunay
tetrahedralization of the given set of sample points.
Surface-oriented methods have advantages, such as the
ability to reuse the untouched depth map and to rapidly
reconstruct the fused mesh.
In this paper, 3D reconstruction is proposed by fusing
multiple 2.5D data, captured by multiple RGB/Depth
(RGB-D) cameras, specifically with the Microsoft Kinect
[6] device. The use of multiple capturing devices for
various applications means they can concurrently acquire
the image from various points of view. Examples of these
applications are motion capture systems, virtual mirrors,
and 3D telecommunications.
The approach proposed in this manuscript constructs a
mesh by removing the overlapping surfaces from the
boundaries. A similar approach was proposed by Alexiadis
et al. [7]. Meshes generated from the multiple RGB-D
cameras can introduce various noise problems, including
depth fluctuations during measurement, and holes caused by
the interference of infrared (IR) projections from the
multiple cameras. The proposed algorithm reduces these
issues, by considering the direction of the camera pose and
by analyzing various conditions of the captured point clouds.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
225
Fig. 1. 3D reconstruction block diagram.
The paper is organized as follows. Section 2 explains
the proposed algorithm for 3D reconstruction. Section 3
presents the implementation method of the algorithm.
Section 4 discusses the results of the experiment. Finally,
Section 5 concludes this manuscript.
2. 3D Reconstruction Algorithm
In the proposed scheme, RGB-D cameras are installed
at 90° angles from the adjacent cameras and at a distance
of 1 to 2m from the target. The camera parameters and
their initial positions are estimated beforehand. If any
subtle adjustment to the camera positions is required,
optional online calibration can be performed.
At the beginning, depth and RGB data from each
camera are captured concurrently in each thread. The
captured data are synchronized for subsequent processing.
The depth data go through a bilateral filter [8] and are
transformed into the point clouds using the calibrated
intrinsic depth camera parameters. Then, the point clouds
are used to generate cell-based meshes, following the
removal of background points.
After the cell generation, each point cloud is
transformed to global coordinates with calibrated extrinsic
parameters. The redundancies between the point clouds are
removed after the transformation by the iterative boundary
removal algorithm, and the resultant meshes are clipped
together.
The RGB data is transformed to depth coordinates, and
the brightness level is adjusted by investigating the
intensity of the overlapped regions. Finally, the calibrated
RGB data are rendered with the triangulated mesh.
Fig. 1 shows a block diagram of the overall system.
Every color of the module represents a CPU thread, and
the bold and thin lines indicated in the figure show the
flow of data and parameters, respectively.
2.1 Calibrations
A set of checkerboard images is captured from
RGB/Depth cameras to estimate the intrinsic and extrinsic
parameters, for each camera, using a Matlab camera
calibration toolbox. For the depth camera calibration, IR
images are used instead of the depth images, because the
corner points of the checkerboard cannot be detected in a
depth image.
In addition to the depth camera parameters, the shifting
error between the IR and depth [9] is considered, because
the mapped color usually does not match the point cloud,
as shown in Fig. 2(c). Vertices of a colored cube
Fig. 2. Calibration process (a) RGB image, (b) Depth
image, (c, d) RGB-D mapped image before and after IRdepth shift correction, (e) Edge vectors from the point
cloud of a cube, (f) Multi-Kinect registration result, (g)
Point cloud aggregation.
(50×50×50cm) from the IR and depth images are found to
estimate the shifting value. The intersection point of the
three edges in the IR image corresponds to the vertex of
the cube in the depth image. The vertex can be found via
the intersection of the estimated three planes. The found
offset is applied in the color-to-depth transformation
module.
Usually, the extrinsic parameters between two cameras
can be estimated by finding the corresponding corners of
the checkerboard images from the cameras. However, if
the angle between the two cameras is too large, this
method is difficult to use due to the narrow common view
angle. Therefore, a multi-Kinect registration method is
proposed that uses a cube for the calibration object. It
needs only one pair of RGB/depth images per RGB-D
camera in one scene.
Fig. 2(e) shows the edge vectors and the vertex,
identified by the colors of the intersecting three planes for
one camera. The found edge vectors are transformed to the
coordinates of a virtual cube, which has the same size as
the real cube so as to minimize the mean square error of
the distances for four vertices viewable from each camera.
The registered cube and the estimated pose of the depth
cameras are shown in Fig. 2(f), and the aggregated point
cloud is given in Fig. 2(g).
Online calibration for the extrinsic parameters can be
performed if a slight change in the camera positions occurs
by some accidental movement. An iterative closest point
(ICP) [10] algorithm could be applied for this purpose.
However, there are two kinds of difficulty with traditional
ICP aligning all the point clouds in the proposed system.
First, traditional ICP works only in a pairwise manner.
Second, the point clouds do not have sufficient
overlapping regions to estimate the alignment parameters.
To resolve these problems, a combined solution of
generalized Procrustes analysis (GPA) [11] and sparse ICP
(S-ICP) [12] is adopted. The basic concept is as follows.
1. Extract the common centroid set that would become
the target of S-ICP for all the point clouds.
2. Apply S-ICP on the centroid set for each point cloud.
The difference between GPA presented by Toldo et al.
[11] and our proposed method is that only left and right
point clouds are used for centroid extraction, as seen in Fig.
226
Yoon et al.: Real-time Full-view 3D Human Reconstruction using Multiple RGB-D Cameras
Fig. 3. Mutual neighboring closest points (a, b, c) Valid
cases, (d, e) Invalid cases.
Fig. 6. Camera positions.
cross product of two vectors of the three vertices around
the cell.
The boundary cell is simply defined if the cell does not
have any surrounding cells sharing an edge. The direction
of the boundary cell is defined as the outward direction
from the center to the boundary. For horizontal/vertical
boundary cells, the direction is calculated as the weighted
sum of vectors from the center to the vertices of the
boundary edge (Figs. 5(b) and (c)):
Fig. 4. Registration errors.
(
)
v j3 − c j ( v j 2 − c j ) .
b j = v j 2 − c j v j3 − c j −
(1)
For the diagonal boundary cell, the direction is
calculated as the weighted sum of vectors from the center
to the diagonal vertices (Fig. 5(d)):
Fig. 5. Various cell types (a) No boundary cells, (b, c, d)
Examples of directional boundary cells.
3. The direction of the arrow indicates its closest vertex in
the neighboring point cloud, and the black dot indicates its
centroid.
S-ICP is repeatedly performed until all of the
maximum square residual errors of the pairwise
registration become less than a sufficiently low value. Fig.
4 shows the transition of the errors when three of four
cameras are initially misaligned by about 10cm and at 5°
to the arbitrary direction.
2.2 Cell Based Mesh
A cell-based mesh (CBM) is used for redundancy
removal, rather than the unordered triangle-based mesh,
because CBM is quick to generate, and it is also feasible to
utilize the grid property of the depth map. The projected
location of the cell and its boundary condition can be
examined rapidly, and this is used frequently in the
proposed algorithms.
A cell is constructed if all four edges surrounding the
rectangular area in the depth map grid are within the
Euclidean distance threshold Dm . During the boundary
removal stage, the center of the cell is used, which is
calculated by averaging the positions of the four
neighboring vertices around the cell (Fig. 5(a)). The
normal of each cell is also generated by calculating the
(
)
v j1 − v j 2 ( v j 2 − v j 3 ) .
b j = v j 2 − v j 3 v j1 − v j 2 −
(2)
There are undecidable one-way directional boundary
cells, such as a thin line or a point cell. These cells are
categorized as non-directional boundary cells and are dealt
with accordingly.
2.3 Redundant Boundary Cell Removal
The transformed cells may have redundant surfaces
that overlap surfaces from other camera views. The
redundant cells are removed by the redundant boundary
cell removal (RBCR) algorithm. RBCR utilizes the
direction of the virtual camera ev (Fig. 6), which is the
middle view of its neighboring camera. Using this
direction, we can effectively estimate the redundant
surfaces, minimizing the clipping area. It is also used as
the projection axis for 2D Delaunay triangulation.
Let Mk be the cell mesh generated by camera k, let
Ck , j be the jth cell in Mk , and let ck , j be the center of
cell Ck , j . The index k is labeled in the order of circular
direction. Assuming that Ck , j in Mk is a boundary cell, it
is deemed redundant if a corresponding Ck +1,m can be
found that minimizes the projective distance, d p , with the
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
227
constraint that the Euclidean distance ( d a ) between the
center of the cells should be smaller than the maximum
distance, Da :
Ck*+1,m =
argmin
C∈{ C |d a < Da , C∈\ M k +1}
(d p )
(3)
The projective distance d p is defined as follows:
d p = d a − | (ck +1,m − ck , j ) ⋅ ev |
(4)
where ev is found by spherical linear interpolation, or
“slerp” [13], with angle Ω between camera direction e k
and e k +1 :
ev =
sin ( Ω / 2 )
sin ( Ω )
Fig. 7. CUDA kernel composition.
( ek + ek +1 ) .
(5)
2.5 Triangulation
To find C * , a projection search method is adopted, i.e.,
ck , j is projected to the target view of Mk +1 , and the cells
of Mk +1 , in the fixed-size window centered on the
projected ck , j , are tested for the conditions.
Once C * is found, the corresponding Ck , j
is
considered a potentially redundant cell. The additional
conditions are tested to decide if the cell is truly redundant,
and hence removable.
If the found C * is not a boundary cell and the normal
is in the same direction, it is redundant because Ck , j is on
or under the surface, not the cell of a thin object. Or, if C *
is a directional boundary cell, Ck , j is redundant when
Ck , j is close enough to C * so that d p is smaller than the
Except for the triangulated cells in the previous
boundary clipping stage, all the other cells are simply
triangulated by dividing the cell into the two triangles. The
shorter diagonal edge is selected for triangulation.
2.6 Brightness Adjustment
Although the Kinect device provides an auto-exposure
functionality, it is not sufficient to tune the multiple RGB
cameras. The brightness is tuned online by multiplying the
correction factor. Each factor is calculated by comparing
the intensity of the overlapped region with the mean
intensity of all overlapped regions. The overlapped regions
can be directly extracted from the RBCR operation.
The propagation error from all the cameras is
distributed to each correction factor so that the overall gain
is 1.
maximum projective distance ( D p ) , and the boundary
directions are not in the same direction. This could be
regarded as the depth test in ray-tracing for the direction of
ev of the boundary cell.
The way mutual directionality is decided is by the sign
of the inner product for the two directions.
In one loop, RBCR is performed through all ks, for the
outermost boundary cells in Mk w.r.t. Mk +1 , and vice
versa, and is applied iteratively until no cells are removed.
2.4 Boundary Clipping
In this stage, any boundary cell in Mk within distance
Da from the boundary of Mk +1 is collected with the same
search method of RBCR.
The collected cells are disintegrated to the vertices, and
are orthogonally projected to the plane of the virtual
camera. Then, the projected points are triangulated via 2D
Delaunay algorithm.
3. Implementation
Among the modules of Fig. 1, the bilateral filter
through the position transform, the redundancy removal,
and color-to-depth transform modules are implemented
under the Compute Unified Device Architecture (CUDA)
[14]. The rendering module is implemented with OpenGL
and all other modules with the CPU.
Fig. 7 shows all of the implemented CUDA kernels that
correspond to the logical modules in Fig. 1.
bilateralKernel is configured to filter one depth with
one thread each. The radius and the standard deviation of
the spatial Gaussian kernel were set to 5 pixels and 4.5
pixels, respectively. The standard deviation of the intensity
(i.e. depth value) Gaussian kernel was set to 60mm.
pcdKernel generates point cloud back-projecting of the
depth pixels with the calibrated intrinsic parameters. The
kernel also eliminates the background depth pixels with a
frustum of near 0.5m and far 2.5m.
The cell generation module consists of three kernels.
228
Yoon et al.: Real-time Full-view 3D Human Reconstruction using Multiple RGB-D Cameras
gridKernel marks the valid vertices within distance Dm .
As the neighboring relationship is needed to check the
validity, four vertices are marked with an atomicOr barrier
function if they turn out to be valid. reduceKernel reduces
the grid vertices to a reduced stream, generating the
indices for the marked vertices. cellKernel constructs the
cell information if all of the neighboring vertices are valid.
The constructed cell information includes both the normal
and the center of the cell.
The positional transformation is done in taKernel. It
includes vertex, normal, center transform and the cell
projection. Although the kernel could be implemented with
a general transform kernel, as the transforms use the same
parameter, it is more efficient to process them all at once
by reducing the kernel launch time, rather than by calling
the general purpose kernel multiple times.
The RBCR algorithm is designed to run concurrently
for the four pairs of the mesh by using the CUDA stream
feature, not using the CPU thread, because the status of the
cells needs to be synchronized for every loop. rbKernel
just removes the first outermost boundary cells because the
measured boundary depth tends to be considerably
inaccurate.
The RBCR loop runs with the two CUDA kernels.
• rrKernel: Searches (3) and marks the flag for the cells
to be removed.
• updateKernel: Removes all the marked cells and returns
the number of removed cells.
The two-kernel approach makes the mesh maintain the
consistency of the boundary condition in a loop. The
search function in rrKernel is designed to use 32
concurrent threads per cell for a 16×16 search window. It
leads to loop 8 times for one complete search, and to use
32 elements of shared memory for intermediate storage of
the partial reduction. The grid size is defined as the
number of cells ( N c ) divided by the number of cells per
block ( N cpb ). N cpb is tuned to 16 as a result of
performance tuning that maximizes the speed.
The synchronization of the cell status is done
automatically when the remove counter is copied from the
device to the host with the default stream.
To accelerate RBCR and keep a constant speed, the
loop is terminated after the eighth iteration (max_loop=8)
and one more search is done for all the remaining cells,
including the non-boundary cells.
The boundary cells of the RBCR results are collected
with collectKernel by a method similar to rrKernel but
without the iterative loop.
colorKernel maps the color coordinates to the depth
coordinates followed by correction of the radial, tangential
distortions, and IR-depth shifting error using the calibrated
parameters. The operation is performed only for the
reduced cells.
The boundary clipping module runs on CPU threads
other than the RBCR thread to reduce the waiting time for
RBCR. The Delaunay triangulation (DT) algorithm is
implemented with the Computational Geometry
Algorithms Library (CGAL) [15]. As DT generates the
convex set of triangles, long-edged triangles ( > Dt ) are
Table 1. Experiment Parameters.
Parameters
Values
Dm (max cell edge)
2cm
Da (max Euclidean distance)
3cm
D p (max projected distance)
0.5cm
Dt (max triangle edge length)
3cm
Table 2. Latencies.
Modules
Sync
Latencies
16.2ms
Cell Gen.
11.2ms
Pos. Trfm.
8.1ms
Redun. Rem.
26.5ms
Triangulation
24.3ms
Total
86.3ms
Fig. 8. Performance analysis.
eliminated after the triangulation.
We adapt the sparse ICP module [16] using an external
kd-tree for mutual neighboring of closest points. The
point-to-point A 0.4 -ICP is used for optimization, with max
inner and outer iterations of 1 and 20, respectively.
The parameter values used in this paper are given in
Table 1.
The resolutions for input depth and RGB are both
640×480, and the equipment used for the implementation
was a desktop PC with an Intel i7 3.6GHz core and an
NVidia GTX970 graphics card.
4. Results
Fig. 8 gives the performance analysis results of NVidia
Nsight for the implemented kernels. As expected, it shows
that rrKernel is computationally the most expensive kernel,
as expected. The timeline shows that the speed of the
overall system is approximately 21fps. The latencies
measured at the end of each module are described in Table
2.
Fig. 9(a) shows various views of the reconstructed
human mesh that can be seen on the run. The bottom row
is the color map of the reconstructed mesh, where color
represents the mesh from the corresponding camera. The
thin purple line indicates the clipped area. Compared to the
original unprocessed mesh in Fig. 9(c), we can see that the
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
229
Fig. 9. Result of various views (a) max_loop = 8, (b) max_loop = 24, (c) The original.
Fig. 10. Result of brightness adjustment (a) before, (b) after.
resultant mesh has no redundancies and is clipped cleanly.
Fig. 9(b) is the result of RBCR when max_loop is equal to
24, showing almost no difference when max_loop is equal
to 8.
Fig. 10 gives the result of the brightness adjustment
showing that the mismatched color in the cloth is
effectively corrected.
5. Conclusion
In this paper, it is shown that the proposed algorithm
and the implementation method could reconstruct a 3D
mesh effectively, supporting a 360-degree viewing
direction with multiple consumer RGB-D cameras. The
proposed calibration method, which uses a cube as a
calibration object, could estimate the color/depth camera
parameters and the global position of the cameras
effectively, accommodated by the online calibration
method that exploits mutual neighboring closest points,
and a sparse ICP algorithm. The constructed mesh had no
redundancies after application of the proposed algorithm,
which iteratively removes the estimated redundant regions
from the boundary of the mesh. In addition, the proposed
3D reconstruction system could adjust the mismatched
brightness between the RGB-D cameras by using the
collateral overlapping region of the redundancy removal
algorithm. The overall speed for implementation was 21fps
with a latency of 86.3ms, which is sufficient for real-time
processing.
References
[1] M. Botsch, et al., Polygon Mesh Processing, AK
Peters, London, 2010. Article (CrossRef Link)
[2] H. Hoppe, et al., “Surface reconstruction from
unorganized points,” SIGGRAPH ’92, 1992. Article
(CrossRef Link)
[3] R. Mencl and H. Müller. Graph–based surface
reconstruction using structures in scattered point sets.
In Proceedings of CGI ’98 (Computer Graphics
International), pp. 298–311, June, 1998. Article
(CrossRef Link)
[4] B. Curless and M. Levoy, “A volumetric method for
building complex models from range images,”
SIGGRAPH ’96, 1996. Article (CrossRef Link)
[5] M. Kazhdan, M. Bolitho, and H. Hoppe, “Poisson
Surface Reconstruction,” Proc. Symp. Geometry
Processing, 2006. Article (CrossRef Link)
[6] Microsoft Kinect Article (CrossRef Link)
[7] D. Alexiadis, D. Zarpalas, and P. Daras, “Real-time,
full 3-D reconstruction of moving foreground objects
from mul-tiple consumer depth cameras,” IEEE
Trans on Multimedia, vol. 15, pp. 339 – 358, Feb.
2013. Article (CrossRef Link)
[8] C. Tomasi and R. Manduchi, “Bilateral filtering for
230
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
Yoon et al.: Real-time Full-view 3D Human Reconstruction using Multiple RGB-D Cameras
gray and color images,” in Proc. of the ICCV, 1998.
Article (CrossRef Link)
J. Smisek, M. Jancosek, and T. Pajdla, “3D with
Kinect,” 2011 ICCV Workshops, pp. 1154-1160, Nov.
2011. Article (CrossRef Link)
P. Besl and N. McKay, “A Method for Registration
of 3-D Shapes,” IEEE Trans. Patten Analysis and
Machine Intelligence, vol. 14, pp. 239-256, 1992.
Article (CrossRef Link)
R. Toldo, A. Beinat, and F. Crosilla, “Global
registration of multiple point clouds embedding the
generalized procrustes analysis into an ICP
framework,” in 3DPVT 2010 Conf., Paris, May 1720, 2010. Article (CrossRef Link)
S. Bouaziz, A. Tagliasacchi, and M. Pauly, “Sparse
Iterative Closest Point,” Computer Graphics Forum,
vol. 32, no. 5, pp. 1–11, 2013. Article (CrossRef
Link)
K. Shoemake, “Animating rotation with quaternion
curves,” in Proc. of the SIGGRAPH ’85, 1985, pp.
245-254. Article (CrossRef Link)
NVidia CUDA Article (CrossRef Link)
CGAL Article (CrossRef Link)
Sparse ICP Article (CrossRef Link)
Bumsik Yoon received his BSc and
MSc in Electrical Engineering from
Yonsei University, Korea, in 1997 and
2000, respectively. He is a senior
researcher at Samsung Electronics.
Currently, he is pursuing his PhD at
Hanyang University, Korea. His
research
interests
include
3D
reconstruction, pedestrian detection, time-of-flight and
computer graphics.
Kunwoo Choi received his BSc in
Electronics Engineering at Konkuk
University, Korea, in 2014. He is
currently pursuing an MSc in
Electronics and Computer Engineering
at Hanyang University. His research
interests include depth acquisition and
vehicle vision systems.
Copyrights © 2015 The Institute of Electronics and Information Engineers
Moonsu Ra received his BSc and
MSc at Hanyang University, Korea, in
2011 and 2013, respectively. He is
currently pursuing his PhD at the same
university. His research interests
include visual surveillance, face
tracking and identification, and video
synopsis.
Whoi-Yul Kim received his PhD in
Electronics Engineering from Purdue
University, West Lafayette, Indiana, in
1989. From 1989 to 1994, he was with
the Erick Johansson School of
Engineering and Computer Science at
the University of Texas at Dallas. He
joined Hanyang University in 1994,
where he is now a professor in the Department of
Electronics and Computer Engineering. His research
interests include visual surveillance, face tracking and
identification, and video synopsis.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.4.231
231
IEIE Transactions on Smart Processing and Computing
Analysis of Screen Content Coding Based on HEVC
Yong-Jo Ahn1, Hochan Ryu2, Donggyu Sim1 and Jung-Won Kang3
1
Department of Computer Engineering, Kwangwoon University / Seoul, Rep. of Korea {yongjoahn, dgsim}@kw.ac.kr
Digital Insights Co. / Seoul, Rep. of Korea hc.ryu@digitalinsights.co.kr
3
Broadcasting & Telecommunications Media Research Laboratory, ETRI / Daejeon, Rep. of Korea jungwon@etri.re.kr
2
* Corresponding Author: Donggyu Sim
Received June 20, 2015; Revised July 15, 2015; Accepted August 24, 2015; Published August 31, 2015
* Short Paper: This paper reviews the recent progress, possibly including previous works in a particular research topic, and
has been accepted by the editorial board through the regular reviewing process.
Abstract: In this paper, the technical analysis and characteristics of screen content coding (SCC)
based on High efficiency video coding (HEVC) are presented. For SCC, which is increasingly used
these days, HEVC SCC standardization has been proceeded. Technologies such as intra block copy
(IBC), palette coding, and adaptive color transform are developed and adopted to the HEVC SCC
standard. This paper examines IBC and palette coding that significantly impacts RD performance
of SCC for screen content. The HEVC SCC reference model (SCM) 4.0 was used to comparatively
analyze the coding performance of HEVC SCC based on the HEVC range extension (RExt) model
for screen content.
Keywords: HEVC SCC, Screen content coding
1. Introduction
High efficiency video coding (HEVC) is the latest
video compression standard of the Joint Collaborative
Team on Video Coding (JCT-VC), which was established
by the ITU-T Video Coding Experts Group (VCEG) and
the ISO/IEC Moving Picture Experts Group (MPEG).
Several extensions and profiles of HEVC have been
developed according to application areas and objects to be
coded [1]. The standardization of HEVC version 1 for
major applications such as ultra-high definition (UHD)
content was completed in January 2013. Furthermore, the
standardization of HEVC version 2 for additional
applications, such as high-quality, scalable, and 3D video
services, was released in October 2014. HEVC version 2
includes 21 range extension profiles, two scalable
extension profiles, and one multi-view extension profile
[2]. First, the HEVC range extension (RExt) standard aims
to support the extended color formats and high bit depths
equal to or higher than 12 bits, which HEVC version 1
does not support. In addition, the HEVC scalable extension
(SHVC) standard supports multi-layer video coding
according to consumer communications and market
environments, and the HEVC multi-view extension (MVHEVC) aims to support multi-view video services. Recently,
emerging requirements for screen content coding have
been issued, and the extension for SCC was kicked off
based on HEVC [3]. In addition, the HEVC HDR/WCG
extension for high-dynamic-range (HDR) and wide-colorgamut (WCG) coding is being discussed [4].
The HEVC SCC extension is designed for mixed
content that consists of natural videos, computer-generated
graphics, text, and animation, which are increasingly being
used. The HEVC SCC extension has been discussed since
the 17th JCT-VC meeting in March 2014, and it is being
standardized with the goal of completion in February 2016.
HEVC is known to be efficient for natural video but not
for computer-generated graphics, text, and so on. That
content has high-frequency components due to sharp edges
and lines. Conventional video coders, in general, remove
high-frequency components for compression purposes.
However, HEVC SCC includes all coding techniques
supported by HEVC RExt, and additionally, has IBC,
palette coding, and adaptive color space transform. In this
study, IBC and palette coding are explained in detail in
relation to screen content, among the newly added coding
techniques. In addition, the HEVC SCC reference model
(SCM) is used to present and analyze the coding performance for screen content. The result of the formal
subjective assessment and objective testing showed a clear
improvement in comparison to HEVC RExt [5].
232
Ahn et al.: Analysis of Screen Content Coding Based on HEVC
Fig. 2. Intra block copy and block vector.
Fig. 1. HEVC SCC decoder block diagram.
This paper is organized as follows. Chapter 2 explains
IBC and palette coding, which are the coding techniques
added to HEVC SCC. Chapter 3 presents and analyzes the
coding performance of HEVC SCC for screen content.
Finally, Chapter 4 concludes this study.
2. HEVC Screen Content Coding
Techniques
HEVC SCC employs new coding techniques in
addition to all the techniques adopted to HEVC RExt. Fig.
1 shows the block diagram for the HEVC SCC decoder,
which includes newly adopted IBC, palette mode, and
adaptive color transform (ACT). This chapter explains IBC
mode coding and palette mode coding, considering the
characteristics of screen content.
2.1 IBC Mode
As mentioned in Chapter 1, screen content is likely to
have similar patterns on one screen, unlike natural images.
Such a spatial redundancy is different from the removable
spatial redundancy under the existing intra prediction
schemes. The most significant difference is the distance
and shapes from neighboring objects. Removable spatial
redundancy with the intra prediction schemes refers to the
similarity between the boundary pixels of the block to be
coded and the adjacent pixels located spatially within one
pixel. The removable spatial redundancy in IBC mode
refers to the similarity between the area in the reconstructed picture and the block to be coded [6]. Unlike the
conventional intra prediction schemes, a target 2D block is
predicted from a reconstructed 2D block that is more than
one pixel distant from it. Inter prediction should have the
motion information, the so-called motion vector, with
respect to the reference block to remove the temporal
redundancy in the previously decoded frames. In the same
manner, IBC mode also needs the location information for
the reference block in the form of a vector in the same
frame. Fig. 2 shows the concept of IBC and the block
vector.
Fig. 3. Candidates of spatial motion vector predictor
and block vector predictor.
In HEVC SCC, a vector that locates the reference block
is called the block vector (BV) [7]. Conceptually, this is
considered to be similar to the motion vector (MV) of inter
prediction. The block vector and motion vector have
differences and similarities. In terms of the accuracy of the
vectors, MV has quarter-pel accuracy to ensure improved
prediction accuracy, whereas BV has integer-pel accuracy.
This is because of the characteristics of the screen content
in IBC mode. In computer-generated graphics, objects are
generated pixel by pixel. The key feature of IBC is
conducted on the reconstructed area of the current frame
not previously coded, or frames decoded different. In
addition, the BV should be sent to the decoder side, but the
block vector is predicted to reduce the amount of data in a
manner similar to the motion vector. During the HEVC
SCC standardization process, various algorithms of BV
prediction techniques were discussed. They can be
classified into a BV prediction method independent from
the MV prediction method, and a prediction method
working in the same way as the MV prediction method.
In the BV prediction method that is independent of the
MV prediction method, the BV of the adjacent block and
the BV that is not adjacent, but IBC-coded, is taken for the
selection of BV predictor (BVP) candidates, and one of
them is selected. Fig. 3 shows the spatial candidates under
MV and BV. In the case of MV, the MVs of all prediction
units (PUs), A0, A1, B0, B1, and B2, are used as spatial
candidates. With the BV, however, only A1 and B1 are
used as spatial candidates. When the left adjacent block,
A1, of the PU is coded with IBC mode, the BV of A1 is
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
selected as the BVP candidate. Then, when the aforementioned block B1 of the PU is coded with IBC mode, the
BV of B1 is also selected as the BVP candidate.
As in the MV prediction method, the BV prediction
technique adds the partly reconstructed current picture to
the reference picture list, and IBC mode of the current PU
could refer the reconstructed area of the current picture.
The second method was proposed in the 20th JCT-VC
meeting and it was adopted. Although the BV can be
predicted, as in the existing MV prediction method,
because the current picture is added to the reference list,
predictors may exist that do not conform to the characteristics of IBC during the advanced motion vector
prediction (AMVP) and merge candidate list creation
processes. However, it still has several problems, such as
the difference in vector accuracy between MV and BV,
optimal BVP candidates, and so on. To address this
problem, studies are being conducted to change the zero
candidate considering the IBC characteristics and to
change the candidate list.
2.2 Palette Mode
HEVC SCC has palette mode in addition to IBC mode.
In palette mode, pixel values that frequently appear in the
block are expressed in indices and then coded [8]. In
HEVC SCC, the index for palette mode coding is defined
as the color index, and the mapping information between
the pixel value and the color index is defined in the palette
table. Palette mode can improve coding efficiency when
the prediction does not work due to low redundancy and
when the number of pixel values for the block is small.
Unlike the existing coding mode, which uses intra-inter
prediction to remove the spatial-temporal redundancy,
palette mode expresses the pixel values that form the
current block with the color index, so coding is possible
independent of the already restored adjacent information.
In addition, fewer color indexes are required than the total
number of pixels in a current block, which improves
coding efficiency.
The coding process in HEVC SCC palette mode
consists of two steps: the palette table coding process and
the color index map coding process. The palette table
coding process is conducted as follows. First, assuming
that N peaks exist when pixel values are shown in a
histogram, N peak pixel values are defined as major colors.
N major colors are mapped as the color indices via
quantization, with the colors in the quantization zone as
the major colors. The colors that exist outside the quantization zone, which are not expressed as major colors, are
defined as escape colors, which are coded not using the
color index but the quantization of the pixel values. The
table with the generated color indices is defined as the
palette table for each coding unit (CU). The palette table
conducts prediction coding by referring to the previous CU
coded by palette mode. Whether or not the prediction
coding is conducted is coded using the previous_palette_
entry_flag. Prediction coding uses palette stuffing, which
utilizes the predictor of the previous CU [9]. Fig. 4 shows
the histogram of the pixel values and the resulting major/
233
Fig. 4. Histogram of the pixel values and major/escape
colors of palette mode.
escape colors of palette mode.
Then, the current CU is coded through the color index
map coding process. The color index map refers to the
block expressed by replacing the pixel value of the current
CU with the color index. The information coded during the
color index map coding process includes the color index
scanning direction and the color index map. The scanning
direction of the color index is coded using the palette_
transpose_flag by the CU, and three types of color index
map, the INDEX mode, COPY_ABOVE mode, and
ESCAPE mode, are coded. In INDEX mode, run-length
coding is conducted for the color index value, and the
mode index, color index, and run-length are coded. In
COPY_ABOVE mode, which copies the color index of the
aforementioned row, the mode index and run-length are
coded. Finally, in ESCAPE mode, which uses the pixel
value as it is, the mode index and the quantized pixel value
are coded.
3. Analysis and Performance Evaluation
of HEVC SCC
In this chapter, the HEVC SCC reference mode (SCM)
4.0 [10] is used to analyze the coding performance of
HEVC SCC for the screen content against the reference
model for HEVC RExt. All the tests were conducted under
the HEVC SCC common test condition (CTC) [11] to
achieve the coding performance with HEVC SCC. In
addition, the HEVC SCC common test sequences were
used in the test, which were classified into four categories.
Text and graphics with motion (TGM) have images with
text and graphics combined, and best shows the
characteristics of the screen content. Mixed contents (M) is
an image that contains mixed characteristics of screen
content and natural images. In addition, there are
categories such as animation (A) and natural image
camera-captured content (CC). Fig. 5 shows the four
categories of the HEVC SCC common test sequences.
The coding performance and speed were measured in
the same test environment. Table 1 lists the details of the
test environment. In addition, Bjontegaard distortion- bitrate
234
Ahn et al.: Analysis of Screen Content Coding Based on HEVC
Table 2. BD-BR performance evaluation of SCM 4.0
compared to HM 16.0 in All Intra.
All Intra
Category
Text and graphics with motion (TGM)
Y
U
V
TGM, 1080p & 720p
-57.4%
-61.2%
-62.7%
M, 1440p & 720p
-44.9%
-50.3%
-50.4%
A, 720p
0.0%
-8.5%
-5.2%
CC, 1080p
5.4%
8.6%
12.5%
Encoding time (%)
347
Decoding time (%)
121
Table 3. BD-BR performance evaluation of SCM 4.0
compared to HM 16.0 in Random Access.
Random Access
Category
Mixed contents (M)
Y
U
V
TGM, 1080p & 720p
-48.0%
-52.4%
-55.0%
M, 1440p & 720p
-36.3%
-43.6%
-43.7%
A, 720p
1.8%
-5.3%
-2.2%
CC, 1080p
6.0%
15.8%
20.1%
Encoding time (%)
139
Decoding time (%)
147
Table 4. BD-BR performance evaluation of SCM 4.0
compared to HM 16.0 in Low delay B.
TGM, 1080p & 720p
Y
-41.4%
Low delay B
U
-45.3%
M, 1440p & 720p
-23.7%
-32.1%
-32.2%
A, 720p
2.7%
-2.0%
0.6%
CC, 1080p
6.0%
14.2%
17.6%
Category
Animation (A)
Camera-captured content (CC)
Fig. 5. Four categories of common test sequences.
Table 1. Test environments.
Specification
CPU
INTEL CORE I7-3960X 3.30GHZ
Memory
16GB
OS
Windows 7
Compiler
VS 2012
Encoding time (%)
141
Decoding time (%)
145
V
-48.0%
(BD-BR) was used to compare the coding performance,
and the time ratio was used to measure the coding speed.
Tables 2 to 4 show the coding performance when the
HEVC SCC common test sequences were coded in HM
16.0 and SCM 4.0. For the screen content, SCM 4.0 had
19.1% Y BD-BR gain, 21.8% U BD-BR gain, and 20.7%
V BD-BR gain, compared with HM 16.0. In the TGM
category, which has strong screen content characteristics,
48.9% Y BD-BR gain, 53.0% U BD-BR gain, and 55.2%
V BD-BR gain were obtained, and the results show that the
newly added IBC mode and palette mode performed well.
Otherwise, 5.8% Y BD-BR loss, 12.9% U BD-BR loss,
and 16.7% BD-BR loss were obtained in the CC category,
which has strong natural content characteristics. As shown
in Tables 2 to 4, the newly added coding tools, IBC and
palette mode, increased encoding time by 347%, 139%,
and 141% for All Intra, Random Access, and Low delay B,
respectively. In the results, fast encoding algorithms for
IBC and palette mode are expected to be widely studied in
the future.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
4. Conclusion
In this paper, the newly added algorithms of HEVC
SCC were introduced, and the HEVC SCC coding
performance was also analyzed for screen content. The
coding characteristics of the new coding tools that consider
the screen content, which are intra block copy (IBC) mode
and palette mode, were introduced in detail. As for coding
performance of the screen content, HEVC SCC had 19.1%
BD-BR gain compared with HEVC RExt. The HEVC SCC
standardization will be completed in October 2015 and is
expected to be widely used in various screen content
applications.
Acknowledgement
This work was partly supported by Institute for
Information & communications Technology Promotion
(IITP) grant funded by the Korea government(MSIP) (No.
R010-14-283, Cloud-based Streaming Service Development for UHD Broadcasting Contents) and Basic Science
Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science,
ICT & Future Planning(NRF-2014R1A2A1A11052210).
References
[1] G. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand,
“Overview of the high efficiency video coding
(HEVC) standard,” Circuits and Systems for Video
Technology, IEEE Transactions on, vol. 22, no. 12,
pp. 1649-1668, 2012. Article (CrossRef Link)
[2] G. Sullivan, J. Boyce, Y. Chen, J.-R. Ohm, C. Segall,
and A. Vectro, “Standardized extensions of high
efficiency video coding (HEVC),” Selected Topics in
Signal Processing, IEEE Journal of, vol. 7, pp. 10011016, Dec. 2013. Article (CrossRef Link)
[3] ITU-T Q6/16 Visual Coding and ISO/IEC
JTC1/SC29/WG11 Coding of Moving Pictures and
Audio, “Joint call for proposals for coding of screen
content,” ISO/IEC JTC1/SC29/WG11 MPEG2014/
N14175, Jan. 2014.
[4] ISO/IEC JTC1/SC29/WG11 Coding of Moving
Pictures and Audio, “Call for evidence (CfE) for
HDR and WCG video coding,” ISO/IEC JTC1/
SC29/WG11 MPEG2014/N15083, Feb. 2015.
[5] ISO/IEC JTC1/SC29/WG11 Coding of Moving
Pictures and Audio, “Results of CfP on screen cotent
coding tools for HEVC,” ISO/IEC JTC1/SC29/WG11
MPEG2014/N14399, April 2014.
[6] S.-L. Yu and C. Chrysafis, “New intra prediction
using intra-macroblock motion compensation,” JVTC151, ISO/IEC JTC1/SC29/WG11 and ITU-T SG16
Q.6, May 2002.
[7] M. Budagavi and D.-K. Kwon, “AHG8: Video
coding using intra motion compensation,” JCTVCM0350, Incheon, KR, Apr. 2013.
[8] Y.-W. Huang, P. Onno, R. Joshi, R. Cohen, X. Xiu,
and Z. Ma, “HEVC screen content core experiment 3
235
(SCCE3): palette mode,” JCTVC-Q1123, Valencia,
ES, March 2014.
[9] C. Gisquet, G. Laroche, and P. Onno, “SCCE3 Test
A.3: palette stuffing,” JCTVC-R0082, Sapporo, JP,
June 2014.
[10] HEVC Screen contents coding reference model
(SCM): ‘HM-16.4+SCM-4.0’, Article (CrossRef Link),
accessed July 2015.
[11] H. Yu, R.Cohen, K. Rapaka, and J. Xu, “Common
test conditions for screen content coding,” JCTVCT1015, Geneva, CH, Feb. 2015.
Yong-Jo Ahn received the B.S. and
M.S. degrees in Computer Engineering
from Kwangwoon University, Seoul,
Korea, in 2010 and 2012, respectively.
He is working toward a Ph.D. candidate
at the same university. His current
research interests are high-efficiency
video compression, parallel processing
for video coding and multimedia systems.
Hochan Ryu received the B.S. and
M.S. degrees in Computer Engineering
from Kwangwoon University, Seoul,
Rep. of Korea, in 2013 and 2015,
respectively. Since 2015, he has been
an associate research engineer at
Digital Insights Co., Rep. of Korea.
His current research interests are
video coding, video processing, parallel processing for
video coding and multimedia systems.
Donggyu Sim received the B.S. and
M.S. degrees in Electronic Engineering
from Sogang University, Seoul, Korea,
in 1993 and 1995, respectively. He
also received the Ph.D. degree at the
same university in 1999. He was with
the Hyundai Electronics Co., Ltd.,
From 1999 to 2000, being involved in
MPEG-7 standardization. He was a senior research
engineer at Varo Vision Co., Ltd., working on MPEG-4
wireless applications from 2000 to 2002. He worked for
the Image Computing Systems Lab. (ICSL) at the
University of Washington as a senior research engineer
from 2002 to 2005. He researched on ultrasound image
analysis and parametric video coding. Since 2005, he has
been with the Department of Computer Engineering at
Kwangwoon University, Seoul, Korea. In 2011, he joined
the Simon Frasier University, as a visiting scholar. He was
elevated to an IEEE Senior Member in 2004. He is one of
main inventors in many essential patents licensed to
MPEG-LA for HEVC standard. His current research
interests are video coding, video processing, computer
vision, and video communication.
236
Jung-Won Kang received her BS and
MS degrees in electrical engineering
in 1993 and 1995, respectively, from
Korea Aerospace University, Seoul,
Rep. of Korea. She received her PhD
degree in electrical and computer
engineering in 2003 from the Georgia
Institute of Technology, Atlanta, GA,
US. Since 2003, she has been a senior member of the
research staff in the Broadcasting & Telecommunications
Media Research Laboratory, ETRI, Rep. of Korea. Her
research interests are in the areas of video signal processing, video coding, and video adaptation.
Copyrights © 2015 The Institute of Electronics and Information Engineers
Ahn et al.: Analysis of Screen Content Coding Based on HEVC
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.4.237
237
IEIE Transactions on Smart Processing and Computing
Real-time Speed Limit Traffic Sign Detection System for
Robust Automotive Environments
Anh-Tuan Hoang, Tetsushi Koide, and Masaharu Yamamoto
Research Institute for Nanodevice and Bio Systems, Hiroshima University, 1-4-2, Kagamiyama, Higashi-Hiroshima,
739-8527, Japan {anhtuan, koide}@hiroshima-u.ac.jp
* Corresponding Author: Anh-Tuan Hoang
Received July 15, 2015; Revised August 5, 2015; Accepted August 24, 2015; Published August 31, 2015
* Regular Paper
* Extended from a Conference: Preliminary results of this paper were presented at the ITC-CSCC 2015, Summer 2015. This
present paper has been accepted by the editorial board through the regular reviewing process that confirms the original
contribution.
Abstract: This paper describes a hardware-oriented algorithm and its conceptual implementation in
a real-time speed limit traffic sign detection system on an automotive-oriented field-programmable
gate array (FPGA). It solves the training and color dependence problems found in other research,
which saw reduced recognition accuracy under unlearned conditions when color has changed. The
algorithm is applicable to various platforms, such as color or grayscale cameras, high-resolution
(4K) or low-resolution (VGA) cameras, and high-end or low-end FPGAs. It is also robust under
various conditions, such as daytime, night time, and on rainy nights, and is adaptable to various
countries’ speed limit traffic sign systems. The speed limit traffic sign candidates on each grayscale
video frame are detected through two simple computational stages using global luminosity and
local pixel direction. Pipeline implementation using results-sharing on overlap, application of a
RAM-based shift register, and optimization of scan window sizes results in a small but highperformance implementation. The proposed system matches the processing speed requirement for a
60 fps system. The speed limit traffic sign recognition system achieves better than 98% accuracy in
detection and recognition, even under difficult conditions such as rainy nights, and is
implementable on the low-end, low-cost Xilinx Zynq automotive Z7020 FPGA.
Keywords: Advanced driver assistance systems (ADAS), Speed limit traffic sign detection, Rectangle pattern
matching, Circle detection, FPGA implementation
1. Introduction
Speed limit traffic sign recognition is very important
for the fast-growing advanced driver assistance systems
(ADAS). Under continual pressure for greater road safety
from governments, traffic sign recognition and active
speed limitation become urgent issues for an ADAS.
Important traffic sign information is provided in the
driver’s field of vision via road signs, which are designed
to assist drivers in terms of destination navigation and
safety. Most important for a camera-based ADAS is to
improve the driver’s safety and comfort. Detecting a traffic
sign can be used in warning drivers about current traffic
situations, dangerous crossings, and children’s paths, as
shown in Fig. 1. Although the navigation system is
available, it cannot be applied to new roads or places that
the navigation signal cannot reach, or with electronic speed
limit traffic signs where the sign changes depending on the
traffic conditions. An assistant system with speed limitation
recognition ability can inform drivers about the change in
speed limit, as well as notify them if they drive over the
speed limit. Hence, the driver’s cognitive tasks can be
reduced, and safe driving is supported.
Speed limit traffic (SLT) sign recognition systems face
several problems in real-life usage, as shown in Fig. 2.
(1) Color: the color of an SLT sign will change depending
on the light, weather conditions, and age of the sign.
(2) Sign construction: light-emitting diode (LED) SLT
signs appear different in color, shape, and luminosity
238
Hoang et al.: Real-time Speed Limit Traffic Sign Detection System for Robust Automotive Environments
Advanced Driver Assistance System (ADAS)
Driving Assistant
System
Traffic sign recognition
(until 2017)
Car in front
Lane detection
Pedestrian
Fig. 1. Chalenges of ADAS system toward active
safety.
(a) Painted speed
signs in Japan.
(b) Sign in
Australia.
(c) Sign (d) Sign in
in Swiss. German.
(e) LED. (f) After years. (g) Distortion (h) Shined (i) At night.
(Australia). and not.
Fig. 2. The appearance differences of speed limit traffic
sign in various conditions.
2. Image Size and Scan Window Size
Requirement for SLT Sign Detection
The right side of Fig. 3 shows the differences in image
sizes of an SLT sign at different distances and angles in
real life. We define a scene as a sequence of all frames in
which the same sign appears in, and disappears from the
observation field of the camera. In one scene, when a
640×360 pixel camera is located at more than 30 meters in
front of the sign, the 60 cm diameter SLT sign in Japan
will appear as small as 10×10 pixels, as shown on the left
of Fig. 3. In that case, the size of the number in the sign is
Local road in daytime Size =
Distance > 30 meters 20×20 80
14 m
Sign size = 10 ×10 pixels
Number size = 7 ×6 pixels
Same sign at
closer distance
3.0 m
Size =
80
50×50
20°
16 m
3.6 m
3.6 m
Median strip
from printed signs, depending on the angle between the
camera and the sign.
(3) Light conditions: the presence of the sun and of some
types of lights, in daytime and at night, makes the sign
look different.
(4) Font: fonts on traffic signs are decided by governments,
and they are different in various countries. The
difference can be seen in the thickness and shape of the
number, as shown in Figs. 2 (a), (b), (c), and (d).
(5) Distortion: the image of an SLT sign has distortion
along three axes (x, y, and z), which depend on the
angle between camera and sign.
(6) Accuracy: high accuracy in recognition rate is required,
which means that the same sign should be correctly
recognized at least once in a sequence scene.
(7) Real-time processing: the automotive system must be
able to process 30 to 15 fps under various platforms.
(8) Stability: under various platforms, the system must not
need retraining when users change devices, such as the
camera.
A lot of research on SLT sign recognition (SLTSR) for
the ADAS has been done, but those SLTSR algorithms
have difficulty recognizing when color has changed due to
light conditions, such as the presence of sunshine (Fig.
2(h)), illumination at night (Fig. 2(i)), LED signs (Fig.
2(e)), and in recognizing signs in different countries. They
also have difficulty with high-accuracy, real-time processing
using few computational resources on available low-price
devices.
In this study, we aim to solve these color and
environment problems with a non-color–based recognition
approach, in which grayscale images are used in both
speed limit sign candidate detection and number
recognition. SLT sign candidates are detected from each
input frame before recognizing the limit speed in real time.
Our system combines many simple and easy computation
features of SLT signs, such as area luminosity, pixel
direction, and block histogram, into a real-time, highaccuracy, and low-computational–cost design. Hence, it is
implementable on a low-cost and resource-limited
automotive-oriented field-programmable gate array
(FPGA). The target platform is the Xilinx Zynq 7020,
which has 85K logic cells (1.3 M application-specific
integrated circuit gates), 53.2K lookup tables (LUTs),
106.4K registers, and 506KB block random access
memory. Its price is about $15 per unit [24].
Related works and an overview of our approach to SLT
sign recognition is presented in Section 2. The available
SLT sign recognition system architectures and related
algorithms for SLT sign candidate detection are discussed
in Section 3. Section 4 describes how to combine simple
features of non-color–based SLT signs for a real-time
recognition system. Section 5 offers an overview of the
hardware implementation of the proposed architecture.
Discussions on accuracy, hardware size, and throughput of
the proposed algorithm for SLT sign detection are given in
Section 6. Section 7 concludes this paper
5.2 m
Sign size = 38 ×38 pixels
Number size = 20 ×16 pixels
verge
Fig. 3.Difference of the image size of the SLT sign on
different distance, speed, lane (angle) with 640×360
pixel camera [18].
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
Frame 4, size = 18
Frame 6, size = 21
Frame 8, size = 23
Frame 10, size = 26
Frame 12, size = 28
Frame 16, size = 38
239
tion accuracy, but required a huge training data set, as
well as huge computational resources (four 512-core
GPUs) for traffic sign recognition. Recognition time for a
full HD image will be significantly increased to an
unacceptable level for real-life usage. In addition, they use
color features in their recognition, and so face accuracy
problems when recognizing signs under unknown light
conditions.
3.1.2 Machine Learning–Based Recognition
Fig. 4. A scene in real situation with image of SLT sign
gradually increase.
as small as 7×6 pixels, which is really hard to recognize,
even with the human eye. This size increases to 20×20
pixels when the camera gets closer to the sign at a distance
of 30 meters (with a 10×10-pixel number); then, the size
gradually increases to 50×50 pixels at a distance of 16
meters (on the highway) before disappearing from the
observation field of the camera. At sizes bigger than 20×20,
appearance of the speed sign becomes recognizable, as
shown at the bottom of Fig. 3. When a 1920×1080-pixel
(full HD) camera, or a camera with higher resolution, is
used, the size of an SLT sign becomes bigger. Although
the image size of an SLT sign and the processing size are a
trade-off, the range of 20×20 to 50×50 is always available
in a scene. An example of a scene in a real situation is
shown in Fig. 4.
From that real situation, SLT sign detection should not
use up too many computational resources to recognize the
SLT sign at size smaller than 20×20 or bigger than 50×50.
Since the SLT sign appears in a range of sizes, the system
must process each input frame with scan windows in a
range of sizes for SLT sign detection and recognition at the
proposed distance. Our proposed SLT sign detection
algorithm is designed to detect an SLT sign in the range of
20×20 to 50×50 pixels, which will appear in the
observation field of cameras with resolution higher than
640×360.
If the vehicle moves at 200 km/h, a 60 fps camera can
takes 15 frames for SLT sign detection during the 14 meter
distance between 30 meters away and 16 meters away
from the sign. If the SLT sign can be recognized from
those frames, a detection and recognition system will work
well, even if the vehicle moves at 200 km/h. Our system
aims for this goal.
3. Related Works
3.1 Software-Oriented Implementations
3.1.1 Neural Network–Based Sign Recognition
The multi-column deep neural networks and the multiscale convolutional network were introduced by Ciresan
et al. [21] and Sermanet and LeCun [22] in the Neural
Networks contest. They achieved as high as 99% recogni-
Machine learning was used by Zaklouta and
Stanciulescu [20] and Zaklouta et al. [23] for traffic sign
recognition. They used support vector machine (SVM)
with a histogram of oriented gradient (HOG) feature for
traffic sign detection, and tree classifiers (K-d tree or
random forest) to identify the content of the traffic signs.
They achieved accuracy of 90% with a processing rate of
10-28 fps. Again, they faced the color problem in their
implementation, and so it is difficult to apply to other
situations (night, rain, and so on).
3.1.3 Color-Based, Shape-Based, and
Template-Based Recognition
A general feature of traffic signs is color, which is
predetermined to ensure they get the driver’s attention.
Hence, color is used as a feature in a lot of image
segmentation research. Torresen et al. [16] detected the red
circle of a traffic sign by utilizing a red–white–black filter
before applying a detection algorithm. Miura et al. [7]
detected traffic sign candidate regions by focusing on the
white circular region within some thresholds. Zaklouta and
Stanciulescu [20] also used color information within a
threshold for traffic sign image segmentation. This
detection method required a color camera and more
computation resources, such as memory, for color image
storage and detection. Similar to the neural network–based
and machine learning–based recognition approaches,
color-based segmentation relies on the red color, and so,
has difficulty with recognition when the color of the traffic
sign has changed due to age and lighting conditions.
Another method for traffic sign candidate detection is
based on the shape of the signs. In this approach [5], a
feature where a rectangular structure yields gradients with
high magnitudes at its borders is used in traffic sign
candidate detection. Moutarde et al. [8] used edge
detection for rectangle detection and a Hough transform
for circle detection. This method is robust to changes in
illumination, but it requires complex computation, such as
the Hough transforms. Processing the transformation and
extracting matching peaks from big image are computationally complex for real-time processing systems.
Template matching [16] uses a prepared template for
area comparison with various traffic sign sizes. The
approach simply takes the specific color information of an
area, and compares it with a prepared template for matches.
Because the sizes of the traffic signs vary from 32x32 to
78x78 pixels, a lot of hardware resources and computation
time are required.
240
Hoang et al.: Real-time Speed Limit Traffic Sign Detection System for Robust Automotive Environments
3.2 Hardware Oriented Implementations
3.2.1 Hardware/Software Co-design
Implemen-tations
Hardware/software co-design on a low-cost Xilinx
Zynq FPGA system was presented by Han [1], in which an
input color image of the traffic sign is processed by
software on a PC before sending it to the Zynq 7000
system on chip (SoC). The traffic sign candidates are
detected with hardware using color information before
refining and performing template-based matching on an
ARM core. This hardware processing requires a lot of
memory access, as well as software processing on both the
ARM core and PC, so latency is high and throughput is
low. Big templates of 80×80 pixels are required for
improvement of detection accuracy.
Muller et al. [9] applied software and hardware design
flows on a Xilinx Virtex-4 FPGA to implement a traffic
sign recognition application. It combines multiple embedded
LEON3 processors for preprocessing, shape detection,
segmentation, and extraction with hardware IPs for
classification in parallel to achieve latency not longer than
600 ms to process one frame. However, this latency is not
fast enough to apply to real-time detection and recognition.
Irmak [3] also utilized an embedded processor approach
with minimal hardware acceleration on a Xilinx Virtex 5
FPGA. Color segmentation, shape extraction, morphological
processing, and template matching are performed on a
Power PC processor with software, and edge detection is
performed on a dedicated hardware block.
Waite and Oruklu [17] used a Xilinx Virtex 5 FPGA
device in a hardware implementation. Hardware IPs are
used for hue calculation, morphological filtering, and
scanning and labeling. The MicroBlaze embedded core is
used for data communication, filtering, scaling, and traffic
sign matching.
3.2.2 Neural Network on an FPGA
A neuron network implementation on a Xilinx Vitex 4
FPGA for traffic sign detection was presented by Souani et
al. [14]. The system works with low-resolution images
(640×480 pixels) with two predefined regions of interest
(ROIs) of 200×280 pixels. The small ROI results in high
recognition speed. However, the accuracy is as low as 82%,
and the traffic signs must be well lit. Hence, this system
has difficulty when applied at night or to back-lit signs.
3.3 A Proposed Approach for Robust
Automotive Environments
All the related speed limit traffic sign recognition
systems use color as an important feature for detection and
recognition in their implementations, and so they have
difficulty in recognizing long-term deterioration in traffic
signs, traffic signs under different lighting conditions, and
with LED traffic signs. They also use processing with high
computational complexity, such as the Hough transform,
for the traffic sign shape detection, and tree classifiers,
SVM, and complex multi-layer neural networks for traffic
sign identification. So the implementation resources
become large, and the hardware costs are also high. The
processing time of the available implementations also
presents problems when applied in real life to low-end
vehicles. In addition, the recognition approaches using
neural networks and machine learning require learning
processes with a huge dataset. For each user with a
different platform, such as camera type, a specific dataset
must be prepared and a relearning process is necessary to
guarantee high accuracy.
Different from the above works, which rely on color
features and high complexity processing in traffic sign
recognition, our approach utilizes simple yet effective
speed limit traffic sign features from grayscale images.
The proposed approach utilizes multiple rough, simple,
and easily computable features in three-step processing to
achieve a robust speed limit traffic sign detection system.
Using simple features makes the proposed system
applicable to various platforms, such as the type of camera
(color or grayscale, high- or low-resolution), and in the
line-up of FPGAs (low-end, automotive, high-end). It is
also robust under various conditions in Japan scenarios,
such as illumination conditions (daytime, night time, and
rainy nights), and types of road (local roads and highways).
The simple yet effective features enable it to easily
optimize a parameter set so as to meet features of the speed
limit traffic sign systems in other countries. The proposed
features include area luminosity difference, pixel direction,
and an area histogram. The computation of these simple
features only requires simple and low-cost hardware
resources such as adders, comparators, and first-in, firstout (FIFO) stacks. So, it is possible to implement on any
line-up of FPGAs. The proposed speed limit traffic sign
recognition system can be also extended to other traffic
sign recognition.
4. Rough and Simple Feature Combination for Real-Time SLTSR Systems
4.1 Multiple Simple Features of SLT Signs
4.1.1 Shared Luminosity Feature Between
Rectangle and Circle
Fig. 5 shows the area luminosity feature of a scan
window with a circular speed limit traffic sign inside, in
which the circle line of the sign is much darker than the
adjacent white area in a grayscale image. It means that the
luminosity of the areas that contain the circle is much
lower than the adjacent ones. If the circle fits within the
scan window, the location of the dark and the white areas
can be predefined as B1 to B8 (say, for black) and W1 to
W8 (say, for white) in Fig. 5. The luminosity differences
between those corresponding black and white areas exceed
a predefined threshold. Depending on the view angle, the
brightness of the black and white areas is different, and so
a variable threshold is used in our detection algorithm.
This luminosity feature is simple but effective because it is
extendable to the detection of other shape types, such as a
241
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
“0” in Japan along two axes. Locations of the maximum
and minimum in the histogram, as well as the ratio
between those rows/columns and others, are different. The
maximum and minimum in the histogram for each axis and
area (the total number of pixels) are the features of the
numbers and can be used to recognize the speed sign
number [18].
Fig. 5. Shared luminosity feature between rectangle and
circle.
3×3 template for
inner-circle
detection
Inner-circle line detection
Direction
3×3 template for
outer-circle
detection
Outer-circle line detection
Direction
Feature:
- Local direction of pixel can be detected by simple pattern matching.
- Pixels at different locations of circle match different directions.
Fig. 6. Local pixel direction feature of circle.
Fig. 7. Difference in histogram between numbers.
hexagon. In addition, this feature is applicable to any
image size, from VGA to 4K. It is also stable if a highresolution image is down-sampled to a lower resolution.
Hence, down-sampling an image can be used in detection,
instead of a high-resolution image to save computational
resources without sacrificing accuracy.
4.2 Multiple Simple Feature-Based Speed
Limit Sign Recognition System
4.2.1 System Overview
Fig. 8 shows the speed limit recognition system
overview. The input grayscale image is raster scanned with
a scan window (SW) in the rectangle pattern matching
(RPM) [26] module. It computes the luminosity of
rectangular and circular traffic signs, as shown in Fig. 5.
The luminosity differences of those areas are then
compared with a dynamic threshold to roughly determine
if the SW contains a rectangle/circle shape as a traffic sign
candidate with the same size. Since the RPM algorithm
utilizes common features of the circle and rectangle, it can
be used to recognize both circular and rectangular signs.
The sign enhancement filter developed by our group
includes a hardware-oriented convolution filter and image
binarization, and is applied to the 8-bit grayscale pixels of
traffic sign candidates, changing them to 1-bit black-andwhite (binary) pixels for circle detection and speed number
recognition. The sign enhancement process helps increase
the features of the speed numbers and reduces the amount
of data being processed. Circle detection uses a local
direction feature at the circle’s edge to decide if the
detected rectangle/circle candidates are really a circle mark
or not, using a binary image. Direction of pixels in
different areas in the SW and patterns for pixel local
direction confirmation are shown in Fig. 6. Finally, the
speed number recognition (NR) module analyzes block
histogram features of the regions of interest and compares
them with the predefined features of speed numbers for the
NR module [18].
Speed sign candidates (ROIs)
Global
luminosity
difference
4.1.2 Pixel Local Direction Feature of Circle
Rectangle
Pattern Matching
Fig. 6 shows a local feature of a circle in a binary
image, in which pixels at the edge of a circle have different
direction depending on its location. For a pre-determined
size, those locations and directions are pre-determined.
Direction of a pixel can be verified using simple patterns
of 3×3 pixels. A circle will have the total number of pixels
that match each specific direction to get into a predefined
range.
Signs
Enhancement
and Binarization
Fig. 7 shows examples of a histogram for binary
images of the speed sign numbers “4”, “5”, “6”, “8” and
Local pixel
directions
Circle Detection on
Binary image
Number
Recognition
60 km/h
4.1.3 Block Histogram Feature of Numbers
Input image
Not speed traffic sign
Speed limit
III
IV
V
VII
I II
VI
Block based histogram
Fig. 8. Speed limit traffic sign recognition system based
on multiple simple features.
242
Hoang et al.: Real-time Speed Limit Traffic Sign Detection System for Robust Automotive Environments
(1,1)
50
SW 50×50
Consequently, the luminosity feature in Fig. 5 is easily
extended and applied to detect an LED speed sign. The
detection of black and white luminosity is inverted to be
applied to grayscale images of the LED speed sign to
uniform features of painted and LED signs.
(x,y)
SW 49×49
•••
Location and size of the traffic sign candidates
SW 20×20
Inversion
Converted to
grayscale
Fig. 9. Multi-scan window size raster scan in parallel by
last column pixels processing.
Fig. 10. Extendibility to LED sign recognition.
4.2.2 Parallel Raster Scanning of Multiple
Scan Window Sizes
Although the image size of the system is variable, and
our system is applicable to 4K, full HD, and VGA sizes, in
the following section, for simplification of explanation, we
assume the input image size is 640×360. The detectable
speed limit traffic sign is in a range from 20×20 to 50×50
pixels. We use a raster scanning method with the above
scan window sizes, as shown in Fig. 9, to keep the
processing time constant. At any clock cycle, when a new
pixel (x,y) gets into the system, a column of 50 pixels from
(x,y) will be read from FIFO to SWs for detection, as
shown on the right of Fig. 9. All those SW sizes (from
20×20 to 50×50 pixels) are processed in parallel in one
clock cycle to find all candidates at different sign sizes at
that position. In the circle detection module, the three last
continuous columns of the scan window are buffered in
registers for detection in the same manner.
4.2.3 Feature and Strategy for LED Speed
Limit Sign Detection
Fig. 10 shows the difference between painted speed
signs and LED type speed signs, in which the number in
the LED speed sign is brighter than that in the painted sign,
and the background of the sign is black in Japan. It makes
the number in the LED sign became off-white in the
grayscale image, and so the color of the circle line in the
grayscale image becomes white, while the adjacent area
becomes dark, as shown in the middle of Fig. 10.
5. Hardware Implementation
5.1 Algorithm Modification for Efficient
Hardware Implementation
Fig. 11 shows the data processing flow, which is
optimized for hardware implementation. The hardware
module, once implemented, will occupy hardware
resources, even if it is used or not. Hence, in our hardware
implementation, the sign enhancement and binarization
(SEB) module will work in parallel with the rectangle
pattern matching module to reduce the recognition latency
by reducing random memory access and applying the
raster scan. If a high-resolution camera is used,
preprocessing, which only performs the down-sampling of
the grayscale input image to one-third, is an option to
reduce the computational resources occupied by RPM. The
number recognition (NR) and the circle detection (CD)
modules process each binary image candidate of a speed
limit traffic sign in sequence, and the two modules can
process the speed sign candidates in parallel to share the
input. The circle detection result can be used to enhance
the decision of speed number recognition. The hardware
size is small, but processing time for NR is increased
depending on the number of speed sign candidates.
Another approach is making CD to process the speed sign
candidate in parallel with RPM and SEB. Results of CD is
used to reduce the number of speed limit traffic sign
candidates detected by RPM. It reduces the processing
time, but the penalty is an increase in hardware size. In our
prototype design, we will introduce a pipeline design for
the first approach.
The optional pre-processing module is used if highresolution and/or an interlace camera is used. If an
Input image
Pre-Processing
(optional)
Rectangle
Pattern Matching
Signs
Enhancement
and Binarization
Candidate location
Binary image
Circle Detection on
Binary image
Number
Recognition
Decision
Speed limit
Fig 11. Modification for hardware oriented speed limit
sign recognition system.
243
Location and SW
flags FIFO (512)
Number
Recognition
(NR)
• • • x 50
FIFO50
Sign
Enhancement
Filter
Speed
recognition
Binary Image
Memory 2
(640
x 390)
Binary
Image
Memory 1
64
Circle
Recognition
Judgement
FIFO1 (640)
64
(640 x 390)
RPM stage
NR stage
Fig. 12. Pipeline implementation of speed limit sign
recognition system.
interlaced scan is used, the preprocessing module deinterlaces by taking odd or even lines and columns of the
input image before sending data to other modules. If a
high-resolution camera such as a full HD camera is used,
the preprocessing module is used to down-sample the input
image to the appropriate image size, such as 640×360 for
RPM and CD. It helps to reduce the number of candidates
that need to be processed in NR. Since the luminosity and
the local pixel direction features used in RPM and CD are
not affected by down-sampling, down-sampling is enough
for hardware resource reduction without sacrificing
detection accuracy.
5.2 Hardware-Oriented Pipeline
Architecture
Fig. 12 shows the two-stage pipeline architecture of the
speed limit recognition system. The system is able to scan
for traffic signs up to 50×50 pixels in size. The proposed
input image is 8-bit grayscale with a resolution of 640×360
= 230,400 pixels.
It contains two stages of RPM and NR with four main
modules of RPM, SEB, CD and NR. A judgement module
is included to decide which speed limit is shown in the
traffic sign. Support for RPM is a number of 8-bit FIFOs.
The SEB and RPM processing are independent, and
both of them work with grayscale images, and so they
could be processed in parallel in the first pipeline stage as
suggested by Fig. 11. Results from the CR module are
used to strengthen the judgment of the speed limit
recognition. CR and NR modules process binary images,
and so are executed in parallel for input data sharing
before the final judgment in the second pipeline stage.
Those 8-bit processing modules and 1-bit binary
processing modules are connected with the others through
two memories. The first one, the location and scan window
flags FIFO (LSW-FIFO), is used to store the position of
the sign candidates in the input frame and the detected
scan window sizes at that position. The second one, a
general memory called binary image memory (BIM) with
a size of 640×360 bits, is used to store the black-and-white
bit value of each frame. Two independent memories and a
memory-swapping mechanism are necessary to allow the
8-bit and 1-bit processing modules access the binary image
memories to read and write in parallel.
Fig. 13 shows the timing of the two pipeline stages of
the proposed SLT sign detection and recognition system.
RPM and SEB occur at the RPM stage in parallel using the
RPM
result FIFO
CD
SEB
BIM 1 write
NR
RPM stage
Judgement
NR stage
2nd frame
8
Rectangle
Pattern
Matching
(RPM)
1st frame
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
RPM
result FIFO
CD
SEB
BIM 2 write
NR
Judgement
Fig. 13. Pipeline stages of speed limit sign recognition
system.
Local Overlap
BB
BB
22
11 1
WW
W
11 W
22
WW
3 BB
3
B8 W 8
3 3
B8 W 8
W
B
4 B
4
B7 W 7
W
4 4
B7 W 7
WW
W
66 W
55
BB
6 6 BB
55
Reuseable
Global Overlap
B1
W1
B2
W2
B1
W1
B2
W2
W 3 B3
B8 W 8
W 4 B4
B7 W 7
B8 W 8
B7 W 7
W6
B6
SWt
W5
B5
W 3 B3
W 4 B4
W6
B6
W5
B5
SWt+n
Fig. 14. Local overlap in adjacent scan windows and
global overlap between with computational result
reusable.
same pixel data input. An optional scaling-down module
can be applied to input data for RPM if necessary for highresolution camera usage. The scan windows in RPM and
SEB are pipeline-processed with one input pixel in each
clock cycle. Hence, about 640×360=230,400 clock cycles
are necessary for the first stage. During the processing
time of the first frame, the detection result is written into
the result FIFO, and the binarization image result is written
into BIM 1 for the second stage. At the second stage, the
CD and the NR modules read data from the LSW-FIFO
and BIM 1 for processing before handing the result to the
judgment module. At the same time, the data of the second
frame is processed in the RPM stage. The result is written
into BIM 2. Then, the NR stage of the second frame occurs
with the previously written data inside BIM 2. The same
process occurs with other frames, and so the system
processes all frames in the pipeline.
5.3 Implementation of Area Luminosity
Computation for RPM using
Computa-tional Result-Sharing on
Overlap
Depending on the distance between the vehicle
(camera) and the SLT sign, the size of the sign on the input
image varies from 20×20 to 50×50 pixels. We need to scan
the input frame with various scan window sizes, as shown
in Fig. 9 for SLT sign detection at all distances. Inside
each scan window, the shared luminosity feature between
rectangle and circle in Fig. 5 is used. Two reusable
computations generated by local and global overlaps, as
shown in Fig. 14, are applied to the RPM implementation
to reduce the hardware size.
The first re-usable computational result concerns local
overlap between two adjacent scan windows, as shown in
244
Hoang et al.: Real-time Speed Limit Traffic Sign Detection System for Robust Automotive Environments
St = Sstoret-1+Saddt
+
Sstore
Hor_tmp
Sstore
Horizontal
Area (B)
Sadd
+
Ssub
S
···
20~50pixel
S
···
FIFO
Vertical B3
Area 12
the left side of Fig. 14. It is locally applied to the
brightness computation of the same area in two adjacent
scan windows (e.g. B1 area) and is compatible with the
raster scan method. The algorithm, which is compatible
with the raster scan method, and hardware implementation
for pipeline area luminosity computation, is shown in Fig.
15. The overlapped area between St-1 and St (Sstoret-1) is
reused without computation (Fig. 15(a)). The luminosity of
the preferred area St is generated by that of the overlapped
area Sstoret-1 plus the luminosity of the new input area Saddt.
Luminosity of the overlapped area Sstoret-1 is computed
from the luminosity of area St-1, which is computed during
the processing of the previous SW, and subtracts that of
the subtraction area (Ssubt-1). Hence, the computation is
now for the luminosity of the addition area (Saddt), and
storing the result for later use. At the same time, the newly
computed luminosity of the Saddt area is added to that of
the St-1 area before subtracting the previously stored
luminosity of the Ssubt-1 area for area St luminosity
computation. Hardware design of one scan window size
for the brightness computation for each area of B1~W8 is
shown in Fig. 15(b).
The second overlap is globally applied to the scan
windows inside a frame, as shown in the right side of Fig.
14. During the processing of scan window t (SWt), the
brightness of areas W3, W4, B3, and B4 are computed.
During SWt+n processing, these areas become B7, B8, W7
and W8, respectively. Hence, the brightness computation
results of those areas in SWt can be stored in FIFO for later
reuse in SWt+n. The upper right part of Fig. 16 shows how
to use FIFO to design rectangle pattern matching for a
single scan window size using global overlap.
Combination of the luminosity computation using local
and global overlaps results in simple and compact
hardware design of a single scan window, as shown in the
upper part of Fig. 16. The computation for other scan
window sizes at the same position can be done in parallel
with the same input pixels in a column, as shown in Fig. 9.
The final design with all the necessary scan window sizes
operating in parallel is shown in the lower part of Fig. 16.
Due to the change in features of the LED speed sign
shown in Fig. 10, that is, the black areas became white and
vice versa, luminosity difference in the LED sign is also
reversed. Instead of reversing the grayscale image for LED
−
W2
W3
W8
・・・
−
(x,y)
Dt
50pixel
pix_valid
Fig. 15. Pipeline implementation of area luminosity
computation using cmputational result sharing among
local overlap.
−
−
W1
B
8
match
Frame (640×390 pixels)
RPM for 50×50 SW
RPM for 49×49 SW
・・・
(b) Hardware design of area
luminosity computation using
local overlap.
B1
×2
(1,1)
50
(a) Simple area luminosity
computation using local
overlap algorithm.
B2
12
Horizontal
Area (W)
Dt
Sadd
+
Ssub
+
Threshold comparison
=
−
···
t
St-Ssubt
−
···
Sstore
Recursive implementation
Sstoret-1
x 31 scan window sizes
RPM for 20×20 SW
Controller
result
16 RPM flags
(20~50)
col_cnt
line_cnt
wr
x,y
Fig. 16. Implementation of RPM using global and local
overlap and FIFO.
sign detection, which requires a lot of computational
resources, inversion of the luminosity difference between
black and white areas is enough for LED speed sign
detection. It can be done by taking the absolute value of
the luminosity difference before making a comparison with
the threshold.
5.4 Local Pixel Direction Based Circle
Detection Implementations
5.4.1 Straight-Forward Implementation
Fig. 17 shows the mechanism and design of the circle
recognition module. A 3×3 pixel array is used to detect the
direction of the input pixel using local border templates in
Fig. 6. The direction is then compared with the expected
direction of that pixel. The number of matches is voted on
and stored in the register. The final number of matches is
compared with a previously decided threshold. If the
number of matches is in the predefined threshold range,
the input SW is considered a circle.
5.4.2 Fast and Compact Implementation
The above direct circle detection design in Fig. 17 can
be improved for faster operation using pipeline and
computation reuse of local overlap, which are suitable for
raster scanning. Since the templates are 3×3 pixels, data of
the last three columns for an SW is buffered for direction
confirmation and voting. The pipeline computation
mechanism of matched pixel voting for each expected
direction in a scan window is shown in Fig. 18. When a
new pixel arrives, its corresponding column in the SW and
two previous ones are buffered, making a 3-pixel column,
as shown in the left side of Fig. 18. There is overlap in
3×3-pixel-patterns among different lines in the 3-pixel
column; hence, a 1×3-pixel pattern of each line is checked
separately before combining three adjacencies for the final
direction confirmation. In a column, the upper part needs
to be verified with three directions: down-right (↘), down
(↓), and down-left (↙); the middle part needs to be
verified with two directions: right (→), left (←); and the
lower part needs to be verified with three directions: upright (↗), up (↑), and up-left (↖). Dividing the input
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
245
Fig. 17. Straight-forward implementation of circle detection using local pixel direction.
Fig. 19. Fast and compact implementation of circle
detection using local pixel direction and overlap.
Fig. 18. Fast and compact local pixel direction voting
using overlap.
column into three parts (upper, middle, and lower) helps to
reduce the number of directions that need to be verified at
each location to one-third. The right part of Fig. 18 shows
the direction voting for one area among the eight, using
reusable computational results, in which the direction
voting results in the overlap in each area between SWi-3,
and SWi is reused without revoting. The voting for an area
in SWi is generated by the voting result of the overlapped
area plus the voting result of the new input area (dircol+i).
The voting result of the overlapped area is the result of the
previously voted area (dirareai-3) without the subtraction
area (dircol-i). The voting is now for the addition area
(dircol+i) and storing the result for later usage as
subtraction (dircol-i). At the same time, the newly voted
column result dircol+i is added with dirareai-3 before
subtracting the previously stored dircol-i voting result for
that area’s (areai) final voting result (dirareai ). As shown
in Fig. 18, the waiting time after the direction of the newly
input pixels is verified and voted on, until the time it
becomes addition dircol+i and subtraction dircol-i of
different directions, is different. The hardware implementation of CD using voting for the local pixel direction
at the edge is shown in Fig. 19. When a new binary pixel
arrives in the system, the corresponding three columns (the
last columns in Fig. 18) are given to the direction voting
module. Pattern confirmation, a shared module among
various SW sizes compares each of the three inputs in a
line with the line patterns. Three adjacent results are then
combined in a column-direction confirmation submodule
before voting for the number of pixels that match a
Fig. 20. Daytime scenes in local road and highway with
sign distortion: recognizable with no difficulty.
specific direction for each SW size. These results are
pushed into column-direction voting FIFOs and become
dircol+ and dircol- at a specific time, which depends on the
pixel location and size of the scan window. The voting
results of all columns in the corresponding area are
accumulated, forming area voting result dirarea. Then,
dirarea is stored in the local area direction voting register.
All local area voting results are added together, making the
SWi directions voting result for CD.
6. Evaluation Results and Discussion
6.1 Datasets in Various Conditions
6.1.1 Dataset Taken in Japan
The Japan dataset is used to verify speed sign
recognition for moving images under various conditions,
as shown in Fig. 20 and Fig. 21. It includes 125 daytime
scenes on highways and local roads; 25 scenes on a clear
night, and 44 scenes on rainy nights on local roads, as
shown in Table 1. When a high-resolution camera (full
HD) is used, images at both original and down-sampled
sizes are tested. All of the frames are grayscale. Frames
with full HD resolution are down-sized to 1/3 on each axis
for simulation in our test. The algorithm is easily
implemented with few hardware resources by reading the
pixels after every three columns and rows. A scene is
considered to be all the contiguous frames in which the
246
Hoang et al.: Real-time Speed Limit Traffic Sign Detection System for Robust Automotive Environments
Sign size = 64x64
Sign size = 75x75
Sign size = 55x55
Fig. 22. Dataset (single frame) in German.
Fig. 21. Adjacent frames of the same sign under backlit
condition: left: undetectable; right: detectable.
Table 1. Datasets for accuracy evaluation with various
lighting, weather, and camera conditions taken in
Japan.
Condition
Camera
Daytime 1
(Japan)
Grayscale
Cam *1
Daytime 2
(Japan)
Clear Night
(Japan)
Rainy Night
(Japan)
Video
Cam *2
Video
Cam *2
Video
Cam *2
Resolution
[pixels]
640×390
(Original)
No. of
Frames
No. of
Scenes
41,120
60
1920×1080
(Original)
40,000
65
640×360
(downsampling)
40,000
65
1920×1080
(Original)
39,136
25
640×360
(downsampling)
39,136
25
1920×1080
(Original)
66,880
44
640×360
(downsampling)
66,880
44
*1
: 60 fps gray scale camera (640 x 390 pixels).
*2
: 60 fps interlace Full HD color camera.
same sign appears in the observable field of the camera
until it disappears, as defined in Section 2 and Fig. 4.
Depending on the speed of the car, the number of frames
in a scene varies. The size of the sign also varies and
gradually increases between adjacent frames as the vehicle
gets closer to the sign. Lighting conditions for the same
sign also change, depending on the distance and angle
between the sign and the camera, as shown in Fig. 21. The
dataset also includes signs with different recognition
difficulties, depending on the weather and light conditions.
6.1.2 Dataset Taken in Germany
The German Traffic Sign Detection Benchmark
(GTSDB) dataset [25], which includes 900 individual
traffic images of 1360×800 pixels, is originally used for
sign detection contest purposes with machine learning.
Since we aim for two-digit speed traffic sign recognition,
we created a sub-GTSDB dataset by taking all frames that
contain two-digit speed traffic signs. The sub-GTSDB has
255 frames at 1360×800 pixels, as shown in Figs. 22 and
23. “Scene” is not applicable to this dataset because it
contains individual frames only. Sizes of the speed limit
signs range from 14×16 pixels to 120×120 pixels. Color
Fig. 23. German frames at day and night times in
grayscale. Both are detectable and recognizable.
Table 2. Number of candidates and effectiveness of
Circle Detection.
Conditions
No. of sign
candidates
No. of speed
sign candidates
Effectiveness
of CD
Best
Avg.
Best
Avg.
Best
Avg.
Daytime
(640×390)
200
118
168
109
32
9
Night time
(640×360)
403
109
352
96
51
13
images are converted to grayscale images before testing.
6.2 Simulation Results and Discussion
6.2.1 Number of Speed Sign Candidates
Detected in RPM and CD, and
Processing Time for NR
Table 2 shows the number of traffic sign candidates
detected by the RPM module and the number of speed
traffic sign candidates detected by the CD module in one
frame under various conditions. The full HD input image
is down-sampled to match the designed 640×360 pixel
resolution. On average, the number of candidates detected
by the RPM module is 118, and the number of candidate
detected after CD is 96 for night time. CD helps to reduce
the maximum to 51 speed traffic sign candidates. This is
not a big number, but it helps to remove complicated cases
for NR, in which random noise in QR code form and
Japanese kanji are taken as speed sign candidates by RPM.
The number of SLT sign candidates can also be
reduced by applying region of interest, because the SLT
sign appears in the input image in a known area, as shown
in Fig. 24 (in Japan). We can concentrate on the SLT sign
candidates detected in this area only to reduce processing
time for NR.
The time needed to raster scan one frame is
640×360=230,400 clock cycles. In the worst case scenario,
when the speed limit traffic sign candidate is as big as
50×50 pixels, the NR module needs two clock cycles for
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
247
Non traffic sign area
Fig. 24. Appearance area of SLT sign in Japan.
Covered scan window sizes
Step = 1
No. detected
candidates = 112
(a) 112 speed sign candidates detected with scan step = 1
Scan window sizes
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Fig. 25. Cover of each scan window size. Red marks
picked up SW sizes for best hardware and accuracy
trade-off.
one line of data reading and processing [18], and so 100
clock cycles for number candidate processing. Hence,
during 230,400 clock cycles used in RPM, NR can process
2,304 speed sign candidates, which is more than the
number of speed sign candidates detected by RPM and CD.
The available time (230,400 clock cycles) is enough for all
SLT sign candidate recognition in the pipeline implementation in Fig. 12.
6.2.2 Optimization of SW Sizes and Scan Step
Fig. 25 shows the cover of one scan window size over
the others. A small SW size, such as 20×20 pixels and
21×21 pixels cannot cover the others as well as be covered
by the others. However, the cover size of a SW size is
gradually increased, such that SW size of 23×23 pixels can
cover SW size of 22×22 pixels, and a scan window size of
34×34 can cover a range from 32×32 to 41×41. The red
lines show the coverable area for all SW sizes. Blue points
show the SW sizes that are covered by other SW sizes,
while the yellow points show the points in which data is
not available in the dataset. It suggests that not all scan
window sizes are necessary for implementation. Some
scan window sizes with less detectable and recognizable
areas of SLT sign can be removed. In our simulation, 14
scan window sizes (20, 21, 23, 24, 26, 28, 30, 32, 34, 36,
38, 42, 46, and 50) are enough for sign detection without
reduction in accuracy.
Step = variable
No. detected
candidates = 38
(b) 38 speed sign candidates detected with variable scan
step
Fig. 26. Effectiveness of variable scan step.
When the scan window moves one pixel, change in the
scan area is significant for small scan window sizes, such
as 20×20 (a 5% change, because one row or one column of
the SW is replaced), but this change is minor for big scan
window sizes, such as 50×50 (a 2% change). Hence, scan
step can be a variable depending on the SW size. Making
the scan step variable reduces the number of candidates
that need to be processed in the CD and NR stages, as
shown in Fig. 26, in which the number of speed sign
candidates is reduced from 112 to 38 with no impact on
detection accuracy. SLT sign recognition accuracy in Fig.
27 shows that when the SLT sign size is bigger than 23×23
pixels, the recognition rate gets to 100%, and so, some SW
sizes can be removed to save hardware resources with no
reduction in accuracy.
6.3 Hardware Implementation Results
and Processing Ability
Table 3 shows the hardware implementation resources
and frequency for RPM, CD and related modules in speed
limit traffic sign candidate detection. Two implementations
with different FIFO approaches are given.
The first implementation utilizes flip-flops available
inside slices of the Xilinx FPGA to build FIFO stacks and
shift registers that are required in Figs. 17 and 20. Due to
the small number of registers available inside each slice,
the implementation with all 31 SW sizes in the proposed
range (20×20 to 50×50) utilizes 68,552 slice LUTs (128%)
and cannot fit the target FPGA. Reducing the number of
SW sizes to 14 (20, 21, 23, 24, 26, 28, 30, 32, 34, 36, 38,
42, 46, and 50) significantly reduces the required resources,
248
Hoang et al.: Real-time Speed Limit Traffic Sign Detection System for Robust Automotive Environments
SW size [pixels]
This number reduces to 18,154 (34.1%) slice LUTs if 14
SW sizes are implemented. Fewer required computational
resources also increases the maximum frequency for the
detection module to 321.9 MHz, compared with 202.2
MHz achieved by the first one.
6.4 Comparison with Related Works
In terms of speed limit traffic sign candidate detection,
our method achieves 100% accuracy. When number
recognition is included, the accuracy of the proposed speed
SW size [pixels]
(a) Detection accuracy on local road by SW size
Table 4. Detection rate and throughput of related works.
Dataset
Input image
Daytime 1
640×360
Daytime 2 8-bit grayscale
(b) Detection accuracy on highway by SW size
Fig. 27. Speed sign recognition accuracy by SW size.
Night,
Rainy
night
Full HD
8-bit grayscale
Scene positive
recognition
rate (%)
125
)
125
scenes
Speed
100(
> 60 fps
(Xilinx
Zynq 7020)
69
)
69
scenes
100(
Table 3. Hardware resources and latency.
Table 5. Detection rate and throughput of related works.
All SW sizes
(20×20~ 50×50)
14 SW sizes*
# slice # slice Freq. # slice # slice Freq.
registers LUTs (MHz) registers LUTs (MHz)
RPM 19,763 34,189 202.2 8,289 14,394 202.2
Reg.
based CD 26,669 34,363 390.3 11,061 14,437 390.3
FIFO Total 46,432 68,552 202.2 19,350 28,831 202.2
(43.6%) (128.9%)
(18.2%) (54.2%)
RAM RPM 13,951 28,391 321.9 57,704 11,808 321.9
based CD
4,232 14,885 390.3 1,755 6,346 390.3
shift
reg. Total 18,183 43,246 321.9 7,459 18,154 321.9
(17.1%) (81.3%)
(7%) (34.1%)
FIFO
Controller
(RPM+CD)
19
350
-
19
350
-
*
14 SW sizes : 20, 21, 23, 24, 26, 28, 30, 32, 34, 36, 38, 42, 46, and 50.
Target device: Xilinx Zynq Automotive Z7020 FPGA.
- 106,400 slice registers.
- 53,200 slice LUTs.
making it implementable on the target FPGA at a
frequency of 202.2 MHz. This maximum frequency
guarantees that over 60 full HD frames can be processed
for sign detection in one second.
The second implementation utilizes the memory inside
the LUT in SLICEM to generate a 32-bit shift register
without using the flip-flops available in a slice. Since the
size of the FIFO stack and shift registers required in Figs.
17 and 20 is not too big, the shift register generated by the
memory inside the LUTs is applicable. Since the amount
of memory inside the LUTs in a slice is much larger than
the number of flip-flops inside a slice, this approach
significantly reduces the required resources to 43,246
(81.3%) slice LUTs if all 31 SW sizes are implemented.
Method
Recognition Thpt.
rate (%)
(fps)
Platform
Color /
grayscale
Proposed
method
98 (100%*)
> 60 fps
Zynq*
both
SIFT [15]
90.4
0.7
Pentium
D 3.2 GHz
Color
Hough [4]
91.4
6.7
Pentium
4 2.8 GHz
Color
Random
forest [20]
97.2
18~28
-
Color
Neural
network [14]
82.0
62.0
Virtex 4
Color
Fuzzy
template [6]
93.3
66.0
Pentium
4 3.0 GHz
Color
Hardware/
Software [17]
-
1.3
Virtex 5
Color
Multi-Core
SoC [9]
-
2.3
Virtex 4
Color
Hardware/
Software [1]
-
10.4
Zynq 4
Color
Hardware/
Software [3]
90.0
14.3
Virtex 5
Color
Multi-scale
convolutional
network [22]
98.6
NA
Multi-column
deep neural
network [21]
99.5
NA
Color
GPU 512
core
Color
100%* : achieved 100% accuracy with contrast adjustment.
NA : not applicable due to traffic sign recognition only.
Zynq* : Xilinx Automotive Zynq 7020 FPGA.
Virtex : Xilinx Virtex FPGA.
Original rectangle
detection
B
W
BW
WB
W
B
Directly applicable
for shape detection
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
Extendable traffic sign shapes
B
W
B
W
BW
WB
BW
W
B
B
W
WB
BW
W
B
WB
W
B
Fig. 28. Extendibility to other sign shapes.
limit traffic sign detection and recognition system is 98%
if no contrast adjustment is applied to the rainy night
scenes. With simple contrast adjustment, accuracy in speed
limit sign detection and recognition increases to 100%,
even under difficult conditions such as rainy nights, as
shown in Table 4.
Table 5 shows the accuracy and throughput of related
works. In comparison with other available research, our
system gets a higher precision rate and higher throughput
than all others. It also requires fewer hardware resources
than the others, and so is implementable on a low-cost
automotive-oriented Xilinx Zynq 7020 FPGA, whereas the
others require a PC or high-end, and high-cost FPGAs,
such as the Xilinx Virtex 4 or Virtex 5, in their implementations.
as rainy nights, is able to process more than 60 full HD fps,
and is implementable on the low-cost Xilinx automotiveoriented Zynq 7020 FPGA. Therefore, it is applicable in
real life. In the future, a full speed limit traffic sign
recognition system will be implemented and verified on an
FPGA. Extension to LED signs and other countries’ speed
limit sign recognition, as well as other traffic sign detection, will be done.
Acknowledgement
Part of this work was supported by Grant-in-Aid for
Scientific Research (C) and Research (B) JSPS KAKENHI
Grant Numbers 2459102 and 26280015, respectively.
References
[1]
[2]
6.5 Function Extendibility
In terms of application, the rectangle pattern–matching
algorithm can directly apply to other sign detections aside
from the rectangle shape, such as the circle and octagon
shapes, as shown in Fig. 28. These three shapes have the
same global location and luminosity features, and so we
can roughly recognize the ROI for those signs before
dividing them into the correct shape based on their local
features. The algorithm can also extend to other shapes,
such as the triangle and hexagon. Depending on the shape
of the target sign border, we need to change the location of
black and white area computation.
7. Conclusion
This paper introduces our novel algorithm and
implementation for speed limit traffic sign candidate
detection using simple features of speed limit traffic signs
in a grayscale-image (area luminosity, local pixel direction,
area histogram) combination. By using grayscale images,
our approach overcomes training and color dependence
problems, which reduce recognition accuracy in unlearned
conditions when color has changed, compared to other
research. The proposed algorithm is robust under various
conditions, such as during the day, at night, and on rainy
nights, and is applicable to various platforms, such as color
or grayscale cameras, high-resolution (4K) or lowresolution (VGA) cameras, and high-end or low-end
FPGAs. The combination of coarse- and fine-grain pipeline
architectures using results-sharing on overlap, application
of a RAM-based shift register, and optimization of scan
window sizes provides a small but high-performance
implementation. Our proposed system achieves better than
98% recognition accuracy, even in difficult situations, such
249
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
Han, Y.: Real-time traffic sign recognition based on
Zynq FPGA and ARM SoCs, Proc. 2014 IEEE Intl.
Conf. Electro/Information Technology (EIT), pp.
373-376, 2014. Article (CrossRef Link)
Hoang, A.T., Yamamoto, M., Koide, T.: Low cost
hardware implementation for traffic sign detection
system, Proc. 2014 IEEE Asia Pacific Conf. Circuits
and Systems (APCCAS2014), pp. 363-366, 2014.
Article (CrossRef Link)
Irmak, M.: Real time traffic sign recognition system
on FPGA, Master Thesis, The Graduate School of
Natural and Applied Sciences of Middle East
Technical University, 2010. Article (CrossRef Link)
Ishizuka, Y., and Hirai, Y.: Recognition system of
road traffic signs using opponent-color filter, Technical report of IEICE, No. 103, pp. 13-18, 2004, (in
Japanese).
Keller, C. G., et al.: Real-time recognition of U.S.
speed signs, Proc. Intl. IEEE Intelligent Vehicles
Symposium, pp. 518-523, 2008. Article (CrossRef
Link)
Liu, W., et al.: Real-time speed limit sign detection
and recognition from image sequences, Proc. 2014
IEEE Intl. Conf. Artificial Intelligence and Computational Intelligence (AICI), pp. 262-267, 2010.
Article (CrossRef Link)
Miura, J., et al.: An active vision system for realtime traffic sign recognition, Proc. Intl. IEEE Conf.
Intelligent Transportation Systems, pp. 52-57, 2000.
Article (CrossRef Link)
Moutarde, F., et al.: Robust on-vehicle real-time
visual detection of American and European speed
limit signs, with a modular traffic signs recognition
system, Proc. Intl. IEEE Intelligent Vehicles Symposium, pp. 1122-1126, 2007. Article (CrossRef Link)
Muller, M., et al.: Design of an automotive traffic
sign recognition system targeting a multi-core SoC
implementation, Proc. Design, Automation & Test in
Europe Conference & Exhibition 2010, pp. 532-537,
2010. Article (CrossRef Link)
Ozcelik, P. M., et al.: A template-based approach for
real-time speed limit sign recognition on an embedded
250
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
Hoang et al.: Real-time Speed Limit Traffic Sign Detection System for Robust Automotive Environments
system using GPU computing, Proc. 32nd DAGM
Conf. Pattern Recognition, pp. 162-171, 2010. Article
(CrossRef Link)
Raiyn, J.,and Toledo, T.: Real-time road traffic
anomaly detection, Journal of Transportation Technologies, 2014, No. 4, pp. 256-266, 2014. Article
(CrossRef Link)
Schewior, G., et al.: A hardware accelerated
configurable ASIP architecture for embedded realtime video-based driver assistance applications, Proc.
2011 IEEE Intl. Conf. Embedded Computer Systems
(SAMOS), pp. 18-21, 2011. Article (CrossRef Link)
Soendoro, D., and Supriana, I.: Traffic sign
recognition with color based method, shape-arc
estimation and SVM, Proc. 2014 IEEE Intl. Conf.
Electrical Engi-neering and Informatics (ICEEI), pp.
1-6, 2011. Article (CrossRef Link)
Souani, C., Faiedh, H., and Besbes, K.: Efficient
algorithm for automatic road sign recognition and its
hardware implementation, Journal of Real-Time
Image Processing, Vol. 9, Issue 1, pp. 79-93, 2014.
Article (CrossRef Link)
Takagi, M., and Fujiyoshi, H.: Road sign recognition
using SIFT feature, Proc. 18th Symposium on
Sensing via Image Information, LD2-06, 2007, (in
Japanese).
Torresen, J., et al.: Efficient recognition of speed
limit signs, Proc. 7th Intl. IEEE Conf. Intelligent
Transportation Systems, pp. 652-656, 2004. Article
(CrossRef Link)
Waite, S. and Oruklu, E.: FPGA-based traffic sign
recognition for Advanced Driver Assistance Systems,
Journal of Transportation Technologies, Vol. 3, No. 1,
pp. 1-16, 2013. Article (CrossRef Link)
Yamamoto, M., Hoang, A-T., Koide, T.: Speed
traffic sign recognition algorithm for real-time
driving assistant system, Proc. 18th Workshop on
Synthesis And System Integration of Mixed
Information Technologies (SASIMI 2013), pp. 195200, 2013. Article (CrossRef Link)
Zaklouta, F. and Stanciulescu, B.: Segmentation
masks for real-time traffic sign recognition using
weighted HOG-based trees, Proc. 14th Intl. IEEE
Conf. Intelligent Transportation Systems (ITSC), pp.
1954-1959, 2011. Article (CrossRef Link)
Zaklouta, F., Stanciulescu, B.: Real-time traffic sign
recognition in three stages, Journal of Robotics and
Autonomous Systems, Vol 62, Issue 1, pp. 16-24,
2014. Article (CrossRef Link)
Ciresan, D., et al.: Multi-column deep neural
network for traffic sign classification, Journal of
Neural Networks, Vol 32, pp. 333-338, 2012. Article
(CrossRef Link)
Sermanet, P., LeCun, Y.: Traffic sign recognition
with multi-scale convolutional networks, Proc. Intl.
Joint IEEE Conf. on Neural Networks (IJCNN), pp.
2809-2813, 2011. Article (CrossRef Link)
Zaklouta, F., Stanciulescu, B., and Hamdoun. O:
Traffic sign classification using K-d trees and random
Copyrights © 2015 The Institute of Electronics and Information Engineers
forests, Proc. Intl. Joint IEEE Conf. on Neural
Networks (IJCNN), pp. 2151-2155, 2011. Article
(CrossRef Link)
[24] Article (CrossRef Link), access at March, 15th, 2015
[25] Article (CrossRef Link), access at June, 10th, 2015.
[26] Hoang, A-T., Yamamoto, M., Koide, T.: Simple yet
effective two-stage speed traffic sign recognition for
robust vehicle environments, Proc. 30th Intl. Technical
Conference on Circuits/Systems, Computers and
Communications (ITC-CSCC 2015), pp.420-423, 2015.
Anh-Tuan Hoang was born in 1976.
He received his Ph.D. from Ritsumeikan
University in 2010. He was a postdoctoral researcher at Ritsumeikan University, Japan, from 2010 to 2012.
Since 2012, he has been a postdoctoral
researcher at the Research Institute for
Nanodevice and Bio System, Hiroshima University, Japan. His research interest is side
channel attack and tamper-resistant logic design, real-time
image processing and recognition for vehicle and medical
image type identification. He is a member of the IEEE and
IEICE.
Tetsushi Koide (M’92) was born in
Wakayama, Japan, in 1967. He received
a BEng in Physical Electronics, an
MEng and a PhD in Systems Engineering from Hiroshima University in
1990, 1992, and 1998, respectively.
He was a Research Associate and an
Associate Professor in the Faculty of
Engineering at Hiroshima University from 1992-1999 and
in 1999, respectively. After 1999, he was with the VLSI
Design and Education Center (VDEC), The University of
Tokyo, as an Associate Professor. Since 2001, he has been
an Associate Professor in the Research Center for
Nanodevices and Systems, Hiroshima University. His
research interests include system design and architecture
issues for functional memory-based systems, real-time
image processing, healthcare medical image processing
systems, VLSI CAD/DA, genetic algorithms, and combinatorial optimization. Dr. Koide is a member of the
Institute of Electrical and Electronics Engineers, the
Association for Computing Machinery, the Institute of
Electronics, Information and Communication Engineers of
Japan, and the Information Processing Society of Japan.
Masaharu Yamamoto was born in
1990. He received a BEng in Electrical
Engineering from Hiroshima University, Hiroshima, Japan in 2013. He is
currently at TMEiC, Japan. Since 2012,
he has been involved in the research
and development of hardware-oriented
image processing for advanced driver
assistance systems (ADASs).
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.4.251
251
IEIE Transactions on Smart Processing and Computing
Pair-Wise Serial ROIC for Uncooled Microbolometer Array
Syed Irtaza Haider1, Sohaib Majzoub2, Mohammed Alturaigi1 and Mohamed Abdel-Rahman1
1
2
College of Engineering, King Saud University, Riyadh, KSA {sirtaza, mturaigi, mabdelrahman}@ksu.edu.sa
Electrical and Computer Engineering, University of Sharjah, UAE sohaib.majzoub@ieee.org
* Corresponding Author: Syed Irtaza Haider
Received June 20, 2015; Revised July 15, 2015; Accepted August 24, 2015; Published August 31, 2015
* Short Paper
* Extended from a Conference: Preliminary results of this paper were presented at the ICEIC 2015. This present paper has
been accepted by the editorial board through the regular reviewing process that confirms the original contribution.
Abstract: This work presents modelling and simulation of a readout integrated circuit (ROIC)
design considering pair-wise serial configuration along with thermal modeling of an uncooled
microbolometer array. A fully differential approach is used at the input stage in order to reduce
fixed pattern noise due to the process variation and self-heating–related issues. Each pair of
microbolometers is pulse-biased such that they both fall under the same self-heating point along the
self-heating trend line. A ±10% process variation is considered. The proposed design is simulated
with a reference input image consisting of an array of 127x92 pixels. This configuration uses only
one unity gain differential amplifier along with a single 14-bit analog-to-digital converter in order
to minimize the dynamic range requirement of the ROIC
Keywords: Microbolometer, Process variation, ROIC, Self-heating thermal imaging system
1. Introduction
Infrared uncooled thermal imagers have been employed in
a wide range of civilian and military applications for
smartphone cameras, industrial process monitoring, driver
night vision enhancement, and military surveillance [1-3].
Micro-electro-mechanical system (MEMS) microbolometer
thermal detectors are the most widely used pixel element
detectors in today's infrared uncooled thermal imaging
cameras.
Microbolometer sensor arrays are fabricated using
MEMS technology, but they suffer from process variation,
which introduces fixed pattern noise (FPN) in detector
arrays [4]. At the early stage of sensor fabrication, a
sensor’s resistance discrepancy of ±10% is expected [5]. A
microbolometer detector changes its resistance when
exposed to IR radiation due to its thermally sensitive layer.
If the target temperature differs from the ambient
temperature, i.e. ΔTscene, by 1K, it results in a temperature increase in the microbolometer membrane on the
order of 4mK [2].
These thermal detectors need to be electrically biased
during the readout in order to monitor the change in
resistance. Electrical biasing generates heat, which results
in self-heating of the microbolometer detector and causes a
change in resistance. Heat generated by self-heating cannot
be quickly dissipated through thermal conduction to the
substrate. It results in a change in temperature due to selfheating much higher than a change in temperature due to
incident radiation [6, 7].
FPN and self-heating results in major degradation and
poor performance of the thermal imaging system, and
hence, imposes a strict requirement on ROIC for noise
compensation in order to detect the actual change due to
infrared radiation.
Readout topologies extensively discussed in the
literature are pixel-wise [8], column-wise [5] and serial
readout [9]. Pixel-wise readout improves noise performance
of the microbolometer by increasing the integration time
up to the frame rate [10]. Column-wise readout reduces the
number of amplifiers and integrators, which thus serves as
a good compromise between the silicon area and parallel
components [10, 11]. Serial readout architecture is read
pixel by pixel, and therefore, it requires only one amplifier
and integrator, resulting in low power consumption and a
compact layout [10].
This paper focuses on ROIC design considering the
impact of process variation and self-heating on performance.
Pair-wise, time-multiplexed, column-wise configuration is
used, in which one pair of microbolometers is selected at a
252
Haider et al.: Pair-Wise Serial ROIC for Uncooled Microbolometer Array
time. Readout is performed differentially during the pulse
duration. The focal plane array (FPA) consists of 127x92
normal microbolometers and one row of blind microbolometers to provide a reference for the ROIC. This paper
is organized as follows. Section 2 describes the literature
review. Section 3 covers thermal modelling of an uncooled
microbolometer. Section 4 explains the pair-wise serial
readout architecture in detail, along with an ROIC simulator. Finally, we conclude this paper in the last section.
2. Literature Review
This section summarizes the studies conducted on
different aspects of thermal imaging systems. Some of the
important figures of merit are discussed, which are helpful
in evaluating the performance of infrared detectors. Some
of the commonly used thermal sensing materials that
influence the sensitivity of microbolometers are discussed.
The focal plane array and the readout integrated circuit are
two major building blocks of a thermal imaging system.
The operation of a thermal imaging system starts with
the absorption of the incident infrared radiation by the
uncooled microbolometer detector array. Each microbolometer detector changes its resistance based on the absorbed
infrared radiation. It is important to establish criteria with
which different infrared thermal detectors are compared.
The most important figures of merit are the thermal time
constant (τ), noise equivalent power (NEP), responsivity
( ), noise equivalent temperature difference (NETD) and
detectivity (D*) [2, 10, 12-14]. The performance of a
microbolometer is influenced by the thermal sensing
material. There are three types of material that are suitable
for the bolometer: metal, semiconductor and superconductor. Metal and semiconductor microbolometers operate
at room temperature, whereas superconductor microbolometers require cryogenic coolers. Typical materials used
for microbolometers are titanium, vanadium oxide and
amorphous silicon.
Single input mode and differential input mode ROIC
designs are widely used at the reading stage of the ROIC in
order to detect the resistance value of a microbolometer
and to generate the voltage value. Differential input mode
assumes that the adjacent pixels are subjected to a small
radiation difference, and hence, results in a small
resistance change, causing a similar voltage change for
adjacent microbolometer cell resistances that can be read
and handled by a differential amplifier [15].
In addition, process variation creates resistance discrepancy among the sensors during the wafer process. A
differential input mode ROIC design attempts to cancel
these resistance differences among microbolometers using
the differentiation method. Thus, it suppresses the common
error and amplifies the differential signal. The conventional single input mode, on the other hand, is known to be
inefficient at compensating for fixed pattern noise since it
has low immunity to process variation. A comparison
between single input mode ROIC and differential input
mode ROIC to decrease the error due to process variation
was presented [16].
Pixel-parallel, serial and column-wise readout archi-
tectures have mostly been discussed in the past. Parallel
readout increases power consumption and the complexity
of the readout, whereas serial readout reduces the speed of
a thermal imager due to its time-multiplexed nature. Pixelparallel readout, also known as frame-simultaneous readout,
is used for very-high-speed thermal imaging systems. Each
pixel in a cell array consists of detector, amplifier and
integrator. The pixel-wise readout architecture is suitable
for very-low-noise applications because it reduces Johnson
noise, one of the largest noise sources. The disadvantage of
the pixel-parallel architecture is the complex readout
resulting in a large pixel area and extensive power
dissipation. Conventional ROIC uses column-wise readout
because this architecture serves as a good compromise
between the speed and complexity of readout due to
parallel components. Finally, the last readout architecture
is serial readout. This approach uses only one amplifier
and one analog-to-digital converter (ADC) to perform the
readout due to the time-multiplexed nature of its readout.
Advantages with serial readout are compact layout and low
power consumption.
3. Thermal Modeling of a Microbolometer
Self-heating is an unavoidable phenomenon which
causes the temperature of a thermal detector to rise, even
though the bias duration is much smaller, compared to the
thermal time constant of the detector. The heat balance
equation of a microbolometer, including self-heating, can
be written as:
H
d ΔT
+ GΔT = PBIAS + PIR
dt
(1)
where H is the thermal capacitance, G is the thermal
conductance of the microbolometer, PBIAS is the bias power,
and PIR is the infrared power absorbed by the
microbolometer detector. For metallic microbolometer
materials, resistance RB has linear dependence on
temperature, and can be expressed as:
RB ( t BIAS ) = R0 (1 + αΔT ( t BIAS ) )
(2)
where α is the temperature coefficient of resistance (TCR)
of the detector, R0 is the nominal resistance, and ΔT(tBIAS)
is the temperature change due to self-heating during pulse
biasing, and is given by:
ΔT ( t BIAS ) = T − T0 =
PBIAS ( t BIAS ) × t BIAS
G.τ
(3)
where τ is the thermal time constant. Under normal
conditions, tBIAS << τ and ∆T due to self-heating is
independent of thermal conductance. If IBIAS is the constant
bias current applied to the microbolometer during the
readout, self-heating power can be expressed as:
2
PBIAS ( t BIAS ) = I BIAS
RB ( t BIAS )
By solving the above equation using (2) and (3),
(4)
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
PBIAS ( t BIAS ) =
2
I BIAS
R0 H
2
H − I BIAS t BIAS R0α
Table 1.Thermal Parameters of the Microbolometer.
(5)
When FPA is exposed to incident radiation, the
difference in radiant flux incoming to the microbolometer
can be estimated as
ΔφIR =
Ab ΔTscene dP
( ) 300 K , Δλ
dT
4F 2
(6)
where ΔTscene is the difference in temperature between
target and ambient temperatures, and (dP/dT)300K,Δλ is the
change in power per unit area with respect to temperature
change radiated by a black body at an ambient temperature
in the wavelength interval 8μm-14μm. The temperature
change due to the absorbed infrared power is given by
ΔTIR =
ΔφIR
G
253
(7)
Parameters
Nominal Resistance, R0 (kΩ)
Pulse Duration, tBIAS (μs)
Bias Current, IBIAS (μA)
Ambient Temperature, T0 (K)
Thermal Time Constant, τ (ms)
Temperature Coefficient of Resistance, α (%/K)
Thermal Conductance, G (W/K)
Thermal Capacitance, H (J/K)
Optics F/Number
Area of Microbolometer Pixel, Ab(m2)
Fill Factor of Microbolometer, β (%)
Transmission of Infrared Optics, ΦΔλ (%)
Absorption of microbolometer membrane, εΔλ (%)
Temperature Contrast (dP/dT) 300K, Δλ (WK-1m-2)
FPA Size
Frame Rate (frames per second)
Values
100
6
20
300
11.7
-2.6
3.7e-8
4.34e-10
1
6.25e-10
62
98
92
2.624
128 × 92
10
Resistance change of a microbolometer detector due to
change in scene temperature can be evaluated as
ΔRIR = R0 (α . β .φΔλ .ε Δλ . ΔRIR )
(8)
where β is the fill factor of the microbolometer, ΦΔλ is the
transmission of optics, and εΔλ is the absorption of
microbolometer membrane in an infrared region. When
both incident and bias power are zero, the temperature of
the microbolometer cools down based on the equation
below:
TCOOL (t ) = TB e
(
−t
τ
)
(9)
where TB is the temperature of the microbolometer at the
end of pulse biasing. Infrared system parameters
mentioned in Table 1 are taken from [5].
4. Thermal Modeling of a Microbolometer
Fig. 1 demonstrates the flowchart of the ROIC simulator.
Temperature mapping of a thermal image is performed at
the beginning. For each pixel, ΔTscene is calculated based
on the target temperature and ambient temperature. An
infrared radiation model evaluates the difference in
incoming flux ΔΦIR, the change in bolometer temperature
ΔTIR, and the resistance change in the bolometer due to the
absorbed incident power ΔRIR.
RTOTAL = R0 + RPV + RIR
Fig. 1. ROIC Simulator Flowchart.
(10)
where RPV is resistance due to process variation and is a
±10% deviation from nominal resistance, and RIR is the
change in microbolometer resistance due to incident
radiation. Parameters mentioned in Table 1 and synchronization pulse sequence, as mentioned in Table 2, are
provided to the ROIC simulator. It evaluates self-heating
power, temperature drift and change in resistance due to
pulse biasing.
A 14-bit ADC is used to convert the signal to digital
representation. The blind microbolometer exhibits the
same thermal characteristics as that of a normal micro-
254
Haider et al.: Pair-Wise Serial ROIC for Uncooled Microbolometer Array
Table 2. Pulse sequence for pair-wise serial ROIC
architecture, where BD is microbolometer detector.
Pulse
1
2
:
(X/2)
(X/2) + 1
:
(X-1)
Selected Microbolometer
Detector (BD) Pair
Blind (1,i) and BD(2,i)
BD (3,i) and BD(4,i)
:
i = 1,2,3,…Y,
BD (X-1,i) and BD (X, i)
BD (2,i)and BD (3,i)
:
BD (X-2,i) and BD (X-1,i)
The serial architecture is time-multiplexed; thus, only
one pair of microbolometers is read during a single pulse
duration. Thus, the minimum time to perform the readout
of a single frame of a 127x92 pixel focal plane array is
approximately 100ms for a TBIAS of 6μs, and approximately 150ms for a TBIAS of 10μs. This constraint limits the
thermal imager to a maximum frame rate of 10 frames per
second and 6 frames per second, respectively.
The following are the calculations of the start time of
the pulse for the microbolometer in the tenth column
second row, i.e. microbolometer (2, 10):
TWAIT = 2μs, TBIAS = 6μs,
TCOLUMN = (TWAIT + TBIAS) * (X – 1)
bolometer but remains unaffected by incident radiation,
and thus, serves as a reference point. Each microbolometer,
either blind or normal, is connected with a current source
through a p-channel metal-oxide-semiconductor (PMOS)
switch. The gate of the PMOS switch is controlled through
the digital circuitry that generates the pulse sequence based
on the serial readout topology. Absolute values of each
pixel are then calculated based on the reference blind
microbolometer after all values are converted to the digital
domain.
Fixed pattern noise correction is achieved by performing the complete readout under the dark condition.
This measurement takes place when the focal plane array
is not exposed to any incident power, and it is considered
the reference value for each pixel. Later, every pixel
reading has to be adjusted based on the pixel’s reference
point taken during the dark condition.
where X is the number of rows. It means, in order to
perform the readout of a complete column, X-1 pulses are
required. For the first pulse for microbolometer (2, 10):
5. Pair-wise Serial Readout Architecture
Before the second pulse for the same microbolometer
arrives, the cooling time of a microbolometer is evaluated
as follows:
Pair-wise serial readout architecture selects and bias
one pair of microbolometers at a time, and the readout is
performed differentially. A pair of microbolometers is
biased twice during a frame rate, and the reading is
performed differentially. Once the readout of the selected
pair is finished, bias current source is switched to the next
pair of microbolometers. The recently biased detector pair
is left to cool off until the next pulse (for the next reading
within the same frame rate) is applied. A blind microbolometer is biased once in a column with the first normal
microbolometer, and it exhibits the same temperature drift
due to self-heating as normal microbolometers, which can
also be drastically minimized using a differential approach.
The normal microbolometers are read twice with each
adjacent neighbor, except for the last normal microbolometer. Table 2 shows the pulse sequence based on pairwise serial configuration, where X is the number of rows
and Y is the number of columns.
Consider a case for serial readout architecture where,
TWAIT = 2μs, TBIAS = 6μs,
TCOLUMN = (TWAIT + TBIAS) * (X – 1)
TCOLUMN = (2μs + 6μs) * (128 – 1) = 10.16ms
TFRAME = TCOLUMN * Y
TFRAME = 10.16ms * 92 = 93.472ms
TPULSE1 START = TCOLUMN * 9 + TWAIT
TPULSE1 START = 9.146ms
TPULSE1 END = TPULSE1 START + TBIAS
TPULSE1 END = 9.146ms + 6μs = 9.152ms
Similarly, the time of the second pulse for microbolometer (2, 10) is calculated as follows:
TPULSE2 START = TPULSE1 END + ((X/2) – 1)TBIAS
+ (X/2)TWAIT
TPULSE2 START = 9.152ms + 63*TBIAS + 64*TWAIT
TPULSE2 START = 9.658ms
TPULSE2 END = TPULSE2 START + TBIAS
TPULSE2 END = 9.658ms + 6μs = 9.664ms
TCOOL1 = TPULSE2 START - TPULSE1 END
TCOOL1 = 9.658ms - 9.152ms = 506μs
After the second pulse, cooling time of the microbolometer is evaluated as
TCOOL2 = TFRAME - TPULSE2 END
TCOOL2 = 100ms – 9.658ms = 90.142ms
6. Simulation Results
Self-heating of a microbolometer causes the resistance
of the microbolometer to drop due to its negative TCR.
The resistance drop due to self-heating is of higher
magnitude than the resistance drop due to incident infrared
radiation, and thus imposes a strict requirement on the
dynamic range of the ROIC. In order to relax this
requirement, the microbolometers are pulse biased during
the readout time, and the voltage drop due to self-heating
is minimized.
Electrical biasing generates Joule heating and causes a
change in the resistance of the microbolometer detector.
255
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
Fig. 2. Microbolometer self-heating for tBIAS= 20μs and
tBIAS=10μs.
Even though the bias duration is much shorter than the
thermal time constant of the microbolometer detector, it
still results in a significant rise in the temperature of the
detector. This is due to the fact that applied bias power is
much higher, compared to infrared absorbed power. It
results in a much higher change in temperature due to selfheating, compared to incident radiation. Heat generated by
self-heating cannot be quickly dissipated through thermal
conduction to the substrate. Readout circuits are required
to have complex circuit design and a high dynamic range,
if self-heating not compensated for. Thus, self-heating
must be compensated for in order to improve the
performance of readout circuit, and eventually, the thermal
imaging system.
In this work, microbolometers are pulse biased with a
nominal value of bias current. If the bias current is too
high, or biased for a long duration, it will result in
excessive heating and permanent damage to the thermal
detectors. Similarly, if the bias current is too low, it will
result in low responsivity in the microbolometers. If the
microbolometers are biased for a long duration, it will
result in high temperature drift in the microbolometers due
to self-heating.
Fig. 2 demonstrates the effect of self-heating by
measuring resistance versus bias current for a pulse
duration of 20μs and 10μs. For a pulse duration of 20μs,
increasing the bias current from 1μA to 25μA results in a
decrease in the nominal resistance of the microbolometer
by approximately 7000Ω, which is equivalent to a
temperature rise of 2.5K. In order to minimize the impact
of self-heating, one way is to reduce the bias current to the
lowest practical value, as shown in Fig. 2.
The thermal time constant of the microbolometer,
along with time-multiplexed integration, limits the
maximum frame rate of the thermal imaging system. For
an FPA of 128x92 pixels, and by using the proposed pulse
sequence, each pair of microbolometers can be selected for
a pulse duration of 6μs with 10 frames per second. Fig. 3
shows the temperature variation of microbolometer pixel
(2, 1) due to self-heating when given two pulses, one at t =
2μs and the second at t = 82μs. From t = 8μs to t = 82μs
and from t = 88µs to t = tframe, both incident and bias
power are zero, and hence, the bolometer cools down, as
Fig. 3. Variation in microbolometer temperature with
and without IR.
Fig. 4. Voltage variation of pixel (1,1) and pixel (2,1),
and differential readout.
(a)
(b)
Fig. 5. (a) Input thermal image 127 × 92, and (b) ROIC
output.
per (9), where tframe is the total time for the readout.
Fig 4 shows the voltage variation of reference
microbolometer (1, 1) and normal microbolometers (2, 1),
(3, 1) and (4, 1) during the readout. Few things can be
concluded from the figure. Only one pair of microbolometers is biased at a time, and the voltage variation
during the readout period is different for each microbolometer, even when there is no input power. This is due
to the fact that during fabrication, microbolometers suffer
from process variation due to process immaturity. Individual
256
Haider et al.: Pair-Wise Serial ROIC for Uncooled Microbolometer Array
slopes of reference for microbolometer (1, 1) and normal
microbolometer (2, 1) are 5.348mV/μs and 4.2mV/μs,
respectively. Both the microbolometers have different
slopes, because they have different nominal resistances
before the readout.
Fig. 5(a) shows the input thermal image to an ROIC
simulator [17], which is 320 × 240 but cropped to 127 × 92
for simulation purposes. Fig. 5(b) shows the output image
of the simulator using the proposed architecture, mapping
a temperature change of about 12°C. Fixed pattern noise
can be seen in the output image.
7. Conclusion
The ROIC model presented in this paper uses a pulse
bias current scheme to reduce the effect of self-heating.
Simulation results show that the temperature drift due to
self-heating is compensated for by using differential
readout, but it is not completely eliminated due to the
consideration of a very high resistance discrepancy of
±10% due to process variation. A pulse sequence to each
pair of microbolometers is provided, such that they both
fall under the same self-heating point along the selfheating trend line, i.e. the pair picked is such that both are
biased the same number of times. The proposed architecture for the ROIC requires one differential amplifier and
one 14-bit ADC in order to reduce the dynamic range
requirement, power dissipation and area at the expense of a
longer readout time for a large focal plane array.
Acknowledgement
This work was supported by the NSTIP strategic
technologies program number 12-ELE2936-02 in the
Kingdom of Saudi Arabia.
References
[1] M. Perenzoni, D. Mosconi, and D. Stoppa, "A 160×
120-pixel uncooled IR-FPA readout integrated circuit
with on-chip non-uniformity compensation," in
ESSCIRC, 2010 Proceedings of the, 2010, pp. 122125. Article (CrossRef Link)
[2] B. F. Andresen, B. Mesgarzadeh, M. R. Sadeghifar, P.
Fredriksson, C. Jansson, F. Niklaus, A. Alvandpour,
G. F. Fulop, and P. R. Norton, "A low-noise readout
circuit in 0.35-μm CMOS for low-cost uncooled FPA
infrared network camera," Infrared Technology and
Applications XXXV, vol. 7298, pp. 72982F-72982F-8,
2009. Article (CrossRef Link)
[3] P. Neuzil and T. Mei, "A Method of Suppressing
Self-Heating Signal of Bolometers," IEEE Sensors
Journal, vol. 4, pp. 207-210, 2004. Article (CrossRef
Link)
[4] S. J. Hwang, H. H. Shin, and M. Y. Sung, "High
performance read-out IC design for IR image
sensor applications," Analog Integrated Circuits and
Signal Processing, vol. 64, pp. 147-152, 2009.
Article (CrossRef Link)
[5] D. Svärd, C. Jansson, and A. Alvandpour, "A readout
IC for an uncooled microbolometer infrared FPA
with on-chip self-heating compensation in 0.35 μm
CMOS," Analog Integrated Circuits and Signal
Processing, vol. 77, pp. 29-44, 2013. Article
(CrossRef Link)
[6] X.Gu, G.Karunasiri, J.Yu, G.Chen, U.Sridhar, and
W.J.Zeng, "On-chip compensation of self-heating
effects in microbolometer infrared detecto arrays,"
Sensors and Actuators A: Physical, vol. 69, pp. 92-96,
1998. Article (CrossRef Link)
[7] P. J. Thomas, A. Savchenko, P. M. Sinclair, P.
Goldman, R. I. Hornsey, C. S. Hong, and T. D. Pope,
"Offset and gain compensation in an integrated bolometer array," 1999, pp. 826-836. Article (CrossRef
Link)
[8] C. H. Hwang, C. B. Kim, Y. S. Lee, and H. C. Lee,
"Pixelwise readout circuit with current mirroring
injection for microbolometer FPAs," Electronics
Letters, vol. 44, pp. 732-733, 2008. Article (CrossRef
Link)
[9] S. I. Haider, S. Majzoub, M. Alturaigi, and M. AbdelRahman, "Modeling and Simulation of a Pair-Wise
Serial ROIC for Uncooled Microbolometer Array," in
ICEIC 2015, Singapore, 2015.
[10] D. Jakonis, C. Svensson, and C. Jansson, "Readout
architectures for uncooled IR detector arrays,"
Sensors and Actuators A: Physical, vol. 84, pp. 220229, 2000. Article (CrossRef Link)
[11] S. I. Haider, S. Majzoub, M. Alturaigi, and M. AbdelRahman, "Column-Wise ROIC Design for Uncooled
Microbolometer Array," presented at the International
Conference on Information and Communication
Technology Research (ICTRC), Abu-Dhabi, 2015.
Article (CrossRef Link)
[12] R. K. Bhan, R. S. Saxena, C. R. Jalwania, and S. K.
Lomash, "Uncooled Infrared Microbolometer Arrays
and their Characterisation Techniques," Defence
Science Journal, vol. Vol. 59, pp. 580-589, 2009.
[13] R. T. R. Kumar, B. Karunagaran, D. Mangalaraj, S. K.
Narayandass, P. Manoravi, M. Joseph, V. Gopal, R.
K. Madaria, and J. P. Singh, "Determination of
Thermal Parameters of Vanadium oxide Uncooled
Microbolometer Infrared Detector," International
Journal of Infrared and Millimeter Waves, vol. 24, pp.
327-334, 2003. Article (CrossRef Link)
[14] J.-C. Chiao, F. Niklaus, C. Vieider, H. Jakobsen, X.
Chen, Z. Zhou, and X. Li, "MEMS-based uncooled
infrared bolometer arrays: a review," MEMS/MOEMS
Technologies and Applications III, vol. 6836, pp.
68360D-68360D-15, 2007. Article (CrossRef Link)
[15] S. J. Hwang, A. Shin, H. H. Shin, and M. Y. Sung,
"A CMOS Readout IC Design for Uncooled Infrared
Bolometer Image Sensor Application," presented at
the IEEE ISIE, 2006. Article (CrossRef Link)
[16] S. J. Hwang, H. H. Shin, and M. Y. Sung, "A New
CMOS Read-out IC for Uncooled Microbolometer
Infrared Image Sensor," International Journal of
Infrared and Millimeter Waves, vol. 29, pp. 953-965,
2008. Article (CrossRef Link)
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
[17] SPI infrared. Available: Article (CrossRef Link)
Syed Irtaza Haider received his BE
degree in Electronics Engineering
from National University of Sciences
and Technology (NUST), Pakistan, in
2010 and MS degree in Electronics
Engineering
from
King
Saud
University (KSU), Saudi Arabia, in
2015. Currently, he is a researcher at
Embedded Computing and Signal Processing Lab
(ECASP) at the King Saud University. His research
interest includes signal processing, mixed signal design
and image processing. He is a student member IEEE.
Sohaib Majzoub completed his BE in
Electrical Engineering, Computer
Section at Beirut Arab University 2000,
Beirut Lebanon, and his ME degree
from American University of Beirut,
2003 Beirut Lebanon. Then he worked
for one year at the Processor
Architecture Lab at the Swiss Federal
Institute of Technology, Lausanne Switzerland. In 2010,
he finished his PhD working at the System-on-Chip
research Lab, University of British Columbia, Canada. He
worked for two years as assistant professor at American
University in Dubai, Dubai, UAE. He then worked for
three years starting in 2012 at King Saud University,
Riyadh, KSA, as a faculty in the electrical engineering
department. In September 2015, he joined the Electrical
and Computer Department at the University of Sharjah,
UAE. His research field is delay and power modeling,
analysis, and design at the system level. He is an IEEE
member.
Copyrights © 2015 The Institute of Electronics and Information Engineers
257
Dr. Muhammad Turaigi is a professor of Electronics at
the EE department, College of Engineering King Saud
University. He got his PhD from the department of
Electrical Engineering Syracuse University, Syracuse NY,
USA in the year 1983. His PhD thesis is in the topic of
parallel processing. In 1980 he got his MSc from the same
school. His B.Sc. is from King Saud University (formerly
Riyadh University) Riyadh Saudi Arabia. In Nov. 1983 He
joined the department of Electrical Engineering, King
Saud University. He teaches the courses in digital and
analog electronic circuits. He has more than 40 published
papers in refereed journals and international conferences.
His research interest is in the field of parallel processing,
parallel computations, and electronic circuits and
instrumentation. He was the director of the University
Computer Center from July/1987 until Nov/1991. He was
a member of the University Scientific Council from
July/1999 until June/2003.
Dr. Mohamed Ramy is an Assistant Professor in the
Electrical Engineering Department, King Saud University,
Riyadh, Saudi Arabia. He was previously associated with
Prince Sultan Advanced Technologies Research Institute
(PSATRI) as the Director of the Electro-Optics Laboratory
(EOL). He has a PhD degree in Electrical Engineering
from UCF (University of Central Florida), Orlando, USA.
He has over 10 years of research and development
experience in infrared/millimeter wave detectors and focal
plane arrays. He has worked on the design and
development of infrared, millimeter wave sensors, focal
arrays and camera systems.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.4.258
258
IEIE Transactions on Smart Processing and Computing
Implementation of an LFM-FSK Transceiver for
Automotive Radar
HyunGi Yoo1, MyoungYeol Park2, YoungSu Kim2, SangChul Ahn2 and Franklin Bien1*
1
School of Electrical and Computer Engineering, Ulsan National Institute of Science and Technology, Banyeon-ri,
Eonyang-eup, Ulju-gun, Ulsan, Korea {bien}@unist.ac.kr
2
Comotech corp. R&D center, 908-1 Apt-bldg, Hyomun dong, Buk-gu, Ulsan, Korea mypark@comotech.com
* Corresponding Author: Franklin Bien
Received June 20, 2015; Revised July 15, 2015; Accepted August 24, 2015; Published August 31, 2015
* Short Paper
Abstract: The first 77 GHz transceiver that applies a heterodyne structure–based linear frequency
modulation–frequency shift keying (LFM-FSK) front-end module (FEM) is presented. An LFMFSK waveform generator is proposed for the transceiver design to avoid ghost target detection in a
multi-target environment. This FEM consists of three parts: a frequency synthesizer, a 77 GHz
up/down converter, and a baseband block. The purpose of the FEM is to make an appropriate beat
frequency, which will be the key to solving problems in the digital signal processor (DSP). This
paper mainly focuses on the most challenging tasks, including generating and conveying the correct
transmission waveform in the 77 GHz frequency band to the DSP. A synthesizer test confirmed that
the developed module for the signal generator of the LFM-FSK can produce an adequate
transmission signal. Additionally, a loop back test confirmed that the output frequency of this
module works well. This development will contribute to future progress in integrating a radar
module for multi-target detection. By using the LFM-FSK waveform method, this radar transceiver
is expected to provide multi-target detection, in contrast to the existing method.
Keywords: 77-GHz radar module, Front-end module (FEM), LFM-FSK, Multi-target detection, Millimeter wave
transceiver, Patch array antenna, RF, Homodyne structure
1. Introduction
An intelligent transportation system (ITS) is a representative fusion technology that brings vibrant change to
the car itself by using information technology, such as
smart cruise control and blind spot detection (BSD) [1-3].
Over the years, various attempts have been made to
develop and adapt techniques to provide convenience [4-6].
Considering the potential for treacherous driving conditions, automotive assistance systems require a reliable
tracking system. Thus, radar, which has been widely used
in the military and aviation fields, has emerged as an
alternative approach [7-9]. Today, car companies and
suppliers are already working to develop the nextgeneration of long-range radar (LRR) at 77 GHz, which
will improve the maximum and minimum ranges, provide
a wider field of view, as well as improved range, angular
resolution and accuracy, self-alignment, and blockage
detection capability. The most commonly used LRR is the
frequency-modulated continuous wave (FMCW) radar
because of its high performance-to-cost (P/C) ratio
compared to other radar modulation methods. However,
this FMCW radar waveform has some serious limitations
in multiple target situations due to the technically
complicated association step. On the other hand, linear
frequency modulation–frequency shift keying (LFM-FSK),
which combines a frequency shift–keying method with a
linear frequency modulation waveform, is seen as a new
alternative in this situation because of its advanced
structure [10, 11]. This paper presents a transceiver
module for a 77 GHz complementary metal-oxide semiconductor long-range automotive radar module, which was
developed to provide accurate information for consumers,
even in multi-target situations, by using the LFM-FSK
radar method. Section 2 identifies the overall architecture
of this radar module, and Section 3 gives a system
description, which is followed by a presentation of the
measured results in Section 4. The conclusions are pre-
259
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
sented in Section 5, the final section of this paper.
2. Overall Architecture
2.1 LFM-FSK
In the automotive radar market, the classic radar
module has some limitations due to its structure. The most
common limitation is difficulty detecting multiple targets
in real traffic environments. There have been a variety of
solutions by developing an advanced module [12-14]. In
this paper, the LFM-FSK radar front-end module (FEM) is
proposed for a transceiver design to avoid ghost target
detection in a multi-target environment. The LFM-FSK
waveform is a new waveform designed for automotive
applications, based on continuous wave (CW) transmit
signals, which leads to an extremely short measurement
time. The basic idea is a combination of LFM and FSK
CW waveforms in an intertwined technique. Unambiguous
range and velocity measurement with high resolution and
accuracy can be required in this case, even in multi-target
situations.
Fig. 1. 77 GHz LFM-FSK radar module.
2.2 Overall Architecture
The main purpose of the radar module is gathering and
supplying correct results to a user. As demonstrated in Fig.
1, the radar module commonly consists of three parts: an
FEM, an antenna, and a digital signal processing (DSP)
module. Among these, constructing an efficient radar FEM
with the correct waveform generator was commonly
considered to be the most challenging task. This is an
FMCW signal waveform to which the classic frequency
shift keying (FSK) modulation method has been applied.
This module consists of the radar FEM, microstrip patch
array antenna, up/down converter, baseband block, and
DSP module. Every part is covered in this paper, except
the DSP and antenna. To implement the LFM-FSK radar
FEM, a heterodyne structure was considered, which is one
of the most stable ways to generate a signal waveform
using radar technology. In this structure, the signal was
generated by the frequency synthesizer block though the
phase locked loop (PLL) block with a crystal oscillator.
The 77 GHz up/down converter down-converted the
received signal, which was up-converted from the
baseband block to 77 GHz in the transmission module
before reaching the antenna array with an up-and-down
mixer. The baseband block located after the converter
handles the signal filtering process to efficiently supply
information to the next module, the DSP. The hybrid
coupler in the up/down converter is employed to supply
this beat frequency by simultaneously distributing the
generated signal to the transmit antenna and baseband
block. In the LFM-FSK radar waveform method, the beat
frequency needs to be calculated using special equations.
As demonstrated in Fig. 2, there are key concepts that
allow this module to be implemented successfully. First,
the frequency range of the synthesizer is 76.5 to 76.6 GHz,
with enough chirp time and delay time. Chirp time is the
overall elapsed time in one period. The delay time men-
(a) With fshift / fstep
(b) With chirp / delay time
Fig. 2. LFM-FSK waveform.
tioned here is the float time during which the DSP
implements the algorithm, which gives the information to
the user. This module also has the goal of generating a
signal with enough fstep (the difference value between two
transmitting signals). A crystal oscillator was employed in
the synthesizer block to generate the signal. The receiver
translates the channel of interest directly from 77 GHz to
the baseband block in a single stage. This structure needs
260
Yoo et al.: Implementation of an LFM-FSK Transceiver for Automotive Radar
Fig. 4. Synthesizer of 77 GHz FEM.
Fig. 3. Block diagram of the synthesizer.
less hardware compared to a heterodyne structure, which is
currently the most widely used structure in wireless
transceivers. In this structure, the integrated circuit’s low
noise amplifier (LNA) does not need to match 50 Ω
because there is no image reject filter between the LNA
and mixer. Another advantage of this structure is the
amplification at the baseband block, which results in
power savings.
3. System Description
3.1 Front-End Module (FEM)
The implemented radar module includes the synthesizer,
77 GHz up/down converter, and baseband block. Fig. 3
shows the implemented block synthesizer. The PLL in the
radar FEM generates a baseband signal at 3 GHz. The 77
GHz up/down converter processes the generated signal
passed from the frequency synthesizer block. In the
transmit stage, this block converts the signal to the 77 GHz
frequency band to radiate the RF signal, while the reflected
signal from the target is down-converted in the receiver
stage.
3.2 Synthesizer Block
As demonstrated in Fig. 3, this block is divided into the
synthesizer, which generates the LFM-FSK waveform, and
the PLL, which generates the local oscillator (LO)
frequency, along with the power divider (PDV), which
sends the phase information to the voltage-controlled
oscillator (VCO) to generate the correct signal in the PLL.
This module requires the signal to have an LFM-FSK
shape for multi-target detection. In the synthesizer, the
VCO generates a 3 GHz signal. The PLL creates a
specialized shape waveform according to a command from
the PIC block, which generates the programming code
from the J1 outside the synthesizer block. The lock
detector controls the generated signal through J2, which
verifies the operation of the PLL.
The conventional PLL is a chip that is used for the
purpose of locking a fixed frequency. On the other hand,
this module uses a fractional synthesizer that locks the
Fig. 5. Block diagram of a 77 GHz up/down converter.
PLL by controlling the VCO. This module includes the
VCO block in the synthesizer block to generate a stable
waveform, while the primary design concept is ensuring
immunity to noise when each frequency step is changed.
Therefore, this synthesizer generates a variety of
modulated waveforms, such as single, sawtooth, and
triangular ramp. In particular, this structure can generate a
waveform that has regular intervals and time delay. By
using this, we can make a specialized ramp that is similar
to the LFM-FSK waveform in transmission output. Here,
the output frequency of the LO is 3 GHz with the VCO
output frequency shift. Thus, we can generate a very
similar LFM-FSK signal in the synthesizer block by using
this PLL. The PDV element controls the VCO by
comparing the phases of the two signals, which are the
input signal and reference signal in the PLL. To operate
this synthesizer block, this module uses a 5 v/2000 mA
bias source, which comes from outside the module. Fig. 4
shows the implemented synthesizer of a 77 GHz FEM.
This block can minimize the interference from another
block’s signal due to the metal wall.
3.3 77 GHz Up/down Converter Block &
Baseband Block
As demonstrated in Fig. 5, the up/down converter
consists of a transmitter and receiver pair, and the signal is
raised to 77 GHz through the LO. After mixing with the
carrier frequency in the receiver, there is a phase
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
261
Fig. 6. Block diagram of the baseband block.
difference that leads to an ambiguous measurement for
distance and relative speed. The synthesizer signal is
conveyed for transmission through the amplifier, a band
pass filter (BPF), and an up-mixer. The LO signal is
applied where the 3 GHz frequency band can generate a 70
GHz signal through the 4 GHz and 6 GHz frequency bands.
The RF signal that is sent through the up-mixer is elevated
to 76.5 GHz using a drive amplifier (DRA). The
transmitter outputs this signal through the antenna. The
received signal is amplified by the LNA and converted to
an intermediate frequency (IF) signal though the downmixer. This signal is dropped to a baseband signal through
the I/Q mixer. To split the generated signal and make the
beat frequency, which represents the phase difference
between the transmitted and received signals, the mixer
employs a hybrid coupler. This module uses a mixer for
the high-frequency E-band. In this system, the LO is used
to generate a signal to convert the signal of interest to a
different frequency. The receiver converts the received
signal frequency to the IF block through the mixer.
Fig. 6 shows the implemented block of the baseband
block. As shown in this figure, the baseband block downconverts the high-frequency signal to the baseband
frequency for accurate signal processing. It needs an
automatic gain control (AGC) block because of the level
difference between the received signals. It can be
substituted for an op-amp, which can regulate the signal
output level by controlling the gain slope of a high pass
filter (HPF). By doing this, it can reduce the level dynamic
range of the received signal according to the target
distance. This module uses an op-amp that was used to
implement the filter and an AGC function to reduce the
output change in accordance with the power of the
received signal and generate a constant output level. The
module uses a band pass filter that consists of an HPF and
a low-pass filter (LPF) pair, to reduce the noise from the
adjacent frequency band. To maintain constant output
power for the signal, this module employs the AGC in the
system instead of a general amplifier. If the target is far
from the transceiver, there might be low signal power
compared to when the target is close to the system. The
gain of the processing amplifier can be adjusted by using
the AGC elements to maintain a constant signal strength
and send the correct information to the DSP though the J3
and J4 ports. If the mixed received signal is digitized and
subjected to Fourier transformation within a single period
in the DSP, the ambiguities for distance and speed can be
resolved by combining the measurement results in
accordance with the special equations. The purpose of the
radar FEM is to generate an appropriate beat frequency,
which will be the key to solving the problem in the DSP.
Fig. 7. 77GHz up/down converter & baseband block.
Fig. 7 shows the implemented 77 GHz up/down converter
and baseband block.
4. Measurement Results
4.1 Block Measurement Result
4.1.1 Synthesizer Test
The most important part in this LFM-FSK radar
module implementation is the synthesizer. As shown in Fig.
8, the synthesizer was measured by using this PCB. Before
frequency multiplication, the test frequency was 3 GHz in
the synthesizer. In the synthesizer test, the Vctrl signal
waveform controlled the VCO. The LFM-FSK waveform
was generated correctly, which shows the output signal of
the synthesizer. Fig. 9 represents a transmitted signal’s
frequency range of 76.45 to 76.55 GHz. According to Fig.
10 by conducting the spectrum analyzer, the bandwidth of
the transmitter is 100 MHz. Finally, according to Fig. 11,
the chirp time was around 6 ms, which means the
measurement resolution time of this block is fast enough.
4.1.2 Loop Back Test
Fig. 12 shows the conducted loop back test for
simulation. The beat frequency is the difference between
the transmitted and reflected echo signals. The cable used
in the transmission line represented a time delay in the
actual driving environment. By changing the cable length,
the reflected signal was measured with a variety of time
delays. Each cable length represented a different driving
262
Yoo et al.: Implementation of an LFM-FSK Transceiver for Automotive Radar
(a) Test block diagram
Fig. 11. Chirp time of the synthesizer.
(b) Synthesizer test
Fig. 8. Synthesizer test.
Fig. 12. 77 GHz LFM-FSK radar module.
Fig. 9. Vctrl signal waveform of the synthesizer.
attenuator (ATTN) was employed to reduce the tested
output power to create conditions similar to actual
situations. Fig. 13(a), shows the results of the loop back
test when using a 1.4 m cable length. Unlike this result,
which assumes that the distance between the target and
observer is small, Fig. 13(b) shows that the beat frequency
was lower with a longer cable length. This is an important
finding: the shorter the distance between the target and the
observer, the more likely it is that the module will hand
over the beat frequency. Based on this simulation, we can
conclude that the implemented radar FEM can operate well
with the correct signal information, even in a real traffic
environment. A long-length cable represented a target that
was far away from the transceiver module, and a short
length cable represented the opposite situation.
4.2 System Measurement Result
Fig. 10. RF sweep of the synthesizer.
situation; a cable length of 35 m represented a target that
was far away from the observer, compared to the 1.4 m
length. The loop back test created an artificial delay to
check the beat frequency using cable length variation. An
This paper mainly focuses on the most challenging
tasks: generating and conveying the correct transmission
waveform in the 77 GHz frequency band to the DSP. The
77 GHz radar FEM was designed using the LFM-FSK
method, unlike the conventional FMCW radar. This
implementation emphasizes generating the appropriate
waveform at a high frequency. The synthesizer test of the
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
263
the advantages of the proposed system. Using this
implementation for an automotive radar module will
promote its commercialization for multi-target detection.
References
(a) Close situation: 1.4 m cable
(b) Far away situation : 35 m cable
Fig. 13. Loop back test result.
Table 1. Performance of the LFM-FSK radar transceiver.
Module
Transmitter
Receiver
Parameter
Tx (RF)
Tx (IF)
Bandwidth
Output
LO
Dynamic range
Conversion gain
Rx input P1dB
Noise figure
Value
76.5 ± 0.05 GHZ
2.97 ± 0.05 GHz
100 MHz
+10 dBm
73.53 GHz
-23 to -112 dBm
88 dB
-22 dBm
10 dB
developed module confirmed that the LFM-FSK signal
generator could produce an adequate signal to transmit.
Additionally, the loop back test confirmed that the output
frequency of this module works well. Using these methods,
the performance of the radar module could be verified in a
simulation. The measurement results for the LFM-FSK
radar transceiver are summarized in Table 1.
5. Conclusion
This paper presented the first 77 GHz transceiver that
applies an LFM-FSK FEM with a frequency synthesizer
block, an up/down converter block and a baseband block.
The performance of the implemented module was
experimentally evaluated twice using a synthesizer and a
loop back test. The results of the experiment demonstrated
[1] C.T. Chen, Y.S. Chen.: ‘Real-time approaching
vehicle detection in blind-spot area’, Proceedings of
the 12th International IEEE Conference on Intelligent
Transportation Systems, 2009.
[2] Article (CrossRef Link)
[3] Sawant, H., Jindong Tan, Qingyan Yang, Qizhi
Wang.: ‘Using Bluetooth and sensor networks for
intelligent transportation systems’, Intelligent Transportation Systems, 2004. Proceedings. The 7th
International IEEE Conference on, 2004, pp.767, 772,
3-6 .
[4] Article (CrossRef Link)
[5] Papadimitratos, P., La Fortelle, A., Evenssen, K.,
Brignolo, R., Cosenza, S.: ‘Vehicular communication
systems: Enabling technologies, applications, and
future outlook on intelligent transportation’, Communications Magazine, IEEE , 2009, 47, (11), pp.8495.
[6] Article (CrossRef Link)
[7] Kan Zheng, Fei Liu, Qiang Zheng, Wei Xiang,
Wenbo Wang.: ‘A Graph-Based Cooperative Scheduling Scheme for Vehicular Networks’, Vehicular
Technology, IEEE Transactions on, 2013, 62, (4),
pp.1450-1458.
[8] Alam, N., Dempster, A.G.: ‘Cooperative Positioning
for Vehicular Networks: Facts and Future’, Intelligent
Transportation Systems, IEEE Transactions on, 2013,
14, (4), pp.1708-1717.
[9] Punithan, M.X., Seung-Woo Seo: ‘King's GraphBased Neighbor-Vehicle Mapping Framework’,
Intelligent Transportation Systems, IEEE Transactions on, 2013, 14, (3), pp.1313-1330.
[10] Brennan, P.V., Lok, L.B., Nicholls, K., Corr, H.:
‘Phase-sensitive FMCW radar system for highprecision Antarctic ice shelf profile monitoring’,
Radar, Sonar & Navigation, IET, 2014, 8, (7), pp.
776- 786.
[11] Article (CrossRef Link)
[12] M.-S. Lee, Y.-H. Kim: ‘New data association method
for automotive radar tracking’, IEE Proceedings Radar, Sonar and Navigation, 2001, 148, (5), pp. 297301.
[13] Article (CrossRef Link)
[14] A. Polychronopoulos, A. Amditis, N. Floudas, H.
Lind: ‘Integrated object and road border tracking
using 77 GHz automotive radars’, IEE Proceedings Radar, Sonar and Navigation, 2004, 151, (6), pp. 375381.
[15] Article (CrossRef Link)
[16] Marc-Michael Meinecke, Hermann Rohling: ‘Combination of LFMCW and FSK Modulation Principles
for Automotive Radar Systems’, German Radar
Symposium GRS2000, 2000.
[17] Hermann Rohling, Christof Möller: ‘Radar waveform
264
[18]
[19]
[20]
[21]
Yoo et al.: Implementation of an LFM-FSK Transceiver for Automotive Radar
for automotive radar systems and applications’,
Radar Conference 2008 '08 IEEE, 2008, pp.1-4.
Bi Xin, Du Jinsong: ‘A New Waveform for RangeVelocity Decoupling in Automotive Radar’, 2010
2nd International Conference on Signal Processing
Systems (ICSPS), 2010.
M. Musa, S. Salous: ‘Ambiguity elimination in HF
FMCW radar systems’, IEE Proceedings - Radar,
Sonar and Navigation, 2012, 147, (4), pp. 182-188.
Article (CrossRef Link)
Eugin Hyun, Woojin Oh, Jong-Hun Lee: ‘Multitarget detection algorithm for FMCW radar ’, Radar
Conference (RADAR), 2012 IEEE, 2012, pp. 338341.
HyunGi Yoo received his B.S degree
in electronic Engineering from Korea
Polytechnic University, Korea, in
2010. From 2013, he joined in M.S
degree at Ulsan National Institute of
Science and Technology (UNIST),
Ulsan, Korea. His research interests
are electronics for electric vehicles,
control system for crack detection and analog/RF IC
design for automotive radar technology.
MyoungYeol Park received the B.S.
degree and the M.S. degree from
University of Ulsan, Korea, in 1998.
Since 1999, he works for Comotech
Corp, as a chief researcher. Comotech
Corp. is a leading company developing
the world best point-to-point ultra
broad-bandwidth wireless link up to
1.25Gbps data rate by using millimeter-wave such as 60
GHz or 70/80 GHz frequency band. Also the company
produces high performance components above 18 GHz Kband to 110GHz W-band including 77 GHz automotive
radar front-end modules.
Copyrights © 2015 The Institute of Electronics and Information Engineers
YoungSu Kim received the Ph.D.
degree in the School of Electrical and
Computer Engineering at Ulsan National Institute of Science and Technology
(UNIST) in 2014. He now works with
Comotech Corp. as a Senior Field
Application Manager working on Eband radiolinks and mmW transceiver.
Back in 2004, he was with LG-Innotek as a Junior
Research Engineer working on 77 GHz radar system and
10 GHz X-band military radars. His research interests
include E-band radiolink, RF front-end module & devices
in microwave and millimeter-wave frequency ranges for
wireless communication systems.
SangChul Ahn received the M.S.
degree from University of Ulsan,
Korea, in 2004. Since 2009, he works
for Comotech Corp, as a Manager in
Development Team. His research
interests include millimeter-wave radio,
radar systems and antenna design.
Franklin Bien is currently an Associate
professor in the School of Electrical
and Computer Engineering at Ulsan
National Institute of Science and
Technology (UNIST), Ulsan, Korea,
since March, 2009. Prior to joining
UNIST, Dr. Bien was with Staccato
Communications, San Diego, CA as a
Senior IC Design Engineer working on analog/mixedsignal IC and RF front-end blocks for Ultra-Wideband
(UWB) products such as Wireless-USB in 65nm CMOS
technologies. Prior to working at Staccato, he was with
Agilent Technologies and Quellan Inc., developing transceiver ICs for enterprise segments that improve the speed
and reach of communications channels in consumer,
broadcast, enterprise and computing markets. His research
interests include analog/RF IC design for wireless communications, signal integrity improvement for 10+Gb/sec
broadband communication applications, circuits for wireless
power transfer technologies, and electronics for electric
vehicles.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.4.265
265
IEIE Transactions on Smart Processing and Computing
A Novel Red Apple Detection Algorithm Based on
AdaBoost Learning
Donggi Kim, Hongchul Choi, Jaehoon Choi, Seong Joon Yoo and Dongil Han
Department of Computer Engineering, Sejong University
{kdg1016,maltessfox,s041735}@sju.ac.kr, {sjyoo,dihan}@sejong.ac.kr
* Corresponding Author: Dongil Han
Received June 20, 2015; Revised July 15, 2015; Accepted August 24, 2015; Published August 31, 2015
* Short Paper
Abstract: This study proposes an algorithm for recognizing apple trees in images and detecting
apples to measure the number of apples on the trees. The proposed algorithm explores whether
there are apple trees or not based on the number of image block-unit edges, and then it detects apple
areas. In order to extract colors appropriate for apple areas, the CIE L*a*b* color space is used. In
order to extract apple characteristics strong against illumination changes, modified census
transform (MCT) is used. Then, using the AdaBoost learning algorithm, characteristics data on the
apples are learned and generated. With the generated data, the detection of apple areas is made. The
proposed algorithm has a higher detection rate than existing pixel-based image processing
algorithms and minimizes false detection.
Keywords: Crop yield estimation, Image segmentation, Apple tree detection, Apple detection, Object detection
1. Introduction
Generally, a crop disaster evaluation procedure entails
a sample survey that measures the yield before and after a
natural disaster through visual inspection and manual work
to judge the size of the crops and the disaster damage. This
consumes a lot of time and money, and it is possible to
compromise fairness depending on the accuracy of the
inspector. However, developments in image processing
and machine vision technologies have been proposed as a
solution to these problems. Most existing studies on fruit
detection have determined the area with the pixel-based
image processing method and counted the number of
detected areas [1-3].
This study analyzes the shape of the apple tree to detect
its existence and recognize the apple tree in the first stage.
Then, apples on the trees are detected using learned apple
data through AdaBoost Learning as an edge-based preprocess. Various detection errors occur when detecting
apples on a tree. Examples of errors in apple detection
include detecting other objects with similar colors to the
fruit, errors due to reflection or the shade of the apples, and
not detecting fruit hidden behind objects like leaves. This
method extracts colors suitable to an apple area using the
CIE L*a*b color space to minimize detection errors with
colors similar to an apple.
In addition, this study utilizes modified census
transform (MCT), which reduces the influence from
illumination changes by extracting structural information
about the apple region. Also, in the post-processing stage,
outliers are eliminated by using normal distribution
characteristics, which finally reduce the fault detection
area of the apple.
2. Related Work
Existing studies that detected and measured the fruit
area with an image processing technique generally utilized
external or structural information of the pixel itself. They
proposed a method that implemented labeling and
boundary detection after removing the background of the
input image and extracting the area of the fruit, finally
counting and calculating the amount of fruit. Wang et al.
[2] used the HSV color space to extract the pixel area of
the red of the apple and labelled nearby pixels. Patel et al.
[5] implemented a noise reduction algorithm on the
extracted area based on color to detect the orange color
Kim et al.: A Novel Red Apple Detection Algorithm Based on AdaBoost Learning
266
Fig. 1. Entire system block diagram.
and judged circular forms as the fruit through boundary
detection. Other studies have shown common detection
errors including missing the fruit area owing to the light
source, missing fruit hidden by other fruit or by a leaf, and
counting as fruit items that have colors similar to the fruit.
3. The Proposed Scheme
This paper proposes a system that dramatically reduces
the fault detection rate while minimizing the effect from
the existing light source by introducing AdaBoost
Learning and MCT algorithms. The proposed apple
detection system consists of apple tree recognition and
apple area detection modules. The pre-process, or apple
tree recognition module, analyzes the external shape of the
tree on the input image based on the edge to judge the
existence of the apple tree, and then goes to the apple area
detection stage.
The block diagram for the entire system is shown in Fig. 1.
Fig. 2. Apple tree region extraction algorithm.
3.1 Apple Tree Recognition
Most of existing apple detection systems generated
false detection results in performing the detection process,
even though the input image was not an apple tree. An
unfocused camera image also generates the same results.
This paper segments the input image into blocks as a preprocess of the system to solve such recognition errors. And
then, the number of edges in each block is extracted to
judge whether the block is included in the tree candidate
area. The apple area contains a smaller number of edges
than that of the tree area, causing empty space in the tree
area, and the empty space is filled. Finally, the tree block
candidate is judged as a tree if the block meets a certain
rate. The study tested hundreds of apple tree images to
analyze the edge information on the tree shape and finally
extract them to more precisely apply the tree shape
information to the algorithm.
Th1 is the parameter indicating the number of pixels on
the edge for each block area, and Th2 is the parameter
indicating the number of blocks corresponding to the tree
block in each block area. Th1 and Th2 had values of 50
and 120, respectively, on an experimental basis. The edge
extraction resulting images from the tree shape pre-process
and the proposed algorithm are as follows.
3.2 CIE LAB Color Space Analysis
It is important to clearly separate the apple region from
the background to precisely detect the apples in the image.
Therefore, this study compared and analyzed various color
(a) Input image
(b) Apple tree region extraction
Fig. 3. An example of the apple tree region extraction
result.
space models (RGB, HSI, CIE L*a*b) to find out the
proper color range of the apple. The CIE L*a*b space [11],
similar to the human visualization model, showed the most
remarkable separation of the apple from the background.
Therefore, this study used the L*, a* and b* color space to
define the range of the red apple area. The color range in
the defined condition and the extracted apple area are as
follows.
0 ≤ L∗ ≤ 100
15 ≤ a∗ ≤ 80
(1)
∗
0 ≤ b ≤ 60
The three coordinates of CIE L*a*b* represent the
lightness of the color (L* = 0 yields black and L* = 100
indicates diffuse white), its position between red/magenta
267
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
Fig. 4. Apple tree shape analysis examples.
source or items hidden by shade on the apple area. The
MCT calculates the average brightness based on a 3×3
mask, and then compares and calculates the brightness to
nearby pixels.
Eq. (2) and the MCT calculation process are as follows.
X = (x, y) means the location of each pixel in the image,
and the image brightness corresponding to each position is
defined as I(X). The 3×3 window, of which X is the center,
is W(X); N′ is a set of pixels in W(X), and Y represents
nine pixels each in the window. In addition, I ( X ) is the
average value of pixels in the window, and I(Y) is the
brightness value of each pixel in the window. As a
comparison function, ζ() becomes 1 if I ( X ) < I ( Y ) ,
(a) Input image
(b) Apple color area extraction
Fig. 5. Apple area extraction from the CIE L*a*b* color
space.
and green (a*, negative values indicate green, while
positive values indicate magenta) and its position between
yellow and blue (b*, negative values indicate blue and
positive values indicate yellow).
Figures with various color space models (RGB, HSI,
CIE L*a*b) used to find the proper color range of the
apple and background area are shown in Fig. 6.
3.3 MCT
The MCT of the apple area extracted from the input
image may minimize the effect from the light source and
only extract texture information in the apple area to
minimize fault detection due to reflection from the light
otherwise, it is 0. As a set operator, ⊗ connects binary
patterns of function, and then nine binary patterns are
connected through the operations. MCT was applied for
apple and non-apple areas in a 20 × 20 window.
The extracted features effectively distinguish apple and
non-apple areas through the AdaBoost learning algorithm
proposed by Viola and Jones [12].
Γ(X) =
⊗
ζ( I ( X ), I (Y ))
Y ∈ N'
(2)
3.4 Removal of Fault Detection Area
There may exist areas with faulty detection of the
apples due to colors and patterns similar to the apple area
during apple area detection. The study calculated the
average (μ) and standard deviation (σ) of the areas of the
apple area detected during apple image extraction to
eliminate such errors. The apple area detected from the
normal distribution feature was distributed within a range
of 3σ standard deviation (μ -3σ, μ +3σ) for 99.7%.
Therefore, the study judged the red object rather than the
Kim et al.: A Novel Red Apple Detection Algorithm Based on AdaBoost Learning
268
Fig. 6. Apple and background areas in the RGB, HSI and CIE L*a*b* color spaces (red: apple region, green: background).
Table 1. Processing time and accuracy of the proposed
method.
Fig. 7. MCT calculation process.
(a) Experiment image 1
(b) Experiment image 2
Image
index
Pieces of
fruits
1
2
3
4
5
6
7
8
9
10
Average (30)
15
15
16
17
18
16
16
15
15
15
502
Proposed
Detection
rate (%)
86.70
80.00
68.80
70.60
77.80
68.80
87.50
80.00
86.70
80.00
80.68
False
detection
2
3
5
5
4
5
2
3
2
1
18
Module name
Image size
Processing
timing
Cumulative
timing
Image resize
864 x 648
ㆍ
ㆍ
Apple tree
recognition
384 x 288
0.370 sec
0.370 sec
Apple detection
384 x 288
0.440 sec
0.810 sec
Table 2. Comparison of detection performance.
(c) MCT conversion image
of (a)
(d) MCT conversion image
of (b)
Fig. 8. MCT of two images with different objects.
apple within the 3σ range as faulty and eliminated it from
the detection area.
4. Performance Evaluation
Apple detection was verified with about 30 apple
images under environments with various apple sizes,
colors and light sources.
A comparison of the proposed apple detection study
Algorithm
Yeon Linker
Proposed
Chinhhuluun Aggelopoulou
et al. et al.
method
and Lee [6]
et al. [8]
[1]
[3]
Coefficient
of
determination 0.8402 0.7621 0.8006
( R 2 value)
0.8300
0.7225
against the study by Yeon et al. [1] shows that the
proposed method recorded a 3.5% higher detection rate
and faulty detection was dramatically decreased.
Furthermore, we have compared the coefficient of
determination values of existing detection systems. The
proposed method shows the best results, as shown in Table
2. Fig. 10 shows the output of our apple detection system
for color images.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
Fig. 9. Comparison between ground truth and the proposed method.
Fig. 10. Results of apple area detection.
269
270
Kim et al.: A Novel Red Apple Detection Algorithm Based on AdaBoost Learning
5. Conclusion
The analysis of existing apple detection methods shows
that a lot of the faulty detection was due to the color, the
light source, and shades being similar to the apple.
This paper recognizes the apple tree first and extracts
proper colors for the apple area through various color
space analyses as a pre-process to solving various
problems in apple detection. In addition, the study applied
MCT to minimize problems in reflection and shade, and
conducts an AdaBoost machine learning process on the
applied features to learn the shape information (pattern)
corresponding to the apple. The study developed an apple
detection algorithm that dramatically decreases faulty
detection and improves the detection rate, compared to
existing studies, and verified that it operates in real time.
Acknowledgement
This work was supported by a National Research
Foundation of Korea Grant funded by the Korean
Government (No. 2012-007498), (NRF-2014R1A1A2058592),
and was also supported by the ICT R&D Program of MSIP
[I0114-14-1016, Development of Prediction and Response
Technology for Agricultural Disasters Based on ICT] and
the Creative Vitamin Project.
[7] M.W.Hannan, T.F.Burks, D.M.Bulano, "“A Machine
Vision Algorithm for Orange Fruit Detection",
Agricultural Engineering International: the CIGR
Ejournal. Manuscript 1281. Vol.XI, December, 2009.
Article (CrossRef Link)
[8] A. D. Aggelopoulou, D. Bochtis, S. Fountas, K. C.
Swain, T. A. Gemtos, G. D. Nanos, “Yield Prediction
in apple orchards based on image processing”,
Journal of Precision Agriculture, 2011. Article
(CrossRef Link)
[9] Ulzii-Orshikh Dorj, Malrey Lee, Sangsub Han, “A
Comparative Study on Tangerine Detection Counting
and Yield Estimation Algorithm”, Journal of Security
and Its Applications, 7(3), pp.405-412, May 2013.
Article (CrossRef Link)
[10] Bernhard Fröba and Andreas Ernst, “Face detection
with the Modified Census Transform”, IEEE
International Conf. On Automatic Face and Gesture
Recognition(AFGR), pp. 91-96, Seoul, Korea, May.
2004. Article (CrossRef Link)
[11] CIE Color Space. [Online]. Available: Article
(CrossRef Link)
[12] P. Viola and M. Jones, “Fast and robust classification
using asymmetric AdaBoost and a detector cascade”,
in NIPS 14, 2002, pp. 1311-1318. Article (CrossRef
Link)
References
[1] Hanbyul Yeon, SeongJoon Yoo, Dongil Han, Jinhee
Lim, “Automatic Detection and Count of Apples
Using N-adic Overlapped Object Separation
Technology”, Proceeding of International Conference
on Information Technology and Management,
November 2013. Article (CrossRef Link)
[2] Qi Wang, Stephen Nuske, & Marcel Bergerman, E.A.,
“Design of Crop Yield Estimation System for Apple
Orchards Using Computer Vision”, In Proceedings of
ASABE, July 2012. Article (CrossRef Link)
[3] Raphael Linker, Oded Cohen, Amos Naor,
“Determination of the number of green apples in
RGB images recorded in orchards”, Journal of
Computers and Electronics in Agriculture, pp. 45-57
February 2012. Article (CrossRef Link)
[4] Y. Song, C.A. Glasbey, G.W. Horgan, G. Polder, J.A.
Dieleman, G.W.A.M. van der Heijden, "Automatic
fruit recognition and counting from multiple images",
Journal of Biosystems Engineering, pp.203-215
February 2013. Article (CrossRef Link)
[5] H. N. Patel, R. K. Jain, M. V. Joshi, “Automatic
Segmentation and Yield Measurement of Fruit using
Shape Analysis”, Journal of Computer Applications,
45(7), pp.19-24, 2012. Article (CrossRef Link)
[6] Radnaabazer Chinhhuluun, Won Suk Lee, “Citrus
Yield Mapping System in Natural Outdoor Scenes
using the Watershed Transform”, In Proceeding of
American Society of Agricultural and Biological
Engineers, 2006. Article (CrossRef Link)
Donggi Kim received his BSc in
Electronic Engineering from Sunmoon
University, Asan, Korea, in 2014. He
is currently in the Master’s course in
the Vision & Image Processing
Laboratory at Sejong University. His
research interest is image processing.
Hongchul Choi received his BSc in
Computer Engineering from Sejong
University, Seoul, Korea, in 2013. He
is currently in the Master’s course in
the Vision & Image Processing
Laboratory at Sejong University. His
research interest is image processing.
Jaehoon Choi received his BSc in
Computer Engineering from Sejong
University, Seoul, Korea, in 2013. He
is currently in the Master’s course in
the Vision & Image Processing
Laboratory at Sejong University. His
research interest is image processing.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
Seong Joon Yoo From 1982 to 2000,
he was Information Search and Research
Team Leader at the Electronics and
Communications Research Institute and
was head of the research institute of
Search Cast Co., Ltd. Since 2002, he
has been with the Department of
Computer Engineering, Sejong University, Korea, where he is currently a professor. His
research interests include data mining.
Copyrights © 2015 The Institute of Electronics and Information Engineers
271
Dongil Han received his BSc in
Computer Engineering from Korea
University, Seoul, Korea, in 1988, and
an MSc in 1990 from the Department
of Electric and Electronic Engineering,
KAIST, Daejeon, Korea. He received
his PhD in 1995 from the Department
of Electric and Electronic Engineering
at KAIST, Daejeon, Korea. From 1995 to 2003, he was
senior researcher in the Digital TV Lab of LG Electronics.
Since 2003, he has been with the Department of Computer
Engineering, Sejong University, Korea, where he is
currently a professor. His research interests include image
processing.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.4.272
272
IEIE Transactions on Smart Processing and Computing
Construction of Confusion Lines for Color Vision
Deficiency and Verification by Ishihara Chart
Keuyhong Cho, Jusun Lee, Sanghoon Song, and Dongil Han*
Department of Computer Engineering, Sejong University
isolat09@naver.com, jusunleeme@nate.com, song@sejong.ac.kr, dihan@sejong.ac.kr
* Corresponding Author: Author: Dongil Han
Received June 20, 2015; Revised July 15, 2015; Accepted August 24, 2015; Published August 31, 2015
* Regular Paper
Abstract: This paper proposes color databases that can be used for various purposes for people with
a color vision deficiency (CVD). The purpose of this paper is to group colors within the sRGB
gamut into the CIE L*a*b* color space using the Brettel algorithm to simulate the representative
colors of each group into colors visible to people with a CVD, and to establish a confusion line
database by comparing colors that might cause confusion for people with different types of color
vision deficiency. The validity of the established confusion lines were verified by using an Ishihara
chart. The different colors that confuse those with a CVD in an Ishihara chart are located in the
same confusion line database for both protanopia and deutanopia. Instead of the 3D RGB color
space, we have grouped confusion colors to the CIE L*a*b* space coordinates in a more distinctive
and intuitive manner, and can establish a database of colors that can be perceived by people with a
CVD more accurately. Editor - Highlight - Do these changes reflect the intended meaning? If not,
please rephrase as intended.
Keywords: Color vision deficiency, Ishihara chart, Confusion line, Color vision deficiency simulation
1. Introduction
The recent remarkable developments in display technology have ushered in an environment where color
information display devices are ubiquitous, and where
people can enjoy more color information than at any time
in history. Although about 92% of the world's people can
enjoy the benefits of rich color, for the remaining 8% who
have a color vision deficiency (CVD), it is not possible.
People with a CVD are usually classified into three groups:
protan (red color-blind), deutan (green color-blind), and
tritan (blue color-blind). The protan population has
abnormal L-cone cells, which are sensitive to red light,
whereas the deutan population has abnormal M-cone cells,
which perceive green; both types account for approximately 95% of the people with a CVD. The remaining 5%,
a small percentage, belong to the tritan group, characterized by an absence of the S-cone cells sensitive to blue
light [1].
In this paper, we proposed an algorithm to construct a
database of confusion lines in order to identify colors that
cause confusion for people with a CVD. To solve the
problem, we used the protanopia and deuteranopia
simulation [2, 3, 4] algorithms proposed by Brettel to
construct a database of confusion lines within the sRGB
gamut, but in the CIE L*a*b* color space [5]. However,
because the definition of color differences within the
sRGB color gamut is ambiguous, we selected the
representative values from the sRGB color gamut in the
CIE L*a*b* color space, which is more similar to real
color perception in humans, in order to produce more
effective results. By using colors that are grouped
according to different types of CVD, we could establish a
database of confusion lines that cause confusion for people
with each type of CVD.
2. Related Work
Some previous research [6] proposed a way of
changing colors from the color pallet that cause confusion
for people with a CVD in order to prevent that confusion,
and other research [7, 8] proposed establishing a database
of confusion lines by using the sRGB color space. But this
273
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
coordinates, and selected their central points as their
representative values. We selected the representative
values and simulated them by using the protanopia and
deutanopia simulation algorithms proposed by Brettel and
constructed a database of confusion lines based on the
given conditions by comparing the simulated representative values.
3.1 Grouping Phase of RGB values in CIE
L*a*b*
Fig. 1. sRGB color gamut in the CIE L*a*b* color space.
The gamut of the CIE L*a*b* color space [10] is L* :
0 ~ 100, a* : -128 ~ 128, b* : -128 ~ 128. This phase groups
all L*a*b* color values to the CIE L*a*b* 3D color
coordinates. In the CIE L*a*b* color space, five units in
L* are grouped together, while 13 units in each of a* and
b* are grouped together. Therefore, virtual boxes such as
L* box, a* box and b* box were created by dividing L* , a*
and b* , respectively, into 20 equal parts. We can create
these virtual boxes by using Eqs. (1) to (3). Although the
number of virtual boxes in the CIE L*a*b* color space can
be 8,000 (20x20x20), we removed colors that exist in the
CIE L*a*b* color space but not in the sRGB space. The
total number of virtual boxes used by this research is 982.
After creating virtual boxes from (0, 0, -2) to (19, -2, 5),
we calculated the central points of the virtual boxes by
using Eqs. (4) to (6), and then assigned the representative
values of L* pri, a* pri and b* pri to their respective virtual
boxes.
L* box = ( L* /5);
a* box = ( a* /13);
(1)
(2)
(3)
(4)
(5)
(6)
b* box = ( b* /13);
L pri = L* box + 2.5;
a* pri = a* box + 6.5;
*
b* pri = b* box + 6.5;
3.2 Simulation Phase of Representative
Color Values
Fig. 2. These figures showed the four steps of a
confusion line database construction procedure.
previous research did not consider the characteristics of
human color perception. As a result, some color data might
be lost in the process of grouping representative values.
Given this, we used colors within the sRGB gamut in the
CIE L*a*b* color space to reflect the features of human
color perception, and that color gamut is shown in Fig. 1 [9].
3. The Proposed Scheme
Fig. 2 diagrams how to construct a confusion line
database (DB) proposed by this research. In the preliminary
phase, we created color boxes in the CIE L*a*b* color
space by grouping RGB color values to CIE L*a*b*
In the simulation phase of representative color values,
we conducted the simulation of color appearance by
comparing the representative color value of the central
point of a rectangular L*a*b* virtual box, which was
created in the RGB value grouping process, to the CIE
L*a*b* space. We also implemented the protanopia and
deutanopia simulation algorithms proposed by Vienot et al.
[2]. Fig. 3 shows the confusion lines of protanopia in a CIE
1931 chromaticity diagram and the simulation image
perceived by someone with protanopia.
3.3 Comparison Phase of Representative
Values of Confusion Lines
The representative values of the existing confusion
lines were calculated using Eqs. (4) to (6). Eq. (7)
calculates the color difference between
( L*1 , a1* , b1* )
and
274
Cho et al.: Construction of Confusion Lines for Color Vision Deficiency and Verification by Ishihara Chart
(a) Yellow confusion line group (b) Blue confusion line group
Fig. 3. Before and after Brettel color simulation in a CIE
1931 chromaticity diagram.
( L , a , b ) in CIE L*a*b* space. If condition (8) is met,
*
2
*
2
*
2
each representative color pair, which looked different
before color simulation, looks like the same color after the
simulation. And this color pair causes confusion to people
who have a CVD, but not to people who do not. In
addition, whenever a color is added to a confusion line, the
representative values PL*( x ) , Pa*( x ) , Pb*(x ) of each
confusion line were recalculated using Eqs. (9) - (11) to
provide a more reliable algorithm to construct a database
of confusion lines.
A conventional color difference equation uses equation
(12), which was used to calculate the existing color
differences [10, 11] with two color pairs: (L1*, a1*, b1*),
(L2*, a2*, b2*). There are some cases where people with a
CVD can distinguish similar colors, depending on the level
of brightness. To avoid this, we used condition (8) instead
of condition (12).
Δ L* = L*1 − L*2
Δa b* =
*
1
* 2
2
*
1
*
* 2
2
(7)
*
Δ L <= 3 and Δ a b <= 15
n
∑
(9)
n
1 x
Pa*( x ) = ٛ ⋅ a*i
nx i =0
∑
(10)
n
x
1
Pb ( x ) = ٛ ⋅ b*i
nx i =0
Δ E* =
(a1* − ٛa2*
∑
2
+ b1* − b2*
2
(11)
+ L*1 − L*2
2
(12)
Blue :
φ >= 270o || φ <= 90o
Yellow :
φ < 270o && φ > 90o
3.4 Discrimination Phase of Confusion
Lines by Major Colors
When we looked into the simulated color DB after we
finished the the construction of confusion lines using the
algorithm proposed by this research, the simulated colors
were divided into two dominant colors. The major colors
that can be perceived by people with protanopia or
deutanopia are divided into yellow and blue regions. In
this research, we classified each confusion line either into
the yellow DB or into the blue DB. By dividing the
confusion line DB by major colors, confusing colors can
be clearly distinguished. Given this, we converted the
coordinates in the CIE L*a*b* color space into spherical
coordinates, and distinguished confusion lines by major
colors using condition (13) [12], which was designed to
discriminate colors using angle φ in order to make color
compensation more useful for people with a CVD. The
result of this algorithm is shown in Fig. 4. The X-axis
represents the number of colors in each confusion line, and
the y-axis represents the confusion line type in both groups.
4. Experimental Results
(8)
1 x
PL*( x ) = ٛ ⋅ L*i
n x i =0
*
Fig. 4. Yellow confusion line group and blue confusion
line group for people with protanopia.
(13)
where ϕ is the azimuth angle component in a spherical
coordinate system
To verify the confusion lines created based on the
algorithm proposed by this research, we compared them
with the existing confusion lines. Fig. 5(a) shows the
theoretical confusion lines for people with protanopia. Fig.
5(b) shows the generated confusion lines from using the
proposed algorithm. The theoretical confusion lines ignore
brightness. Thus, the colors on the same confusion lines
can be differentiated by someone with protanopia because
of the difference in brightness. But the confusion lines
established by the suggested algorithm can distinguish
brightness, thus creating more effective confusion lines.
Fig. 6 shows the generated confusion lines by using the
proposed algorithm in the CIE L*a*b* color space. Fig. 7
shows all confusion lines for those with protanopia and
deutanopia. The vertical axis indicates the number of
confusion lines for people with each type of CVD, while
the horizontal axis shows the different colors that exist on
the same confusion line. For example, Fig. 7(c) lists the
representative color values of each confusion line for
people with deutanopia. We can see that red and green
275
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
(a) Confusion lines in the CIE xy color space [5]
(b) Proposed confusion lines
Fig. 5. Proposed confusion lines for people with protanopia within the CIE xy color space.
(a) CIE L*a*b* color space
(b) Proposed confusion lines
Fig. 6. Proposed confusion lines for people with protanopia within the CIE L*a*b* color space.
(a) Protanopia confusion
lines (104 lines)
(b) Simulation image (a)
perceived by someone
with protanopia
(c) Deutanopia confusion
lines (94 lines)
(d) Simulation image (c)
perceived by someone
with deutanopia
Fig. 7. Arrangement of representative colors for created confusion lines.
exist on the same confusion lines. Because each CVD
patient can discern similar colors depending on brightness,
we can see that they do not exist on the same horizontal
axis in Fig. 7(c). For Fig. 7(a), the results are the same. Fig.
8 shows three confusion lines, which were magnified to
help explain Fig. 7. After the construction of confusion
276
Cho et al.: Construction of Confusion Lines for Color Vision Deficiency and Verification by Ishihara Chart
Table 1. Constructed Confusion Lines.
Deficiency
(a) Deutanopia confusion lines in D60, D61, and D62
Protanopia
Confusion
Lines
P1
…
P27
…
P104
Confusion
Boxes
(0 0 -2)
…
(5 3 -6) (5 4 -7) (5 4 -6)
…
(19 -1 0), (19 -1 1)
D1
…
D27
…
D95
(0 0 -2)
…
(5 3 -4) (6 0 -3) (6 1 -4)
(6 2 -4) (6 2 -3)
…
(19 -1 0), (19 -1 1)
Deuteranopia
(b) Simulation image (a) perceived by someone with deutanopia
Fig. 8. Magnified images of arrangement of representative colors.
(a)
(b)
(c)
(d)
(a) Ishihara chart perceived by people without a CVD; (b) Ishihara chart perceived by people with a CVD;
(c) colors on the P14 confusion line within an Ishihara chart, and (d) colors on the P17 confusion line within an Ishihara chart.
Fig. 9. Ishihara chart test results.
lines for people with protanopia or deutanopia, we found
that the number of confusion lines for those with
protanopia amounted to 104, ranging from P1 to P104,
while the number for those with deutanopia was 95,
ranging from D1 to D95, as shown in Table 1.
To verify these results, we classified colors based on
the databases of confusion lines created in an Ishihara
chart. We checked how confusion lines were distributed by
analyzing colors within an Ishihara chart using the
confusion line DB. As shown in Fig. 9(a), numbers 6 and 2
are visible for people without a CVD. Other examples are
Figs. 9(c) and (d). Figs. 9(c) and (d) show the colors that
exist on the P14 confusion line and the P17 confusion line,
respectively, within each Ishihara chart. As seen below, we
found that the color for numbers 6 and 2 in the Ishihara
chart and the background colors exist on the same
confusion lines. Therefore, people with a CVD could not
see the numbers within Ishihara charts like Fig. 9(b).
5. Conclusion
Previous research was conducted by grouping
confusion colors in the 3D RGB color space, and some
color data might be lost in the grouping process. However,
if we use the confusion line construction method proposed
by this research, we can group confusion colors on the CIE
L*a*b* color space coordinates in a more distinctive and
intuitive manner and can establish a database of colors that
can be perceived by people with a CVD in a more accurate
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
manner. Therefore, we were able to group a wider color
range and more varied colors that can be perceived by
people with a CVD than the existing confusion line
databases. The generated confusion line DB for protanopia
and deutanopia will contribute greatly to the development
of other research on people with a CVD.
Acknowledgement
This work was supported by a National Research
Foundation of Korea grant funded by the Korean
Government (NRF-2014R1A1A2058592), and was also
supported by a National Research Foundation of Korea
grant funded by the Korean Government (No. 2012007498).
References
[1] M. P. Simunovic, "Colour vision deficiency," Eye,
24.5, pp. 747-755, 2009. Article (CrossRef Link)
[2] F. Vienot, H. Brettel, J. D. Mollon, "Digital Video
Colourmaps for Checking the Legibility of Displays
by Dichromats," Color Research and Application,
24.4, pp. 243-252, August, 1999. Article (CrossRef
Link)
[3] G. M. Machado, et al., "A physiologically-based
model for simulation of color vision deficiency,"
Visualization and Computer Graphics, IEEE Transactions on 15.6, pp. 1291-1298, 2009. Article (CrossRef
Link)
[4] C. Rigden, "'The Eye of the Beholder'-Designing for
Colour-Blind Users," British Telecommunications
Engineering 17, pp. 291-295, 1999. Article (CrossRef
277
Link)
[5] CIE ColorSpace. [Online]. Available: Article
(CrossRef Link)
[6] D. Han, S. Yoo and B. Kim, "A Novel ConfusionLine Separation Algorithm Based on Color
Segmentation for Color Vision Deficiency," Journal
of Imaging Science & Technology, 56.3, 2012.
Article (CrossRef Link)
[7] S. Park, Y. Kim, “The Confusing Color line of the
Color deficiency in Panel D-15 using CIELab Color
Space” Journal of Korean Ophthalmic Optics Society
6 pp. 139-144, 2001 Article (CrossRef Link)
[8] P. Doliotis, et al., "Intelligent modification of colors
in digitized paintings for enhancing the visual
perception of color-blind viewers," Artificial
Intelligence Applications and Innovations III, pp.
293-301, 2009. Article (CrossRef Link)
[9] CIE 1931. [Online].Available: Article (CrossRef
Link)
[10] CIE. Technical report: Industrial colour-difference
evaluation. CIE Pub. No. 116. Vienna: Central
Bureau of the CIE; 1995. Article (CrossRef Link)
[11] Manuel Melgosa, "Testing CIELAB-based colordifference formulas," Color Research & Application,
25.1, pp. 49-55, 2000. Article (CrossRef Link)
[12] H. Brettel, F. Viénot, and J. D. Mollon. "Compu-terized simulation of color appearance for dichromats,"
JOSA A, 14.10, pp. 2647-2655, 1997. Article
(CrossRef Link)
Appendix
See Table 2
Table 2. Confusion line map for protanopin
Type of
confusion line
P1
Box positions on same confusion line ( L* a* b* )
( 0 0 -2)
P2
( 0 0 -1)
P3
( 1 -1 -1)
P4
( 1 -1 0), ( 1 0 0), ( 1 1 0)
P5
( 1 0 -2), ( 1 0 -1), ( 1 1 -2), ( 1 1 -1)
P6
( 1 1 -3), ( 1 2 -4), ( 1 2 -3)
P7
( 2 -1 -1), ( 2 -1 0), ( 2 0 -1), ( 2 0 0), ( 2 1 -1), ( 2 1 0), ( 2 2 -1), ( 2 2 0)
P8
( 2 0 -2), ( 2 1 -2), ( 2 2 -2)
P9
( 2 1 -3), ( 2 2 -4), ( 2 2 -3)
P10
( 2 1 1), ( 2 2 1)
P11
( 2 3 -5), ( 3 2 -4), ( 3 3 -5), ( 3 3 -4)
P12
( 3 -2 0), ( 3 -1 -1), ( 3 -1 0), ( 3 0 -1), ( 3 0 0), ( 3 1 -1), ( 3 1 0), ( 3 2 -1), ( 3 2 0), ( 4 3 -1), ( 4 3 0)
P13
( 3 -2 1), ( 3 -1 1), ( 3 0 1), ( 3 1 1), ( 3 2 1)
P14
( 3 0 -2), ( 3 1 -2), ( 3 2 -2), ( 4 3 -2)
P15
( 3 1 -3), ( 3 2 -3)
P16
( 4 -2 0), ( 4 -2 1), ( 4 -1 0), ( 4 -1 1), ( 4 0 0), ( 4 0 1), ( 4 1 0), ( 4 1 1), ( 4 2 0), ( 4 2 1), ( 5 3 0), ( 5 3 1),( 5 3 2)
P17
( 4 -1 -2), ( 4 -1 -1), ( 4 0 -2), ( 4 0 -1), ( 4 1 -2), ( 4 1 -1), ( 4 2 -2), ( 4 2 -1), ( 5 3 -2), ( 5 3 -1)
278
Cho et al.: Construction of Confusion Lines for Color Vision Deficiency and Verification by Ishihara Chart
P18
( 4 0 -3), ( 4 1 -4), ( 4 1 -3), ( 4 2 -4), ( 4 2 -3), ( 4 3 -4), ( 4 3 -3), ( 5 3 -3)
P19
( 4 1 2), ( 4 2 2)
P20
( 4 2 -5), ( 4 3 -5), ( 4 4 -6), ( 5 4 -5)
P21
( 5 -3 2), ( 5 -2 2)
P22
( 5 -2 0), ( 5 -2 1), ( 5 -1 0), ( 5 -1 1), ( 5 0 0), ( 5 0 1), ( 5 1 0), ( 5 1 1), ( 5 2 0), ( 5 2 1), ( 5 2 2),
( 6 3 0), ( 6 3 1), ( 6 3 2)
P23
( 5 -1 -2), ( 5 -1 -1), ( 5 0 -2), ( 5 0 -1), ( 5 1 -2), ( 5 1 -1), ( 5 2 -2), ( 5 2 -1), ( 6 3 -2), ( 6 3 -1), ( 6 4 -2), ( 6 4 -1)
P24
( 5 -1 2), ( 5 0 2), ( 5 1 2), ( 6 2 2), ( 6 3 3)
P25
( 5 0 -3), ( 5 1 -4), ( 5 1 -3), ( 5 2 -4), ( 5 2 -3), ( 5 3 -4), ( 6 4 -4), ( 6 4 -3)
P26
( 5 2 -5), ( 5 3 -5), ( 6 3 -4), ( 6 4 -5)
P27
( 5 3 -6), ( 5 4 -7), ( 5 4 -6)
P28
( 6 -3 1), ( 6 -3 2), ( 6 -2 1), ( 6 -2 2), ( 6 -1 1), ( 6 -1 2), ( 6 0 1), ( 6 0 2), ( 6 1 1), ( 6 1 2), ( 6 2 1), ( 7 2 1), ( 7 2 2),
( 7 2 3), ( 7 3 1), ( 7 3 2), ( 7 3 3), ( 7 4 2), ( 7 4 3)
P29
( 6 -2 -1), ( 6 -2 0), ( 6 -1 -1), ( 6 -1 0), ( 6 0 -1), ( 6 0 0), ( 6 1 -1), ( 6 1 0), ( 6 2 -1), ( 6 2 0),
( 7 3 -1), ( 7 3 0), ( 7 4 -1), ( 7 4 0), ( 7 4 1)
P30
( 6 -1 -2), ( 6 0 -3), ( 6 0 -2), ( 6 1 -3), ( 6 1 -2), ( 6 2 -3), ( 6 2 -2), ( 6 3 -3), ( 7 3 -2), ( 7 4 -3), ( 7 4 -2)
P31
( 6 1 -4), ( 6 2 -4), ( 7 4 -4)
P32
( 6 2 -5), ( 6 3 -5), ( 7 3 -4), ( 7 4 -5), ( 7 5 -5), ( 8 5 -4)
P33
( 6 3 -6), ( 6 4 -6), ( 7 5 -6)
P34
( 6 4 -7), ( 6 5 -7)
P35
( 6 5 -8), ( 7 5 -7)
P36
( 7 -3 1), ( 7 -3 2), ( 7 -2 1), ( 7 -2 2), ( 7 -1 1), ( 7 -1 2), ( 7 0 1), ( 7 0 2), ( 7 1 1), ( 7 1 2), ( 7 1 3),
( 8 2 1), ( 8 2 2), ( 8 2 3), ( 8 3 2), ( 8 3 3), ( 8 4 2), ( 8 4 3), ( 9 5 2), ( 9 5 3), ( 9 5 4)
P37
( 7 -2 -1), ( 7 -2 0), ( 7 -1 -1), ( 7 -1 0), ( 7 0 -1), ( 7 0 0), ( 7 1 -1), ( 7 1 0), ( 7 2 -1), ( 7 2 0),
( 8 3 -1), ( 8 3 0), ( 8 3 1), ( 8 4 0), ( 8 4 1),( 9 5 0), ( 9 5 1)
P38
( 7 -1 -2), ( 7 0 -3), ( 7 0 -2), ( 7 1 -3), ( 7 1 -2), ( 7 2 -3), ( 7 2 -2), ( 7 3 -3), ( 8 4 -3), ( 8 4 -2), ( 8 4 -1), ( 8 5 -2)
P39
( 7 0 -4), ( 7 1 -5), ( 7 1 -4), ( 7 2 -5), ( 7 2 -4), ( 7 3 -5), ( 8 4 -5), ( 8 4 -4), ( 8 5 -5), ( 8 5 -3)
P40
( 7 0 3)
P41
( 7 2 -6), ( 7 3 -6), ( 7 4 -6), ( 8 5 -6)
P42
( 7 4 -7), ( 8 4 -6), ( 8 5 -7), ( 9 6 -6)
P43
( 7 5 -8)
P44
( 8 -4 3), ( 8 -3 3), ( 8 -2 3), ( 8 -1 3), ( 8 0 3), ( 8 1 3), ( 9 2 3), ( 9 3 3), ( 9 4 3), ( 9 4 4), (10 5 3), (10 5 4)
P45
( 8 -3 0), ( 8 -3 1), ( 8 -2 0), ( 8 -2 1), ( 8 -1 0), ( 8 -1 1), ( 8 0 0), ( 8 0 1), ( 8 1 0), ( 8 1 1), ( 8 2 0),
( 9 2 1), ( 9 2 2), ( 9 3 0), ( 9 3 1), ( 9 3 2), ( 9 4 1), ( 9 4 2), (10 5 1), (10 5 2)
P46
( 8 -3 2), ( 8 -2 2), ( 8 -1 2), ( 8 0 2), ( 8 1 2)
P47
( 8 -2 -1), ( 8 -1 -2), ( 8 -1 -1), ( 8 0 -2), ( 8 0 -1), ( 8 1 -2), ( 8 1 -1), ( 8 2 -2), ( 8 2 -1), ( 8 3 -2),
( 9 3 -2), ( 9 3 -1), ( 9 4 -2), ( 9 4 -1), ( 9 4 0), ( 9 5 -1)
P48
( 8 -1 -3), ( 8 0 -4), ( 8 0 -3), ( 8 1 -4), ( 8 1 -3), ( 8 2 -4), ( 8 2 -3), ( 8 3 -4), ( 8 3 -3),
( 9 4 -4), ( 9 4 -3), ( 9 5 -3), ( 9 5 -2), (10 6 -3)
P49
( 8 1 -5), ( 8 2 -5), ( 8 3 -5), ( 9 4 -5), ( 9 5 -5), ( 9 5 -4), (10 6 -4)
P50
( 8 2 -6), ( 8 3 -6), ( 9 5 -6), (10 6 -5)
P51
( 8 3 -7), ( 8 4 -7), ( 9 4 -6)
P52
( 9 -4 2), ( 9 -4 3), ( 9 -3 2), ( 9 -3 3), ( 9 -2 2), ( 9 -2 3), ( 9 -1 2), ( 9 -1 3), ( 9 0 2), ( 9 0 3), ( 9 1 2), ( 9 1 3),
(10 2 2), (10 2 3), (10 2 4), (10 3 3), (10 3 4), (10 4 3), (10 4 4)
P53
( 9 -3 0), ( 9 -3 1), ( 9 -2 0), ( 9 -2 1), ( 9 -1 0), ( 9 -1 1), ( 9 0 0), ( 9 0 1), ( 9 1 0), ( 9 1 1), ( 9 2 0),
(10 2 0), (10 2 1), (10 3 0), (10 3 1), (10 3 2), (10 4 1), (10 4 2), (11 5 1), (11 5 2)
P54
( 9 -2 -2), ( 9 -2 -1), ( 9 -1 -2), ( 9 -1 -1), ( 9 0 -2), ( 9 0 -1), ( 9 1 -2), ( 9 1 -1), ( 9 2 -2), ( 9 2 -1),
(10 3 -2), (10 3 -1), (10 4 -2), (10 4 -1), (10 4 0), (10 5 -1), (10 5 0)
P55
( 9 -1 -3), ( 9 0 -4), ( 9 0 -3), ( 9 1 -4), ( 9 1 -3), ( 9 2 -4), ( 9 2 -3), ( 9 3 -4), ( 9 3 -3),
(10 4 -4), (10 4 -3), (10 5 -3), (10 5 -2), (11 6 -3), (11 6 -2)
P56
( 9 1 -5), ( 9 2 -6), ( 9 2 -5), ( 9 3 -6), ( 9 3 -5), (10 4 -5), (10 5 -6), (10 5 -5), (10 5 -4), (11 6 -4)
P57
(10 -4 1), (10 -4 2), (10 -3 1), (10 -3 2), (10 -2 1), (10 -2 2), (10 -1 1), (10 -1 2), (10 0 1), (10 0 2),
(10 1 1), (10 1 2), (10 1 3), (11 2 1), (11 2 2), (11 2 3), (11 3 2), (11 3 3), (11 4 2), (11 4 3)
P58
(10 -4 3), (10 -3 3), (10 -2 3), (10 -1 3), (10 0 3), (10 0 4), (10 1 4), (11 2 4), (11 3 4), (11 4 4)
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
P59
(10 -3 0), (10 -2 -1), (10 -2 0), (10 -1 -1), (10 -1 0), (10 0 -1), (10 0 0), (10 1 -1), (10 1 0), (10 2 -1),
(11 2 0), (11 3 -1), (11 3 0), (11 3 1), (11 4 0), (11 4 1), (11 5 0)
P60
(10 -2 -2), (10 -1 -3), (10 -1 -2), (10 0 -3), (10 0 -2), (10 1 -3), (10 1 -2), (10 2 -3), (10 2 -2), (10 3 -3),
(11 3 -3), (11 3 -2), (11 4 -3), (11 4 -2), (11 4 -1), (11 5 -2), (11 5 -1)
P61
(10 0 -4), (10 1 -5), (10 1 -4), (10 2 -5), (10 2 -4), (10 3 -5), (10 3 -4), (11 4 -4), (11 5 -5), (11 5 -4), (11 5 -3)
P62
(10 2 -6), (10 3 -6), (10 4 -6), (11 6 -5)
P63
(10 6 -6)
P64
(11 -5 4), (11 -4 3), (11 -4 4), (11 -3 3), (11 -3 4), (11 -2 3), (11 -2 4), (11 -1 3), (11 -1 4),
(11 0 3), (11 0 4), (11 1 3), (11 1 4), (12 2 3), (12 2 4), (12 3 4)
P65
(11 -4 1), (11 -4 2), (11 -3 1), (11 -3 2), (11 -2 1), (11 -2 2), (11 -1 1), (11 -1 2), (11 0 1), (11 0 2),
(11 1 1), (11 1 2), (12 2 1), (12 2 2), (12 3 1), (12 3 2), (12 3 3), (12 4 2)
P66
(11 -3 -1), (11 -3 0), (11 -2 -1), (11 -2 0), (11 -1 -1), (11 -1 0), (11 0 -1), (11 0 0), (11 1 -1), (11 1 0),
(11 2 -1), (12 2 0), (12 3 -1), (12 3 0), (12 4 -1), (12 4 0), (12 4 1)
P67
(11 -2 -2), (11 -1 -3), (11 -1 -2), (11 0 -3), (11 0 -2), (11 1 -3), (11 1 -2), (11 2 -3), (11 2 -2),
(12 3 -3), (12 3 -2), (12 4 -3), (12 4 -2), (12 5 -2)
P68
(11 0 -5), (11 0 -4), (11 1 -5), (11 1 -4), (11 2 -5), (11 2 -4), (11 3 -5), (11 3 -4), (11 4 -5),
(12 4 -4), (12 5 -4), (12 5 -3), (12 6 -4)
P69
(12 -5 3), (12 -5 4), (12 -4 3), (12 -4 4), (12 -3 3), (12 -3 4), (12 -2 3), (12 -2 4), (12 -1 3), (12 -1 4),
(12 0 3), (12 0 4), (12 1 3), (12 1 4), (13 1 5), (13 2 3), (13 2 4), (13 2 5)
P70
(12 -4 1), (12 -4 2), (12 -3 1), (12 -3 2), (12 -2 1), (12 -2 2), (12 -1 1), (12 -1 2), (12 0 1), (12 0 2),
(12 1 1), (12 1 2), (13 2 1), (13 2 2), (13 3 1), (13 3 2), (13 3 3)
P71
(12 -3 -1), (12 -3 0), (12 -2 -1), (12 -2 0), (12 -1 -1), (12 -1 0), (12 0 -1), (12 0 0), (12 1 -1), (12 1 0), (12 2 -1),
(13 2 -1), (13 2 0), (13 3 -1), (13 3 0), (13 4 -1)
P72
(12 -2 -3), (12 -2 -2), (12 -1 -3), (12 -1 -2), (12 0 -3), (12 0 -2), (12 1 -3), (12 1 -2), (12 2 -3), (12 2 -2),
(13 3 -3), (13 3 -2), (13 4 -3), (13 4 -2)
P73
(12 -1 -4), (12 0 -5), (12 0 -4), (12 1 -5), (12 1 -4), (12 2 -5), (12 2 -4), (12 3 -5), (12 3 -4), (13 4 -4), (13 5 -4)
P74
(13 -5 2), (13 -5 3), (13 -4 2), (13 -4 3), (13 -3 2), (13 -3 3), (13 -2 2), (13 -2 3), (13 -1 2), (13 -1 3),
(13 0 2), (13 0 3), (13 1 2), (13 1 3), (13 1 4), (14 2 2), (14 2 3), (14 2 4)
P75
(13 -5 4), (13 -4 4), (13 -3 4), (13 -2 4), (13 -1 4), (13 0 4), (14 1 4), (14 1 5)
P76
(13 -4 0), (13 -4 1), (13 -3 0), (13 -3 1), (13 -2 0), (13 -2 1), (13 -1 0), (13 -1 1), (13 0 0), (13 0 1),
(13 1 0), (13 1 1), (14 2 0), (14 2 1)
P77
(13 -3 -1), (13 -2 -2), (13 -2 -1), (13 -1 -2), (13 -1 -1), (13 0 -2), (13 0 -1), (13 1 -2), (13 1 -1),
(13 2 -2), (14 3 -2), (14 3 -1)
P78
(13 -2 -3), (13 -1 -4), (13 -1 -3), (13 0 -4), (13 0 -3), (13 1 -4), (13 1 -3),
(13 2 -4), (13 2 -3), (13 3 -4), (14 3 -3), (14 4 -3)
P79
(14 -5 2), (14 -5 3), (14 -4 2), (14 -4 3), (14 -3 2), (14 -3 3), (14 -2 2), (14 -2 3), (14 -1 2), (14 -1 3),
(14 0 2), (14 0 3), (14 1 2), (14 1 3), (15 1 4)
P80
(14 -5 4), (14 -5 5), (14 -4 4), (14 -4 5), (14 -3 4), (14 -3 5), (14 -2 4), (14 -2 5), (14 -1 4), (14 -1 5),
(14 0 4), (14 0 5), (15 1 5)
P81
(14 -4 0), (14 -4 1), (14 -3 0), (14 -3 1), (14 -2 0), (14 -2 1), (14 -1 0), (14 -1 1),
(14 0 0), (14 0 1), (14 1 0), (14 1 1), (15 2 0)
P82
(14 -3 -2), (14 -3 -1), (14 -2 -2), (14 -2 -1), (14 -1 -2), (14 -1 -1), (14 0 -2),
(14 0 -1), (14 1 -2), (14 1 -1), (14 2 -2), (14 2 -1)
P83
(14 -2 -3), (14 -1 -3), (14 0 -3), (14 1 -3), (14 2 -3), (15 3 -3)
P84
(15 -6 4), (15 -6 5), (15 -5 4), (15 -5 5), (15 -4 4), (15 -4 5), (15 -3 4), (15 -3 5), (15 -2 4), (15 -2 5),
(15 -1 4), (15 -1 5), (15 0 4), (15 0 5)
P85
(15 -5 1), (15 -5 2), (15 -4 1), (15 -4 2), (15 -3 1), (15 -3 2) (15 -2 1), (15 -2 2), (15 -1 1), (15 -1 2),
(15 0 1), (15 0 2), (15 1 1), (15 1 2)
P86
(15 -5 3), (15 -4 3), (15 -3 3), (15 -2 3), (15 -1 3), (15 0 3), (15 1 3)
P87
(15 -4 -1), (15 -4 0), (15 -3 -1), (15 -3 0), (15 -2 -1), (15 -2 0), (15 -1 -1), (15 -1 0),
(15 0 -1), (15 0 0), (15 1 -1), (15 1 0), (15 2 -1)
P88
(15 -3 -2), (15 -2 -3), (15 -2 -2), (15 -1 -3), (15 -1 -2), (15 0 -3), (15 0 -2),
(15 1 -3), (15 1 -2), (15 2 -3), (15 2 -2)
P89
(16 -6 3), (16 -6 4), (16 -5 3), (16 -5 4), (16 -4 3), (16 -4 4), (16 -3 3), (16 -3 4), (16 -2 3), (16 -2 4),
(16 -1 3), (16 -1 4), (16 0 3), (16 0 4), (16 0 5)
P90
(16 -6 5), (16 -5 5), (16 -4 5), (16 -3 5), (16 -2 5), (16 -1 5)
279
280
Cho et al.: Construction of Confusion Lines for Color Vision Deficiency and Verification by Ishihara Chart
P91
(16 -5 1), (16 -5 2), (16 -4 1), (16 -4 2), (16 -3 1), (16 -3 2), (16 -2 1), (16 -2 2), (16 -1 1), (16 -1 2),
(16 0 1), (16 0 2), (16 1 1)
P92
(16 -4 -1), (16 -4 0), (16 -3 -1), (16 -3 0), (16 -2 -1), (16 -2 0), (16 -1 -1),
(16 -1 0), (16 0 -1), (16 0 0), (16 1 -1), (16 1 0)
P93
(16 -3 -2), (16 -2 -2), (16 -1 -2), (16 0 -2), (16 1 -2), (16 2 -2)
P94
(17 -6 3), (17 -6 4), (17 -5 3), (17 -5 4), (17 -4 3), (17 -4 4), (17 -3 3), (17 -3 4),
(17 -2 3), (17 -2 4), (17 -1 3), (17 -1 4)
P95
(17 -6 5), (17 -5 5), (17 -4 5), (17 -4 6), (17 -3 5), (17 -3 6), (17 -2 5), (17 -2 6), (17 -1 5), (17 -1 6)
P96
(17 -5 1), (17 -5 2), (17 -4 1), (17 -4 2), (17 -3 1), (17 -3 2), (17 -2 1), (17 -2 2), (17 -1 1), (17 -1 2), (17 0 1), (17 0 2)
P97
(17 -4 -1), (17 -4 0), (17 -3 -1), (17 -3 0), (17 -2 -1), (17 -2 0), (17 -1 -1), (17 -1 0), (17 0 -1), (17 0 0), (17 1 -1)
P98
(17 -3 -2), (17 -2 -2), (17 -1 -2)
P99
(18 -4 1), (18 -4 2), (18 -3 1), (18 -3 2), (18 -2 1), (18 -2 2), (18 -1 1), (18 -1 2)
P100
(18 -4 3), (18 -4 4), (18 -3 3), (18 -3 4), (18 -2 3), (18 -2 4), (18 -1 3)
P101
(18 -4 5), (18 -4 6), (18 -3 5), (18 -3 6), (18 -2 5), (18 -2 6)
P102
(18 -3 -1), (18 -3 0), (18 -2 -1), (18 -2 0), (18 -1 -1), (18 -1 0), (18 0 -1), (18 0 0)
P103
(19 -2 4), (19 -2 5)
P104
(19 -1 0), (19 -1 1)
Keuyhong Cho received his BSc in
Physics from Sejong University,
Seoul, Korea, in 2015. He is currently
a Master’s student in the Vision &
Image Processing Laboratory at
Sejong University. His research
interest is image processing.
Jusun Lee received his BSc in
Computer Engineering from Sejong
University, Seoul, Korea, in 2015. He
is currently a Master’s student in the
Vision & Image Processing Laboratory at Sejong University. His research
interest is image processing.
Copyrights © 2015 The Institute of Electronics and Information Engineers
Sanghoon Song received his BSc in
Electronics Engineering from Yonsei
University, Seoul, Korea, in 1977, and
an MSc in 1979 from the Department
of Computer Science, KAIST, Korea.
He received his PhD in 1992 from the
Department of Computer Science at
the University of Minnesota, Minneapolis, U.S.A. Since 1992, he has been with the Department
of Computer Engineering, Sejong University, Korea,
where he is currently a professor. His research interests are
embedded computing systems, computer arithmetic, and
distributed systems.
Dongil Han received his BSc in
Computer Engineering from Korea
University, Seoul, Korea, in 1988 and
an MSc in 1990 from the Department
of Electric and Electronic Engineering,
KAIST, Daejeon, Korea. He received
his PhD in 1995 from the Department
of Electric and Electronic Engineering
at KAIST, Daejeon, Korea. From 1995 to 2003, he was
Senior Researcher in the Digital TV Lab of LG Electronics.
Since 2003, he has been with the Department of Computer
Engineering, Sejong University, Korea, where he is
currently a professor. His research interests include image
processing.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.4.281
281
EIE Transactions on Smart Processing and Computing
A Survey of Human Action Recognition Approaches that
use an RGB-D Sensor
Adnan Farooq and Chee Sun Won*
Department of Electronics and Electrical Engineering, Dongguk University -Seoul, South Korea
{aadnan, cswon}@dongguk.edu
* Corresponding Author: Chee Sun Won
Received June 20, 2015; Revised July 15, 2015; Accepted August 24, 2015; Published August 31, 2015
* Regular Paper: This paper reviews the recent progress possibly including previous works in a particular research topic, and
has been accepted by the editorial board through the regular reviewing process.
Abstract: Human action recognition from a video scene has remained a challenging problem in the
area of computer vision and pattern recognition. The development of the low-cost RGB depth
camera (RGB-D) allows new opportunities to solve the problem of human action recognition. In
this paper, we present a comprehensive review of recent approaches to human action recognition
based on depth maps, skeleton joints, and other hybrid approaches. In particular, we focus on the
advantages and limitations of the existing approaches and on future directions.
Keywords: Human action recognition, Depth maps, Skeleton joints, Kinect
1. Introduction
The study of human action recognition introduces
various new methods for understanding actions and
activities from video data. The main concern in human
action recognition systems is how to identify the type of
action from a set of video sequences. Different systems
like consumer interactive entertainment, gaming, surveillance systems, smart homes, and life-care systems
include several feasible applications [1, 2], which have
become the utmost inspiration for researchers, who have
hence developed algorithms for human action recognition.
Previously, RGB cameras have always been a focal
point of studies into identifying actions from image
sequences taken by these cameras [3, 4]. Various constraints relating to 2D cameras are responsiveness to
illumination changes, surrounding clutter, and disorder [3,
4]. It has been a tough and difficult task to precisely
recognize human actions. However, with the development
of cost-effective RGB depth (RGB-D) camera sensors (e.g.,
the Microsoft Kinect), the results from action recognition
have improved, and they have become a point of
consideration for many researchers [5]. Depth camera
sensors provide more discriminating and clear information
by giving a 3D structural view from which to recognize
action, compared to visible light cameras. Furthermore,
depth sensors also help lessen and ease the low-level
complications found in RGB images, such as background
subtraction and light variations. Also, depth cameras can
be beneficial for the entire range of day-to-day work, even
at night, like patient monitoring systems. Depth images
enable us to view and assess human skeleton joints in a 3D
coordinate system. These 3D skeleton joints provide
additional information to examine for recognition of action,
which in turn increases the accuracy of the human–
computer interface [5]. Depth sensors, like the Kinect,
usually provide three types of data: depth images, 3D
skeleton joints, and color (RGB) images. So, it has been a
big challenge to utilize these data, together or
independently, to present human behavior and to improve
the accuracy of action recognition.
Human movement is classified into four levels (motion,
action, activity, and behavior), where motion is a small
movement of a body part for a very short time span.
However, motion is a key factor in actions, which helps to
identify other movement, such as the following [6].
An action is a collection of recurring different motions,
which show what a person is doing, like running, sitting,
etc., or interaction of the person with certain things. The
duration of the action lasts no more than a few seconds.
An activity is also an assortment of various actions that
help in perceiving and understanding human behavior
while performing designated tasks, like cooking, cleaning,
studying, etc., which are activities that can continue for
282
Farooq et al.: A Survey of Human Action Recognition Approaches that use an RGB-D Sensor
Table 1. Publicly available RGB-D datasets for evaluating
action recognition systems.
Fig. 1. Flow of an action recognition system.
much longer times.
Behavior is extremely meaningful in understanding
human motion that can last for hours (or even days) and
that can be considered either normal or abnormal.
Action, activity and behavior can be differentiated on
the basis of supportive dissimilar features concerning time
scales. In this study, we focus on shorter and medium time
period actions, such as raising a hand or sitting down.
Human action recognition has been mainly focused on
three leading applications: 1) surveillance systems, 2)
entertainment environments, and 3) healthcare systems,
which comprise systems to track or follow individuals
automatically [2, 7-14]. In a surveillance system, the
authorities need to monitor and detect all kinds of criminal
and suspicious activities [2, 7, 8]. Most surveillance
systems, equipped with several cameras, require welltrained staff to monitor human actions on screen. However,
using automatic human action recognition algorithms, it is
possible to reduce the number of staff and immediately
create a security alert in order to prevent dangerous
situations. Furthermore, human action recognition systems
can also be used to identify entertainment actions,
including sports, dance and gaming. For entertainment
actions, response time to interact with a game is very
important. Thus, a number of techniques have been
developed to address this issue using depth sensors [9, 10].
In healthcare systems, it is important to monitor the
activities of a patient [11, 12]. The aim of using such
healthcare systems is to assist the health workers to care
for, treat, and diagnose patients, hence, improving the
reliability of diagnosis. These medical healthcare systems
can also help decrease the work load on medical staff and
provide better facilities to patients.
Generally, human action–recognition approaches involve
several steps, as shown in Fig. 1, where feature extraction
is one of the important blocks, which performs a vital role
in the action recognition system. The performance of
feature extraction methods for an action recognition
system is evaluated on the basis of classification accuracy.
Several available datasets, recorded from depth sensors,
are widely available and accessible for developing an
innovative recognition system.
Every dataset includes different actions and activities
performed by different volunteer subjects, and each dataset
is designed to resolve a particular challenge. Table 1
provides a summary of the most popular datasets. Most of
the methods reviewed in this paper are evaluated on one or
more of these datasets. In this survey, we review human
action recognition systems that have been proposed to
recognize human actions. This review paper is organized
as follows. In Section 2, we review human action systems
based on depth maps, skeleton joints, and hybrid methods
Datasets
Size
Microsoft
Research
Action3D [13]
10 subjects/
20 actions/
2-3 repetitions
Microsoft
Research Daily
Activity
3D [14]
10 subjects/
16 activities/
2 repetitions
UT-Kinect
Action [15]
10 subjects/
10 actions/
2 repetitions
UCF-Kinect
[16]
16 subjects/
16 activities/
5 repetitions
Kitchen scene
action [17]
9 activities
Remarks
There are a total of 567
depth map sequences with
a resolution of 320x240.
The dataset was recorded
using the Kinect sensor.
All are interactions with
game consoles (i.e. draw a
circle, two-hand wave,
forward kick, etc.).
16 indoor activities were
done by 10 subjects. Each
subject performed each
activity once in a standing
position and once in a
sitting position. Three
channels are recorded
using the Kinect sensor:
(i) depth maps, (ii) RGB
video, (iii) skeleton joint
positions.
In the UT-Kinect Action
dataset, there are 10
different actions with
three channels: (i) RGB,
(ii) depth, and (iii) 3D
skeleton joints.
The UCF-Kinect dataset
is a long-sequence dataset
that is used to test latency.
Different activities in the
kitchen have been
performed to recognize
cooking motions.
(i.e., depth and color, depth and skeleton). A summary of
all the reviewed work is presented in Section 3, which
includes the advantages and disadvantages of each
reviewed method. The conclusion is presented in Section 4.
2. Human Action Recognition
2.1 Human Action Recognition Using
Depth Maps
Li et al. [18] introduced a method that recognizes
human actions from depth sequences. The motivation of
this work was to develop a method that does not require
joint tracking. It also uses 3D contour points instead of 2D
points. Depth maps are projected on three orthogonal
Cartesian planes, and a specified number of points along
the contours of all three projections are sampled for each
frame. These sampled points are then used as a “bag-ofpoints” to illustrate a set of salient postures that correspond
to the nodes of an action graph used to model the dynamics
of the actions. The authors used their own dataset for the
experiments, which later became known as the Microsoft
Research (MSR) Action3D dataset, and they achieved a
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
74.4% recognition accuracy. The limitation of this
approach is that the sampling of 3D points from the whole
body requires a large dataset. Also, due to noise and
occlusion in the depth maps, YZ and XZ views may not be
reliable for extracting 3D points.
To overcome some of the issues [13] in Table 1, Vieira
et al. [18] proposed space–time occupancy patterns
(STOP) to represent the sequence of depth maps, where
the space and time axes are divided into multiple segments
so that each action sequence is embedded in a multiple 4D
grid. In order to enhance the role of spare cells, a
saturation scheme was proposed, which typically consists
of points on a silhouette or moving parts of the body. To
recognize the actions, a nearest neighbor classifier based
on cosine distance was employed. Experimental results on
the MSR Action3D dataset show that STOP features for
action classification yield better accuracy than that of
Rougier et al. [12]. The major drawback to this approach is
that they empirically set the parameter for dividing
sequences into cells.
A method that addresses the noise and occlusion issues
in action recognition systems using depth images was
proposed by Wang et al. [19]. The authors considered a 3D
action sequence as a 4D shape and proposed random
occupancy pattern (ROP) features extracted from
randomly sampled 4D sub-volumes of different sizes and
at different locations using a weighted sampling scheme.
An elastic-net regularized classification is then modeled to
further select the most discriminative features, which are
robust to noise and less sensitive to occlusions. Finally,
support vector machine (SVM) is used to recognize the
actions. Experimental results on the MSR Action3D
dataset show that the proposed method outperforms
previous methods by Li et al. [13] and Vieira et al. [18].
An action recognition system that is capable of
extracting additional shape and motion information using
3D depth maps was proposed by Yang et al. [20]. In this
system, each 3D depth map is first projected onto three
orthogonal Cartesian planes. Each projected view is
generated by thresholding the difference of consecutive
depth frames and stacks to obtain a depth motion map
(DMM) for each projected view. A histogram of oriented
gradients (HOG) [21] is then applied to each 2D projected
view to extract the features. Furthermore, the features from
all three views are then concatenated to form a DMMHOG descriptor. An SVM classifier is used to recognize
the actions. Steps for extracting the HOG from the DMM
are shown in Fig. 2. The drawback of this system is that
their approach does not show the direction of the variation.
Also, the authors explored the number of frames required
to generate satisfactory results, which showed that roughly
35 frames are enough to generate acceptable results.
Nonetheless, it cannot be applied to complex actions to get
satisfactory results.
Ahmad et al. [22] employed an R transform [23] to
compute a 2D angular projection map of an activity
silhouette via Radon transform and to compare the
proposed method with other feature extraction methods (i.e.
PCA and ICA) [24, 25]. The authors argue that PCA and
ICA are sensitive to scale and translation using depth
silhouettes. Therefore, a 2D Radon transform converts into
283
Fig. 2. Histogram of oriented gradients descriptor on
motion maps.
Fig. 3. Framework of the human activity recognition
system using R transform [22].
a 1D R transform profile to provide a highly compact
shape representation for each depth silhouette. That is, to
extract suitable features from the 1D R transformed
profiles of depth silhouettes, linear discriminant analysis
(LDA) is used to make the features more discriminative.
Finally, the features are trained and tested using hidden
Markov models (HMMs) [26] on the codebook of vectors
generated using the Linde-Buzo-Gray (LBG) clustering
algorithm [27] for recognition. Fig. 3 shows the overall
flow of the proposed method. Experimental results show
that their feature extraction method is robust on the 10
human activities collected by the authors. Using this
dataset, they achieved an accuracy of 96.55%. The
limitation to this system is that the proposed method is
view-dependent.
Using depth sequences, a new feature descriptor named
histogram of oriented 4D surface normal (HON4D) was
proposed by Oreifej and Liu [28]. The proposed feature
descriptor is more discriminative than the average 4D
occupancy [18] and is robust against noise and occlusion
[18]. HON4D features consider the 3D depth sequences as
a surface in 4D spatio-temporal space–time, depth and
spatial coordinates. In order to construct HON4D, the 4D
space is quantized using the 120 vertices of a 600-cell
polychoron. Then, the quantization is refined using a
discriminative density measure by inducing additional
projectors in the direction, where the 4D normal is denser
and more discriminative. An SVM classifier is used to
recognize the actions. Experimental results show that
HON4D achieves high accuracy compared to state-of-theart methods. The limitation to this system is that HON4D
can only roughly characterize the local spatial shape
around each joint to represent human–object interaction.
Also, differential operation on depth images can enhance
noise.
284
Farooq et al.: A Survey of Human Action Recognition Approaches that use an RGB-D Sensor
Xia et al. [29] proposed an algorithm for extracting
local spatio-temporal interest points (STIPs) from depth
videos (DSTIPs) and described a local 3D depth cuboid
using the depth cuboid similarity feature (DCSF). The
DSTIPs deal with the noise in the depth videos, and DCSF
was presented as a descriptor for the 3D local cuboid in
depth videos. The authors claimed that the DSTIPs+DCSF
pipeline recognizes activities from the depth videos
without depending on skeleton joint information, motion
segmentation, tracking or de-noising procedures. The
experimental results reported for the MSR Daily Activity
3D dataset show that it is possible to recognize human
activities using the DSTIPs and DCSF with an accuracy of
88.20% by using 12 out of 16 activities. Four activities that
have less motion (i.e., sitting still, reading a book, writing
on paper, and using a laptop) were removed from the
experiments because most of the recognition errors come
from these relatively motionless activities. Furthermore,
the accuracy of the proposed system is highly dependent
on the noise level of the depth images.
Recent work by Song et al. [30] focuses on the use of
depth information to describe human actions in videos that
seem to be of essential concern and can greatly influence
the performance of human action recognition. The 3D
point cloud is exercised because it holds points in the 3D
real-world coordinate system to symbolize the human
body’s outer surface. An attribute named body surface
context (BSC) is presented to explain the sharing of
relative locations of neighbors for a reference point in the
point cloud. Tests using the Kinect Human Action Dataset
resulted in an accuracy of 91.32%. Using the BSC feature,
experiments on the MSR Action3D dataset yielded an
average accuracy of 90.36% and an accuracy of 77.8%
with the MSR Daily Activity 3D dataset. Experimentation
showed that superior performance is attained with the
tested feature and it performed robustly when observing
variations (i.e. translation and rotation).
2.2 Human Action Recognition Using
Skeleton Joints
Xia et al. [31] showed the advantages of using 3D
skeleton joints and represented 3D human postures using a
histogram of 3D joint locations (HOJ3D). In their
representation, 3D space is partitioned into bins using a
modified spherical coordinate system. That is, 12 manually
selected joints were used to build a compact representation
of the human posture. To make the representation more
robust, votes of 3D skeleton joints were cast into
neighboring bins using a Gaussian weight function. To
extract most dominant and discriminative features, LDA
was applied to reduce the dimensionality. These discriminative features were then clustered into a fixed number
of posture vocabularies which represent the prototypical
poses of actions. Finally, these visual words were trained
and tested using a discrete HMM. According to reported
experimental results on the MSR Action3D dataset, and by
using their own proposed dataset, their proposed method
has the salient advantage of using 3D skeleton joints of the
human posture. However, the drawback to their method is
its reliance on the hip joint, which might potentially
Fig. 4. Steps for computing eigen-joint features proposed by Yang et al. [32].
compromise recognition accuracy due to the noise
embedded in the estimation of hip joint location.
In a similar way, Yang et al. [32] illustrated that
skeleton joints are computationally inexpensive, more
compact, and distinctive compared to depth maps. Based
on that, the authors proposed an eigen joints–based action
recognition system, which extracts three different kinds of
features using skeleton joints. These features include
posture (Fcc), motion features (Fcp) that encode spatial
and temporal characteristics of skeleton joints, and offset
features (Fci), which calculate the difference between a
current pose and the initial one. Then, applying PCA to
these joint differences to obtain eigen joints by reducing
the redundancy and noise, the Naive-Bayes-NearestNeighbor (NBNN) classifier [33] is used to recognize
multiple action categories. Fig. 4 shows the process of
extracting eigen joints. Also, they further explore the
number of frames that are sufficient to recognize the action
for their system. Experimental results on the MSR
Action3D dataset show that a short sequence of 15-20
frames is sufficient for action recognition. The drawback
to this approach is, while calculating the offset feature, the
authors assume that the initial skeleton pose is neutral,
which is not always correct.
Using the advantages of 3D joints, Yang et al. [32]
proposed a compact but effective local skeleton descriptor
that creates a pose representation invariant to any
similarity conversion, which is, hence, view-invariant. The
new skeletal feature, which is called skeletal quad [34],
locally encodes the relation of joint quadruples so that 3D
similarity invariance is assured. Experimental results of the
proposed method verify its state-of-the-art performance in
human action recognition using 3D joint positions. The
proposed action recognition method was tested on broadly
used datasets, such as the MSR Action3D dataset and
HDM05. Experimental results with MSR Action3D using
skeleton joints showed an average accuracy of 89.86%,
and showed 93.89% accuracy with HDM05.
2.3 Human Action Recognition Using
Hybrid Methods
The work done by Wang et al. [14] utilizes the
advantages of both skeleton joints and point cloud
information. Most of the actions differ mainly due to the
objects in interactions, whereas in such cases, using only
skeleton information is not sufficient. Moreover, to capture
285
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
the intra-class variance via occupancy information, the
authors proposed a novel actionlet ensemble model. An
important observation made by them in terms of skeleton
joints is that the pairwise relative positions of joints are
more discriminative than the joint positions themselves.
Interaction between human and environmental objects is
characterized by a local occupancy pattern (LOP) at each
joint. Furthermore, the proposed method is evaluated using
the CMU MoCap dataset, the MSR Action3D dataset,
and a new dataset called the MSR Daily Activity 3D
dataset. Experimental results showed that their method has
superior performance compared to previous methods. The
drawback of their system is that it relies on skeleton
localization, which is unreliable for posture with selfocclusion.
Lei at al. [35] combined depth and color features to
recognize kitchen activities. Their method successfully
demonstrated tracking the interactions between hands and
objects during kitchen activities, such as mixing flour with
water and chopping vegetables. For object recognition, the
reported system uses a gradient kernel descriptor on both
color and depth data. The global features are extracted by
applying PCA on the gradient of the hand trajectories,
which are extracted by tracking the skin characteristics,
and local features are defined using a bag-of-words for
snippets of trajectory gradients. All the features are then
fed into an SVM classifier for training. The overall
reported accuracy is 82% for combined action and object
recognition. This work shows the initial concept of
recognizing the object and actions in a real-world kitchen
environment. However, using such system in real time
requires a large dataset to train the system.
Recently, Althloothi et al. [36] proposed a human
activity recognition system using multi-features and
multiple kernel learning (MKL) [37]. In order to recognize
human actions from a sequence of RGB-D data, their
method utilizes surface representation and a kinematics
structure of the human body. It extracts shape features
from a depth map using a spherical harmonics representation that describes the 3D silhouette structure, whereas the
motion features are extracted using 3D joints that describe
the movement of the human body. The author believes that
segments such as forearms and the shin provide sufficient
and compact information to recognize human activities.
Therefore, each distal limb segment is described by
orientation and translation with respect to the initial frame
to create temporal features. Then, both feature sets are
combined using an MKL technique to produce an
optimally combined kernel matrix within the SVM for
activity classification. The drawback to their system is that
the shape features extracted using spherical harmonics are
large in size. Also, at the beginning and at the end of each
depth sequence in the MSR Action3D and MSR Daily
Activity 3D datasets, the subject is in a stand-still position
with small body movements. However, while generating
the motion characteristics of an action, these small movements at the beginning and at the end generate large pixel
values, which ultimately contribute to large reconstruction
error.
3. Summary
The advantages and disadvantages of the above
reviewed methods, based on depth maps, skeleton joints,
and hybrid approaches, are presented in Table 2. Although
Table 2. Advantages and disadvantages of the existing methods.
Feature Extraction
Methods
3D sampled points
[13]
STOP: Space–Time
Occupancy Patterns
[18]
Random Occupancy
Patterns (ROP) [19]
Motion maps [20]
General comments
Pros
Cons
Using depth silhouettes, 3D points
have been extracted on the contour
of the depth map.
They extend RGB approaches to
extract contour points on depth
images. However, their method
can recognize the action
performed by single or multiple
parts of the human body without
tracking the skeleton joints.
Due to noise and occlusion,
contours of multiple views are not
reliable, and the current sampling
scheme is view-dependent.
Space–time occupancy patterns
Spatial and temporal contextual
are presented by dividing the
information has been used to
depth sequence into a 4D
recognize the actions, which is
space–time grid. All the cells in
robust against noise and occlusion.
the grid have the same size.
ROP features are extracted from
randomly sampled 4D subvolumes with different sizes and
The proposed feature extraction
different volumes. Then, all the
method is robust to noise and less
points in the sub-volumes are
sensitive to occlusion.
accumulated and normalized with
a sigmoid function.
Motion maps provide shape as
They are computationally efficient
well as motion information.
action recognition systems based
However, HOG has been used to
on depth maps for extraction of
extract local appearance and shape
additional shape and motion
of motion maps.
information.
There is no method defined to set
the parameter for dividing the
sequence into cells.
Feature patterns are highly
complex and need more time
during processing.
Motion maps do not provide
directional velocity information
between the frames.
286
Farooq et al.: A Survey of Human Action Recognition Approaches that use an RGB-D Sensor
R Transform [22]
R transform has been used to
extract features from depth
silhouettes, comparing the
proposed method with PCA
and ICA.
HON4D [24]
Captures histogram distribution
of the surface normal orientation
in the 4D volume of time, depth
and spatial coordinates.
DCSF [29]
Extracting STIP from depth
videos and describing local 3D
DCSF around interest points can
be efficiently used to recognize
actions.
Body surface
context (BSC) [30]
HOJ3D [31]
Eigen joints [32]
Quadruples [34]
3D point clouds have been used
to represent the 3D surface of
the body, which contains rich
information to recognize human
actions.
Twelve manually selected
skeleton joints are converted to
a spherical coordinate system to
make a compact representation
of the human posture.
This is an action recognition
system that extracts
spatiotemporal change between
the joints. Then, PCA is used to
obtain eigen joints by reducing
redundancy and noise.
A skeleton joint–based feature
extraction method called skeletal
quad ensures 3D similarity
invariance of joint quadruples by
local encoding using a Fisher
kernel.
Hybrid method
(3D point cloud +
skeleton) [14]
Local occupancy pattern (LOP)
features are calculated from depth
maps around the joints’ locations.
Kitchen activities
(depth + RGB) [35]
Fine-grained kitchen activities
are recognized using depth and
color cues.
Multi-feature
(3D point cloud and
skeleton joints) [36]
This human activity recognition
system combines spherical
harmonics features from depth
maps and motion features using
3D joints.
R transform–based translation
and scale-invariant feature
extraction methods can be used
for human activity recognition
systems.
The proposed feature extraction
method is robust against noise
and occlusion and more
discriminative than other 4D
occupancy methods. Also, it
captures the distribution of
changing shape and motion cues
together.
Uses DSTIPs and DCSF to
recognize the activities from
depth videos without depending
on skeleton joints, motion
segmentation and tracking or
de-noising procedures.
The R transform–based feature
extraction method is not viewinvariant.
This method can roughly
characterize the local spatial shape
around each joint. Differential
operation on a depth image can
enhance noise.
It is difficult to analyze the
method for full activities,
and most of the recognition
errors come from those activities.
3D point clouds of the body’s
surface can avoid perspective
distortion in depth images.
It is based on different
combinations of features for each
dataset, but it is not feasible for
an automatic system to select the
combination for high accuracy.
Skeleton joints are more
informative and can achieve
high accuracy with a smaller
number of joints.
Relying only on the hip joint
might potentially compromise
recognition accuracy.
It is a skeleton joint–based
feature extraction method that
extracts features in both spatial
and temporal domains. It is more
accurate and informative than
trajectory-based methods.
Offset feature computation
depends on the assumption that
the initial skeleton pose is
neutral, which is not correct.
A view-invariant descriptor
using joint quadruples encodes
Fisher kernel representations.
It is not a good choice to
completely rely on skeleton
joints, because these 3D joints
are noisy and fail when there
are occlusions.
A highly discriminative and
translation invariant feature
extraction method captures
relations between the human
body parts and the env
ironmental objects in the
interaction. Also, it represents
the temporal structure of an
individual joint.
It is an efficient feature
extraction method taking
advantage of both RGB and
depth images to recognize
objects and fine-grained kitchen
activities.
It is a view-invariant feature
extraction method based on
shape representation and the
kinematics structure of the
human body. That is, both
features are fused using MKL to
produce an optimal combined
kernel matrix.
Heavily relying on skeleton
localization becomes unreliable
for postures with self-occlusion.
Requires a large dataset to train
the system.
Shape features are large in size,
which may be unreliable for
postures with self-occlusion,
whereas it extracts motion
features on the assumption that
the initial pose is in a neutral
state, which is not the case.
287
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
Table 3. Summary of feature selection, classification and recognition methods
Paper
Extracted Features
[13]
3D points at the contour of
a depth map
[18]
Depth values
Feature Selection/
Dimension reduction
Clustering
Classification
Action graph
PCA
K-means
HMM
LDA
Elastic net regularized
classifier
SVM
PCA, LDA
LBG
HMM
[19]
3D point cloud
[20]
Histogram of gradients
[22]
Depth values
[28]
Histogram of surface normal
[29]
Histogram of depth pixels
PCA
K-means
DS-SRC
[30]
3D point cloud
PCA
K-means
SVM
[31]
Histogram of 3D joints in
spherical coordinates
LDA
K-means
HMM
[32]
Skeleton joints
PCA
[34]
Gradient values
[14]
Low-frequency Fourier
coefficients
SVM
SVM
NBNN
SVM
Actionlet ensemble
SVM
[35]
Gradient values
SVM
[36]
3D point cloud and
skeleton joints
SVM
Table 4. Recognition accuracies of reviewed action recognition systems on benchmark datasets.
Paper
MSR Action3D
[13]
74.7%
[18]
84.80%
[19]
86.50%
[20]
91.63%
MSR Daily
Activity 3D
UCF Kinect
dataset
Kitchen scene
action
[22]
[28]
88.89%
80%
[29]
89.3%
83.6%
[30]
90.36%
77.8%
[31]
78.97%
[32]
82.33%
[34]
89.86%
[14]
88.2%
97.1%
85.75%
[35]
[36]
82%
79.7%
all the above methods are capable of dealing with the
actions and activities of daily life, there are also drawbacks
and limitations to using depth map–based, skeleton joint–
based and hybrid methods for action recognition systems.
Depth maps fail to recognize human actions when finegrained motion is required, whereas extracting 3D points at
the contours may incur loss of inner information from the
depth silhouettes. Furthermore, shape-based features do
not provide any information for calculating the directional
velocity of the action between the frames, and it is an
important parameter for differentiating the actions. Hence,
93.1%
depth-based features are neither very efficient for, nor
sufficient for, certain applications, such as entertainment,
human–computer interaction, and smart healthcare systems.
The 3D skeleton joints estimated using the depth maps are
often noisy and may have large errors when there are
occlusions (e.g., legs or hands crossing over each other).
Moreover, motion information extracted using 3D joints
alone is not sufficient to differentiate similar activities,
such as drinking water and eating. Therefore, there is a
need to include extra information in the feature vector to
improve classification performance. Thus, a hybrid method
288
Farooq et al.: A Survey of Human Action Recognition Approaches that use an RGB-D Sensor
can be helpful by taking full advantage of using depth
maps and 3D skeleton joints to enhance the classification
performance of human action recognition.
A summary of all the feature-selection, clustering and
recognition methods used in the above reviewed papers is
in Table 3. Because most of the studied action recognition
systems select dominant and discriminative features using
LDA, these features are then represented by the codebook,
which is generated using a k-means algorithm. Finally,
after training the system, it recognizes the learned actions
via the trained SVM.
The recognition accuracy of the reviewed methods on
the datasets mentioned in Table 1 is summarized in Table
4. The assessment method adopted by the mentioned
works for the MSR Action3D dataset is a cross-subject test.
This method was originally proposed by Li et al. [13] by
dividing the 20 actions into three subsets, with each subset
containing eight actions. For the MSR Daily Activity 3D
dataset, all the authors verified the performance of their
method using a leave-one-subject-out (LOSO) test. For the
UCF Kinect dataset, 70% of the actions were used for
training and 30% for testing. Jalal et al. [22] proposed their
own human activity dataset and evaluated the performance
of their proposed method using 30% video clips for
training and 70% for testing.
4. Conclusion
Over the last few years, there has been a lot of work by
researchers in the field of human action recognition using
the low-cost depth sensor. The success of these works is
demonstrated in entertainment systems that estimate the
body poses and recognize facial and hand gestures, by
smart healthcare systems to care for patients and monitor
their activities, and in the security systems that recognize
suspicious activities and create an alert to prevent
dangerous situations. Different databases have been used
by the authors to test the performance of their algorithms.
For the MSR Action3D dataset, Yang et al. [20] achieved
91.63% accuracy, whereas for the MSR Daily Activity 3D
dataset, Althloothi et al. [36] achieved 93.1% accuracy.
Moreover, Yang et al. [32] achieved 97.1% accuracy for
the UCF Kinect dataset. Currently, human action systems
focus only on extracting boundary information from depth
silhouettes. However, using only skeleton information may
not be feasible, because the skeleton joints are not always
accurate. Furthermore, to overcome the limitations and
drawbacks of the current human action recognition systems, it is necessary to extract valuable information from
inside the depth silhouettes. Also, it is necessary to use the
joint points with the depth silhouettes for an accurate and
stable human action recognition system.
Acknowledgement
This work was supported by Basic Science Research
Program through the National Research Foundation of
Korea (NRF) funded by the Ministry of Education (NRF-
2013R1A1A2005024).
References
[1] A. Veeraraghavan et al., “Matching shape sequences
in video with applications in human movement
analysis,” Pattern Analysis and Machine Intelligence,
IEEE Transactions, pp.1896-1909, Jun. 2004. Article
(CrossRef Link)
[2] W. Lin et al., "Human activity recognition for video
surveillance,” in Circuits and Systems, IEEE International Symposium on, pp. 2737-2740, May. 2008.
Article (CrossRef Link)
[3] H. S. Mojidra et al., “A Literature Survey on Human
Activity Recognition via Hidden Markov Model,”
IJCA Proc. on International Conference on Recent
Trends in Information Technology and Computer
Science 2012 ICRTITCS, pp. 1-5, Feb. 2013. Article
(CrossRef Link)
[4] R. Gupta et al., “Human activities recognition using
depth images,” in Proc. of the 21st ACM international conference on Multimedia, pp. 283-292, Oct.
2013. Article (CrossRef Link)
[5] Z. Zhang et al., “Microsoft kinect sensor and its
effect.” MultiMedia, IEEE, Vol. 19, No. 2, pp. 4-10,
Feb. 2012. Article (CrossRef Link)
[6] A. A. Chaaraoui, "Vision-based Recognition of
Human Behaviour for Intelligent Environments,"
Director: Florez Revuelta, Franciso, Jan. 2014.
Article (CrossRef Link)
[7] M. Valera et al., “Intelligent distributed surveillance
systems: a review,” Vision, Image and Signal
Processing, IEE Proceedings, Vol. 152, No. 2, pp.
192-204. Apr. 2005. Article (CrossRef Link)
[8] J. W. Hsieh et al., “Video-based human movement
analysis and its application to surveillance systems,”
Multimedia, IEEE Transactions on, Vol. 10, No. 3,
pp. 372-384, Apr. 2008. Article (CrossRef Link)
[9] V. Bloom et al., “G3d: A gaming action dataset and
real time action recognition evaluation framework,”
Computer Vision and Pattern Recognition Workshops
(CVPRW), 2012 IEEE Computer Society Conference
on, pp. 7-12, Jul. 2012. Article (CrossRef Link)
[10] A. Fossati et al., “Consumer depth cameras for
computer vision: research topics and applications,”
Springer Science & Business Media, Article (CrossRef
Link)
[11] M. Parajuli et al., “Senior health monitoring using
Kinect,” Communications and Electronics (ICCE),
Fourth International Conference on, pp. 309-312,
Aug. 2012. Article (CrossRef Link)
[12] C. Rougier et al., “Fall detection from depth map
video sequences,” Toward Useful Services for
Elderly and People with Disabilities, Vol. 6719, pp.
121-128. Hun. 2011. Article (CrossRef Link)
[13] W. Li et al., “Action recognition based on a bag of 3d
points,” Computer Vision and Pattern Recognition
Workshops (CVPRW) IEEE Computer Society Conference on, pp. 9-14, Jun. 2010. Article (CrossRef
Link)
[14] J. Wang et al., “Mining actionlet ensemble for action
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
recognition with depth cameras,” Computer Vision
and Pattern Recognition (CVPR) IEEE Conference
on, pp. 1290-1297. Jun. 2012. Article (CrossRef
Link)
L. Xia et al., “View invariant human action recognition using histograms of 3d joints,” Computer Vision
and Pattern Recognition Workshops (CVPRW) IEEE
Computer Society Conference on, pp. 20-27, Jun.
2012. Article (CrossRef Link)
C. Ellis et al., “Exploring the trade-off between
accuracy and observational latency in action
recognition,” International Journal of Computer
Vision, Vol. 101, No. 3. Pp. 420-436. Aug. 2012.
Article (CrossRef Link)
A. Shimada et al., “Kitchen scene context based
gesture recognition: A contest in ICPR2012,” Advances
in Depth Image Analysis and Applications, Vol. 7854,
pp. 168-185, Nov. 2011. Article (CrossRef Link)
A. W. Vieira et al., “Stop: Space-time occupancy
patterns for 3d action recognition from depth map
sequences,” Progress in Pattern Recognition, Image
Analysis, Computer Vision, and Applications, Vol.
7441, pp. 252-259. Sep. 2012. Article (CrossRef
Link)
J. Wang et al., “Robust 3d action recognition with
random occupancy patterns.” 12th European Conference on Computer Vision, pp. 872-885. Oct. 2012.
Article (CrossRef Link)
X. Yang et al., “Recognizing actions using depth
motion maps-based histograms of oriented gradients,”
in Proc. of the 20th ACM international conference on
Multimedia, pp. 1057-1060, Nov. 2012. Article
(CrossRef Link)
N. Dalal et al., “Histograms of oriented gradients for
human detection,” Computer Vision and Pattern
Recognition, IEEE Computer Society Conference on,
Vol. 1, pp. 886-893, Jun. 2005. Article (CrossRef
Link)
A. Jalal et al., “Depth video-based human activity
recognition system using translation and scaling
invariant features for life logging at smart home,”
Consumer Electronics, IEEE Transactions on, Vol.
58, No. 3, pp. 863-871, Aug. 2012. Article (CrossRef
Link)
Y. Wang, K. Huang, and T. Tan. "Human activity
recognition based on r transform." In Computer
Vision and Pattern Recognition, IEEE Conference on,
pp. 1-8. Jun. 2007. Article (CrossRef Link)
M. Z. Uddin et al., “Independent shape componentbased human activity recognition via Hidden Markov
Model,” Applied Intelligence, Vol. 33, No. 2, pp.
193-206. Jan. 2010. Article (CrossRef Link)
J. Han et al., “Human activity recognition in thermal
infrared imagery,” Computer Vision and Pattern
Recognition. IEEE Computer Society Conference on,
pp. 17, Jun. 2005. Article (CrossRef Link)
H. Othman et al., “A separable low complexity 2D
HMM with application to face recognition,” Pattern
Analysis and Machine Intelligence, IEEE Transactions on, Vol. 25. No. 10, pp. 1229 – 1238, Oct. 2003.
Article (CrossRef Link)
289
[27] Y. Linde et al., "An algorithm for vector quantizer
design,” Communications, IEEE Transactions on, Vol.
28, No. 1, pp. 84–95, Jan. 1980. Article (CrossRef
Link)
[28] O. Oreifej, & Z. Liu., “Hon4d: Histogram of oriented
4d normals for activity recognition from depth
sequences,” Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, 2013.
Article (CrossRef Link)
[29] L. Xia et al., “Spatio-temporal depth cuboid
similarity feature for activity recognition using depth
camera,” Computer Vision and Pattern Recognition,
IEEE Conference on, pp. 2834-2841. Jun. 2013.
Article (CrossRef Link)
[30] Y. Song et al., “Body Surface Context: A New
Robust Feature for Action Recognition from Depth
Videos,” Circuits and Systems for Video Technology,
IEEE Transactions on, Vol. 24, No. 6, pp. 952-964,
Jan. 2014. Article (CrossRef Link)
[31] L. Xia et al., “View invariant human action recognition using histograms of 3d joints,” Computer Vision
and Pattern Recognition, IEEE Computer Society
Conference on, pp. 20-27. Jun. 2012.
Article
(CrossRef Link)
[32] X. Yang et al., “Effective 3d action recognition using
eigenjoints,” Journal of Visual Communication and
Image Representation, Vol. 25, No. 1, pp. 2-11, Jan.
2014. Article (CrossRef Link)
[33] O. Boiman et al., “In defense of nearest-neighbor
based image classification,” Computer Vision and
Pattern Recognition, IEEE Conference on, pp. 1-8,
Jun. 2008. Article (CrossRef Link)
[34] G. Evangelidis et al., “Skeletal quads: Human action
recognition using joint quadruples,” Pattern Recognition (ICPR), 22nd International Conference on, pp.
4513-4518. Aug. 2014. Article (CrossRef Link)
[35] J. Lei et al., “Fine-grained kitchen activity recognition using rgb-d.” in Proc. of the ACM Conference on
Ubiquitous Computing, pp. 208-211. Sep. 2012.
Article (CrossRef Link)
[36] S. Althloothi et al., “Human activity recognition
using multi-features and multiple kernel learning,”
Pattern Recognition, Vol. 47. No. 5, pp. 1800-1812.
May. 2014. Article (CrossRef Link)
[37] M. Gönen and E. Alpaydin, “Multiple kernel learning
algorithms,” The Journal of Machine Learning
Research, Vol. 12, pp. 2211-2268. Jan. 2011. Article
(CrossRef Link)
Adnan Farooq is a Ph.D. student in
Department of Electrical and Electronics Engineering at Dongguk University, Seoul, South Korea. He
received his B.S degree in Computer
Engineering from COMSATS Institute of Science and Technology,
Abbottabad, Pakistan and M.S. degree
in Biomedical Engineering from Kyung Hee University,
Republic of Korea. His research interest includes Image
Processing, Computer vision.
290
Farooq et al.: A Survey of Human Action Recognition Approaches that use an RGB-D Sensor
Chee Sun Won received the B.S.
degree in electronics engineering from
Korea University, Seoul, in 1982, and
the M.S. and Ph.D. degrees in electrical and computer engineering from
the University of Massachusetts, Amherst,
in 1986 and 1990, respectively. From
1989 to 1992, he was a Senior
Engineer with GoldStar Co., Ltd. (LG Electronics), Seoul,
Korea. In 1992, he joined Dongguk University, Seoul,
Korea, where he is currently a Professor in the Division of
Electronics and Electrical Engineering. He was a Visiting
Professor at Stanford University, Stanford, CA, and at
McMaster University, Hamilton, ON, Canada. His
research interests include MRF image modeling, image
segmentation, robot vision, image retrieval, image/video
compression, video condensation, stereoscopic 3D video
signal processing, and image watermarking.
Copyrights © 2015 The Institute of Electronics and Information Engineers
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.4.291
291
IEIE Transactions on Smart Processing and Computing
Design of High-Speed Comparators for High-Speed
Automatic Test Equipment
Byunghun Yoon and Shin-Il Lim*
Department of Electronics Engineering, Seokyeong University / Seoul, South Korea {bhyoon, silim}@skuniv.ac.kr
* Corresponding Author: Shin-Il Lim
Received June 20, 2015; Revised July 15, 2015; Accepted August 24, 2015; Published August 31, 2015
* Short Paper
Abstract: This paper describes the design of a high-speed comparator for high-speed automatic test
equipment (ATE). The normal comparator block, which compares the detected signal from the
device under test (DUT) to the reference signal from an internal digital-to-analog converter (DAC),
is composed of a rail-to-rail first pre-amplifier, a hysteresis amplifier, and a third pre-amplifier and
latch for high-speed operation. The proposed continuous comparator handles high-frequency
signals up to 800MHz and a wide range of input signals (0~5V). Also, to compare the differences
of both common signals and differential signals between two DUTs, the proposed differential mode
comparator exploits one differential difference amplifier (DDA) as a pre-amplifier in the
comparator, while a conventional differential comparator uses three op-amps as a pre-amplifier.
The chip was implemented with 0.18μm Bipolar CMOS DEMOS (BCDMOS) technology, can
compare signal differences of 5mV, and operates in a frequency range up to 800MHz. The chip
area is 0.514mm2.
Keywords: ATE, High-speed continuous comparator, Hysteresis, Differential difference amplifier
1. Introduction
For testing and characterizing an application processor
(AP) and a system on chip (SoC), the pin card in highspeed automatic test equipment (ATE) includes a driver
integrated circuit (IC) to force the signal to the device
under test (DUT) and to detect the signal from the DUT. In
the driver IC, a parametric measurement unit (PMU), a
digital-to-analog converter (DAC), comparators, an active
load and serial peripheral interface (SPI) memory resistors
are included [1], [2]. This paper describes new techniques
in designing a high-speed comparator for this driver IC in
ATE with Bipolar CMOS DEMOS (BCDMOS) technology.
The comparator compares the detected signal values
from the DUTs to the output of the DAC, which are the
reference values. The comparator must be able to handle a
wide input signal range (0V to 5V) and operate with an
input signal up to 800MHz. Also, it should have sufficient
accuracy to compare a signal difference of 5mV. Moreover,
considering unexpected noise, the comparator must have a
hysteresis function. The implementation of a high-speed
comparator with 0.18μm BCDMOS technology is challeng-
ing work because BCDMOS devices have large parasitic
capacitors and greater resistance.
2. Architecture of the Comparator
A block diagram of the high-speed comparator is
depicted in Fig. 1. This comparator is continuous, so it
Fig. 1. Block diagram of the comparator.
292
Yoon et al.: Design of High-Speed Comparators for High-Speed Automatic Test Equipment
Fig. 2. Block diagram of continuous comparator.
Fig. 3. First stage rail-to-rail pre-amplifier.
does not need a clock signal. This comparator has two
operating modes. The first mode is normal comparator
mode (NCM), which compares the detected values from
the DUT (DUT0) to the output of the DAC reference
values (DACVOH, DACVOL).
The second mode is differential comparator mode
(DCM), which also compares the difference of both
common mode signals and differential signals from the
two DUTs (DUT0, DUT1) to the output of the DAC
reference signals (DACVOH, DACVOL). Decided results
from comparators are transferred to the control blocks
through the high-speed driver circuits.
3. The Proposed Comparator Design
3.1 High Speed Comparator in Normal
Mode
Fig. 2 shows the proposed block diagram of the highspeed continuous comparator. For high accuracy and highspeed operation with BCDMOS technology, three cascode
stages of a pre-amplifier and a latch are considered.
The first stage pre-amplifier is the rail-to-rail amplifier
to deal with the wide input range of 0V to 5V. And then,
the second stage amplifier has hysteresis circuits controlled
from 2b SPI signals. Finally, the high-speed third preamplifier and latch are designed to accomplish high-speed
and high-accuracy requirements.
Fig. 3 shows the first stage rail-to-rail pre-amplifier for
accepting the wide range of input signals (0V to 5V).
Because the active load devices in BCDMOS technology
have large parasitic capacitors and large output resistances,
small passive resistances instead of active loads are used
for high-bandwidth operation.
The second pre-amplifier with the hysteresis function is
shown in Fig. 4 [3]. This hysteresis circuit provides hysteresis voltages in a continuous comparator to overcome
unexpected noise. Hysteresis can be achieved with the
difference of device ratios between the diode-connected
active transistors and switch-controlled active transistors.
The 2b signals from the SPI resister can control the S1, S2,
S3 and S4 switches and achieves hysteresis of four-step
Fig. 4. Second pre-amplifier with hysteresis function.
voltages from a minimum 0mV to a maximum 96mV. The
relation between the hysteresis voltage and device size of
switch-connected PMOSFET (PMOS) transistors is shown
in Eq. (1).
(
Vhys = 1 −
M
)
2I
(1 + M ) μnCox (
W
)
L
(1)
M is the device ratio between the diode-connected
active transistors and switch-controlled active transistors. I
is the tail current of the second pre-amplifier.
Figs. 5 and 6 show the third pre-amplifier and latch,
respectively. A fully differential third pre-amplifier exploits
simple common mode feedback (CMFB) with two resisters
to stabilize the output common mode voltage. The latch is
composed of two inverters, which are connected to each
other forming positive feedback. Inverter chains are added
to clarify the high or low output level with minimum delay.
3.2 Comparator in Differential Mode
To implement DMC mode, a differential difference
amplifier (DDA)[4] is used, as shown in Fig. 7, as a pre-
293
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
Fig. 8. The Circuits of DDA.
Fig. 5. Third pre-amplifier.
Fig. 6. Latch with inverter chain.
amplifier is enough for processing all differences in
common mode signals and the difference in differential
mode signals. Since the proposed DMC uses one op-amp,
it does not suffer from mismatches, the cumulated offsets
and the noise of many op-amps. Moreover, the proposed
DMC consumes less power and requires less hardware
area, and hence, offers low-cost implementation.
Fig. 8 shows a transistor-level schematic diagram of
the DDA. For high gain, a folded cascode structure was
used. This folded cascode structure inherently guarantees
stable high-frequency operation and does not require additional compensation capacitors. The input stage is designed
for rail-to-rail operation by paralleling the PMOS input
pair and the NMOSFET (NMOS) input pair.
The output voltage, VOUT, includes both the difference
of common mode signals and the difference of differential
signals from DUT0 (VIP) and DUT1 (VIN), as expressed
in Eq. (2):
VOUT = VIP − VIN + VCM
= (V p.di − Vn,di ) + (V p ,cm − Vn,cm ) + VCM
(2)
4. Simulated and Measured Results
Fig. 7. Block Diagram of the DDA.
amplifier.
Conventional design of the DMC needs three amplifiers,
one for differential signal generation, another for common
mode signal generation, and the final one for summing
these two signals. But if we use a DDA, only one DDA
The proposed comparator was implemented with
0.18μm BCDMOS technology. The layout size of the
proposed comparator is 620μm x 830μm, as shown in Fig.
9. There are four continuous comparators without clock
signals, a DDA and two output stages. All the supply voltages in comparators are 5V. The first rail-to-rail preamplifier has a gain of 10dB, and a unit gain frequency of
2.5GHz, while the third pre-amplifier has a gain of 14dB
and a unit gain frequency of 1.53GHz.
Fig. 10 shows the simulation results of normal mode
continuous comparators. The proposed comparator can
compare a difference of 5mV at a frequency of 800MHz.
Fig. 11 shows the simulation results of hysteresis
voltages. Hysteresis voltages such as 0mV, 37.9mV, 68mV
and 96mV are achieved according to the codes 00, 01, 10
and 11, respectively.
The AC simulation results of the DDA in DMC mode
are shown in Fig. 12. The gain is 32dB, and the unity gain
frequency is 1.33GHz. Phase margin is 65°.
294
Yoon et al.: Design of High-Speed Comparators for High-Speed Automatic Test Equipment
Fig. 9. Comparator Layout (0.5146mm2).
Fig. 11. Simulation results of hysteresis.
Fig. 10. Transient simulation results of comparator.
Fig. 12. AC simulation results of DDA.
Fig. 13. Simulated and measured results of normal comparator.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015
295
Fig. 14. Simulated & measured results of DDA (Vout).
Table 1. Performance Comparison.
Process
This Work
0.18μm
BCDMOS
[2]
BiCMOS
Supply Voltage
5V
5V
Input Range
0 ~ 5V
0 ~ 5V
Maximum
Input Freq.
800MHz
1.2GHz
Hysteresis
(00 ~ 11)
0mV ~ 96mV
0mV ~ 100mV
Power
Consumption
300mW
1.1W
(per channel)
Resolution
5mV
N/A
Chip Area
(w/o pad)
620μm × 830μm
N/A
Fig. 13 shows the simulated results of the proposed
continuous comparator on the left side and the measured
results of the proposed continuous comparator on the right
side. To verify the wide range of operation, the difference
voltage of 5mV from the common mode voltage of 2.5V,
4.9V and 0.1V were tested. And the outputs of the comparator were correctly detected with high or low levels, as
shown in Fig. 13 at a signal frequency of 800MHz.
Fig. 14 shows the simulated results of the proposed
DDA on the left side and also the measured results of the
proposed DDA on the right side. As shown in Fig. 14, the
differences in the common mode voltages and the difference in differential mode voltages are measured correctly.
Three cases of input signals are tested to prove correct
operation of the DDA. For simulated results on the left
side, there are two input signals on the top, and the simulated output of the comparator is on the bottom, while for
measured results on the right side, two input signals are
shown on the bottom.
The performance summary of this comparator is in
Table 1.
5. Conclusion
This paper describes how to design a high-speed
comparator in a driver IC for automatic test equipment
with BCDMOS technology. To accept a wide input range
of signals from 0V to 5V and to handle a frequency range
up to 800MHz, a cascode amplifier structure for a continuous comparator is proposed. Also, to measure the
difference in output signals between the two DUTs, a
DDA is exploited with minimum hardware, with less
power consumption and with lower noise to detect the
differences of both common mode signals and differential
signals.
Acknowledgment
This research was supported by the Ministry of Science,
ICT and Future Planning (MSIP), Korea, under the
Information Technology Research Center (ITRC) support
program (IITP-2015-H8501-15-1010) supervised by the
Institute for Information & Communication Technology
Promotion (IITP) and also supported by the Industrial Core
296
Yoon et al.: Design of High-Speed Comparators for High-Speed Automatic Test Equipment
Technology Development Program (10049009) funded by
the Ministry of Trade, Industry & Energy (MOTIE), Korea.
References
[1] In-Seok Jung, Yong-Bin Kim, “Cost Effective Test
Methodology Using PMU for Automated Test
Equipment Systems”, International Journal of VLSI
design & Communication Systems (VLSICS), vol.5,
no.1, pp. 15-28, February 2014. Article (CrossRef
Link)
[2] Data sheet of ADATE318, Analog Devices. Article
(CrossRef Link)
[3] Xinbo Qian, “A Low-power Comparator with Programmable hysteresis Level for Blood Pressure Peak
Detection” TENCON2009, pp.1-4, Jan. Singapore,
2009. Article (CrossRef Link)
[4] E. Säackinger and W. Guggenbüuhl, “A Versatile
Building Block: The CMOS Differential Difference
Amplifier”, IEEE Journal of Solid-State Circuits,
pp.287-294, April, 1987. Article (CrossRef Link)
[5] G. Nicollini and C. Guadiani, “A 3.3-V 800-nV noise,
gain-programmable CMOS microphone preamplifier
design using yield modeling technique” IEEE J.
Solid-State Circuits, vol. 28. No. 8, pp. 915-920, Aug.
1993. Article (CrossRef Link)
[6] Vladimir Milovanovi, Zimmermann, H. “A 40nm LP
CMOS Self-Biased Continuous-Time Comparator
with sub-100ps Delay at 1.1V & 1.2mW”, ESSCIRC,
pp. 101-104, 2013. Article (CrossRef Link)
[7] Hong-Wei Huang, Chia-Hsiang Lin and Ke-Horng
Chen, “A Programmable Dual Hysteretic Window
Comparator” ISCAS, pp.1930-1933, 2008. Article
(CrossRef Link)
[8] Baker, R. Jacob, “CMOS: Circuit Design, Layout and
Simulation, 3rd Edition” IEEE Press Series on Microelectronics Systems, August 2010. Article (CrossRef
Link)
[9] Behzad Razavi, “Design of Analog CMOS Integrated
Circuits”, McGraw Hill, 2001. Article (CrossRef Link)
Copyrights © 2015 The Institute of Electronics and Information Engineers
Byung-Hun Yoon received a BSc
from the Department of Electronic
Engineering at Seokyeng University,
Seoul, Korea, in 2014. Since 2014, he
has been in the master’s course at
Seokyeong University. His research
interests include analog and mixed
mode IC design and PMIC.
Shin-Il Lim received his BSc, MSc
and PhD in electronic engineering
from Sogang University, Seoul, Korea,
in 1980, 1983, and 1995, respectively.
He was with the Electronics and Telecommunication Research Institute (ETRI)
from 1982 to 1991 as senior technical
staff. He was also with the Korea
Electronics Technology Institute (KETI) from 1991 to 1995
as a senior engineer. Since 1995, he has been with Seokyeong
University, Seoul, Korea, as a professor. His research areas
are in analog and mixed mode IC design for communication, consumer, biomedical and sensor applica-tions.
He was the TPC chair of ISOCC’2009 and also was the
general chair of ISOCC’2011
Download