IEIE TRANSACTIONS ON SMART PROCESSING AND COMPUTING www.ieiespc.org ISSN 2287-5255 Officers President Steering Committee Editor-in-Chief: Byung Gook Park, Seoul National Univ. Joonki Paik Chung-Ang University, South Korea paikj@cau.ac.kr President-elect Yong Seo Koo, Dankook Univ. Vice President Daesik Hong, Yonsei Univ. Hong June Park, POSTECH Hyun Wook Park, KAIST Joonki Paik, Chung-Ang Univ. Seung Kwon Ahn, LG Electronics Seon Wook Kim Korea University, South Korea seon@korea.ac.kr Founder Minho Jo Korea University, South Korea minhojo@korea.ac.kr Editors Associate Editors Anand PRASAD NEC, Japan anand@bq.jp.nec.com André L. F. de ALMEIDA Universidade Federal do Ceará, Brazil andre@gtel.ufc.br Aldo W. MORALES Penn State University, USA awm2@psu.edu Aggelos K. KATSAGGGELOS Northwestern, USA aggk@ece.northwestern.edu Athanasios (Thanos) VASILAKOS University of Western Macedonia, Greece vasilako@ath.forthnet.gr Bong Jun KO IBM T. J. Watson Research Center, USA bongjun_ko@us.ibm.com Dong Gyu SIM Kwang Woon University, South Korea dgsim@kw.ac.kr Eun-Jun YOON Jinsuk BAEK Winston-Salem State University, NC, USA baekj@wssu.edu Kyu Tae LEE SeungWon JUNG Dongguk University, South Korea swjung83@gmail.com Sunghyun CHOI Seoul National University, South Korea schoi@snu.ac.kr Kyungil University, South Korea ejyoon@kiu.ac.kr Kongju National University, South Korea ktlee@kongju.ac.kr Motorola, USA faisal@motorola.com Faisal ISHTIAQ University of Ionnia, Greece lkon@cs.uoi.gr McGill University, Canada heungsun.hwang@mcgill.ca Heungsun HWANG Osaka University, Japan lei.shu@ieee.org I2R, Singapore sunsm@i2r.a-star.edu.sg Ho-Jin CHOI Dalian University of Technology, China lei.wang@ieee.org Lei WANG Imperial College, UK t.stathaki@imperial.ac.uk Hae Yong KIM Nanjing University of Posts and Telecommunications, China liang.zhou@ieee.org Liang ZHOU Huazhong University of Science and Technology, Hubei, China Tao.Jiang@ieee.org Monson HAYES Tokyo University of Agriculture and Technology, Japan tanakat@cc.tuat.ac.jp Lisimachos P. KONDI KAIST, South Korea hojinc@kaist.ac.kr Soo Yong CHOI Yonsei University, South Korea csyong@yonsei.ac.kr Sumei SUN Lei SHU Tania STATHAKI Tao JIANG Byeungwoo JEON University of Sao Paulo, Portugal hae@lps.usp.br Byonghyo SHIM University of Palermo, Italy ilenia.tinnirello@tti.unipa.it Georgia Institute of Tech. USA mhh3@gatech.edu Chang D. YOO Qualcomm, USA insungk@qualcomm.com Insung KANG Chung-Ang Soongsil University, South Korea mhong@e.ssu.ac.kr Charles Casimiro CAVALCANTE Kangwon University, South Korea ihwang@kangwon.ac.kr Inchul HWANG Singapore University of Technology and Design, Singapore ngaiman_cheung@sutd.edu.sg Yonsei University, South Korea wro@yonsei.ac.kr Oscar AU KAIST, South Korea wchoi@ee.kaist.ac.kr Peng MUGEN Huazhong University of Science and Technology, China xhge@mail.hust.edu.cn Sungkyunkwan University, South Korea bjeon@skku.edu Korea University, South Korea bshim@korea.ac.kr KAIST, South Korea cdyoo@ee.kaist.ac.kr Universidade Federal do Ceará, Brazil charles@gtel.ufc.br Chun Tung CHOU Ilenia TINNIRELLO Min-Cheol HONG Ngai-Man CHEUNG Jaehoon LEE The University of New South Wales, Australia ctchou@cse.unsw.edu.au Korea University, South Korea ejhoon@korea.ac.kr Chun-Ting CHOU University of Wollongong, Australia jhk@uow.edu.au Daesik HONG Yonsei University, South Korea Jun JO National Taiwan University, Taiwan chuntingchou@cc.ee.ntu.edu.tw daesikh@yonsei.ac.kr Daniel da COSTA Federal University of Ceara (UFC), Brazil danielbcosta@ieee.org Daji QIAO Iowa State University, USA daji@iastate.edu Dongsoo KIM Indiana University-Purdue University, USA dskim@iupui.edu Daqiang ZHANG Jung Ho KIM Griffith University, Australia j.jo@griffith.edu.au Hong Kong University of Science and Technology, Hong Kong eeau@ust.hk Beijing Univ. of Posts and Telecommunications, China pmg@bupt.edu.cn Qiang NI Brunel University, UK Qiang.Ni@brunel.ac.uk Joonki PAIK University of Waterloo, Canada rxlu@bbcr.uwaterloo.ca Chung-Ang University, South Korea paikj@cau.ac.kr Jaime Lloret MAURI Polytechnic University of Valencia, Spain illoret@dcom.upv.es Junmo KIM KAIST, South Korea junmo@ee.kaist.ac.kr Dong Sun KIM Nanyang Technological University, Singapore sxyu1@ualr.edu Junsung YUAN Stefan MANGOLD Disney Research Laboratory, Switzerland stefan@disneyresearch.com Shucheng YU University of Arkansas at Little Rock, USA jsyuan@ntu.edu.sg Sasitharan BALASUBRAMANIAM Waterford Institute of Technology, Ireland sasib@tssg.org IMT, Italy stsaft@gmail.com Won Woo RO Wan CHOI Xiaohu GE Xudong WANG Univ. of Michigan-Shanghai Jiao Tong Yo-Sung HO Rongxing LU Sotrios TSAFTARIS Taesang YOO Qualcomm, USA taesangy@qualcomm.com University Joint Institute, China wxudong@ieee.org Jianhua HE Swansea University, UK j.he@swansea.ac.uk Institute Telecom, France Nanjing Normal University, China dqzhang@njnu.edu.cn KETI, South Korea dskim@keti.re.kr Tanaka TOSHIHISA Gwangju Institute of Science and Technology (GIST), South Korea hoyo@gist.ac.kr Yan (Josh) ZHANG Simula Research Laboratory, Norway yanzhang@ieee.org Young-Ro KIM Myongji College, South Korea foryoung@mjc.ac.kr Yongsheng GAO Griffith University, Australia yongsheng.gao@griffith.edu.au Won-Yong SHIN Dankook University, South Korea wyshin@dankook.ac.kr Administration & Office Journal Coodinator Prof. Kang-Sun CHOI Korea University of Technology and Education, South Korea ks.choi@koreatech.ac.kr Prof. Youngsun HAN Kyungil University, South Korea youngsun@kiu.ac.kr Administration Manager Ms. Yunju KIM inter@theieie.org TEL) +82-2-553-0255(ext.4) FAX) +82-2-552-6093 THE INSTITUTE OF ELECTRONICS AND INFORMATION ENGINEERS Room #907 Science and Technology New Building (635-4, Yeoksam-dong) 22, Teheran-ro 7-gil, Gangnam-gu 135-703, Seoul, Korea TEL : +82-2-553-0255~7 , FAX : +82-2-552-6093 http://www.ieiespc.org/ IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.4.195 195 IEIE Transactions on Smart Processing and Computing Median Filtering Detection of Digital Images Using Pixel Gradients Kang Hyeon RHEE Dept. of Electronics Eng. and School of Design and Creative Eng., Chosun University / Gwangju 501-759, Korea khrhee@chosun.ac.kr * Corresponding Author: Received July 15, 2015; Revised August 5, 2015; Accepted August 24, 2015; Published August 31, 2015 * Short Paper: This paper is invited by Seung-Won Jung, the associate editor. * Extended from a Conference: Preliminary results of this paper were presented at the ITC-CSCC 2015. The present paper has been accepted by the editorial board through the regular reviewing process that confirms the original contribution. Abstract: For median filtering (MF) detection in altered digital images, this paper presents a new feature vector that is formed from autoregressive (AR) coefficients via an AR model of the gradients between the neighboring row and column lines in an image. Subsequently, the defined 10-D feature vector is trained in a support vector machine (SVM) for MF detection among forged images. The MF classification is compared to the median filter residual (MFR) scheme that had the same 10-D feature vector. In the experiment, three kinds of test items are area under receiver operating characteristic (ROC) curve (AUC), classification ratio, and minimal average decision error. The performance is excellent for unaltered (ORI) or once-altered images, such as 3×3 average filtering (AVE3), QF=90 JPEG (JPG90), 90% down, and 110% up to scale (DN0.9 and Up1.1) images, versus 3×3 and 5×5 median filtering (MF3 and MF5, respectively) and MF3 and MF5 composite images (MF35). When the forged image was post-altered with AVE3, DN0.9, UP1.1 and JPG70 after MF3, MF5 and MF35, the performance of the proposed scheme is lower than the MFR scheme. In particular, the feature vector in this paper has a superior classification ratio compared to AVE3. However, in the measured performances with unaltered, once-altered and post-altered images versus MF3, MF5 and MF35, the resultant AUC by ‘sensitivity’ (TP: true positive rate) and ‘1-specificity’ (FN: false negative rate) is achieved closer to 1. Thus, it is confirmed that the grade evaluation of the proposed scheme can be rated as ‘Excellent (A)’. Keywords: Forgery image, Median filtering (MF), Median filtering detection, Median filter residual (MFR), Median filtering forensic, Autoregressive (AR) model, Pixel gradient 1. Introduction In image alteration, content-preserving manipulation uses compression, filtering, averaging, rotation, mosaic editing and scaling, and so on, using the forgery method [1-4]. Median filtering (MF) is especially preferred among some forgers because it has characteristics of non-linear filtering based on order statistics. Furthermore, the MF detection technique could classify altered images by MF. The state of the art is well-documented [5-9]. Consequently, an MF detector becomes a significant forensic tool for recovery of the processing history of a forged image. To detect MF in a forged image, Cao et al. [10] analyzed the probability that an image's first-order pixel difference is zero in textured regions. In this regard, Stamm et al. [2] accounted for a method that is highly accurate with unaltered or uncompressed images. Meanwhile, to extract the feature vector for median filtering detection, Kang et al. [6] obtained autoregressive (AR) coefficients as feature vectors via an AR model to analyze the median filter residual (MFR), which is the difference values of the original and its median filtering image. In this paper, a new MF detection algorithm is proposed in which the feature vector is formed from AR coefficients via an AR model of the gradients of the RHEE : Median Filtering Detection of Digital Images Using Pixel Gradients 196 neighboring horizontal and vertical lines in an image. The rest of the paper is organized as follows. Section 2 briefly introduces the theoretical background of MFR, and a gradient of the neighboring lines in an image. Section 3 describes the extraction method of a new feature vector in the proposed MF detection algorithm. The experimental results of the proposed algorithm are shown in Section 4. The performance evaluation is compared to a previous one, and is followed by some discussion. Finally, the conclusion is drawn, and future work presented in Section 5. and in the column direction: d ( i, j ) = − p ∑a( ) d (i − q, j ) + ε ( ) ( i, j ) , c k c (6) q =1 where ε(r)(i,j) and ε(c)(i,j) are the prediction errors [11], and p refers to the order of the AR model. Again, AR coefficients are the difference image (d). 2.2 Gradient of Neighboring Lines in Image The gradients of the neighboring row and column 2. Theoretical Background r direction lines in an image (x) are defined as G ( ) and In this section, the MFR and a gradient of the neighboring lines in an image are briefly introduced. 2.1 MFR Kang et al. proposed a MFR [6] that used a 10-D feature vector, which computed AR coefficients of the difference values between the original image (y) and its median filtering image (med(y)). Yuan [7] attempted to reduce interference from an image’s edge content and block artifacts from JPEG compression, proposing to gather detection features from an image’s MFR. The difference values (d) between the original and its median filtering image are AR-modeled. The difference is referred to as the median filter residual, which is formally defined as d ( i, j ) = med w ( y ( i, j ) ) − y ( i, j ) = z ( i, j ) − y ( i, j ) (2) c r (c) (3) )/2. (4) where r and c are the row and column directions, k is the (r ) (c) AR order number, and ak and ak are the AR coefficients in the row and column directions, respectively. Then a single ak for a one-dimensional AR model is obtained from Eq. (4), for which the average AR coefficients in both directions are Eqs. (2) and (3). The author attempts to reduce the dimensionality of the feature vector according to an image’s statistical property, fitting the MFR to a one-dimensional AR model in the row direction: d ( i, j ) = − p ∑a( ) d ( i, j − q ) + ε ( ) (i, j ) , r k q =1 c G ( ) ( i, j ) = x ( i + 1, j ) − x ( i, j ) (8) 3. The Proposed MF Detection Algorithm For the proposed MF detection algorithm, AR coefficients are computed via an AR model with Eqs. (7) and (8), then Eqs (9) and (10) as follows: ( ( )) ( ) g = AR ( mean ( G ( ) ) ) . ( ) ( ) g = (g + g ) / 2 . r (r ) g k = AR mean G ( ) c r k ( ( )) ( ) a = AR ( mean ( d ( ) ) ) ( () (7) c (1) r (r ) ak = AR mean d ( ) ak = ak + ak r G ( ) ( i, j ) = x ( i, j + 1) − x ( i, j ) k where (i,j) is a pixel coordinate, and w is the MF window size (w ∈ {3,5}). AR coefficients ak are computed as c k c G ( ) respectively as follows: r (5) c k k (9) (10) (11) In Eq. (11), g k [1st:10th] is formed as a 10-D feature vector. The flow diagram of the proposed algorithm for MF detection is shown in Fig. 1. The MF detection algorithm is described in the following steps, and is presented in Fig. 2. [Step 1] Compute the neighboring row and column line gradients in the image. [Step 2] Build the AR model of Step 1’s gradients. [Step 3] In Step 2, AR coefficients [1st :10th] of the gradients are formed as a 10-D feature vector. [Step 4] The feature vector is trained in an SVM classifier. [Step 5] Implement the MF detector via the trained SVM. 4. Performance Evaluation The proposed scheme uses C-SVM with Gaussian kernel (12) with a 10-D feature vector. ( ) ( K xi , x j = exp −γ ٛxi − x j 2 ) (12) IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 197 Fig. 3. The feature vector distribution of the MFR. Fig. 1. The flow diagram of the proposed MF detection algorithm. Fig. 4. The feature vector distribution of the proposed MF detection. Main Median Filtering detection Begin Feature Vector Extraction Gradients ← Neighboring Row and Col. Lines in Image. Feature Vector ← AR_Model(Gradients). End Feature Vector Extraction Begin Training Feature Vector SVM Classifier (Feature Vector). End Training Feature Vector Begin Test Images Feature Vectors of Test Images → Trained SVM Classifier. End Test Images Begin Classification and Analysis Score, Classification and Confusion Table by Trained SVM Classifier. Median Filtering Decision ← Analyze Confusion Table. End Classification and Analysis Leave Median Filtering detection Fig. 2. Proposed MF detection. Fig. 5. Average AR coefficients of ‘A’ group images of the proposed MF detection scheme. and a formed 10-D feature vector to an SVM classifier with trained five-fold cross-validation in conjunction with a grid search for the best parameters of C and γ in the multiplicative grid. ( C , γ ) ∈ {( 2i , 2 j ) | 4 × i, 4 × j ∈ Z } (13) 198 RHEE : Median Filtering Detection of Digital Images Using Pixel Gradients Fig. 6. ROC curves of MFR [7] scheme. The searching step size for i,j is 0.25, and those parameters are used to get the classifier model on the training set. UCID (Uncompressed Color Image Database)’s 1,388image database (DB) [12] is used for MF detection, and test image types prepared were MF3, MF5, unaltered (ORI), 3×3 average filtering (AVE3), JPEG (QF=90) and 90% down and 110% up to scale (DN0.9 and UP1.1). Subsequently, the trained classifier model was used to perform classification on the testing set. From among UCID’s 1,388-image DB, 1,000 images were randomly selected for training, and the other 388 images were allocated to testing. In Figs. 3 and 4, the feature vector distributions of the MFR and the proposed MF detection scheme, respectively, are presented. The test image group was prepared with three kinds: • Group A: the unaltered and the once-altered images. - ORI - AVE3 - JPG90 - DN0.9 - UP1.1 199 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 Fig. 7. ROC curves of the proposed MF detection scheme. • Group B: Post-altered two times more, after MF3. - MF3+AVE3+JPG70 - MF3+DN0.9+JPG70 - MF3+UP1.1+JPG70 • Group C: Post-altered two times more, after MF5. - MF5+AVE3+JPG70 - MF5+DN0.9+JPG70 - MF5+UP1.1+JPG70 Fig. 5 presents the average AR coefficients of the ‘A’ group images. In Fig. 6, ROC curves show each performance on MFw versus test images under the MFR [6] scheme, and in Fig. 7, ROC curves show each performance on the MFs versus test images on the proposed MF detection scheme. Table 1 shows the experimental results of the MFw and test image types on AUC and Pe (the minimal average decision error) under the assumption of equal priors and equal costs [13] and classification ratio, respectively. Pe = min( PFP + 1 − PTP ) 2 (14) The above procedure was repeated 30 times to reduce performance variations caused by different selections of the training samples. The detection accuracy, which is the RHEE : Median Filtering Detection of Digital Images Using Pixel Gradients 200 Table 1. Performance comparison between the MFR and the proposed MF detection scheme. MFw: Median filtering window size, w ∈ {3,5,35} RI(Result Item) 1: AUC, 2: Pe and 3: Classification ratio A a: ORI B a: MF3+AVE3+JPG70 b: AVE3 b: MF3+DN0.9+JPG70 c: JPG90 c: MF3+UP1.1+JPG70 d: DN0.9 e: UP1.1 arithmetic average of the true positive (TP) rate and true negative (TN) rate, was averaged over 30 times in random experiments [9]. From Table 1, it can be seen that the performance of the proposed scheme is excellent with MF3, MF5 and MF35 versus ORI, AV3, JPG90, DN0.9 and UP1.1 images compared to the MFR scheme. For the forged image that was post-altered two times more with AVE3, DN0.9, UP1.1 and JPEG70 after MF3 and MF5, the performance of the scheme is lower than the MFR scheme. In particular, the feature vector in this paper has a superior classification against AVE3. However, in the measured performances of all items, AUC by ‘sensitivity’ (TP) and ‘1-specificity’ (FP) was achieved closer to 1. Thus, it was confirmed that the grade evaluation of the proposed algorithm can be rated as ‘Excellent (A)’. In all the above experiments, the proposed MF detection considered only AR coefficients of the image line’s gradients to form the feature vector in the spatial domain. C a: MF5+AVE3+JPG70 b: MF5+DN0.9+JPG70 c: MF5+UP1.1+JPG70 From an AR model of an image pixel’s gradients, the scheme uses the AR coefficients as MF detection feature vectors. The proposed MF detection scheme is compared to the MFR [6], so these results will serve as further research content on MF detection. This appears to be the complete solution of an AR model from gradients of the neighboring row and column lines in a variety of images. However, despite the feature vector of the proposed MF detection scheme having a short length, the performance results are excellent due to AUC, the classification ratio is more than 0.9, and Pe is closer to 0. Future work should consider a performance evaluation of smaller sizes, such as 64 × 64 or 32 × 32 of an altered image, which were not considered in this paper. Finally, the proposed approach can also be applied to solve different forensic problems, like previous MF detection techniques. Acknowledgement 5. Conclusion This paper proposes a new robust MF detection scheme. This research was supported by the Ministry of Trade, Industry and Energy (MOTIE), KOREA, through the Education Program for Creative and Industrial Conver- IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 201 gence. (Grant Number N0000717) References [1] Kang Hyeon RHEE, “Image Forensic Decision Algorithm using Edge Energy Information of Forgery Image, ” IEIE, Journal of IEIE, Vol. 51, No. 3, pp. 75-81, March 2014. Article (CrossRef Link) [2] Stamm, M.C., Min Wu, K.J.R. Liu, “Information Forensics: An Overview of the First Decade,” Access IEEE, pp. 167-200, 2013. Article (CrossRef Link) [3] Kang Hyeon RHEE, “Forensic Decision of Median Filtering by Pixel Value's Gradients of Digital Image,” IEIE, Journal of IEIE, Vol. 52, No. 6, pp. 79-84, June 2015. Article (CrossRef Link) [4] Kang Hyeon RHEE, “Framework of multimedia forensic system,” Computing and Convergence Technology (ICCCT), 2012 7th International Conference on, IEEE Conf. Pub., pp. 1084-1087, 2012. Article (CrossRef Link) [5] Chenglong Chen, Jiangqun Ni and Jiwu Huang, “Blind Detection of Median Filtering in Digital Images: A Difference Domain Based Approach,” Image Processing, IEEE Transactions on, Vol. 22, pp. 4699-4710, 2013. Article (CrossRef Link) [6] Xiangui Kang, Matthew C. Stamm, Anjie Peng, and K. J. Ray Liu, “Robust Median Filtering Forensics Using an Autoregressive Model,” IEEE Trans. on Information Forensics and Security, vol. 8, no. 9, pp. 1456-1468, Sept. 2013. Article (CrossRef Link) [7] H. Yuan, “Blind forensics of median filtering in digital images,” IEEE Trans. Inf. Forensics Security, Vol. 6, no. 4, pp. 1335–1345, Dec. 2011. Article (CrossRef Link) [8] Tomáˇs Pevný, “Steganalysis by Subtractive Pixel Adjacency Matrix,” Information Forensics and Security, IEEE Transactions on, Vol. 5, pp. 215-224, 2010. Article (CrossRef Link) [9] Yujin Zhang, Shenghong Li, Shilin Wang and Yun Qing Shi, “Revealing the Traces of Median Filtering Using High-Order Local Ternary Patterns,” Signal Processing Letters, IEEE, Vol. 21, pp. 275-279, 2014. Article (CrossRef Link) [10] G. Cao, Y. Zhao, R. Ni, L. Yu, and H. Tian, “Forensic detection of median filtering in digital images,” in Multimedia and Expo (ICME), 2010, Jul. 2010, pp. 89–94, 2010. Article (CrossRef Link) [11] S. M. Kay, Modern Spectral Estimation: Theory and Application, Englewood Cliffs, NJ, USA: PrenticeHall, 1998. [12] Article (CrossRef Link) (2015.4.1) [13] M. Kirchner and J. Fridrich, “On detection of median filtering in digital images.” In Proc. SPIE, Electronic Imaging, Media Forensics and Security II, vol. 7541, pp. 1–12, 2010. Article (CrossRef Link) Copyrights © 2015 The Institute of Electronics and Information Engineers Kang Hyeon RHEE graduated and received a BSc and an MSc in Electronics Engineering from Chosun University, Korea, in 1977 and 1981, respectively. In 1991, he was awarded a Ph.D. in Electronics Engineering from Ajou University, Korea. Since 1977, Dr. Rhee has been with the Dept. of Electronics Eng. and School of Design and Creative Engineering, Chosun University, Gwangju, Korea. His current research interests include Embedded System Design related to Multimedia Fingerprinting/Forensics. He is on the committee of the LSI Design Contest in Okinawa, Japan. Dr. Rhee is also the recipient of awards such as the Haedong Prize from the Haedong Science and Culture Juridical Foundation, Korea, in 2002 and 2009. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.4.202 202 IEIE Transactions on Smart Processing and Computing New Inference for a Multiclass Gaussian Process Classification Model using a Variational Bayesian EM Algorithm and Laplace Approximation Wanhyun Cho1, Sangkyoon Kim2 and Soonyoung Park3 1 Department of Statistics, Chonnam National University / Gwangju,500-757 South Koreawhcho@chonnam.ac.kr Department of Electronics Engineering, Mokpo National University / Chonnam, South Koreanarciss76@mokpo.ac.kr 3 Department of Electronics Engineering, Mokpo National University / Chonnam, South Koreasypark@mokpo.ac.kr 2 * Corresponding Author: Sangkyoon Kim Received July 15, 2015; Revised August 5, 2015; Accepted August 24, 2015; Published August 31, 2015 * Short Paper * Extended from a Conference: Preliminary results of this paper were presented at the ICT-CSCC, Summer 2015. The present paper has been accepted by the editorial board through the regular reviewing process that confirms the original contribution. Abstract: In this study, we propose a new inference algorithm for a multiclass Gaussian process classification model using a variational EM framework and the Laplace approximation (LA) technique. This is performed in two steps, called expectation and maximization. First, in the expectation step (E-step), using Bayes’ theorem and the LA technique, we derive the approximate posterior distribution of the latent function, indicating the possibility that each observation belongs to a certain class in the Gaussian process classification model. In the maximization step, we compute the maximum likelihood estimators for hyper-parameters of a covariance matrix necessary to define the prior distribution of the latent function by using the posterior distribution derived in the E-step. These steps iteratively repeat until a convergence condition is satisfied. Moreover, we conducted the experiments by using synthetic data and Iris data in order to verify the performance of the proposed algorithm. Experimental results reveal that the proposed algorithm shows good performance on these datasets. Keywords: Multiclass gaussian process classification model, Variational bayesian EM algorithm, Laplace approximation technique, Latent function, Softmax function, Synthetic data, Iris data 1. Introduction Gaussian process (GP) can be conveniently used to specify prior distributions of hidden functions for Bayesian inference. In the case of regression with Gaussian noise, inference can be done simply in closed form, since the posterior is also a GP. But in the case of classification, exact inference is analytically intractable because the likelihood function is given as a non-Gaussian form. One prolific line of attack is based on approximating the non-Gaussian posterior with a tractable Gaussian distribution. Three different types of solutions have been suggested in the recent literature [1]. These are the Laplace approximation (LA) and expectation propagation (EP), Kullback-Leibler divergence minimization comprising variational bounding as a special case, and factorial approximation. First, Williams et al. proposed the use of a second-order Taylor expansion around the posterior mode to a natural way of constructing a Gaussian approximation to the log-posterior distribution [2]. The mode is taken as the mean of the approximate Gaussian. Linear terms of the log-posterior vanish because the gradient at the mode is zero. The quadratic term of the log-posterior is given by the negative Hessian matrix. Minka presented a new approximation technique (EP) for Bayesian networks [3]. This is an iterative method to find approximations based on approximate marginal moments, which can be applied to Gaussian processes. Second, Opper et al. discussed the relationship between the Laplace and variational approximations, and they show that for models with 203 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 Gaussian priors and factoring likelihoods, the number of variational parameters is actually O(N) [4]. They also considered a problem that minimizes the KL-divergence measure between the approximated posterior and the exact posterior. Gibbs et al. showed that the variational methods of Jaakkola and Jordan are applied to Gaussian processes to produce an efficient Bayesian binary classifier [5]. They obtained tractable upper and lower bounds for the unnormalized posterior density. These bounds are parameterized by variational parameters that are adjusted to obtain the tightest possible fit. Using the normalized versions of the optimized bounds, they then compute approximations to the predictive distributions. Third, Csato et al. presented three simple approximations for the calculation of the posterior mean in Gaussian process classification [6]. The first two methods are related to mean field ideas known in statistical physics. The third approach is based on a Bayesian outline approach. Finally, Kim et al. presented an approximate expectation– maximization (EM) algorithm and the EM-EP algorithm to learn both the latent function and hyper-parameters in a Gaussian process classification model [7]. We propose a new inference algorithm that can simultaneously derive both a posterior distribution of a latent function and maximum likelihood estimators of hyper-parameters in a Gaussian process classification model. The proposed algorithm is performed in two steps: called the expectation step (E-step) and the maximization step (M-step). First, in the expectation step, using the Bayesian formula and LA, we derive the approximate posterior distribution of the latent function based on learning data. Furthermore, we calculate a mean vector and covariance matrix of the latent function. Second, in the maximization step, using a derived posterior distribution of the latent function, we derive the maximum likelihood estimator for hyper-parameters necessary to define a covariance matrix. Moreover, we conducted the experiments by using synthetic data and Iris data in order to verify the performance of the proposed algorithm. The rest of this paper is organized as follows. The next section describes a multiclass Gaussian process classification model. In the Section 3 and 4, inference scheme section, we propose a new inference method that can derive the approximate distribution for a posterior distribution of latent variables and estimate the hyperparameters of the covariance function for prior distribution of the latent function. The section 5 includes performance evaluations and discussion of the effects of the proposed model. Finally, we conclude this paper in the last section. 2. Multiclass Gaussian Process Classification Model We first consider a multiclass Gaussian process classification model (MGPCM). The model consists of three components: a latent function with a Gaussian process prior distribution, a multiclass response, and a link function that relates between the latent function and response mean. First, we consider the multivariate latent function. Here, we define the latent function f (x) for Gaussian process classification having C classes at a set of observations x1 ,", x n as f (x | Θ) = (f11 (x),",f n1 (x),",f1c (x),",f1c (x), ",f1C (x),",f nC (x))T (1) Then, we assume a GP prior for the latent function f (x) as defined by f (x | Θ) ~ GP(0, K (xi , x j | Θ)) (2) where K (xi , x j ) is the covariance matrix. In this paper, we assume that the latent function f (x) represents the C classes, and the individual variables of the c -th component vector f c (x) of latent function f (x) are uncorrelated. Therefore, the GP covariance matrix K (xi , x j ) can be assumed from the following block diagonal form: K (xi , x j | Θ) = diag(K1 (xi , x j | Θ1 ),", K c (xi , x j | Θc ), ", K C (xi , x j | ΘC )) (3) where K c (xi , x j | Θ c ) = (k c (xi , x j ) | (θ1c ,θ 2c ))( n×n ) , i, j = 1,", n is also the covariance matrix for the c -th component vector of the latent function. Second, the response vector Y is constituted by identical independent multinomial random variables where each component variable represents a c class. That is, let us define the response vector Y as Y = ( y11 (x),", y1n (x),", y1c (x),", ync (x), ", y1C (x),", ynC (x))T , (4) where a vector of response Y has the same length as f (x) , and each component ykc of the c -th response vector y c = ( y1c ,", ync )T for c = 1,", C has 1 for the class, which is the label for observation, and 0 for the other C − 1 classes. Here, we are able to assume that the multinomial density function p( Y | π) of the response vector Y is given in the following form: C n p(Y | π) = ∏∏ (π ck ) yk c (5) c =1 k =1 where the indicator variable ykc takes one or zero with probability π ck and 1- π ck , and π ck denotes the probability that the k -th observation vector belongs to the particular 204 Cho et al.: New Inference for a Multiclass Gaussian Process Classification Model using a Variational Bayesian EM Algorithm… class c . Third, we consider the link function that specifies the relation between the latent function f (x) and the response mean vector E(Y | f ) . Here, the link function can be defined as Ψ (f ) = ln p(f | Y, X, Θ) = ln p(f | X, Θ) + ln p(Y | f ) 1 1 nC ln 2π = ln p(Y | f ) − f T K −1f − ln | K | − 2 2 2 Here, taking the first and second derivatives of Eq. (8) with respect to f , we obtain E(Y | f ) = (E(y1 | f ),", E(y c | f ),", E(y C | f ))T , ∇Ψ (f ) = ∇ ln p (Y | f ) − K −1f , where ∇∇Ψ (f ) = ∇∇ ln p(Y | f ) − K −1 = − W − K −1 E(y c | f ) = ( E (y1c | f ),", E (y ck | f ),", E (y cn | f )), c = 1,", C , And E (yck | f ) = π ck = ∑ exp(f kc ) C ' exp(f kc ) c =1 , k = 1,", n (6) ' 3. Variational EM Framework and Laplace Approximation Method One important issue in the Gaussian process classification model is to both derive the approximate distribution for a posterior distribution of latent variables and to estimate the hyper-parameters of the covariance function for prior distribution of the latent function. One possible approach is to consider the variational EM algorithm that is widely used in the incomplete data. In the E-step of the variational EM algorithm, we derive the approximate Gaussian posterior q(f | X, Y, Θ) for latent function value f using Laplace approximation. In the M-step of the variational EM algorithm, we seek an estimator of hyper-parameter Θ that can maximize a lower bound on a logarithm of the marginal likelihood q(Y | X, Θ) using the approximate posterior q(f | X, Y, Θ) obtained in the E-step. The E-step and M-step are iteratively repeated until a convergence condition is satisfied. Our algorithm is given in detail in the following sections. = arg max f ln p (Y | f ) p (f | X, θ) First, using Bayes' rule at a variational E-step, the posterior over the latent variable f is given by (7) but because the denominator p(y | X , Θ) is independent with latent function f , we need only consider the unnormalized posterior when maximizing with respect to f . Taking the logarithm of the un-normalized posterior of latent function f , it can be given as (10) It gives us the following equation: Ψ (f ) = Ψ (m f ) + ∇Ψ (f )f =mf (f − m f ) 1 + (f − m f )T (∇∇Ψ (f )f =mf )(f − m f ) 2 1 = Ψ (m f ) − (f − m f )T ( W + K −1 )(f − m f ) 2 ≅ ln N (f | m f ,(K −1 + W ) −1 ). (11) Thus, we have obtained a Gaussian approximation posterior q(f | Y, X, Θ) to the true posterior p(f | Y, X, Θ) with mean vector m f and covariance matrix V = (K −1 + W) −1. That is, using the Laplace approximation, the true posterior p(f | X, Y, Θ) of latent function f is approximated as a Gaussian posterior q(f | X, Y, Θ) as the following: q(f | X, Y, Θ) ~ N (m F , V = (K −1 + W ) −1 ) 3.1 Variation E-step and Laplace Approximation (9) where W ≡ −∇∇ ln p (Y | F ) is diagonal, since the likelihood factorizes over the case. A natural way of constructing a Gaussian approximation to the log-posterior Ψ (f ) = ln p (f | Y, X, Θ) is to perform a second-order Taylor expansion at the mode m F of the posterior, i.e. mf = arg max f Ψ (f ) p(f | X, y , Θ) = p (y | f ) p(f | X , Θ) / p(y | X , Θ) (8) .(12) Here, the mode or maximum m f of the log-posterior Ψ (f ) can be found iteratively using the Newton-Rapson algorithm. That is, given an initial estimate, m f , a new estimate is iteratively found, as follows: mfnew = mf − (∇∇Ψ (f )f =mf ) −1 ∇Ψ (f )f =mf = mf + (K −1 + W ) −1 (∇ ln p(Y | f )f =mf − K −1mf ) (13) = (K −1 + W ) −1 ( W mf + ∇ ln p (Y | f )f =mF ). Moreover, since the log-likelihood function ln p(Y | f ) 205 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 can be expressed as n ∑ ln p( y ,", y 1 k k =1 C k | f k ) , we obtain the following equation by differentiating the log-likelihood function ln p(Y | f ) with respect to f : n ∇f ln p (Y | f ) = ∇ f (∑ ln p( y1k ,", ykC | f k )) Here, the low bound F (q, Θ) can be written as ⎛ p (f | X, Θ) p (Y | f ) ⎞ F (q, Θ) = ∫ q (f ) ln ⎜ ⎟d f q (f ) ⎝ ⎠ = ∫ q (f ) ln p (f | X, Θ)d f + ∫ q (f ) ln p (Y | f )d f − ∫ q (f ) ln q (f )d f k =1 n C ⎛ n C ⎞ = ∇ F ⎜ ∑∑ yic f ic −∑ ln (∑ exp(f kc )) ⎟ (14) k =1 c =1 ⎝ k =1 c =1 ⎠ =Y −π where a vector π is defined by π ic = C ∑ exp(f c* =1 c* i (19) Moreover, since the second term and the third term are independent with hyper-parameters Θ , we only need to maximize the first term, Eq (f ) (ln p (f | X, Θ)) , with respect to Θ . By computing Eq (f ) (ln p (f | X, Θ)) using a Gaussian π ( nC×1) = (π11 ,", π ic ,", π Cn )T , exp(f ic ) = Eq (f ) (ln p (f | X, Θ)) + Eq (f ) (ln p (Y | f )) + H(q(f )). posterior, we obtain: , i = 1,", n, c = 1,", C (15) ) Eq (f ) (ln p (f | X, Θ)) nc 1 1 ln 2π - ln|K (Θ)|- Eq (f ) f T K (Θ) −1 f 2 2 2 nc 1 1 = − ln 2π - ln|K (Θ)|- Eq (f ) (f )T K (Θ) −1 Eq (f ) (f ) 2 2 2 (20) 1 − tr K (Θ) −1 Cov(f ) 2 nc 1 1 = − ln 2π - ln|K (Θ)|- m f T K (Θ) −1 m f 2 2 2 1 − tr K (Θ) −1 Cov(f ) 2 ( =− Second, the matrix W can be given as ) ( ⎛ ∂ ln p(Y | f ) ⎞ W = −∇∇ ln p (Y | f ) = ⎜ − ⎟ ∂f ∂f T ⎠ ⎝ (16) = diag(π) − ΠT Π, ( ) ) ( where Π is an (n × nC ) matrix obtained by horizontally ( ) stacking the diagonal matrices diag(π c ) , c = 1,", C . This is given in the following form: 3.2 Variational M-Step As we assume a derived approximate Gaussian posterior q(f | X, Y, Θ) is held fixed, we seek the new parameter values Θ new that the lower bound F (q, Θ) , given in the following Eq. (18) can be maximized with respect to Θ : to Θ using the E-step result, we obtain ∂Eq (f ) (ln p(f | X, Θ)) ∂Θ ∂K (Θ) 1 = - tr( K (Θ) −1 ) ∂Θ 2 (21) K (Θ) 1⎛ ⎞ K (Θ) −1 m f ⎟ + ⎜ m f T K ( Θ ) −1 ∂Θ 2⎝ ⎠ K Θ 1 ⎛ ( ) ⎞ K (Θ) −1 Cov(f ) ⎟ . + tr ⎜ K (Θ) −1 ∂Θ 2 ⎝ ⎠ Therefore, we can obtain the hyper-parameter maximizing the free energy by the following gradient update rule: ⎛ ∂E (ln p (f | X, Θ)) ⎞ Θ new = Θ old + η ⎜ q (f ) ⎟ ∂Θ ⎝ ⎠Θ=Θold ln p (Y | X, Θ) = ln ∫ p (f | X, Θ) p (Y | f ) d f ⎛ p (f , Y | X, Θ) ⎞ = ∫ q (f ) ln ⎜ ⎟d f q (f ) ⎝ ⎠ ⎛ p (f , Y | X, Θ) ⎞ ≥ F (q, Θ) = ∫ q (f ) ln ⎜ ⎟ d f. q (f ) ⎝ ⎠ ) Here, by differentiating Eq (f ) (ln p(f | X, Θ)) with respect ⎡ diag(π11 ) " ⎤ 0 " diag(π1C ) " 0 ⎢ ⎥ Π=⎢ # % # " # % # ⎥ 1 C ⎥ ⎢ 0 diag(π ) 0 diag(π ) " " " n n ⎦ ⎣ (17) ⎛ ⎞ q (f ) + ∫ q (f ) ln ⎜ ⎟d f ⎝ p (f | Y, X, Θ) ⎠ ( ) (18) (22) 4. Prediction Method Here, if we denote a vector f* as the latent function value corresponding with test point x* , then the joint prior distribution of the training latent function f and the test 206 Cho et al.: New Inference for a Multiclass Gaussian Process Classification Model using a Variational Bayesian EM Algorithm… latent function f* is And the covariance matrix of the latent function f* can be represented as ⎛⎡f ⎤ ⎡ K p(f* , f | x* , X, Θ) ~ N ⎜ ⎢ ⎥ | 0, ⎢ T ⎜ f ⎝ ⎣ * ⎦ ⎣K * K* ⎤ ⎞ ⎟ k ** ⎥⎦ ⎟⎠ (23) = k ** − Q*T (K + W −1 ) −1 Q* where Κ * = Vertical+Diag(K (x, x* ),", K (x, x* ),", K (x, x* )), 1 * c * C * K*c (x, x* ) = (k c (x1 , x* ),", k c (x n , x* ))T ,c = 1,",C, and k ** = diag (k 1 (x* , x* ),", k C (x* , x* )) Hence, given a novel test point x* , the posterior distribution of latent function f* corresponding to a test point x* can be obtained by marginalizing the latent functions of the training set: p(f* | x* , Y, X, Θ) = ∫ p(f* , f | x* , Y, X, Θ)d f = ∫ p(f* | f , x* , X, Θ) p (f | Y, X, Θ)d f given as the Gaussian with mean vector K T* (K −1 + W )m F and covariance matrix k ** − K *T K −1K * . Hence, the predictive mean vector for class c of the latent function value f* corresponding with test point x* is given by = K *c (x, x* )T (y c − π c ) vector of these probabilities (π*1 ,", π*C ). Therefore, we will classify the input vector X* into the class which its classification probability is maximized. That is, ' π*c = arg max1≤c≤C (π*1 ,", π*C ). In order to evaluate the performance improvement achieved by the proposed inference method, we consider a bivariate normal synthetic data and Iris data. 5.1 Synthetic Data Here, we will consider four partially overlapping Gaussian sources of data in two dimensions. First, in order to train a model, we generate four classes of bivariate Gaussian random samples. One hundred sixty data points were generated by the four bivariate normal distributions with the mean vectors and covariance matrices described in Table 1. Fig. 1(a) plots these data points in a twodimensional space. (25) Table 1. Mean Vector and Covariance Matrix for Each Class. (K ) m = (y − π ) . Moreover, if these are put into vector form, then the expectation of latent function f* under the Laplace approximation is given as c F c c μ* = Ε q (f*c | x* , X, Y, Θ) = Q*T (y − π), (29) 5. Performance Evaluation where the last equality comes from K −1m F = Y − π , and c −1 (28) Therefore, we have obtained the approximate Gaussian posterior distribution G (μ* , Σ* ) of the latent function f* . Finally, in order to classify input vector X* into its proper class, we first extract the n random samples f*1 ,", f*n from the predictive distribution of latent function f* corresponding to the input vector. Further, using the Eq. (2), we calculate the estimate of the classification probability (π*1c ,", π*n c ), c = 1,", C ,and compute a mean (24) But the posterior distribution of the latent function is unfortunately not Gaussian due to the non-Gaussian likelihood, as mentioned above. Hence, the approximate posterior distribution of the latent function is necessary. Here, if we use the Laplace approximation posterior q(f | X, Y, Θ) to a true posterior p(f | X, Y, Θ) , we have obtained the approximate posterior distribution q(f* | x* , X, Y, Θ) of latent function f* . It is obviously Ε q (f*c | x* , X, Y, Θ) = K *c (x, x* )T (K c ) −1 m Fc Σ* = Covq (f* | x* , X, Y, Θ) Mean vector Covariance matrix Class 1 Class 2 Class 3 Class 4 (1.75,-1.0) (-1.75,1.0) (2,2) (-2,-2) ⎛ 1 0.5 ⎞ ⎜ ⎟ ⎝ 0.5 1 ⎠ ⎛ 1 −0.5 ⎞ ⎜ ⎟ ⎝ −0.5 1 ⎠ ⎛ 1 −0.5 ⎞ ⎜ ⎟ ⎝ −0.5 1 ⎠ ⎛1 ⎜ ⎝0 0⎞ ⎟ 1⎠ (26) where a matrix QT* is defined as the (nC × C ) matrix ⎛ K 1* (x, x* ) 0 ⎜ 2 0 K * ( x, x* ) Q* = ⎜ ⎜ M M ⎜⎜ 0 0 ⎝ ⎞ ⎟ K M ⎟ ⎟ O M ⎟ K K *C (x, x* ) ⎟⎠ K 0 (27) Fig. 1. (a) Training data, (b) testing data, and (c) class region and misclassification observations. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 Fig. 2. Iris dataset. Table 2. Classification of Iris Species. setosa versicolor virginica Setosa 1 0 0 Versicolor 0 0.96 0.04 Virginica 0 0.01 0.99 Second, in order to verify the performance of the model, we generate four different classes of bivariate Gaussian random samples. Four hundred data points were generated by the bivariate normal distribution. Fig. 1(b) plots the testing data points. Fig. 1(c) shows each region and misclassification data points. We can see that it totals about 7-8% misclassification. Therefore, we know that the proposed method can completely classify the data points well. 5.2 Iris Dataset Here, we considered real data called an Iris dataset. This dataset consists of 50 samples from each of three species of Iris flowers: setosa, versicolor and virginica. Four features were measured from each sample (length and width of sepal and petal) in centimeters. Based on the combination of the four features, we developed a GP classifier model to distinguish one species from another. Fig. 2 shows the Iris dataset from different viewpoints. First, in order to train a model, we used a total of 90 observations from three classes. And in order to verify the performance of the model, we selected 60 samples, except for ones used in the training set. Next, we want to measure the performance of our proposed model when classifying the Iris species. To find the best performance, we chose to find the optimal hyperparameters at the point where the marginal likelihood has a maximum using the EM algorithm. Table 2 shows the results of the Iris species classification. To calculate the rates, we estimate the number of correctly classified negatives and positives and divide by the total number of each species. We had to try many experiments to get meaningful results using randomly selected samples. Experimental results reveal that the average for a successful classification rate is about 98%. 6. Conclusion This paper proposed a new inference algorithm that can 207 simultaneously derive both a posterior distribution of a latent function and estimators of hyper-parameters in the Gaussian process classification model. The proposed algorithm was performed in two steps: the expectation step and the maximization step. In the expectation step, using a Bayesian formula and Laplace approximation, we derived the approximate posterior distribution of the latent function on the basis of the learning data. Furthermore, we considered a method of calculating a mean vector and covariance matrix of a latent function. In the classification step, using the derived posterior distribution of the latent function, we derived the maximum likelihood estimator for hyper-parameters necessary to define a covariance matrix. Finally, we conducted experiments by using synthetic data and Iris data in order to verify the performance of the proposed algorithm. Experimental results reveal that the proposed algorithm shows good performance on these datasets. Our future work will extend the proposed method to other video recognition problems, such as 3D human action recognition, gesture recognition, and surveillance systems. Acknowledgement This work was jointly supported by the National Research Foundation of the Korea Government (2014R1A1A4A0109398) and the research fund of Chonnam National University (2014-2256). References [1] H. Nickish et al., “Approximations for Binary Gaussian Process Classification,” Journal of Machine Learning Research, Vol. 9, pp. 2035-2078, 2008. Article (CrossRef Link) [2] C. K. I. Williams et al., “Bayesian Classification with Gaussian Processes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 12, pp. 1103-1118, 1998. Article (CrossRef Link) [3] T. P. Minka, “Expectation Propagation for Approximate Bayesian Inference,” Technical Report, Depart. of Statistics, Carnegie Mellon University, Pittsburgh, PA 15213, 2001. Article (CrossRef Link) [4] M. Opper et al., “The Variational Gaussian Approximation Revisited,” Neural Comput., Vol. 21, No. 3, pp. 78692, 2009. Article (CrossRef Link) [5] M. N. Gibbs et al., “Variational Gaussian Process Classifiers,” IEEE Transactions on Neural Networks, Vol. 11, No. 6, pp. 1458-1464, 2000. Article (CrossRef Link) [6] L. Csato et al., “Efficient Approaches to Gaussian Process Classification,” in Neural Information Processing Systems, Vol. 12, pp. 251-257, MIT Press, 2000. Article (CrossRef Link) [7] H. Kim, et al., "Bayesian Gaussian Process Classification with the EM-EP algorithm," IEEE Trans. on PAMI, Vol. 28, No. 12, pp 1948-1959, 2006. Article (CrossRef Link) 208 Cho et al.: New Inference for a Multiclass Gaussian Process Classification Model using a Variational Bayesian EM Algorithm… Wanhyun Cho received both a BSc and an MSc from the Department of Mathematics, Chonnam National University, Korea, in 1977 and 1981, respectively, and a PhD from the Department of Statistics, Korea University, Korea, in 1988. He is now teaching at Chonnam National University. His research interests are statistical modeling, pattern recognition, image processing, and medical image processing. Sangkyoon Kim received a BSc, an MSc and a PhD in Electronics Engineering, Mokpo National University, Korea, in 1998, 2000 and 2015, respectively. From 2011 to 2015, he was a Visiting Professor in the Department of Information & Electronics Engineering, Mokpo National University, Korea. His research interests include image processing, pattern recognition and computer vision. Copyrights © 2015 The Institute of Electronics and Information Engineers Soonyoung Park received a BSc in Electronics Engineering from Yonsei University, Korea, in 1982 and an MSc and PhD in Electrical and Computer Engineering from State University of New York at Buffalo, in 1986 and 1989, respectively. From 1989 to 1990 he was a Postdoctoral Research Fellow in the Department of Electrical and Computer Engineering at the State University of New York at Buffalo. Since 1990, he has been a Professor with the Department of Electronics Engineering, Mokpo National University, Korea. His research interests include image and video processing, image protection and authentication, and image retrieval techniques. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.4.209 209 IEIE Transactions on Smart Processing and Computing A Simulation Study on The Behavior Analysis of The Degree of Membership in Fuzzy c-means Method Takeo Okazaki1, Ukyo Aibara1 and Lina Setiyani2 1 Department of Information Engineering, University of the Ryukyus / Okinawa, Japan okazaki@ie.u-ryukyu.ac.jp, ukyo@ms.ie.u-ryukyu.ac.jp 2 Department of Information Engineering, Graduate school of Engineering and Science, University of the Ryukyus / Okinawa, Japantya.sachi@ms.ie.u-ryukyu.ac.jp * Corresponding Author: Takeo Okazaki Received July 15, 2015; Revised August 5, 2015; Accepted August 24, 2015; Published August 31, 2015 * Short Paper * Extended from a Conference: Preliminary results of this paper were presented at the ITC-CSCC 2015. This paper has been accepted by the editorial board through the regular reviewing process that confirms the original contribution. Abstract: Fuzzy c-means method is typical soft clustering, and requires a degree of membership that indicates the degree of belonging to each cluster at the time of clustering. Parameter values greater than 1 and less than 2 have been used by convention. According to the proposed datageneration scheme and the simulation results, some behaviors in the degree of “fuzziness” was derived. Keywords: Fuzzy c-means, Degree of membership, Numerical simulation, Correct ratio, Incorrect ratio 1. Introduction Soft clustering is clustering that permits belonging to more than one cluster, whereas hard clustering requires belonging to just one cluster to provide crisp classification. Fuzzy c-means (FCM) method [1, 2] is typical soft clustering, which is achieved to estimate a membership value that indicates the degree of belonging to each cluster. Since the parameters for the degree of “fuzziness” are included, it is necessary to provide a parameter value at the time of clustering. In most of the traditional research, parameter values greater than 1 and less than 2 have been used with little theoretical explanation. In this study, we analyzed some behaviors of the degree of fuzziness by numerical simulations. 2. Fuzzy c-means Method Given a finite set of n objects X = {x1 ,", x n } and the number of clusters c, we consider partitioning X to c clusters while allowing duplicate belonging. With the belonging coefficient uki (k:cluster_id, i:object_id), FCM aims to minimize the objective function Err ( u , μ ) . c n Err ( u, μ ) = ∑∑ ( uki ) m k =1 i =1 uki = 1 ⎛ x −μ ∑j ⎜⎜ xi − μ k j ⎝ i c n μk = xi − μ k ∑ (u ) ki i m 2 (1) (2) 2 ⎞ m −1 ⎟ ⎟ ⎠ xi n ∑ (u ) m (3) ki i μ k denotes each cluster center and m means the degree of fuzziness, with m>1. The degree m corresponds to the level of cluster fuzziness. A larger m causes fuzzier clusters; in contrast, m = 1 indicates crisp classification. We need to determine the value of m at the time of clustering, and m = 2 has been applied in the absence of domain knowledge by convention. 3. Approach to Finding Properties of The Degree of Membership Although we would like to find the universal or 210 Okazaki et al.: A Simulation Study on The Behavior Analysis of The Degree of Membership in Fuzzy c-means Method mathematical properties of m, it is difficult to avoid the relation to data-specific characters. They may be found when considering a suitable m applied to various constructed data by designing a generation model to create various data and values to estimate an optimal m. In order to design the data model, the following indexes, which concern cluster relationships, were picked up. ▪ Distance between clusters (cluster placement) ▪ Number of clusters ▪ Shape of clusters ▪ Number of objects in each cluster The meaning of cluster in distance and shape is the set of objects to give the initial objects placement for our experiments, but it is not the target cluster. For distance, regular intervals give the typical placement, and we can arrange the number of cluster overlaps. For shape, the circle type is easy to handle because of density; however, the oval type requires consideration of bias. The procedure for data generation and cluster assignment is as follows. In order to analyze the accuracy of each cluster, the correct ratio inside of a cluster denotes the rate by which objects should belong to it. CRiinside = ⎞ ⎛ 2π i ⎟ , d × sin ⎜ ⎠ ⎝ c ⎞⎞ i⎟⎟ ⎠⎠ CRioutside = pik = n − Cni (8) n − Cn*i On the other hand, the incorrect ratio for each cluster can be the evaluation indexes in a similar manner. IRiinside = IRioutside = Cni − Cn+i (9) Cni Cn*i − Cn+i (10) Cn*i (4) 4. Evaluation Experiments d: distance from the origin [Step.2] Generate normal random numbers for each cluster with mean vector vi and covariance matrix E. [Step.3] Calculate the coefficient pik that indicates object xi belongs to cluster i. 1 dik + 1 (7) Cn*i The correct ratio outside of a cluster denotes the rate at which objects that do not belong to that cluster should not belong to it. [Step.1] Decide the center vector vi for each cluster. ⎛ ⎛ 2π vi = ⎜ d × cos ⎜ ⎝ c ⎝ Cn+i According to the strategy in Section 3, we designed the evaluation experiment scheme as follows. The results of the basic case with a regular interval and a circle shape at c = 5, Cn*i = 50 are shown in Fig. 1 to Fig. 6. Table 1. Experimental conditions. c ∑d j =1 1 jk + 1 (5) dik : distance between object and cluster [Step.4] Calculate the mean pik of the normal random numbers with mean pik and standard deviation 1 for 10c 1 all objects. If pik ≥ then object x k is deemed to belong c to cluster i. To obtain an optimal m, we need some indexes to evaluate the FCM results. Assuming that Cn*i is the Parameters m c : number of clusters Values 1.1 ~ 7 5~7 Cn*i 50 ~ 100 Distance between clusters Shape of clusters regular interval or biased placement circle or oval number of objects belonging to cluster i in the input data, Cni is the number of objects belonging to cluster i in the results, Cn+i is the number of correct objects belonging to cluster i in the results, and the correct ratio is used for overall suitability. c CR = ∑C i =1 + ni c ∑ Cn*i i =1 (6) Fig. 1. A case of input data with regular placement, c = 5, Cn*i = 50 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 211 Fig. 2. CR : correct ratio overall with regular placement, c = 5, Cn*i = 50 . Fig. 6. IRioutside : incorrect ratio outside a cluster with regular placement, c = 5, Cn*i = 50 . Fig. 3. CRiinside : correct ratio inside a cluster with regular placement, c = 5, Cn*i = 50 . Fig. 7. A case of input data with regular placement, c = 7, Cn*i = 50 Fig. 4. CRioutside : correct ratio outside a cluster with regular placement, c = 5, Cn*i = 50 . Fig. 8. CR : correct ratio overall with regular placement, c = 7, Cn*i = 50 . Fig. 5. IRiinside : incorrect ratio inside a cluster with regular placement, c = 5, Cn*i = 50 . The correct ratio overall had a peak from m = 3.5 to m = 5.5, and a clear inflection point could be seen around m = 4 for the each cluster evaluation index. The results of the basic case with a regular interval and circle shape at c = 7, Cn*i = 50 are shown in Fig. 7 to Fig. 12. 212 Okazaki et al.: A Simulation Study on The Behavior Analysis of The Degree of Membership in Fuzzy c-means Method The correct ratio overall had a peak at m = 3, and a clear inflection point could be seen around m = 3 for each cluster evaluation index. Cluster C7 was located at the center of the objects, therefore C7 was error prone, and its evaluation values were bad compared with the other six clusters. The results of the modified case with biased placement and a circle shape at c = 5, Cn*i = 100 are shown in Fig. 13 to Fig. 18. Fig. 9. CRiinside : correct ratio inside a cluster with regular placement, c = 7, Cn*i = 50 . Fig. 13. A case of input data with biased placement, c = 5, Cn*i = 100 . Fig. 10. CRioutside : correct ratio outside a cluster with regular placement, c = 7, Cn*i = 50 . Fig. 11. IRiinside : incorrect ratio inside a cluster with Fig. 14. CR : correct ratio overall with biased placement, c = 5, Cn*i = 100 . regular placement, c = 7, Cn*i = 50 . Fig. 12. IRioutside : incorrect ratio outside a cluster with Fig. 15. CRiinside : correct ratio inside a cluster with regular placement, c = 7, Cn*i = 50 . biased placement, c = 5, Cn*i = 100 . IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 Fig. 16. CRioutside : correct ratio outside a cluster with 213 Fig. 19. A case of input data with biased placement, c = 7, Cn*i = 100 . biased placement, c = 5, Cn*i = 100 . Fig. 20. CR : correct ratio overall with biased placement, c = 7, Cn*i = 100 . Fig. 17. IRiinside : incorrect ratio inside a cluster with biased placement, c = 5, Cn*i = 100 . Fig. 21. CRiinside : correct ratio inside a cluster with biased placement, c = 7, Cn*i = 100 Fig. 18. IRioutside : incorrect ratio outside a cluster with biased placement, c = 5, Cn*i = 100 . The correct ratio overall had a peak at m = 4.25, and a inflection area could be seen from m = 3 to m = 4 for each cluster evaluation index. Cluster C3 was located in isolation, therefore it could be distinguished stably. The results of the modified case with biased placement and a circle shape with c = 7, Cn*i = 100 are shown in Fig. Fig. 22. CRioutside : correct ratio outside a cluster with 19 to Fig. 24. biased placement, c = 7, Cn*i = 100 . 214 Okazaki et al.: A Simulation Study on The Behavior Analysis of The Degree of Membership in Fuzzy c-means Method Table 2. Comparison of clustering results. Category C1 Fig. 23. IRiinside : incorrect ratio inside a cluster with biased placement, c = 7, Cn*i = 100 . C2 C3 Fig. 24. IRioutside : incorrect ratio outside a cluster with biased placement, c = 7, Cn*i = 100 . The correct ratio overall had a peak at m = 4.25, and a clear inflection point could be seen around m = 4 for each cluster evaluation index. These values were larger than those from regular placement. Cluster C4 and C7 were located at the center of the objects; therefore, these were error prone. Cluster C1 was located apart from other objects, and could be distinguished stably. In a limited number of clusters, both regular and biased placement cases showed that the optimal m was larger. The optimal m for biased placement was larger than those for regular placement. A value of 3 or more for m was valid when the number of clusters was 7 or less. 5. Application to Motor Car Type Classification We confirmed the validity of the experimental results through application of actual data from motor car road tests [4]. The 32 cars had 5 variables, such as fuel consumption, amount of emissions, horsepower, vehicle weight and 1/4 mile time. Because of the data description, we assumed four clusters: big sedan, midsize sedan, small sedan and sports car. The results of FCM for the conventional m = 2 and proposal m = 4 are shown in Table 2 and Fig. 25. Blue lines correspond to m = 2, and red lines correspond to m = 4. Black line categories have no difference between m = 2 and m = 4. C4 m=2 Datsun 710 Merc 240D Merc 230 Fiat 128 Honda Civic Toyota Corolla Toyota Corona Fiat X1-9 Porsche 914-2 Lotus Europa Volvo 142E Hornet Sportabout Duster 360 Cadillac Fleetwood Lincoln Continental Chrysler Imperial Camaro Z28 Pontiac Firebird Ford Pantera L Maserati Bora Hornet 4 Drive Hornet Sportabout Merc 450SE Merc 450SL Merc 450SLC Dodge Challenger AMC Javelin Camaro Z28 Ford Pantera L Maserati Bora Mazda RX4 Mazda RX4 Wag Hornet 4 Drive Valiant Merc 240D Merc 230 Merc 280 Merc 280C Toyota Corona Ferrari Dino Volvo 142E m=4 Datsun 710 Merc 240D Merc 230 Fiat 128 Honda Civic Toyota Corolla Toyota Corona Fiat X1-9 Porsche 914-2 Lotus Europa Volvo 142E Hornet Sportabout Duster 360 Cadillac Fleetwood Lincoln Continental Chrysler Imperial Camaro Z28 Pontiac Firebird Ford Pantera L Maserati Bora Hornet 4 Drive Hornet Sportabout Valiant Duster 360 Merc 450SE Merc 450SL Merc 450SLC Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Ford Pantera L Maserati Bora Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Valiant Merc 240D Merc 230 Merc 280 Merc 280C Toyota Corona Porsche 914-2 Lotus Europa Ferrari Dino Volvo 142E The cluster placement was biased, and the results for m = 4 gave a more appropriate classification because of the original car descriptions. 6. Conclusion For typical soft clustering (FCM), the degree of membership has an important role. Parameter values greater than 1 and less than 2 have been used by convention with little theoretical explanation. We analyzed the behavior of the parameter with simulation studies. The results showed the relations between the optimum value IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 215 with the L1 and L ∞ norms”, IEEE Transactions on Systems Man and Cybernetics, Vol.21, No.3, pp.545– 554, 1991. Article (CrossRef Link) [7] R. J. Hathaway, J. C. Bezdek and W. Pedrycz, “A parametric model for fusing heterogeneous fuzzy data”, IEEE Transactions on Fuzzy Systems, Vol.4, No.3, pp.270–281, 1996. Article (CrossRef Link) Fig. 25. Mapping of clustering results and cluster placements or the number of clusters. We mentioned that at least a larger value than that used by convention was suitable. It is clear that a lower m provides a conservative decision that does not allow too much overlap among the clusters. For the correct ratio inside the cluster and the incorrect ratio outside the cluster, a smaller m is desirable. However a larger m is desirable for the correct ratio outside the cluster and the incorrect ratio inside the cluster. With judgment from a multi-faceted perspective, the optimal m should be larger than the conventional value. As a future issue for research, we need to know more sophisticated features of the parameter by using a greater variety of data generation. References [1] J. C. Dunn, “A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact WellSeparated Clusters”, Journal. of Cybernetics, Vol.3, pp.32–57, 1974. Article (CrossRef Link) [2] J. C. Bezdek, “Pattern Recognition with Fuzzy Objective Function Algorithms”, Plenum Press, New York, 1981. Article (CrossRef Link) [3] S. Miyamoto, K. Umayahara and M. Mukaidono, “Fuzzy Classification Functions in the Methods of Fuzzy c-Means and Regularization by Entropy”, Journal. of Japan Society for Fuzzy Theory and Intelligent Informatics, Vol.10, No.3, pp.548–557, 1998. Article (CrossRef Link) [4] H. Henderson and P. Velleman, “Building multiple regression models interactively”. Biometrics, vol.37, pp.391–411, 1981. Article (CrossRef Link) [5] S. Hotta and K. Urahama, “Retrieval of Videos by Fuzzy Clustering”, Image Information and Television Engineers Journals, Vol.53, No.12, pp.1750–1755, 1999. Article (CrossRef Link) [6] L. Bobrowski and J. C. Bezdek, “c-means clustering Copyrights © 2015 The Institute of Electronics and Information Engineers Takeo Okazaki is Associate Professor of Information Engineering at Univer-sity of the Ryukyus, Japan. He recei-ved his B.Sci. and M.Sci. degrees in Algebra and Mathematical Statistics from Kyushu University, Japan, in 1987 and 1989, respectively and his Ph.D. in Information Engineering from University of the Ryukyus in 2014. He was a research assistant at Kyushu University from 1989 to 1995. He has been an assistant professor at University of the Ryukyus since 1995. His research interests are statistical data normalization for analysis, statistical causal relationship analysis. He is a member of JSCS, IEICE, JSS, GISA, and BSJ Japan. Ukyo Aibara received his B.Eng. degree in Information Engineering from University of the Ryukyus, Japan, in 2015. In 2013, he graduated from National Institute of Technology, Kumamoto College. He discussed the evaluation of the performance and character for the soft clustering in his graduate research and thesis. Especially he investigated a variety of applications for fuzzy c-means method, and developed an enhancement package with Statistical Language R. Lina Setiyani is Master Course Student of Information Engineering at Uni-versity of the Ryukyus, Japan. She received her B.Comp. degree in Infor-mation Engineering from Janabadra University, Yogyakarta, Indonesia, in 2007. Her graduation research thema was SMS Based Information System of Janabadra University Student Admission. Her research interest is about finding optimal solution of a problem with genetic algorithm using Java language program. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.4.216 216 IEIE Transactions on Smart Processing and Computing Development of Visual Odometry Estimation for an Underwater Robot Navigation System Kandith Wongsuwan and Kanjanapan Sukvichai Department of Electrical Engineering, Kasetsart University / Chatuchak, Bangkok, Thailand kandithws@yahoo.com, fengkpsc@ku.ac.th * Corresponding Author: Kandith Wongsuwan Received July 15, 2015; Revised August 5, 2015; Accepted August 24, 2015; Published August 31, 2015 * Regular Paper: This paper reviews the recent progress, possibly including previous works in a particular research topic, and has been accepted by the editorial board through the regular reviewing process. * Extended from a Conference: Preliminary results of this paper were presented at the ITC-CSCC 2015. The present paper has been accepted by the editorial board through the regular reviewing process that confirms the original contribution. Abstract: The autonomous underwater vehicle (AUV) is being widely researched in order to achieve superior performance when working in hazardous environments. This research focuses on using image processing techniques to estimate the AUV's egomotion and the changes in orientation, based on image frames from different time frames captured from a single high-definition web camera attached to the bottom of the AUV. A visual odometry application is integrated with other sensors. An internal measurement unit (IMU) sensor is used to determine a correct set of answers corresponding to a homography motion equation. A pressure sensor is used to resolve image scale ambiguity. Uncertainty estimation is computed to correct drift that occurs in the system by using a Jacobian method, singular value decomposition, and backward and forward error propagation. Keywords: Underwater robot, Visual odometry, Monocular odometry, AUVs, Robot navigation 1. Introduction 2. Odometry Estimation via Homography The underwater autonomous vehicle (AUV) is still in development but aims to be effective when working in the industrial field. To create an autonomous robot, one of the important things is a strategy to autonomously navigate the robot to desired destinations. Several techniques are used to estimate its motion by using imaging sonar or Doppler velocity log (DVL). Because the cost per sensor device is extremely high, an alternative for AUV navigation is implemented in this research by using a visual odometry concept, which is normally used in mobile robots. In our design procedure, the monocular visual odometry estimation was done by using a single highdefinition camera attached to the bottom of the robot, grabbing different time sequences of images and calculating the robot’s movement from changes between two images. Assume that the roll, pitch and depth of the robot in relation to the floor of the testing field is known, the monocular visual odometry concept is designed as seen in Fig. 1. The implementation is based on using a single pin-hole camera. The Shi-Tomasi [1] method and Lucas-Kanade pyramidal optical flow [2] are used in order to estimate a different time image homography. Optical flow is implemented in OpenCV for the feature matching algorithm. A random sample consensus (RANSAC) method is used to eliminate any feature outliers. Let the estimated projective homography between frames be H12 , and let the camera intrinsic parameter be A1 . Hence, the calibrated homography is shown in Eq. (1): ⎛ t ⎞ H12c = A1−1 H12 A1 = R 2 ⎜ I − 2 n1t ⎟ d ⎝ ⎠ where - R2 is the camera's rotation matrix - t 2 is the camera's translation vector (1) IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 217 Fig. 1. The monocular visual odometry concept. homography error covariance matrix [3]: - n1 is a normal vector to the object (ground) plane, and - d is the distance to the object (ground) plane in meters In advanced, every single image frame is adjusted for its rotation about the x and y axes, Roll (ϕ )ٛand Pitch (θ ) are known by applying a compensated homography to the image. If t 2 = [0 0 0]T from Eq. (1), the compensated homography can be rewritten as Eq. (2): H = A1 R ϕ Rθ A1−1 (2) From Eq. (1), two sets of solutions can be obtained: S1 = {R112 , t12 , n11 } and S 2 = {R122 , t 22 , n12 } . The criteria for choosing a correct solution is, as both frames are compensated on the same plane (plane normal vector direction toward camera), Roll (ϕ )ٛand Pitch (θ ) of the correct rotation matrix (either R112 or R122 ٛ) must be set to about zero, which must correspond to the correct set of answers, as shown in Eq. (3): Si = argmin( ϕi2 + θ i2 ), i = 1, 2 (3) i ⎛ ∂p' T −1 ∂p' ∑ p' = ⎜⎜ ∑x' ∂hij , xi ⎝ ∂hij , xi ⎞ ⎟⎟ ⎠ † where - p' is the vector of the matched feature point in the second image ∂p' is a Jacobian matrix of p' ∂hij , xi - † is a pseudo inverse. - ∑ is a covariance matrix For pi' and each element of pi' in the Jacobian An odometry estimation of the sensors’ covariance matrix is needed in order to determine the uncertainty occurring in the system. To estimate the uncertainty of rotation about the z-axis, or yaw (ψ ) , and horizontal translation (x, y), there are two steps as shown in sections 3.1 and 3.2. 3.1 Homography Covariance Matrix First, backward error propagation is used to find a ∂p' , ∂hij , xi if the estimate point is normalized in the image plane, the estimate homography is affine in all cases ( h31 = h32 = 0 and h33 = 1 ). So, we have xˆi' = h11 xˆi + h12 yˆ i + h13 yˆ i' = h21 xˆi + h22 yˆi + h23 3. Covariance Matrix Estimation (4) (5) By taking a partial derivative of Eq. (5), the Jacobian element is: ∂xˆi' ∂xˆ ' ∂xˆ ' = xˆi , i = yˆ i , i = 1 ∂h12 ∂h13 ∂h11 ' ' ∂yˆ i ∂yˆ ∂yˆ ' = xˆi , i = yˆ i , i = 1 ∂h22 ∂h23 ∂h21 ' ' ' ∂xˆi ∂xˆ ∂yˆ ∂yˆ ' = h11 , i = h12 , i = h21 , i = h22 ∂xˆi ∂yˆ i ∂xˆi ∂yˆ i (6) 218 Wongsuwan et al.: Development of Visual Odometry Estimation for an Underwater Robot Navigation System And the other element of the Jacobian is 0, since we assume that each pixel feature is independent. In this case, variance of error that occurs in the matching algorithm is assumed to be less than 1 pixel, and there is no error in first-image acquisition. 6σ is used to determine the variance in every single pixel when all of them are independent. Then, the Jacobian of the SVD is computed by using Eq. (7) in order to solve another layer of back error propagation: ∂H12c ∂U ∂D T ∂V T = DV T + U V + UD ∂hij ∂hij ∂hij ∂hij (7) From Eq. (7), the equation is solved as referred to by Papadopoulo and Lourakis [4], as follows: ∂D ∂λk = = uik v jk , k = 1, 2,3 ∂hij ∂hij (8) λl ΩUij ,kl + λk ΩVij ,kl = uik v jl ⎫ λk ΩUij ,kl + λl ΩVij ,kl = −uil v jk ⎬⎭ (9) where the index ranges are k = 1, 2,3 and l = i + 1, i + 2 . Since λk ≠ λl , the 2x2 equation system (9) has a unique solution that is practically solved by using Cramer’s Rule: ΩUij ,kl = ΩVij ,kl = uik v jl λk −uil v jk λl λl2 − λk2 λk uik v jl λl −uil v jk λl2 − λk2 U, V, D is ∑ U,V,D concatenated together as diag (∑ U,V,D ) = ⎡⎣σ u2 "σ u2 σ v2 "σ v2 σ λ2 σ λ2 σ λ2 ⎤⎦ 11 1 2 3 (15) For the rotation matrix, recall that, in this paper, we are only interested in yaw ( ψ ), which could be determined from the camera’s frame rotation matrix ( R 2 ) which can be computed from ⎡ α 0β ⎤ R 2 = UMV = U ⎢ 0 1 0 ⎥ V ٛ ⎢ −sβ 0s β ⎥ ⎣ ⎦ (15) where - s = det(U)det(V ) -α= λ1 + sλ3δ 2 λ2 (1 + δ 2 ) λ12 − λ22 λ22 − λ32 - β = ± 1−α 2 with criteria such that sgn ( β ) = − sgn (δ ) . And the yaw of (11) ∂U = UΩUij ∂hij (12) ∂V = − VΩVij ∂hij (13) ∂U ∂D ∂V T , and ∂hij ∂hij ∂hij can be obtained, where U, D, V are results of applying SVD to H12c . Their covariance matrix ∑U , ∑ D , ∑V can be computed via a forward error propagation method as follows: T ∑U = J U ∑ H J U T ∑D = J D ∑H J D T ∑V = J V ∑ H J V 33 3.2 Rotation Matrix and Translation Vector Covariance Matrix Estimation R 2 can be obtained as: ⎛ r21 ⎞ ⎟ ⎝ r11 ⎠ ψ = arctan ⎜ And finally, we obtain the Jacobian of U and V from From Eqs. (8), (12), and (13), 11 In this case, we assume that each parameter in U, V, D matrices are independent, so as we propagate the value from Eq. (14), we force the other element to be 0. - δ =± (10) 33 (14) where JU , J D , JV is the Jacobian of U, D, V , respectively. From Eq. (14), the covariance matrix of (16) where the rotation matrix elements r11 , r21 can be derived from Eq. (17): r11 = v11 (α u11 − s β u13 ) + v12 u12 + v13 ( β u11 + sα u13 ) ⎫ r21 = v21 (α u11 − s β u13 ) + v22 u12 + v23 ( β u11 + sα u13 ) ⎬⎭ (17) To achieve the target to determine yaw uncertainty, we apply forward propagation to Eq. (17) where its Jacobian can be retrieved by doing a partial derivative function of yaw with respect to the U, V, D parameters. Therefore, Jψ = ∂ψ 1 ∂a = ( ) ∂U, V, D 1 + a 2 ∂U, V, D (18) where r - a = 21 r11 And finally, the covariance of yaw can be obtained from forward error propagation: IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 σ Ψ2 = Jψ Σ U,V,D JψT 219 (19) For translation covariance, from Eq. (1), the translation vector can be obtained by t2 = where - t 2 = [ x ٛy z ⎛λ ⎞ ⎞ 1⎛ ⎜⎜ − β u1 + ⎜ 3 − sα ⎟ u 3 ⎟⎟ ω⎝ ⎝ λ2 ⎠ ⎠ (20) T - ω is a scale factor of normal vector ω is a factor that scales the normal vector to n1 = 1 , Fig. 2. Prototype frame for testing visual odometry. so we have ω= 1 n + ny2 + nz2 2 x (21) By substituting Eq. (21) into Eq. (20), we get a full camera translation vector: ⎛ ⎛λ ⎞ ⎞ t 2 = nx2 + ny2 + nz2 ⎜ − β u1 + ⎜ 3 − sα ⎟ u 3 ⎟ ⎜ ⎟ ⎝ λ2 ⎠ ⎠ ⎝ (22) Table 1. Vectornav VN-100 IMU Specification. The translation vector Jacobian matrix J t can be derived from a partial derivative to all 21 parameters ( all element of U, diag ( D ) ,ٛ ) that have dependency in Eq. (22). Finally, we apply forward propagation and we have Σt = J t Σ U,V,D J Tt Fig. 3. Vectornav VN-100 IMU. (23) 4. Experimental Procedure and Results In order to implement our visual odometry algorithm, the OpenCV library (in C++) for image processing is used. Now, the visual odometry component is tested on a prototype frame to which the selected camera (Logitech C920) is attached, as shown in Fig. 2. A Vectornav VN100 (Fig. 3) internal measurement unit was chosen to measure the prototype frame orientation. The IMU specification is described in Table 1, in which we used fused data from a gyroscope, an accelerometer and a magnetometer. Leaves and rocks were selected as the objects in order to simulate a real underwater environment. They are a good for a real-time feature tracking with good tracking results. The features of leaves and rocks are displayed in Fig. 4. The conclusion of the implementation procedure for the monocular visual odometry is described in Fig. 5. In system integration, all of the software components are run on the robot operating system (ROS). ROS is a middleware or a framework for robot software development. Instead of programming every single module Specification Range: Heading, Roll ±180° Range: Pitch ±90° Static Accuracy (Heading, Magnetic) 2.0° RMS Static Accuracy (Pitch/Roll) 0.5° RMS Dynamic Accuracy (Heading, Magnetic) 2.0° RMS Dynamic Accuracy (Pitch/Roll) 1.0° RMS Angular Resolution < 0.05° Repeatability < 0.2° Fig. 4. Real-time feature tracking using Lucas-Kanade pyramidal optical flow on a prototype frame. in one project or process, ROS provides tools and libraries to do inter-process communication using a publish-andsubscribe mechanism to the socket servers that handle all 220 Wongsuwan et al.: Development of Visual Odometry Estimation for an Underwater Robot Navigation System Fig. 7. Visual odometry experimental results (trajectory). Fig. 5. Implementation procedure for visual odometry. Fig. 6. Visual odometry translation error. messages, parameters and services that occur in the systems, which support python and C++. Moreover, ROS also links many useful libraries for robotics programming, such as OpenCV and OpenNI, and it provides some hardware driver packages (e.g., Dynamixel Servo) and a visualization package to use with ROS messages. Furthermore, ROS handles sensors, images and data flow in the system, which makes system integration easier. 4.1 Error Evaluation The system was tested over several iterations. In our experiment, bias of 15.6344% of translation was added to the system in order to compensate for the translation error. Percent error of translation, by using the LK optical flow algorithm, in the experiments is packed enough to compensate, as shown in Fig. 6. 4.2 Results The visual odometry algorithm was subjected to experimentation. The prototype frame was driven along the ground in order to create translation motion as a fixed trajectory. In the experiment, the prototype frame was turned clockwise 90 degrees, and then sent straight for a while; after that, we turned it back by 90 degrees. The real Fig. 8. Visual odometry experimental result (ψ ) . experimental results compared with the ground truth are shown in Fig. 7. The experimental results show that the estimated trajectory is close to the real translation trajectory. The second experiment was conducted in order to obtain the estimated yaw angle by using the proposed algorithm. The Vectornav VN-100 internal measurement unit (IMU) was used in the second experiment in order to obtain the yaw angle when the prototype frame is rotated. With the same trajectory as in Fig. 7, the results from the visual odometry estimation algorithm and from the real yaw angle from the IMU were compared and are shown in Fig. 8. The experimental results show that the estimated trajectory is also close to the real translation trajectory, even when the frame is rotated. The covariance matrix estimation of the visual odometry algorithm was obtained from a real experiment in real time. As we calculated them frame by frame, each output parameter variance is shown compared with x, y, yaw output from visual odometry estimation in Fig. 9, Figs. 10 and 11, respectively. Note that we scale the value of yaw so that it can be seen in relation to the value transition and its covariance value. In addition, we applied our monocular visual odometry algorithm to a video data log of a real underwater robot. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 Fig. 9. X variance estimation from visual odometry results. Fig. 10. Y variance estimation from visual odometry results. Fig. 11. Yaw variance estimation from visual odometry results. The data log was collected from a Robosub 2014 competition organized by the Association for Unmanned Vehicle Systems International (AUVSI), which was an international competition. By using the ROS, IMU data and barometer data was also collected with time synchronization with the video data log so that we could apply our algorithm. Results are shown in Fig. 12, which demonstrates that our algorithm can be used in a hazardous underwater environment with good performance. Despite unavailability of the real ground truth of the competition field, we could still estimate the robot displacement using a GPS and the competition field plan given by AUVSI. The displacement from the AUV deployment point to where it stopped is about 8.5 meters, which is shown in Fig. 12, such that our algorithm could estimate AUV trajectory. All of the experiments were well tested in a prototype Fig. 12. Final trajectory result of the data log. 221 222 Wongsuwan et al.: Development of Visual Odometry Estimation for an Underwater Robot Navigation System Fig. 13. Bird’s eye view of testing the AUV in the Robosub 2014 competition. Fig. 14. Visual odometry from the Robosub 2014 competition data log. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 Fig. 15. Autonomous underwater robot. frame. In practice, it will be implemented and tested “in real time” in our designed autonomous underwater robot, as shown in Fig. 15. 5. Conclusion A monocular visual odometry estimation was implemented and tested in a prototype frame by looking downward to the ground plane and compensating every input image using pitch and roll from an IMU to guarantee that the input features were not distorted by the camera’s direction. With Lucas-Kanade pyramidal optical flow, the tracked Shi-Tomasi method on the ground can be calculated between frame homography. After the homography is decomposed and the two sets of answers are obtained, the criteria for choosing them was explained. In order to control robot navigation, covariance of the visual odometry output, x y and yaw, is computed by using back and forward error propagation and Jacobian matrix and all mathematical derivatives are explained. The experimental visual odometry algorithm was tested, showed good results, and will later be implemented on a real underwater robot. 223 1981. Article (CrossRef Link) [3] R. Hartley, A. Zisserman, “Multiple View Geometry in Computer Vision Second Edition”. Cambridge University Press, March 2004. Article (CrossRef Link) [4] Papadopoulo, T., Lourakis, M.I.A., “Estimating the jacobian of the singular value decomposition:theory and applications”. In: Proceedings of the 2000 European Conference on Computer Vision,vol. 1, pp. 554–570 (2000) Article (CrossRef Link) [5] F. Caballero, L. Merino, J. Ferruz, A. Ollero. “Vision-Based Odometry and SLAM for Medium and High Altitude Flying UAVs”. J Intell Robot Syst, June 2008. Article (CrossRef Link) [6] P. Drews, G.L. Oliveira and M. da Silva Figueiredo. Visual odometry and mapping for Underwater Autonomous Vehicles In Robotics Symposium (LARS), 2009 6th Latin American Article (CrossRef Link) [7] D. Scaramuzzaand and F. Fraundorfer. Visual Odometry [Tutorial] In Robotics & Automation Magazine, IEEE (Volume:18, Issue: 4) Article (CrossRef Link) [8] E. Malis and M. Vargas. Deeper understanding of the homography decomposition for vision-based control. [Research Report] RR-6303, 2007, pp.90. [9] Article (CrossRef Link) Kandith Wongsuwan received his BSc in Electrical Engineering from Kasetsart University (KU), Bangkok, Thailand, in 2015. He is now a team leader of the SKUBA robot team, which was a fiveyear world champion in world robocup small-size robot competition. Furthermore, he is experienced in autonomous robot competition, as well, such as the @Home robot, which is an autonomous servant robot, and the autonomous underwater vehicle (AUV). His current research interests include computer vision, robotics, signal processing, image processing and embedded systems application. Acknowledgement This research was supported by the Department of Electrical Engineering in the Faculty of Engineering, Kasetsart University, as part of courses 01205491 and 01205499, Electrical Engineering Project I & II. Finally, the authors want to express sincere gratitude to Mr. Somphop Limsoonthrakul, a doctoral student from the Asian Institute of Technology, Prathum-Thani, Thailand, who has always been a supporter of this research. References [1] Shi, Jianbo, and Carlo Tomasi. "Good features to track." Computer Vision and Pattern Recognition, 1994. Proceedings CVPR'94., 1994 IEEE Computer Society Conference on. IEEE, 1994 Article (CrossRef Link) [2] B. D. Lucas, T. Kanade. An iterative image registration technique with an application to stereo vision. IJCAI, Copyrights © 2015 The Institute of Electronics and Information Engineers Kanjanapan Sukvichai received a BSc with first class honours in electrical engineering from Kasetsart University, Thailand, an MSc in electrical and computer engineering from University of New Haven, CT, U.S.A., in 2007 and a DEng in Mechatronics from the Asian Institute of Technology, Thailand, in 2014. He worked as a Professor in the Department of Electrical Engineering, Faculty of Engineering, Kasetsart University, Thailand. He is the advisor to the SKUBA robot team. He served on the Executive Committee, Organizing Committee and Technical Committee of the Small Size Robot Soccer League, which is one division in the RoboCup organization, from 2009 to 2014. His current research interests include multi-agent autonomous mobile robot cooperation, underwater robots, robot AI, machine vision, machine learning, robot system design, robot system integration and control of an unstable robot system. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.4.224 224 IEIE Transactions on Smart Processing and Computing Real-time Full-view 3D Human Reconstruction using Multiple RGB-D Cameras Bumsik Yoon1, Kunwoo Choi2, Moonsu Ra2, and Whoi-Yul Kim2 1 2 Department of Visual Display, Samsung Electronics / Suwon, South Korea bsyoon@samsung.com Department of Electronics and Computer Engineering, Hanyang University / Seoul, South Korea {kwchoi, msna}@vision.hanyang.ac.kr, wykim@hanyang.ac.kr * Corresponding Author: Whoi-Yul Kim Received July 15, 2015; Revised August 5, 2015; Accepted August 24, 2015; Published August 31, 2015 * Short Paper * Extended from a Conference: Preliminary results of this paper were presented at the ITC-CSCC, Summer 2015. The present paper has been accepted by the editorial board through the regular reviewing process that confirms the original contribution. Abstract: This manuscript presents a real-time solution for 3D human body reconstruction with multiple RGB-D cameras. The proposed system uses four consumer RGB/Depth (RGB-D) cameras, each located at approximately 90° from the next camera around a freely moving human body. A single mesh is constructed from the captured point clouds by iteratively removing the estimated overlapping regions from the boundary. A cell-based mesh construction algorithm is developed, recovering the 3D shape from various conditions, considering the direction of the camera and the mesh boundary. The proposed algorithm also allows problematic holes and/or occluded regions to be recovered from another view. Finally, calibrated RGB data is merged with the constructed mesh so it can be viewed from an arbitrary direction. The proposed algorithm is implemented with general-purpose computation on graphics processing unit (GPGPU) for real-time processing owing to its suitability for parallel processing. Keywords: 3D reconstruction, RGB-D, Mesh construction, Virtual mirror 1. Introduction Current advances in technology require three-dimensional (3D) information in many daily life applications, including multimedia, games, shopping, augmented-reality, and many other areas. These applications analyze 3D information and provide a more realistic experience for users. Rapid growth of the 3D printer market also directs aspects of practical use for 3D reconstruction data. However, 3D reconstruction is still a challenging task. The 3D reconstruction algorithms for given point clouds can be classified according to spatial subdivision [1]: surface-oriented algorithms [2, 3], which do not distinguish between open and closed surfaces; and volumeoriented algorithms [4, 5], which work in particular with closed surfaces and are generally based on Delaunay tetrahedralization of the given set of sample points. Surface-oriented methods have advantages, such as the ability to reuse the untouched depth map and to rapidly reconstruct the fused mesh. In this paper, 3D reconstruction is proposed by fusing multiple 2.5D data, captured by multiple RGB/Depth (RGB-D) cameras, specifically with the Microsoft Kinect [6] device. The use of multiple capturing devices for various applications means they can concurrently acquire the image from various points of view. Examples of these applications are motion capture systems, virtual mirrors, and 3D telecommunications. The approach proposed in this manuscript constructs a mesh by removing the overlapping surfaces from the boundaries. A similar approach was proposed by Alexiadis et al. [7]. Meshes generated from the multiple RGB-D cameras can introduce various noise problems, including depth fluctuations during measurement, and holes caused by the interference of infrared (IR) projections from the multiple cameras. The proposed algorithm reduces these issues, by considering the direction of the camera pose and by analyzing various conditions of the captured point clouds. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 225 Fig. 1. 3D reconstruction block diagram. The paper is organized as follows. Section 2 explains the proposed algorithm for 3D reconstruction. Section 3 presents the implementation method of the algorithm. Section 4 discusses the results of the experiment. Finally, Section 5 concludes this manuscript. 2. 3D Reconstruction Algorithm In the proposed scheme, RGB-D cameras are installed at 90° angles from the adjacent cameras and at a distance of 1 to 2m from the target. The camera parameters and their initial positions are estimated beforehand. If any subtle adjustment to the camera positions is required, optional online calibration can be performed. At the beginning, depth and RGB data from each camera are captured concurrently in each thread. The captured data are synchronized for subsequent processing. The depth data go through a bilateral filter [8] and are transformed into the point clouds using the calibrated intrinsic depth camera parameters. Then, the point clouds are used to generate cell-based meshes, following the removal of background points. After the cell generation, each point cloud is transformed to global coordinates with calibrated extrinsic parameters. The redundancies between the point clouds are removed after the transformation by the iterative boundary removal algorithm, and the resultant meshes are clipped together. The RGB data is transformed to depth coordinates, and the brightness level is adjusted by investigating the intensity of the overlapped regions. Finally, the calibrated RGB data are rendered with the triangulated mesh. Fig. 1 shows a block diagram of the overall system. Every color of the module represents a CPU thread, and the bold and thin lines indicated in the figure show the flow of data and parameters, respectively. 2.1 Calibrations A set of checkerboard images is captured from RGB/Depth cameras to estimate the intrinsic and extrinsic parameters, for each camera, using a Matlab camera calibration toolbox. For the depth camera calibration, IR images are used instead of the depth images, because the corner points of the checkerboard cannot be detected in a depth image. In addition to the depth camera parameters, the shifting error between the IR and depth [9] is considered, because the mapped color usually does not match the point cloud, as shown in Fig. 2(c). Vertices of a colored cube Fig. 2. Calibration process (a) RGB image, (b) Depth image, (c, d) RGB-D mapped image before and after IRdepth shift correction, (e) Edge vectors from the point cloud of a cube, (f) Multi-Kinect registration result, (g) Point cloud aggregation. (50×50×50cm) from the IR and depth images are found to estimate the shifting value. The intersection point of the three edges in the IR image corresponds to the vertex of the cube in the depth image. The vertex can be found via the intersection of the estimated three planes. The found offset is applied in the color-to-depth transformation module. Usually, the extrinsic parameters between two cameras can be estimated by finding the corresponding corners of the checkerboard images from the cameras. However, if the angle between the two cameras is too large, this method is difficult to use due to the narrow common view angle. Therefore, a multi-Kinect registration method is proposed that uses a cube for the calibration object. It needs only one pair of RGB/depth images per RGB-D camera in one scene. Fig. 2(e) shows the edge vectors and the vertex, identified by the colors of the intersecting three planes for one camera. The found edge vectors are transformed to the coordinates of a virtual cube, which has the same size as the real cube so as to minimize the mean square error of the distances for four vertices viewable from each camera. The registered cube and the estimated pose of the depth cameras are shown in Fig. 2(f), and the aggregated point cloud is given in Fig. 2(g). Online calibration for the extrinsic parameters can be performed if a slight change in the camera positions occurs by some accidental movement. An iterative closest point (ICP) [10] algorithm could be applied for this purpose. However, there are two kinds of difficulty with traditional ICP aligning all the point clouds in the proposed system. First, traditional ICP works only in a pairwise manner. Second, the point clouds do not have sufficient overlapping regions to estimate the alignment parameters. To resolve these problems, a combined solution of generalized Procrustes analysis (GPA) [11] and sparse ICP (S-ICP) [12] is adopted. The basic concept is as follows. 1. Extract the common centroid set that would become the target of S-ICP for all the point clouds. 2. Apply S-ICP on the centroid set for each point cloud. The difference between GPA presented by Toldo et al. [11] and our proposed method is that only left and right point clouds are used for centroid extraction, as seen in Fig. 226 Yoon et al.: Real-time Full-view 3D Human Reconstruction using Multiple RGB-D Cameras Fig. 3. Mutual neighboring closest points (a, b, c) Valid cases, (d, e) Invalid cases. Fig. 6. Camera positions. cross product of two vectors of the three vertices around the cell. The boundary cell is simply defined if the cell does not have any surrounding cells sharing an edge. The direction of the boundary cell is defined as the outward direction from the center to the boundary. For horizontal/vertical boundary cells, the direction is calculated as the weighted sum of vectors from the center to the vertices of the boundary edge (Figs. 5(b) and (c)): Fig. 4. Registration errors. ( ) v j3 − c j ( v j 2 − c j ) . b j = v j 2 − c j v j3 − c j − (1) For the diagonal boundary cell, the direction is calculated as the weighted sum of vectors from the center to the diagonal vertices (Fig. 5(d)): Fig. 5. Various cell types (a) No boundary cells, (b, c, d) Examples of directional boundary cells. 3. The direction of the arrow indicates its closest vertex in the neighboring point cloud, and the black dot indicates its centroid. S-ICP is repeatedly performed until all of the maximum square residual errors of the pairwise registration become less than a sufficiently low value. Fig. 4 shows the transition of the errors when three of four cameras are initially misaligned by about 10cm and at 5° to the arbitrary direction. 2.2 Cell Based Mesh A cell-based mesh (CBM) is used for redundancy removal, rather than the unordered triangle-based mesh, because CBM is quick to generate, and it is also feasible to utilize the grid property of the depth map. The projected location of the cell and its boundary condition can be examined rapidly, and this is used frequently in the proposed algorithms. A cell is constructed if all four edges surrounding the rectangular area in the depth map grid are within the Euclidean distance threshold Dm . During the boundary removal stage, the center of the cell is used, which is calculated by averaging the positions of the four neighboring vertices around the cell (Fig. 5(a)). The normal of each cell is also generated by calculating the ( ) v j1 − v j 2 ( v j 2 − v j 3 ) . b j = v j 2 − v j 3 v j1 − v j 2 − (2) There are undecidable one-way directional boundary cells, such as a thin line or a point cell. These cells are categorized as non-directional boundary cells and are dealt with accordingly. 2.3 Redundant Boundary Cell Removal The transformed cells may have redundant surfaces that overlap surfaces from other camera views. The redundant cells are removed by the redundant boundary cell removal (RBCR) algorithm. RBCR utilizes the direction of the virtual camera ev (Fig. 6), which is the middle view of its neighboring camera. Using this direction, we can effectively estimate the redundant surfaces, minimizing the clipping area. It is also used as the projection axis for 2D Delaunay triangulation. Let Mk be the cell mesh generated by camera k, let Ck , j be the jth cell in Mk , and let ck , j be the center of cell Ck , j . The index k is labeled in the order of circular direction. Assuming that Ck , j in Mk is a boundary cell, it is deemed redundant if a corresponding Ck +1,m can be found that minimizes the projective distance, d p , with the IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 227 constraint that the Euclidean distance ( d a ) between the center of the cells should be smaller than the maximum distance, Da : Ck*+1,m = argmin C∈{ C |d a < Da , C∈\ M k +1} (d p ) (3) The projective distance d p is defined as follows: d p = d a − | (ck +1,m − ck , j ) ⋅ ev | (4) where ev is found by spherical linear interpolation, or “slerp” [13], with angle Ω between camera direction e k and e k +1 : ev = sin ( Ω / 2 ) sin ( Ω ) Fig. 7. CUDA kernel composition. ( ek + ek +1 ) . (5) 2.5 Triangulation To find C * , a projection search method is adopted, i.e., ck , j is projected to the target view of Mk +1 , and the cells of Mk +1 , in the fixed-size window centered on the projected ck , j , are tested for the conditions. Once C * is found, the corresponding Ck , j is considered a potentially redundant cell. The additional conditions are tested to decide if the cell is truly redundant, and hence removable. If the found C * is not a boundary cell and the normal is in the same direction, it is redundant because Ck , j is on or under the surface, not the cell of a thin object. Or, if C * is a directional boundary cell, Ck , j is redundant when Ck , j is close enough to C * so that d p is smaller than the Except for the triangulated cells in the previous boundary clipping stage, all the other cells are simply triangulated by dividing the cell into the two triangles. The shorter diagonal edge is selected for triangulation. 2.6 Brightness Adjustment Although the Kinect device provides an auto-exposure functionality, it is not sufficient to tune the multiple RGB cameras. The brightness is tuned online by multiplying the correction factor. Each factor is calculated by comparing the intensity of the overlapped region with the mean intensity of all overlapped regions. The overlapped regions can be directly extracted from the RBCR operation. The propagation error from all the cameras is distributed to each correction factor so that the overall gain is 1. maximum projective distance ( D p ) , and the boundary directions are not in the same direction. This could be regarded as the depth test in ray-tracing for the direction of ev of the boundary cell. The way mutual directionality is decided is by the sign of the inner product for the two directions. In one loop, RBCR is performed through all ks, for the outermost boundary cells in Mk w.r.t. Mk +1 , and vice versa, and is applied iteratively until no cells are removed. 2.4 Boundary Clipping In this stage, any boundary cell in Mk within distance Da from the boundary of Mk +1 is collected with the same search method of RBCR. The collected cells are disintegrated to the vertices, and are orthogonally projected to the plane of the virtual camera. Then, the projected points are triangulated via 2D Delaunay algorithm. 3. Implementation Among the modules of Fig. 1, the bilateral filter through the position transform, the redundancy removal, and color-to-depth transform modules are implemented under the Compute Unified Device Architecture (CUDA) [14]. The rendering module is implemented with OpenGL and all other modules with the CPU. Fig. 7 shows all of the implemented CUDA kernels that correspond to the logical modules in Fig. 1. bilateralKernel is configured to filter one depth with one thread each. The radius and the standard deviation of the spatial Gaussian kernel were set to 5 pixels and 4.5 pixels, respectively. The standard deviation of the intensity (i.e. depth value) Gaussian kernel was set to 60mm. pcdKernel generates point cloud back-projecting of the depth pixels with the calibrated intrinsic parameters. The kernel also eliminates the background depth pixels with a frustum of near 0.5m and far 2.5m. The cell generation module consists of three kernels. 228 Yoon et al.: Real-time Full-view 3D Human Reconstruction using Multiple RGB-D Cameras gridKernel marks the valid vertices within distance Dm . As the neighboring relationship is needed to check the validity, four vertices are marked with an atomicOr barrier function if they turn out to be valid. reduceKernel reduces the grid vertices to a reduced stream, generating the indices for the marked vertices. cellKernel constructs the cell information if all of the neighboring vertices are valid. The constructed cell information includes both the normal and the center of the cell. The positional transformation is done in taKernel. It includes vertex, normal, center transform and the cell projection. Although the kernel could be implemented with a general transform kernel, as the transforms use the same parameter, it is more efficient to process them all at once by reducing the kernel launch time, rather than by calling the general purpose kernel multiple times. The RBCR algorithm is designed to run concurrently for the four pairs of the mesh by using the CUDA stream feature, not using the CPU thread, because the status of the cells needs to be synchronized for every loop. rbKernel just removes the first outermost boundary cells because the measured boundary depth tends to be considerably inaccurate. The RBCR loop runs with the two CUDA kernels. • rrKernel: Searches (3) and marks the flag for the cells to be removed. • updateKernel: Removes all the marked cells and returns the number of removed cells. The two-kernel approach makes the mesh maintain the consistency of the boundary condition in a loop. The search function in rrKernel is designed to use 32 concurrent threads per cell for a 16×16 search window. It leads to loop 8 times for one complete search, and to use 32 elements of shared memory for intermediate storage of the partial reduction. The grid size is defined as the number of cells ( N c ) divided by the number of cells per block ( N cpb ). N cpb is tuned to 16 as a result of performance tuning that maximizes the speed. The synchronization of the cell status is done automatically when the remove counter is copied from the device to the host with the default stream. To accelerate RBCR and keep a constant speed, the loop is terminated after the eighth iteration (max_loop=8) and one more search is done for all the remaining cells, including the non-boundary cells. The boundary cells of the RBCR results are collected with collectKernel by a method similar to rrKernel but without the iterative loop. colorKernel maps the color coordinates to the depth coordinates followed by correction of the radial, tangential distortions, and IR-depth shifting error using the calibrated parameters. The operation is performed only for the reduced cells. The boundary clipping module runs on CPU threads other than the RBCR thread to reduce the waiting time for RBCR. The Delaunay triangulation (DT) algorithm is implemented with the Computational Geometry Algorithms Library (CGAL) [15]. As DT generates the convex set of triangles, long-edged triangles ( > Dt ) are Table 1. Experiment Parameters. Parameters Values Dm (max cell edge) 2cm Da (max Euclidean distance) 3cm D p (max projected distance) 0.5cm Dt (max triangle edge length) 3cm Table 2. Latencies. Modules Sync Latencies 16.2ms Cell Gen. 11.2ms Pos. Trfm. 8.1ms Redun. Rem. 26.5ms Triangulation 24.3ms Total 86.3ms Fig. 8. Performance analysis. eliminated after the triangulation. We adapt the sparse ICP module [16] using an external kd-tree for mutual neighboring of closest points. The point-to-point A 0.4 -ICP is used for optimization, with max inner and outer iterations of 1 and 20, respectively. The parameter values used in this paper are given in Table 1. The resolutions for input depth and RGB are both 640×480, and the equipment used for the implementation was a desktop PC with an Intel i7 3.6GHz core and an NVidia GTX970 graphics card. 4. Results Fig. 8 gives the performance analysis results of NVidia Nsight for the implemented kernels. As expected, it shows that rrKernel is computationally the most expensive kernel, as expected. The timeline shows that the speed of the overall system is approximately 21fps. The latencies measured at the end of each module are described in Table 2. Fig. 9(a) shows various views of the reconstructed human mesh that can be seen on the run. The bottom row is the color map of the reconstructed mesh, where color represents the mesh from the corresponding camera. The thin purple line indicates the clipped area. Compared to the original unprocessed mesh in Fig. 9(c), we can see that the IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 229 Fig. 9. Result of various views (a) max_loop = 8, (b) max_loop = 24, (c) The original. Fig. 10. Result of brightness adjustment (a) before, (b) after. resultant mesh has no redundancies and is clipped cleanly. Fig. 9(b) is the result of RBCR when max_loop is equal to 24, showing almost no difference when max_loop is equal to 8. Fig. 10 gives the result of the brightness adjustment showing that the mismatched color in the cloth is effectively corrected. 5. Conclusion In this paper, it is shown that the proposed algorithm and the implementation method could reconstruct a 3D mesh effectively, supporting a 360-degree viewing direction with multiple consumer RGB-D cameras. The proposed calibration method, which uses a cube as a calibration object, could estimate the color/depth camera parameters and the global position of the cameras effectively, accommodated by the online calibration method that exploits mutual neighboring closest points, and a sparse ICP algorithm. The constructed mesh had no redundancies after application of the proposed algorithm, which iteratively removes the estimated redundant regions from the boundary of the mesh. In addition, the proposed 3D reconstruction system could adjust the mismatched brightness between the RGB-D cameras by using the collateral overlapping region of the redundancy removal algorithm. The overall speed for implementation was 21fps with a latency of 86.3ms, which is sufficient for real-time processing. References [1] M. Botsch, et al., Polygon Mesh Processing, AK Peters, London, 2010. Article (CrossRef Link) [2] H. Hoppe, et al., “Surface reconstruction from unorganized points,” SIGGRAPH ’92, 1992. Article (CrossRef Link) [3] R. Mencl and H. Müller. Graph–based surface reconstruction using structures in scattered point sets. In Proceedings of CGI ’98 (Computer Graphics International), pp. 298–311, June, 1998. Article (CrossRef Link) [4] B. Curless and M. Levoy, “A volumetric method for building complex models from range images,” SIGGRAPH ’96, 1996. Article (CrossRef Link) [5] M. Kazhdan, M. Bolitho, and H. Hoppe, “Poisson Surface Reconstruction,” Proc. Symp. Geometry Processing, 2006. Article (CrossRef Link) [6] Microsoft Kinect Article (CrossRef Link) [7] D. Alexiadis, D. Zarpalas, and P. Daras, “Real-time, full 3-D reconstruction of moving foreground objects from mul-tiple consumer depth cameras,” IEEE Trans on Multimedia, vol. 15, pp. 339 – 358, Feb. 2013. Article (CrossRef Link) [8] C. Tomasi and R. Manduchi, “Bilateral filtering for 230 [9] [10] [11] [12] [13] [14] [15] [16] Yoon et al.: Real-time Full-view 3D Human Reconstruction using Multiple RGB-D Cameras gray and color images,” in Proc. of the ICCV, 1998. Article (CrossRef Link) J. Smisek, M. Jancosek, and T. Pajdla, “3D with Kinect,” 2011 ICCV Workshops, pp. 1154-1160, Nov. 2011. Article (CrossRef Link) P. Besl and N. McKay, “A Method for Registration of 3-D Shapes,” IEEE Trans. Patten Analysis and Machine Intelligence, vol. 14, pp. 239-256, 1992. Article (CrossRef Link) R. Toldo, A. Beinat, and F. Crosilla, “Global registration of multiple point clouds embedding the generalized procrustes analysis into an ICP framework,” in 3DPVT 2010 Conf., Paris, May 1720, 2010. Article (CrossRef Link) S. Bouaziz, A. Tagliasacchi, and M. Pauly, “Sparse Iterative Closest Point,” Computer Graphics Forum, vol. 32, no. 5, pp. 1–11, 2013. Article (CrossRef Link) K. Shoemake, “Animating rotation with quaternion curves,” in Proc. of the SIGGRAPH ’85, 1985, pp. 245-254. Article (CrossRef Link) NVidia CUDA Article (CrossRef Link) CGAL Article (CrossRef Link) Sparse ICP Article (CrossRef Link) Bumsik Yoon received his BSc and MSc in Electrical Engineering from Yonsei University, Korea, in 1997 and 2000, respectively. He is a senior researcher at Samsung Electronics. Currently, he is pursuing his PhD at Hanyang University, Korea. His research interests include 3D reconstruction, pedestrian detection, time-of-flight and computer graphics. Kunwoo Choi received his BSc in Electronics Engineering at Konkuk University, Korea, in 2014. He is currently pursuing an MSc in Electronics and Computer Engineering at Hanyang University. His research interests include depth acquisition and vehicle vision systems. Copyrights © 2015 The Institute of Electronics and Information Engineers Moonsu Ra received his BSc and MSc at Hanyang University, Korea, in 2011 and 2013, respectively. He is currently pursuing his PhD at the same university. His research interests include visual surveillance, face tracking and identification, and video synopsis. Whoi-Yul Kim received his PhD in Electronics Engineering from Purdue University, West Lafayette, Indiana, in 1989. From 1989 to 1994, he was with the Erick Johansson School of Engineering and Computer Science at the University of Texas at Dallas. He joined Hanyang University in 1994, where he is now a professor in the Department of Electronics and Computer Engineering. His research interests include visual surveillance, face tracking and identification, and video synopsis. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.4.231 231 IEIE Transactions on Smart Processing and Computing Analysis of Screen Content Coding Based on HEVC Yong-Jo Ahn1, Hochan Ryu2, Donggyu Sim1 and Jung-Won Kang3 1 Department of Computer Engineering, Kwangwoon University / Seoul, Rep. of Korea {yongjoahn, dgsim}@kw.ac.kr Digital Insights Co. / Seoul, Rep. of Korea hc.ryu@digitalinsights.co.kr 3 Broadcasting & Telecommunications Media Research Laboratory, ETRI / Daejeon, Rep. of Korea jungwon@etri.re.kr 2 * Corresponding Author: Donggyu Sim Received June 20, 2015; Revised July 15, 2015; Accepted August 24, 2015; Published August 31, 2015 * Short Paper: This paper reviews the recent progress, possibly including previous works in a particular research topic, and has been accepted by the editorial board through the regular reviewing process. Abstract: In this paper, the technical analysis and characteristics of screen content coding (SCC) based on High efficiency video coding (HEVC) are presented. For SCC, which is increasingly used these days, HEVC SCC standardization has been proceeded. Technologies such as intra block copy (IBC), palette coding, and adaptive color transform are developed and adopted to the HEVC SCC standard. This paper examines IBC and palette coding that significantly impacts RD performance of SCC for screen content. The HEVC SCC reference model (SCM) 4.0 was used to comparatively analyze the coding performance of HEVC SCC based on the HEVC range extension (RExt) model for screen content. Keywords: HEVC SCC, Screen content coding 1. Introduction High efficiency video coding (HEVC) is the latest video compression standard of the Joint Collaborative Team on Video Coding (JCT-VC), which was established by the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). Several extensions and profiles of HEVC have been developed according to application areas and objects to be coded [1]. The standardization of HEVC version 1 for major applications such as ultra-high definition (UHD) content was completed in January 2013. Furthermore, the standardization of HEVC version 2 for additional applications, such as high-quality, scalable, and 3D video services, was released in October 2014. HEVC version 2 includes 21 range extension profiles, two scalable extension profiles, and one multi-view extension profile [2]. First, the HEVC range extension (RExt) standard aims to support the extended color formats and high bit depths equal to or higher than 12 bits, which HEVC version 1 does not support. In addition, the HEVC scalable extension (SHVC) standard supports multi-layer video coding according to consumer communications and market environments, and the HEVC multi-view extension (MVHEVC) aims to support multi-view video services. Recently, emerging requirements for screen content coding have been issued, and the extension for SCC was kicked off based on HEVC [3]. In addition, the HEVC HDR/WCG extension for high-dynamic-range (HDR) and wide-colorgamut (WCG) coding is being discussed [4]. The HEVC SCC extension is designed for mixed content that consists of natural videos, computer-generated graphics, text, and animation, which are increasingly being used. The HEVC SCC extension has been discussed since the 17th JCT-VC meeting in March 2014, and it is being standardized with the goal of completion in February 2016. HEVC is known to be efficient for natural video but not for computer-generated graphics, text, and so on. That content has high-frequency components due to sharp edges and lines. Conventional video coders, in general, remove high-frequency components for compression purposes. However, HEVC SCC includes all coding techniques supported by HEVC RExt, and additionally, has IBC, palette coding, and adaptive color space transform. In this study, IBC and palette coding are explained in detail in relation to screen content, among the newly added coding techniques. In addition, the HEVC SCC reference model (SCM) is used to present and analyze the coding performance for screen content. The result of the formal subjective assessment and objective testing showed a clear improvement in comparison to HEVC RExt [5]. 232 Ahn et al.: Analysis of Screen Content Coding Based on HEVC Fig. 2. Intra block copy and block vector. Fig. 1. HEVC SCC decoder block diagram. This paper is organized as follows. Chapter 2 explains IBC and palette coding, which are the coding techniques added to HEVC SCC. Chapter 3 presents and analyzes the coding performance of HEVC SCC for screen content. Finally, Chapter 4 concludes this study. 2. HEVC Screen Content Coding Techniques HEVC SCC employs new coding techniques in addition to all the techniques adopted to HEVC RExt. Fig. 1 shows the block diagram for the HEVC SCC decoder, which includes newly adopted IBC, palette mode, and adaptive color transform (ACT). This chapter explains IBC mode coding and palette mode coding, considering the characteristics of screen content. 2.1 IBC Mode As mentioned in Chapter 1, screen content is likely to have similar patterns on one screen, unlike natural images. Such a spatial redundancy is different from the removable spatial redundancy under the existing intra prediction schemes. The most significant difference is the distance and shapes from neighboring objects. Removable spatial redundancy with the intra prediction schemes refers to the similarity between the boundary pixels of the block to be coded and the adjacent pixels located spatially within one pixel. The removable spatial redundancy in IBC mode refers to the similarity between the area in the reconstructed picture and the block to be coded [6]. Unlike the conventional intra prediction schemes, a target 2D block is predicted from a reconstructed 2D block that is more than one pixel distant from it. Inter prediction should have the motion information, the so-called motion vector, with respect to the reference block to remove the temporal redundancy in the previously decoded frames. In the same manner, IBC mode also needs the location information for the reference block in the form of a vector in the same frame. Fig. 2 shows the concept of IBC and the block vector. Fig. 3. Candidates of spatial motion vector predictor and block vector predictor. In HEVC SCC, a vector that locates the reference block is called the block vector (BV) [7]. Conceptually, this is considered to be similar to the motion vector (MV) of inter prediction. The block vector and motion vector have differences and similarities. In terms of the accuracy of the vectors, MV has quarter-pel accuracy to ensure improved prediction accuracy, whereas BV has integer-pel accuracy. This is because of the characteristics of the screen content in IBC mode. In computer-generated graphics, objects are generated pixel by pixel. The key feature of IBC is conducted on the reconstructed area of the current frame not previously coded, or frames decoded different. In addition, the BV should be sent to the decoder side, but the block vector is predicted to reduce the amount of data in a manner similar to the motion vector. During the HEVC SCC standardization process, various algorithms of BV prediction techniques were discussed. They can be classified into a BV prediction method independent from the MV prediction method, and a prediction method working in the same way as the MV prediction method. In the BV prediction method that is independent of the MV prediction method, the BV of the adjacent block and the BV that is not adjacent, but IBC-coded, is taken for the selection of BV predictor (BVP) candidates, and one of them is selected. Fig. 3 shows the spatial candidates under MV and BV. In the case of MV, the MVs of all prediction units (PUs), A0, A1, B0, B1, and B2, are used as spatial candidates. With the BV, however, only A1 and B1 are used as spatial candidates. When the left adjacent block, A1, of the PU is coded with IBC mode, the BV of A1 is IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 selected as the BVP candidate. Then, when the aforementioned block B1 of the PU is coded with IBC mode, the BV of B1 is also selected as the BVP candidate. As in the MV prediction method, the BV prediction technique adds the partly reconstructed current picture to the reference picture list, and IBC mode of the current PU could refer the reconstructed area of the current picture. The second method was proposed in the 20th JCT-VC meeting and it was adopted. Although the BV can be predicted, as in the existing MV prediction method, because the current picture is added to the reference list, predictors may exist that do not conform to the characteristics of IBC during the advanced motion vector prediction (AMVP) and merge candidate list creation processes. However, it still has several problems, such as the difference in vector accuracy between MV and BV, optimal BVP candidates, and so on. To address this problem, studies are being conducted to change the zero candidate considering the IBC characteristics and to change the candidate list. 2.2 Palette Mode HEVC SCC has palette mode in addition to IBC mode. In palette mode, pixel values that frequently appear in the block are expressed in indices and then coded [8]. In HEVC SCC, the index for palette mode coding is defined as the color index, and the mapping information between the pixel value and the color index is defined in the palette table. Palette mode can improve coding efficiency when the prediction does not work due to low redundancy and when the number of pixel values for the block is small. Unlike the existing coding mode, which uses intra-inter prediction to remove the spatial-temporal redundancy, palette mode expresses the pixel values that form the current block with the color index, so coding is possible independent of the already restored adjacent information. In addition, fewer color indexes are required than the total number of pixels in a current block, which improves coding efficiency. The coding process in HEVC SCC palette mode consists of two steps: the palette table coding process and the color index map coding process. The palette table coding process is conducted as follows. First, assuming that N peaks exist when pixel values are shown in a histogram, N peak pixel values are defined as major colors. N major colors are mapped as the color indices via quantization, with the colors in the quantization zone as the major colors. The colors that exist outside the quantization zone, which are not expressed as major colors, are defined as escape colors, which are coded not using the color index but the quantization of the pixel values. The table with the generated color indices is defined as the palette table for each coding unit (CU). The palette table conducts prediction coding by referring to the previous CU coded by palette mode. Whether or not the prediction coding is conducted is coded using the previous_palette_ entry_flag. Prediction coding uses palette stuffing, which utilizes the predictor of the previous CU [9]. Fig. 4 shows the histogram of the pixel values and the resulting major/ 233 Fig. 4. Histogram of the pixel values and major/escape colors of palette mode. escape colors of palette mode. Then, the current CU is coded through the color index map coding process. The color index map refers to the block expressed by replacing the pixel value of the current CU with the color index. The information coded during the color index map coding process includes the color index scanning direction and the color index map. The scanning direction of the color index is coded using the palette_ transpose_flag by the CU, and three types of color index map, the INDEX mode, COPY_ABOVE mode, and ESCAPE mode, are coded. In INDEX mode, run-length coding is conducted for the color index value, and the mode index, color index, and run-length are coded. In COPY_ABOVE mode, which copies the color index of the aforementioned row, the mode index and run-length are coded. Finally, in ESCAPE mode, which uses the pixel value as it is, the mode index and the quantized pixel value are coded. 3. Analysis and Performance Evaluation of HEVC SCC In this chapter, the HEVC SCC reference mode (SCM) 4.0 [10] is used to analyze the coding performance of HEVC SCC for the screen content against the reference model for HEVC RExt. All the tests were conducted under the HEVC SCC common test condition (CTC) [11] to achieve the coding performance with HEVC SCC. In addition, the HEVC SCC common test sequences were used in the test, which were classified into four categories. Text and graphics with motion (TGM) have images with text and graphics combined, and best shows the characteristics of the screen content. Mixed contents (M) is an image that contains mixed characteristics of screen content and natural images. In addition, there are categories such as animation (A) and natural image camera-captured content (CC). Fig. 5 shows the four categories of the HEVC SCC common test sequences. The coding performance and speed were measured in the same test environment. Table 1 lists the details of the test environment. In addition, Bjontegaard distortion- bitrate 234 Ahn et al.: Analysis of Screen Content Coding Based on HEVC Table 2. BD-BR performance evaluation of SCM 4.0 compared to HM 16.0 in All Intra. All Intra Category Text and graphics with motion (TGM) Y U V TGM, 1080p & 720p -57.4% -61.2% -62.7% M, 1440p & 720p -44.9% -50.3% -50.4% A, 720p 0.0% -8.5% -5.2% CC, 1080p 5.4% 8.6% 12.5% Encoding time (%) 347 Decoding time (%) 121 Table 3. BD-BR performance evaluation of SCM 4.0 compared to HM 16.0 in Random Access. Random Access Category Mixed contents (M) Y U V TGM, 1080p & 720p -48.0% -52.4% -55.0% M, 1440p & 720p -36.3% -43.6% -43.7% A, 720p 1.8% -5.3% -2.2% CC, 1080p 6.0% 15.8% 20.1% Encoding time (%) 139 Decoding time (%) 147 Table 4. BD-BR performance evaluation of SCM 4.0 compared to HM 16.0 in Low delay B. TGM, 1080p & 720p Y -41.4% Low delay B U -45.3% M, 1440p & 720p -23.7% -32.1% -32.2% A, 720p 2.7% -2.0% 0.6% CC, 1080p 6.0% 14.2% 17.6% Category Animation (A) Camera-captured content (CC) Fig. 5. Four categories of common test sequences. Table 1. Test environments. Specification CPU INTEL CORE I7-3960X 3.30GHZ Memory 16GB OS Windows 7 Compiler VS 2012 Encoding time (%) 141 Decoding time (%) 145 V -48.0% (BD-BR) was used to compare the coding performance, and the time ratio was used to measure the coding speed. Tables 2 to 4 show the coding performance when the HEVC SCC common test sequences were coded in HM 16.0 and SCM 4.0. For the screen content, SCM 4.0 had 19.1% Y BD-BR gain, 21.8% U BD-BR gain, and 20.7% V BD-BR gain, compared with HM 16.0. In the TGM category, which has strong screen content characteristics, 48.9% Y BD-BR gain, 53.0% U BD-BR gain, and 55.2% V BD-BR gain were obtained, and the results show that the newly added IBC mode and palette mode performed well. Otherwise, 5.8% Y BD-BR loss, 12.9% U BD-BR loss, and 16.7% BD-BR loss were obtained in the CC category, which has strong natural content characteristics. As shown in Tables 2 to 4, the newly added coding tools, IBC and palette mode, increased encoding time by 347%, 139%, and 141% for All Intra, Random Access, and Low delay B, respectively. In the results, fast encoding algorithms for IBC and palette mode are expected to be widely studied in the future. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 4. Conclusion In this paper, the newly added algorithms of HEVC SCC were introduced, and the HEVC SCC coding performance was also analyzed for screen content. The coding characteristics of the new coding tools that consider the screen content, which are intra block copy (IBC) mode and palette mode, were introduced in detail. As for coding performance of the screen content, HEVC SCC had 19.1% BD-BR gain compared with HEVC RExt. The HEVC SCC standardization will be completed in October 2015 and is expected to be widely used in various screen content applications. Acknowledgement This work was partly supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government(MSIP) (No. R010-14-283, Cloud-based Streaming Service Development for UHD Broadcasting Contents) and Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT & Future Planning(NRF-2014R1A2A1A11052210). References [1] G. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 22, no. 12, pp. 1649-1668, 2012. Article (CrossRef Link) [2] G. Sullivan, J. Boyce, Y. Chen, J.-R. Ohm, C. Segall, and A. Vectro, “Standardized extensions of high efficiency video coding (HEVC),” Selected Topics in Signal Processing, IEEE Journal of, vol. 7, pp. 10011016, Dec. 2013. Article (CrossRef Link) [3] ITU-T Q6/16 Visual Coding and ISO/IEC JTC1/SC29/WG11 Coding of Moving Pictures and Audio, “Joint call for proposals for coding of screen content,” ISO/IEC JTC1/SC29/WG11 MPEG2014/ N14175, Jan. 2014. [4] ISO/IEC JTC1/SC29/WG11 Coding of Moving Pictures and Audio, “Call for evidence (CfE) for HDR and WCG video coding,” ISO/IEC JTC1/ SC29/WG11 MPEG2014/N15083, Feb. 2015. [5] ISO/IEC JTC1/SC29/WG11 Coding of Moving Pictures and Audio, “Results of CfP on screen cotent coding tools for HEVC,” ISO/IEC JTC1/SC29/WG11 MPEG2014/N14399, April 2014. [6] S.-L. Yu and C. Chrysafis, “New intra prediction using intra-macroblock motion compensation,” JVTC151, ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6, May 2002. [7] M. Budagavi and D.-K. Kwon, “AHG8: Video coding using intra motion compensation,” JCTVCM0350, Incheon, KR, Apr. 2013. [8] Y.-W. Huang, P. Onno, R. Joshi, R. Cohen, X. Xiu, and Z. Ma, “HEVC screen content core experiment 3 235 (SCCE3): palette mode,” JCTVC-Q1123, Valencia, ES, March 2014. [9] C. Gisquet, G. Laroche, and P. Onno, “SCCE3 Test A.3: palette stuffing,” JCTVC-R0082, Sapporo, JP, June 2014. [10] HEVC Screen contents coding reference model (SCM): ‘HM-16.4+SCM-4.0’, Article (CrossRef Link), accessed July 2015. [11] H. Yu, R.Cohen, K. Rapaka, and J. Xu, “Common test conditions for screen content coding,” JCTVCT1015, Geneva, CH, Feb. 2015. Yong-Jo Ahn received the B.S. and M.S. degrees in Computer Engineering from Kwangwoon University, Seoul, Korea, in 2010 and 2012, respectively. He is working toward a Ph.D. candidate at the same university. His current research interests are high-efficiency video compression, parallel processing for video coding and multimedia systems. Hochan Ryu received the B.S. and M.S. degrees in Computer Engineering from Kwangwoon University, Seoul, Rep. of Korea, in 2013 and 2015, respectively. Since 2015, he has been an associate research engineer at Digital Insights Co., Rep. of Korea. His current research interests are video coding, video processing, parallel processing for video coding and multimedia systems. Donggyu Sim received the B.S. and M.S. degrees in Electronic Engineering from Sogang University, Seoul, Korea, in 1993 and 1995, respectively. He also received the Ph.D. degree at the same university in 1999. He was with the Hyundai Electronics Co., Ltd., From 1999 to 2000, being involved in MPEG-7 standardization. He was a senior research engineer at Varo Vision Co., Ltd., working on MPEG-4 wireless applications from 2000 to 2002. He worked for the Image Computing Systems Lab. (ICSL) at the University of Washington as a senior research engineer from 2002 to 2005. He researched on ultrasound image analysis and parametric video coding. Since 2005, he has been with the Department of Computer Engineering at Kwangwoon University, Seoul, Korea. In 2011, he joined the Simon Frasier University, as a visiting scholar. He was elevated to an IEEE Senior Member in 2004. He is one of main inventors in many essential patents licensed to MPEG-LA for HEVC standard. His current research interests are video coding, video processing, computer vision, and video communication. 236 Jung-Won Kang received her BS and MS degrees in electrical engineering in 1993 and 1995, respectively, from Korea Aerospace University, Seoul, Rep. of Korea. She received her PhD degree in electrical and computer engineering in 2003 from the Georgia Institute of Technology, Atlanta, GA, US. Since 2003, she has been a senior member of the research staff in the Broadcasting & Telecommunications Media Research Laboratory, ETRI, Rep. of Korea. Her research interests are in the areas of video signal processing, video coding, and video adaptation. Copyrights © 2015 The Institute of Electronics and Information Engineers Ahn et al.: Analysis of Screen Content Coding Based on HEVC IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.4.237 237 IEIE Transactions on Smart Processing and Computing Real-time Speed Limit Traffic Sign Detection System for Robust Automotive Environments Anh-Tuan Hoang, Tetsushi Koide, and Masaharu Yamamoto Research Institute for Nanodevice and Bio Systems, Hiroshima University, 1-4-2, Kagamiyama, Higashi-Hiroshima, 739-8527, Japan {anhtuan, koide}@hiroshima-u.ac.jp * Corresponding Author: Anh-Tuan Hoang Received July 15, 2015; Revised August 5, 2015; Accepted August 24, 2015; Published August 31, 2015 * Regular Paper * Extended from a Conference: Preliminary results of this paper were presented at the ITC-CSCC 2015, Summer 2015. This present paper has been accepted by the editorial board through the regular reviewing process that confirms the original contribution. Abstract: This paper describes a hardware-oriented algorithm and its conceptual implementation in a real-time speed limit traffic sign detection system on an automotive-oriented field-programmable gate array (FPGA). It solves the training and color dependence problems found in other research, which saw reduced recognition accuracy under unlearned conditions when color has changed. The algorithm is applicable to various platforms, such as color or grayscale cameras, high-resolution (4K) or low-resolution (VGA) cameras, and high-end or low-end FPGAs. It is also robust under various conditions, such as daytime, night time, and on rainy nights, and is adaptable to various countries’ speed limit traffic sign systems. The speed limit traffic sign candidates on each grayscale video frame are detected through two simple computational stages using global luminosity and local pixel direction. Pipeline implementation using results-sharing on overlap, application of a RAM-based shift register, and optimization of scan window sizes results in a small but highperformance implementation. The proposed system matches the processing speed requirement for a 60 fps system. The speed limit traffic sign recognition system achieves better than 98% accuracy in detection and recognition, even under difficult conditions such as rainy nights, and is implementable on the low-end, low-cost Xilinx Zynq automotive Z7020 FPGA. Keywords: Advanced driver assistance systems (ADAS), Speed limit traffic sign detection, Rectangle pattern matching, Circle detection, FPGA implementation 1. Introduction Speed limit traffic sign recognition is very important for the fast-growing advanced driver assistance systems (ADAS). Under continual pressure for greater road safety from governments, traffic sign recognition and active speed limitation become urgent issues for an ADAS. Important traffic sign information is provided in the driver’s field of vision via road signs, which are designed to assist drivers in terms of destination navigation and safety. Most important for a camera-based ADAS is to improve the driver’s safety and comfort. Detecting a traffic sign can be used in warning drivers about current traffic situations, dangerous crossings, and children’s paths, as shown in Fig. 1. Although the navigation system is available, it cannot be applied to new roads or places that the navigation signal cannot reach, or with electronic speed limit traffic signs where the sign changes depending on the traffic conditions. An assistant system with speed limitation recognition ability can inform drivers about the change in speed limit, as well as notify them if they drive over the speed limit. Hence, the driver’s cognitive tasks can be reduced, and safe driving is supported. Speed limit traffic (SLT) sign recognition systems face several problems in real-life usage, as shown in Fig. 2. (1) Color: the color of an SLT sign will change depending on the light, weather conditions, and age of the sign. (2) Sign construction: light-emitting diode (LED) SLT signs appear different in color, shape, and luminosity 238 Hoang et al.: Real-time Speed Limit Traffic Sign Detection System for Robust Automotive Environments Advanced Driver Assistance System (ADAS) Driving Assistant System Traffic sign recognition (until 2017) Car in front Lane detection Pedestrian Fig. 1. Chalenges of ADAS system toward active safety. (a) Painted speed signs in Japan. (b) Sign in Australia. (c) Sign (d) Sign in in Swiss. German. (e) LED. (f) After years. (g) Distortion (h) Shined (i) At night. (Australia). and not. Fig. 2. The appearance differences of speed limit traffic sign in various conditions. 2. Image Size and Scan Window Size Requirement for SLT Sign Detection The right side of Fig. 3 shows the differences in image sizes of an SLT sign at different distances and angles in real life. We define a scene as a sequence of all frames in which the same sign appears in, and disappears from the observation field of the camera. In one scene, when a 640×360 pixel camera is located at more than 30 meters in front of the sign, the 60 cm diameter SLT sign in Japan will appear as small as 10×10 pixels, as shown on the left of Fig. 3. In that case, the size of the number in the sign is Local road in daytime Size = Distance > 30 meters 20×20 80 14 m Sign size = 10 ×10 pixels Number size = 7 ×6 pixels Same sign at closer distance 3.0 m Size = 80 50×50 20° 16 m 3.6 m 3.6 m Median strip from printed signs, depending on the angle between the camera and the sign. (3) Light conditions: the presence of the sun and of some types of lights, in daytime and at night, makes the sign look different. (4) Font: fonts on traffic signs are decided by governments, and they are different in various countries. The difference can be seen in the thickness and shape of the number, as shown in Figs. 2 (a), (b), (c), and (d). (5) Distortion: the image of an SLT sign has distortion along three axes (x, y, and z), which depend on the angle between camera and sign. (6) Accuracy: high accuracy in recognition rate is required, which means that the same sign should be correctly recognized at least once in a sequence scene. (7) Real-time processing: the automotive system must be able to process 30 to 15 fps under various platforms. (8) Stability: under various platforms, the system must not need retraining when users change devices, such as the camera. A lot of research on SLT sign recognition (SLTSR) for the ADAS has been done, but those SLTSR algorithms have difficulty recognizing when color has changed due to light conditions, such as the presence of sunshine (Fig. 2(h)), illumination at night (Fig. 2(i)), LED signs (Fig. 2(e)), and in recognizing signs in different countries. They also have difficulty with high-accuracy, real-time processing using few computational resources on available low-price devices. In this study, we aim to solve these color and environment problems with a non-color–based recognition approach, in which grayscale images are used in both speed limit sign candidate detection and number recognition. SLT sign candidates are detected from each input frame before recognizing the limit speed in real time. Our system combines many simple and easy computation features of SLT signs, such as area luminosity, pixel direction, and block histogram, into a real-time, highaccuracy, and low-computational–cost design. Hence, it is implementable on a low-cost and resource-limited automotive-oriented field-programmable gate array (FPGA). The target platform is the Xilinx Zynq 7020, which has 85K logic cells (1.3 M application-specific integrated circuit gates), 53.2K lookup tables (LUTs), 106.4K registers, and 506KB block random access memory. Its price is about $15 per unit [24]. Related works and an overview of our approach to SLT sign recognition is presented in Section 2. The available SLT sign recognition system architectures and related algorithms for SLT sign candidate detection are discussed in Section 3. Section 4 describes how to combine simple features of non-color–based SLT signs for a real-time recognition system. Section 5 offers an overview of the hardware implementation of the proposed architecture. Discussions on accuracy, hardware size, and throughput of the proposed algorithm for SLT sign detection are given in Section 6. Section 7 concludes this paper 5.2 m Sign size = 38 ×38 pixels Number size = 20 ×16 pixels verge Fig. 3.Difference of the image size of the SLT sign on different distance, speed, lane (angle) with 640×360 pixel camera [18]. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 Frame 4, size = 18 Frame 6, size = 21 Frame 8, size = 23 Frame 10, size = 26 Frame 12, size = 28 Frame 16, size = 38 239 tion accuracy, but required a huge training data set, as well as huge computational resources (four 512-core GPUs) for traffic sign recognition. Recognition time for a full HD image will be significantly increased to an unacceptable level for real-life usage. In addition, they use color features in their recognition, and so face accuracy problems when recognizing signs under unknown light conditions. 3.1.2 Machine Learning–Based Recognition Fig. 4. A scene in real situation with image of SLT sign gradually increase. as small as 7×6 pixels, which is really hard to recognize, even with the human eye. This size increases to 20×20 pixels when the camera gets closer to the sign at a distance of 30 meters (with a 10×10-pixel number); then, the size gradually increases to 50×50 pixels at a distance of 16 meters (on the highway) before disappearing from the observation field of the camera. At sizes bigger than 20×20, appearance of the speed sign becomes recognizable, as shown at the bottom of Fig. 3. When a 1920×1080-pixel (full HD) camera, or a camera with higher resolution, is used, the size of an SLT sign becomes bigger. Although the image size of an SLT sign and the processing size are a trade-off, the range of 20×20 to 50×50 is always available in a scene. An example of a scene in a real situation is shown in Fig. 4. From that real situation, SLT sign detection should not use up too many computational resources to recognize the SLT sign at size smaller than 20×20 or bigger than 50×50. Since the SLT sign appears in a range of sizes, the system must process each input frame with scan windows in a range of sizes for SLT sign detection and recognition at the proposed distance. Our proposed SLT sign detection algorithm is designed to detect an SLT sign in the range of 20×20 to 50×50 pixels, which will appear in the observation field of cameras with resolution higher than 640×360. If the vehicle moves at 200 km/h, a 60 fps camera can takes 15 frames for SLT sign detection during the 14 meter distance between 30 meters away and 16 meters away from the sign. If the SLT sign can be recognized from those frames, a detection and recognition system will work well, even if the vehicle moves at 200 km/h. Our system aims for this goal. 3. Related Works 3.1 Software-Oriented Implementations 3.1.1 Neural Network–Based Sign Recognition The multi-column deep neural networks and the multiscale convolutional network were introduced by Ciresan et al. [21] and Sermanet and LeCun [22] in the Neural Networks contest. They achieved as high as 99% recogni- Machine learning was used by Zaklouta and Stanciulescu [20] and Zaklouta et al. [23] for traffic sign recognition. They used support vector machine (SVM) with a histogram of oriented gradient (HOG) feature for traffic sign detection, and tree classifiers (K-d tree or random forest) to identify the content of the traffic signs. They achieved accuracy of 90% with a processing rate of 10-28 fps. Again, they faced the color problem in their implementation, and so it is difficult to apply to other situations (night, rain, and so on). 3.1.3 Color-Based, Shape-Based, and Template-Based Recognition A general feature of traffic signs is color, which is predetermined to ensure they get the driver’s attention. Hence, color is used as a feature in a lot of image segmentation research. Torresen et al. [16] detected the red circle of a traffic sign by utilizing a red–white–black filter before applying a detection algorithm. Miura et al. [7] detected traffic sign candidate regions by focusing on the white circular region within some thresholds. Zaklouta and Stanciulescu [20] also used color information within a threshold for traffic sign image segmentation. This detection method required a color camera and more computation resources, such as memory, for color image storage and detection. Similar to the neural network–based and machine learning–based recognition approaches, color-based segmentation relies on the red color, and so, has difficulty with recognition when the color of the traffic sign has changed due to age and lighting conditions. Another method for traffic sign candidate detection is based on the shape of the signs. In this approach [5], a feature where a rectangular structure yields gradients with high magnitudes at its borders is used in traffic sign candidate detection. Moutarde et al. [8] used edge detection for rectangle detection and a Hough transform for circle detection. This method is robust to changes in illumination, but it requires complex computation, such as the Hough transforms. Processing the transformation and extracting matching peaks from big image are computationally complex for real-time processing systems. Template matching [16] uses a prepared template for area comparison with various traffic sign sizes. The approach simply takes the specific color information of an area, and compares it with a prepared template for matches. Because the sizes of the traffic signs vary from 32x32 to 78x78 pixels, a lot of hardware resources and computation time are required. 240 Hoang et al.: Real-time Speed Limit Traffic Sign Detection System for Robust Automotive Environments 3.2 Hardware Oriented Implementations 3.2.1 Hardware/Software Co-design Implemen-tations Hardware/software co-design on a low-cost Xilinx Zynq FPGA system was presented by Han [1], in which an input color image of the traffic sign is processed by software on a PC before sending it to the Zynq 7000 system on chip (SoC). The traffic sign candidates are detected with hardware using color information before refining and performing template-based matching on an ARM core. This hardware processing requires a lot of memory access, as well as software processing on both the ARM core and PC, so latency is high and throughput is low. Big templates of 80×80 pixels are required for improvement of detection accuracy. Muller et al. [9] applied software and hardware design flows on a Xilinx Virtex-4 FPGA to implement a traffic sign recognition application. It combines multiple embedded LEON3 processors for preprocessing, shape detection, segmentation, and extraction with hardware IPs for classification in parallel to achieve latency not longer than 600 ms to process one frame. However, this latency is not fast enough to apply to real-time detection and recognition. Irmak [3] also utilized an embedded processor approach with minimal hardware acceleration on a Xilinx Virtex 5 FPGA. Color segmentation, shape extraction, morphological processing, and template matching are performed on a Power PC processor with software, and edge detection is performed on a dedicated hardware block. Waite and Oruklu [17] used a Xilinx Virtex 5 FPGA device in a hardware implementation. Hardware IPs are used for hue calculation, morphological filtering, and scanning and labeling. The MicroBlaze embedded core is used for data communication, filtering, scaling, and traffic sign matching. 3.2.2 Neural Network on an FPGA A neuron network implementation on a Xilinx Vitex 4 FPGA for traffic sign detection was presented by Souani et al. [14]. The system works with low-resolution images (640×480 pixels) with two predefined regions of interest (ROIs) of 200×280 pixels. The small ROI results in high recognition speed. However, the accuracy is as low as 82%, and the traffic signs must be well lit. Hence, this system has difficulty when applied at night or to back-lit signs. 3.3 A Proposed Approach for Robust Automotive Environments All the related speed limit traffic sign recognition systems use color as an important feature for detection and recognition in their implementations, and so they have difficulty in recognizing long-term deterioration in traffic signs, traffic signs under different lighting conditions, and with LED traffic signs. They also use processing with high computational complexity, such as the Hough transform, for the traffic sign shape detection, and tree classifiers, SVM, and complex multi-layer neural networks for traffic sign identification. So the implementation resources become large, and the hardware costs are also high. The processing time of the available implementations also presents problems when applied in real life to low-end vehicles. In addition, the recognition approaches using neural networks and machine learning require learning processes with a huge dataset. For each user with a different platform, such as camera type, a specific dataset must be prepared and a relearning process is necessary to guarantee high accuracy. Different from the above works, which rely on color features and high complexity processing in traffic sign recognition, our approach utilizes simple yet effective speed limit traffic sign features from grayscale images. The proposed approach utilizes multiple rough, simple, and easily computable features in three-step processing to achieve a robust speed limit traffic sign detection system. Using simple features makes the proposed system applicable to various platforms, such as the type of camera (color or grayscale, high- or low-resolution), and in the line-up of FPGAs (low-end, automotive, high-end). It is also robust under various conditions in Japan scenarios, such as illumination conditions (daytime, night time, and rainy nights), and types of road (local roads and highways). The simple yet effective features enable it to easily optimize a parameter set so as to meet features of the speed limit traffic sign systems in other countries. The proposed features include area luminosity difference, pixel direction, and an area histogram. The computation of these simple features only requires simple and low-cost hardware resources such as adders, comparators, and first-in, firstout (FIFO) stacks. So, it is possible to implement on any line-up of FPGAs. The proposed speed limit traffic sign recognition system can be also extended to other traffic sign recognition. 4. Rough and Simple Feature Combination for Real-Time SLTSR Systems 4.1 Multiple Simple Features of SLT Signs 4.1.1 Shared Luminosity Feature Between Rectangle and Circle Fig. 5 shows the area luminosity feature of a scan window with a circular speed limit traffic sign inside, in which the circle line of the sign is much darker than the adjacent white area in a grayscale image. It means that the luminosity of the areas that contain the circle is much lower than the adjacent ones. If the circle fits within the scan window, the location of the dark and the white areas can be predefined as B1 to B8 (say, for black) and W1 to W8 (say, for white) in Fig. 5. The luminosity differences between those corresponding black and white areas exceed a predefined threshold. Depending on the view angle, the brightness of the black and white areas is different, and so a variable threshold is used in our detection algorithm. This luminosity feature is simple but effective because it is extendable to the detection of other shape types, such as a 241 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 “0” in Japan along two axes. Locations of the maximum and minimum in the histogram, as well as the ratio between those rows/columns and others, are different. The maximum and minimum in the histogram for each axis and area (the total number of pixels) are the features of the numbers and can be used to recognize the speed sign number [18]. Fig. 5. Shared luminosity feature between rectangle and circle. 3×3 template for inner-circle detection Inner-circle line detection Direction 3×3 template for outer-circle detection Outer-circle line detection Direction Feature: - Local direction of pixel can be detected by simple pattern matching. - Pixels at different locations of circle match different directions. Fig. 6. Local pixel direction feature of circle. Fig. 7. Difference in histogram between numbers. hexagon. In addition, this feature is applicable to any image size, from VGA to 4K. It is also stable if a highresolution image is down-sampled to a lower resolution. Hence, down-sampling an image can be used in detection, instead of a high-resolution image to save computational resources without sacrificing accuracy. 4.2 Multiple Simple Feature-Based Speed Limit Sign Recognition System 4.2.1 System Overview Fig. 8 shows the speed limit recognition system overview. The input grayscale image is raster scanned with a scan window (SW) in the rectangle pattern matching (RPM) [26] module. It computes the luminosity of rectangular and circular traffic signs, as shown in Fig. 5. The luminosity differences of those areas are then compared with a dynamic threshold to roughly determine if the SW contains a rectangle/circle shape as a traffic sign candidate with the same size. Since the RPM algorithm utilizes common features of the circle and rectangle, it can be used to recognize both circular and rectangular signs. The sign enhancement filter developed by our group includes a hardware-oriented convolution filter and image binarization, and is applied to the 8-bit grayscale pixels of traffic sign candidates, changing them to 1-bit black-andwhite (binary) pixels for circle detection and speed number recognition. The sign enhancement process helps increase the features of the speed numbers and reduces the amount of data being processed. Circle detection uses a local direction feature at the circle’s edge to decide if the detected rectangle/circle candidates are really a circle mark or not, using a binary image. Direction of pixels in different areas in the SW and patterns for pixel local direction confirmation are shown in Fig. 6. Finally, the speed number recognition (NR) module analyzes block histogram features of the regions of interest and compares them with the predefined features of speed numbers for the NR module [18]. Speed sign candidates (ROIs) Global luminosity difference 4.1.2 Pixel Local Direction Feature of Circle Rectangle Pattern Matching Fig. 6 shows a local feature of a circle in a binary image, in which pixels at the edge of a circle have different direction depending on its location. For a pre-determined size, those locations and directions are pre-determined. Direction of a pixel can be verified using simple patterns of 3×3 pixels. A circle will have the total number of pixels that match each specific direction to get into a predefined range. Signs Enhancement and Binarization Fig. 7 shows examples of a histogram for binary images of the speed sign numbers “4”, “5”, “6”, “8” and Local pixel directions Circle Detection on Binary image Number Recognition 60 km/h 4.1.3 Block Histogram Feature of Numbers Input image Not speed traffic sign Speed limit III IV V VII I II VI Block based histogram Fig. 8. Speed limit traffic sign recognition system based on multiple simple features. 242 Hoang et al.: Real-time Speed Limit Traffic Sign Detection System for Robust Automotive Environments (1,1) 50 SW 50×50 Consequently, the luminosity feature in Fig. 5 is easily extended and applied to detect an LED speed sign. The detection of black and white luminosity is inverted to be applied to grayscale images of the LED speed sign to uniform features of painted and LED signs. (x,y) SW 49×49 ••• Location and size of the traffic sign candidates SW 20×20 Inversion Converted to grayscale Fig. 9. Multi-scan window size raster scan in parallel by last column pixels processing. Fig. 10. Extendibility to LED sign recognition. 4.2.2 Parallel Raster Scanning of Multiple Scan Window Sizes Although the image size of the system is variable, and our system is applicable to 4K, full HD, and VGA sizes, in the following section, for simplification of explanation, we assume the input image size is 640×360. The detectable speed limit traffic sign is in a range from 20×20 to 50×50 pixels. We use a raster scanning method with the above scan window sizes, as shown in Fig. 9, to keep the processing time constant. At any clock cycle, when a new pixel (x,y) gets into the system, a column of 50 pixels from (x,y) will be read from FIFO to SWs for detection, as shown on the right of Fig. 9. All those SW sizes (from 20×20 to 50×50 pixels) are processed in parallel in one clock cycle to find all candidates at different sign sizes at that position. In the circle detection module, the three last continuous columns of the scan window are buffered in registers for detection in the same manner. 4.2.3 Feature and Strategy for LED Speed Limit Sign Detection Fig. 10 shows the difference between painted speed signs and LED type speed signs, in which the number in the LED speed sign is brighter than that in the painted sign, and the background of the sign is black in Japan. It makes the number in the LED sign became off-white in the grayscale image, and so the color of the circle line in the grayscale image becomes white, while the adjacent area becomes dark, as shown in the middle of Fig. 10. 5. Hardware Implementation 5.1 Algorithm Modification for Efficient Hardware Implementation Fig. 11 shows the data processing flow, which is optimized for hardware implementation. The hardware module, once implemented, will occupy hardware resources, even if it is used or not. Hence, in our hardware implementation, the sign enhancement and binarization (SEB) module will work in parallel with the rectangle pattern matching module to reduce the recognition latency by reducing random memory access and applying the raster scan. If a high-resolution camera is used, preprocessing, which only performs the down-sampling of the grayscale input image to one-third, is an option to reduce the computational resources occupied by RPM. The number recognition (NR) and the circle detection (CD) modules process each binary image candidate of a speed limit traffic sign in sequence, and the two modules can process the speed sign candidates in parallel to share the input. The circle detection result can be used to enhance the decision of speed number recognition. The hardware size is small, but processing time for NR is increased depending on the number of speed sign candidates. Another approach is making CD to process the speed sign candidate in parallel with RPM and SEB. Results of CD is used to reduce the number of speed limit traffic sign candidates detected by RPM. It reduces the processing time, but the penalty is an increase in hardware size. In our prototype design, we will introduce a pipeline design for the first approach. The optional pre-processing module is used if highresolution and/or an interlace camera is used. If an Input image Pre-Processing (optional) Rectangle Pattern Matching Signs Enhancement and Binarization Candidate location Binary image Circle Detection on Binary image Number Recognition Decision Speed limit Fig 11. Modification for hardware oriented speed limit sign recognition system. 243 Location and SW flags FIFO (512) Number Recognition (NR) • • • x 50 FIFO50 Sign Enhancement Filter Speed recognition Binary Image Memory 2 (640 x 390) Binary Image Memory 1 64 Circle Recognition Judgement FIFO1 (640) 64 (640 x 390) RPM stage NR stage Fig. 12. Pipeline implementation of speed limit sign recognition system. interlaced scan is used, the preprocessing module deinterlaces by taking odd or even lines and columns of the input image before sending data to other modules. If a high-resolution camera such as a full HD camera is used, the preprocessing module is used to down-sample the input image to the appropriate image size, such as 640×360 for RPM and CD. It helps to reduce the number of candidates that need to be processed in NR. Since the luminosity and the local pixel direction features used in RPM and CD are not affected by down-sampling, down-sampling is enough for hardware resource reduction without sacrificing detection accuracy. 5.2 Hardware-Oriented Pipeline Architecture Fig. 12 shows the two-stage pipeline architecture of the speed limit recognition system. The system is able to scan for traffic signs up to 50×50 pixels in size. The proposed input image is 8-bit grayscale with a resolution of 640×360 = 230,400 pixels. It contains two stages of RPM and NR with four main modules of RPM, SEB, CD and NR. A judgement module is included to decide which speed limit is shown in the traffic sign. Support for RPM is a number of 8-bit FIFOs. The SEB and RPM processing are independent, and both of them work with grayscale images, and so they could be processed in parallel in the first pipeline stage as suggested by Fig. 11. Results from the CR module are used to strengthen the judgment of the speed limit recognition. CR and NR modules process binary images, and so are executed in parallel for input data sharing before the final judgment in the second pipeline stage. Those 8-bit processing modules and 1-bit binary processing modules are connected with the others through two memories. The first one, the location and scan window flags FIFO (LSW-FIFO), is used to store the position of the sign candidates in the input frame and the detected scan window sizes at that position. The second one, a general memory called binary image memory (BIM) with a size of 640×360 bits, is used to store the black-and-white bit value of each frame. Two independent memories and a memory-swapping mechanism are necessary to allow the 8-bit and 1-bit processing modules access the binary image memories to read and write in parallel. Fig. 13 shows the timing of the two pipeline stages of the proposed SLT sign detection and recognition system. RPM and SEB occur at the RPM stage in parallel using the RPM result FIFO CD SEB BIM 1 write NR RPM stage Judgement NR stage 2nd frame 8 Rectangle Pattern Matching (RPM) 1st frame IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 RPM result FIFO CD SEB BIM 2 write NR Judgement Fig. 13. Pipeline stages of speed limit sign recognition system. Local Overlap BB BB 22 11 1 WW W 11 W 22 WW 3 BB 3 B8 W 8 3 3 B8 W 8 W B 4 B 4 B7 W 7 W 4 4 B7 W 7 WW W 66 W 55 BB 6 6 BB 55 Reuseable Global Overlap B1 W1 B2 W2 B1 W1 B2 W2 W 3 B3 B8 W 8 W 4 B4 B7 W 7 B8 W 8 B7 W 7 W6 B6 SWt W5 B5 W 3 B3 W 4 B4 W6 B6 W5 B5 SWt+n Fig. 14. Local overlap in adjacent scan windows and global overlap between with computational result reusable. same pixel data input. An optional scaling-down module can be applied to input data for RPM if necessary for highresolution camera usage. The scan windows in RPM and SEB are pipeline-processed with one input pixel in each clock cycle. Hence, about 640×360=230,400 clock cycles are necessary for the first stage. During the processing time of the first frame, the detection result is written into the result FIFO, and the binarization image result is written into BIM 1 for the second stage. At the second stage, the CD and the NR modules read data from the LSW-FIFO and BIM 1 for processing before handing the result to the judgment module. At the same time, the data of the second frame is processed in the RPM stage. The result is written into BIM 2. Then, the NR stage of the second frame occurs with the previously written data inside BIM 2. The same process occurs with other frames, and so the system processes all frames in the pipeline. 5.3 Implementation of Area Luminosity Computation for RPM using Computa-tional Result-Sharing on Overlap Depending on the distance between the vehicle (camera) and the SLT sign, the size of the sign on the input image varies from 20×20 to 50×50 pixels. We need to scan the input frame with various scan window sizes, as shown in Fig. 9 for SLT sign detection at all distances. Inside each scan window, the shared luminosity feature between rectangle and circle in Fig. 5 is used. Two reusable computations generated by local and global overlaps, as shown in Fig. 14, are applied to the RPM implementation to reduce the hardware size. The first re-usable computational result concerns local overlap between two adjacent scan windows, as shown in 244 Hoang et al.: Real-time Speed Limit Traffic Sign Detection System for Robust Automotive Environments St = Sstoret-1+Saddt + Sstore Hor_tmp Sstore Horizontal Area (B) Sadd + Ssub S ··· 20~50pixel S ··· FIFO Vertical B3 Area 12 the left side of Fig. 14. It is locally applied to the brightness computation of the same area in two adjacent scan windows (e.g. B1 area) and is compatible with the raster scan method. The algorithm, which is compatible with the raster scan method, and hardware implementation for pipeline area luminosity computation, is shown in Fig. 15. The overlapped area between St-1 and St (Sstoret-1) is reused without computation (Fig. 15(a)). The luminosity of the preferred area St is generated by that of the overlapped area Sstoret-1 plus the luminosity of the new input area Saddt. Luminosity of the overlapped area Sstoret-1 is computed from the luminosity of area St-1, which is computed during the processing of the previous SW, and subtracts that of the subtraction area (Ssubt-1). Hence, the computation is now for the luminosity of the addition area (Saddt), and storing the result for later use. At the same time, the newly computed luminosity of the Saddt area is added to that of the St-1 area before subtracting the previously stored luminosity of the Ssubt-1 area for area St luminosity computation. Hardware design of one scan window size for the brightness computation for each area of B1~W8 is shown in Fig. 15(b). The second overlap is globally applied to the scan windows inside a frame, as shown in the right side of Fig. 14. During the processing of scan window t (SWt), the brightness of areas W3, W4, B3, and B4 are computed. During SWt+n processing, these areas become B7, B8, W7 and W8, respectively. Hence, the brightness computation results of those areas in SWt can be stored in FIFO for later reuse in SWt+n. The upper right part of Fig. 16 shows how to use FIFO to design rectangle pattern matching for a single scan window size using global overlap. Combination of the luminosity computation using local and global overlaps results in simple and compact hardware design of a single scan window, as shown in the upper part of Fig. 16. The computation for other scan window sizes at the same position can be done in parallel with the same input pixels in a column, as shown in Fig. 9. The final design with all the necessary scan window sizes operating in parallel is shown in the lower part of Fig. 16. Due to the change in features of the LED speed sign shown in Fig. 10, that is, the black areas became white and vice versa, luminosity difference in the LED sign is also reversed. Instead of reversing the grayscale image for LED − W2 W3 W8 ・・・ − (x,y) Dt 50pixel pix_valid Fig. 15. Pipeline implementation of area luminosity computation using cmputational result sharing among local overlap. − − W1 B 8 match Frame (640×390 pixels) RPM for 50×50 SW RPM for 49×49 SW ・・・ (b) Hardware design of area luminosity computation using local overlap. B1 ×2 (1,1) 50 (a) Simple area luminosity computation using local overlap algorithm. B2 12 Horizontal Area (W) Dt Sadd + Ssub + Threshold comparison = − ··· t St-Ssubt − ··· Sstore Recursive implementation Sstoret-1 x 31 scan window sizes RPM for 20×20 SW Controller result 16 RPM flags (20~50) col_cnt line_cnt wr x,y Fig. 16. Implementation of RPM using global and local overlap and FIFO. sign detection, which requires a lot of computational resources, inversion of the luminosity difference between black and white areas is enough for LED speed sign detection. It can be done by taking the absolute value of the luminosity difference before making a comparison with the threshold. 5.4 Local Pixel Direction Based Circle Detection Implementations 5.4.1 Straight-Forward Implementation Fig. 17 shows the mechanism and design of the circle recognition module. A 3×3 pixel array is used to detect the direction of the input pixel using local border templates in Fig. 6. The direction is then compared with the expected direction of that pixel. The number of matches is voted on and stored in the register. The final number of matches is compared with a previously decided threshold. If the number of matches is in the predefined threshold range, the input SW is considered a circle. 5.4.2 Fast and Compact Implementation The above direct circle detection design in Fig. 17 can be improved for faster operation using pipeline and computation reuse of local overlap, which are suitable for raster scanning. Since the templates are 3×3 pixels, data of the last three columns for an SW is buffered for direction confirmation and voting. The pipeline computation mechanism of matched pixel voting for each expected direction in a scan window is shown in Fig. 18. When a new pixel arrives, its corresponding column in the SW and two previous ones are buffered, making a 3-pixel column, as shown in the left side of Fig. 18. There is overlap in 3×3-pixel-patterns among different lines in the 3-pixel column; hence, a 1×3-pixel pattern of each line is checked separately before combining three adjacencies for the final direction confirmation. In a column, the upper part needs to be verified with three directions: down-right (↘), down (↓), and down-left (↙); the middle part needs to be verified with two directions: right (→), left (←); and the lower part needs to be verified with three directions: upright (↗), up (↑), and up-left (↖). Dividing the input IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 245 Fig. 17. Straight-forward implementation of circle detection using local pixel direction. Fig. 19. Fast and compact implementation of circle detection using local pixel direction and overlap. Fig. 18. Fast and compact local pixel direction voting using overlap. column into three parts (upper, middle, and lower) helps to reduce the number of directions that need to be verified at each location to one-third. The right part of Fig. 18 shows the direction voting for one area among the eight, using reusable computational results, in which the direction voting results in the overlap in each area between SWi-3, and SWi is reused without revoting. The voting for an area in SWi is generated by the voting result of the overlapped area plus the voting result of the new input area (dircol+i). The voting result of the overlapped area is the result of the previously voted area (dirareai-3) without the subtraction area (dircol-i). The voting is now for the addition area (dircol+i) and storing the result for later usage as subtraction (dircol-i). At the same time, the newly voted column result dircol+i is added with dirareai-3 before subtracting the previously stored dircol-i voting result for that area’s (areai) final voting result (dirareai ). As shown in Fig. 18, the waiting time after the direction of the newly input pixels is verified and voted on, until the time it becomes addition dircol+i and subtraction dircol-i of different directions, is different. The hardware implementation of CD using voting for the local pixel direction at the edge is shown in Fig. 19. When a new binary pixel arrives in the system, the corresponding three columns (the last columns in Fig. 18) are given to the direction voting module. Pattern confirmation, a shared module among various SW sizes compares each of the three inputs in a line with the line patterns. Three adjacent results are then combined in a column-direction confirmation submodule before voting for the number of pixels that match a Fig. 20. Daytime scenes in local road and highway with sign distortion: recognizable with no difficulty. specific direction for each SW size. These results are pushed into column-direction voting FIFOs and become dircol+ and dircol- at a specific time, which depends on the pixel location and size of the scan window. The voting results of all columns in the corresponding area are accumulated, forming area voting result dirarea. Then, dirarea is stored in the local area direction voting register. All local area voting results are added together, making the SWi directions voting result for CD. 6. Evaluation Results and Discussion 6.1 Datasets in Various Conditions 6.1.1 Dataset Taken in Japan The Japan dataset is used to verify speed sign recognition for moving images under various conditions, as shown in Fig. 20 and Fig. 21. It includes 125 daytime scenes on highways and local roads; 25 scenes on a clear night, and 44 scenes on rainy nights on local roads, as shown in Table 1. When a high-resolution camera (full HD) is used, images at both original and down-sampled sizes are tested. All of the frames are grayscale. Frames with full HD resolution are down-sized to 1/3 on each axis for simulation in our test. The algorithm is easily implemented with few hardware resources by reading the pixels after every three columns and rows. A scene is considered to be all the contiguous frames in which the 246 Hoang et al.: Real-time Speed Limit Traffic Sign Detection System for Robust Automotive Environments Sign size = 64x64 Sign size = 75x75 Sign size = 55x55 Fig. 22. Dataset (single frame) in German. Fig. 21. Adjacent frames of the same sign under backlit condition: left: undetectable; right: detectable. Table 1. Datasets for accuracy evaluation with various lighting, weather, and camera conditions taken in Japan. Condition Camera Daytime 1 (Japan) Grayscale Cam *1 Daytime 2 (Japan) Clear Night (Japan) Rainy Night (Japan) Video Cam *2 Video Cam *2 Video Cam *2 Resolution [pixels] 640×390 (Original) No. of Frames No. of Scenes 41,120 60 1920×1080 (Original) 40,000 65 640×360 (downsampling) 40,000 65 1920×1080 (Original) 39,136 25 640×360 (downsampling) 39,136 25 1920×1080 (Original) 66,880 44 640×360 (downsampling) 66,880 44 *1 : 60 fps gray scale camera (640 x 390 pixels). *2 : 60 fps interlace Full HD color camera. same sign appears in the observable field of the camera until it disappears, as defined in Section 2 and Fig. 4. Depending on the speed of the car, the number of frames in a scene varies. The size of the sign also varies and gradually increases between adjacent frames as the vehicle gets closer to the sign. Lighting conditions for the same sign also change, depending on the distance and angle between the sign and the camera, as shown in Fig. 21. The dataset also includes signs with different recognition difficulties, depending on the weather and light conditions. 6.1.2 Dataset Taken in Germany The German Traffic Sign Detection Benchmark (GTSDB) dataset [25], which includes 900 individual traffic images of 1360×800 pixels, is originally used for sign detection contest purposes with machine learning. Since we aim for two-digit speed traffic sign recognition, we created a sub-GTSDB dataset by taking all frames that contain two-digit speed traffic signs. The sub-GTSDB has 255 frames at 1360×800 pixels, as shown in Figs. 22 and 23. “Scene” is not applicable to this dataset because it contains individual frames only. Sizes of the speed limit signs range from 14×16 pixels to 120×120 pixels. Color Fig. 23. German frames at day and night times in grayscale. Both are detectable and recognizable. Table 2. Number of candidates and effectiveness of Circle Detection. Conditions No. of sign candidates No. of speed sign candidates Effectiveness of CD Best Avg. Best Avg. Best Avg. Daytime (640×390) 200 118 168 109 32 9 Night time (640×360) 403 109 352 96 51 13 images are converted to grayscale images before testing. 6.2 Simulation Results and Discussion 6.2.1 Number of Speed Sign Candidates Detected in RPM and CD, and Processing Time for NR Table 2 shows the number of traffic sign candidates detected by the RPM module and the number of speed traffic sign candidates detected by the CD module in one frame under various conditions. The full HD input image is down-sampled to match the designed 640×360 pixel resolution. On average, the number of candidates detected by the RPM module is 118, and the number of candidate detected after CD is 96 for night time. CD helps to reduce the maximum to 51 speed traffic sign candidates. This is not a big number, but it helps to remove complicated cases for NR, in which random noise in QR code form and Japanese kanji are taken as speed sign candidates by RPM. The number of SLT sign candidates can also be reduced by applying region of interest, because the SLT sign appears in the input image in a known area, as shown in Fig. 24 (in Japan). We can concentrate on the SLT sign candidates detected in this area only to reduce processing time for NR. The time needed to raster scan one frame is 640×360=230,400 clock cycles. In the worst case scenario, when the speed limit traffic sign candidate is as big as 50×50 pixels, the NR module needs two clock cycles for IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 247 Non traffic sign area Fig. 24. Appearance area of SLT sign in Japan. Covered scan window sizes Step = 1 No. detected candidates = 112 (a) 112 speed sign candidates detected with scan step = 1 Scan window sizes 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Fig. 25. Cover of each scan window size. Red marks picked up SW sizes for best hardware and accuracy trade-off. one line of data reading and processing [18], and so 100 clock cycles for number candidate processing. Hence, during 230,400 clock cycles used in RPM, NR can process 2,304 speed sign candidates, which is more than the number of speed sign candidates detected by RPM and CD. The available time (230,400 clock cycles) is enough for all SLT sign candidate recognition in the pipeline implementation in Fig. 12. 6.2.2 Optimization of SW Sizes and Scan Step Fig. 25 shows the cover of one scan window size over the others. A small SW size, such as 20×20 pixels and 21×21 pixels cannot cover the others as well as be covered by the others. However, the cover size of a SW size is gradually increased, such that SW size of 23×23 pixels can cover SW size of 22×22 pixels, and a scan window size of 34×34 can cover a range from 32×32 to 41×41. The red lines show the coverable area for all SW sizes. Blue points show the SW sizes that are covered by other SW sizes, while the yellow points show the points in which data is not available in the dataset. It suggests that not all scan window sizes are necessary for implementation. Some scan window sizes with less detectable and recognizable areas of SLT sign can be removed. In our simulation, 14 scan window sizes (20, 21, 23, 24, 26, 28, 30, 32, 34, 36, 38, 42, 46, and 50) are enough for sign detection without reduction in accuracy. Step = variable No. detected candidates = 38 (b) 38 speed sign candidates detected with variable scan step Fig. 26. Effectiveness of variable scan step. When the scan window moves one pixel, change in the scan area is significant for small scan window sizes, such as 20×20 (a 5% change, because one row or one column of the SW is replaced), but this change is minor for big scan window sizes, such as 50×50 (a 2% change). Hence, scan step can be a variable depending on the SW size. Making the scan step variable reduces the number of candidates that need to be processed in the CD and NR stages, as shown in Fig. 26, in which the number of speed sign candidates is reduced from 112 to 38 with no impact on detection accuracy. SLT sign recognition accuracy in Fig. 27 shows that when the SLT sign size is bigger than 23×23 pixels, the recognition rate gets to 100%, and so, some SW sizes can be removed to save hardware resources with no reduction in accuracy. 6.3 Hardware Implementation Results and Processing Ability Table 3 shows the hardware implementation resources and frequency for RPM, CD and related modules in speed limit traffic sign candidate detection. Two implementations with different FIFO approaches are given. The first implementation utilizes flip-flops available inside slices of the Xilinx FPGA to build FIFO stacks and shift registers that are required in Figs. 17 and 20. Due to the small number of registers available inside each slice, the implementation with all 31 SW sizes in the proposed range (20×20 to 50×50) utilizes 68,552 slice LUTs (128%) and cannot fit the target FPGA. Reducing the number of SW sizes to 14 (20, 21, 23, 24, 26, 28, 30, 32, 34, 36, 38, 42, 46, and 50) significantly reduces the required resources, 248 Hoang et al.: Real-time Speed Limit Traffic Sign Detection System for Robust Automotive Environments SW size [pixels] This number reduces to 18,154 (34.1%) slice LUTs if 14 SW sizes are implemented. Fewer required computational resources also increases the maximum frequency for the detection module to 321.9 MHz, compared with 202.2 MHz achieved by the first one. 6.4 Comparison with Related Works In terms of speed limit traffic sign candidate detection, our method achieves 100% accuracy. When number recognition is included, the accuracy of the proposed speed SW size [pixels] (a) Detection accuracy on local road by SW size Table 4. Detection rate and throughput of related works. Dataset Input image Daytime 1 640×360 Daytime 2 8-bit grayscale (b) Detection accuracy on highway by SW size Fig. 27. Speed sign recognition accuracy by SW size. Night, Rainy night Full HD 8-bit grayscale Scene positive recognition rate (%) 125 ) 125 scenes Speed 100( > 60 fps (Xilinx Zynq 7020) 69 ) 69 scenes 100( Table 3. Hardware resources and latency. Table 5. Detection rate and throughput of related works. All SW sizes (20×20~ 50×50) 14 SW sizes* # slice # slice Freq. # slice # slice Freq. registers LUTs (MHz) registers LUTs (MHz) RPM 19,763 34,189 202.2 8,289 14,394 202.2 Reg. based CD 26,669 34,363 390.3 11,061 14,437 390.3 FIFO Total 46,432 68,552 202.2 19,350 28,831 202.2 (43.6%) (128.9%) (18.2%) (54.2%) RAM RPM 13,951 28,391 321.9 57,704 11,808 321.9 based CD 4,232 14,885 390.3 1,755 6,346 390.3 shift reg. Total 18,183 43,246 321.9 7,459 18,154 321.9 (17.1%) (81.3%) (7%) (34.1%) FIFO Controller (RPM+CD) 19 350 - 19 350 - * 14 SW sizes : 20, 21, 23, 24, 26, 28, 30, 32, 34, 36, 38, 42, 46, and 50. Target device: Xilinx Zynq Automotive Z7020 FPGA. - 106,400 slice registers. - 53,200 slice LUTs. making it implementable on the target FPGA at a frequency of 202.2 MHz. This maximum frequency guarantees that over 60 full HD frames can be processed for sign detection in one second. The second implementation utilizes the memory inside the LUT in SLICEM to generate a 32-bit shift register without using the flip-flops available in a slice. Since the size of the FIFO stack and shift registers required in Figs. 17 and 20 is not too big, the shift register generated by the memory inside the LUTs is applicable. Since the amount of memory inside the LUTs in a slice is much larger than the number of flip-flops inside a slice, this approach significantly reduces the required resources to 43,246 (81.3%) slice LUTs if all 31 SW sizes are implemented. Method Recognition Thpt. rate (%) (fps) Platform Color / grayscale Proposed method 98 (100%*) > 60 fps Zynq* both SIFT [15] 90.4 0.7 Pentium D 3.2 GHz Color Hough [4] 91.4 6.7 Pentium 4 2.8 GHz Color Random forest [20] 97.2 18~28 - Color Neural network [14] 82.0 62.0 Virtex 4 Color Fuzzy template [6] 93.3 66.0 Pentium 4 3.0 GHz Color Hardware/ Software [17] - 1.3 Virtex 5 Color Multi-Core SoC [9] - 2.3 Virtex 4 Color Hardware/ Software [1] - 10.4 Zynq 4 Color Hardware/ Software [3] 90.0 14.3 Virtex 5 Color Multi-scale convolutional network [22] 98.6 NA Multi-column deep neural network [21] 99.5 NA Color GPU 512 core Color 100%* : achieved 100% accuracy with contrast adjustment. NA : not applicable due to traffic sign recognition only. Zynq* : Xilinx Automotive Zynq 7020 FPGA. Virtex : Xilinx Virtex FPGA. Original rectangle detection B W BW WB W B Directly applicable for shape detection IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 Extendable traffic sign shapes B W B W BW WB BW W B B W WB BW W B WB W B Fig. 28. Extendibility to other sign shapes. limit traffic sign detection and recognition system is 98% if no contrast adjustment is applied to the rainy night scenes. With simple contrast adjustment, accuracy in speed limit sign detection and recognition increases to 100%, even under difficult conditions such as rainy nights, as shown in Table 4. Table 5 shows the accuracy and throughput of related works. In comparison with other available research, our system gets a higher precision rate and higher throughput than all others. It also requires fewer hardware resources than the others, and so is implementable on a low-cost automotive-oriented Xilinx Zynq 7020 FPGA, whereas the others require a PC or high-end, and high-cost FPGAs, such as the Xilinx Virtex 4 or Virtex 5, in their implementations. as rainy nights, is able to process more than 60 full HD fps, and is implementable on the low-cost Xilinx automotiveoriented Zynq 7020 FPGA. Therefore, it is applicable in real life. In the future, a full speed limit traffic sign recognition system will be implemented and verified on an FPGA. Extension to LED signs and other countries’ speed limit sign recognition, as well as other traffic sign detection, will be done. Acknowledgement Part of this work was supported by Grant-in-Aid for Scientific Research (C) and Research (B) JSPS KAKENHI Grant Numbers 2459102 and 26280015, respectively. References [1] [2] 6.5 Function Extendibility In terms of application, the rectangle pattern–matching algorithm can directly apply to other sign detections aside from the rectangle shape, such as the circle and octagon shapes, as shown in Fig. 28. These three shapes have the same global location and luminosity features, and so we can roughly recognize the ROI for those signs before dividing them into the correct shape based on their local features. The algorithm can also extend to other shapes, such as the triangle and hexagon. Depending on the shape of the target sign border, we need to change the location of black and white area computation. 7. Conclusion This paper introduces our novel algorithm and implementation for speed limit traffic sign candidate detection using simple features of speed limit traffic signs in a grayscale-image (area luminosity, local pixel direction, area histogram) combination. By using grayscale images, our approach overcomes training and color dependence problems, which reduce recognition accuracy in unlearned conditions when color has changed, compared to other research. The proposed algorithm is robust under various conditions, such as during the day, at night, and on rainy nights, and is applicable to various platforms, such as color or grayscale cameras, high-resolution (4K) or lowresolution (VGA) cameras, and high-end or low-end FPGAs. The combination of coarse- and fine-grain pipeline architectures using results-sharing on overlap, application of a RAM-based shift register, and optimization of scan window sizes provides a small but high-performance implementation. Our proposed system achieves better than 98% recognition accuracy, even in difficult situations, such 249 [3] [4] [5] [6] [7] [8] [9] [10] Han, Y.: Real-time traffic sign recognition based on Zynq FPGA and ARM SoCs, Proc. 2014 IEEE Intl. Conf. Electro/Information Technology (EIT), pp. 373-376, 2014. Article (CrossRef Link) Hoang, A.T., Yamamoto, M., Koide, T.: Low cost hardware implementation for traffic sign detection system, Proc. 2014 IEEE Asia Pacific Conf. Circuits and Systems (APCCAS2014), pp. 363-366, 2014. Article (CrossRef Link) Irmak, M.: Real time traffic sign recognition system on FPGA, Master Thesis, The Graduate School of Natural and Applied Sciences of Middle East Technical University, 2010. Article (CrossRef Link) Ishizuka, Y., and Hirai, Y.: Recognition system of road traffic signs using opponent-color filter, Technical report of IEICE, No. 103, pp. 13-18, 2004, (in Japanese). Keller, C. G., et al.: Real-time recognition of U.S. speed signs, Proc. Intl. IEEE Intelligent Vehicles Symposium, pp. 518-523, 2008. Article (CrossRef Link) Liu, W., et al.: Real-time speed limit sign detection and recognition from image sequences, Proc. 2014 IEEE Intl. Conf. Artificial Intelligence and Computational Intelligence (AICI), pp. 262-267, 2010. Article (CrossRef Link) Miura, J., et al.: An active vision system for realtime traffic sign recognition, Proc. Intl. IEEE Conf. Intelligent Transportation Systems, pp. 52-57, 2000. Article (CrossRef Link) Moutarde, F., et al.: Robust on-vehicle real-time visual detection of American and European speed limit signs, with a modular traffic signs recognition system, Proc. Intl. IEEE Intelligent Vehicles Symposium, pp. 1122-1126, 2007. Article (CrossRef Link) Muller, M., et al.: Design of an automotive traffic sign recognition system targeting a multi-core SoC implementation, Proc. Design, Automation & Test in Europe Conference & Exhibition 2010, pp. 532-537, 2010. Article (CrossRef Link) Ozcelik, P. M., et al.: A template-based approach for real-time speed limit sign recognition on an embedded 250 [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] Hoang et al.: Real-time Speed Limit Traffic Sign Detection System for Robust Automotive Environments system using GPU computing, Proc. 32nd DAGM Conf. Pattern Recognition, pp. 162-171, 2010. Article (CrossRef Link) Raiyn, J.,and Toledo, T.: Real-time road traffic anomaly detection, Journal of Transportation Technologies, 2014, No. 4, pp. 256-266, 2014. Article (CrossRef Link) Schewior, G., et al.: A hardware accelerated configurable ASIP architecture for embedded realtime video-based driver assistance applications, Proc. 2011 IEEE Intl. Conf. Embedded Computer Systems (SAMOS), pp. 18-21, 2011. Article (CrossRef Link) Soendoro, D., and Supriana, I.: Traffic sign recognition with color based method, shape-arc estimation and SVM, Proc. 2014 IEEE Intl. Conf. Electrical Engi-neering and Informatics (ICEEI), pp. 1-6, 2011. Article (CrossRef Link) Souani, C., Faiedh, H., and Besbes, K.: Efficient algorithm for automatic road sign recognition and its hardware implementation, Journal of Real-Time Image Processing, Vol. 9, Issue 1, pp. 79-93, 2014. Article (CrossRef Link) Takagi, M., and Fujiyoshi, H.: Road sign recognition using SIFT feature, Proc. 18th Symposium on Sensing via Image Information, LD2-06, 2007, (in Japanese). Torresen, J., et al.: Efficient recognition of speed limit signs, Proc. 7th Intl. IEEE Conf. Intelligent Transportation Systems, pp. 652-656, 2004. Article (CrossRef Link) Waite, S. and Oruklu, E.: FPGA-based traffic sign recognition for Advanced Driver Assistance Systems, Journal of Transportation Technologies, Vol. 3, No. 1, pp. 1-16, 2013. Article (CrossRef Link) Yamamoto, M., Hoang, A-T., Koide, T.: Speed traffic sign recognition algorithm for real-time driving assistant system, Proc. 18th Workshop on Synthesis And System Integration of Mixed Information Technologies (SASIMI 2013), pp. 195200, 2013. Article (CrossRef Link) Zaklouta, F. and Stanciulescu, B.: Segmentation masks for real-time traffic sign recognition using weighted HOG-based trees, Proc. 14th Intl. IEEE Conf. Intelligent Transportation Systems (ITSC), pp. 1954-1959, 2011. Article (CrossRef Link) Zaklouta, F., Stanciulescu, B.: Real-time traffic sign recognition in three stages, Journal of Robotics and Autonomous Systems, Vol 62, Issue 1, pp. 16-24, 2014. Article (CrossRef Link) Ciresan, D., et al.: Multi-column deep neural network for traffic sign classification, Journal of Neural Networks, Vol 32, pp. 333-338, 2012. Article (CrossRef Link) Sermanet, P., LeCun, Y.: Traffic sign recognition with multi-scale convolutional networks, Proc. Intl. Joint IEEE Conf. on Neural Networks (IJCNN), pp. 2809-2813, 2011. Article (CrossRef Link) Zaklouta, F., Stanciulescu, B., and Hamdoun. O: Traffic sign classification using K-d trees and random Copyrights © 2015 The Institute of Electronics and Information Engineers forests, Proc. Intl. Joint IEEE Conf. on Neural Networks (IJCNN), pp. 2151-2155, 2011. Article (CrossRef Link) [24] Article (CrossRef Link), access at March, 15th, 2015 [25] Article (CrossRef Link), access at June, 10th, 2015. [26] Hoang, A-T., Yamamoto, M., Koide, T.: Simple yet effective two-stage speed traffic sign recognition for robust vehicle environments, Proc. 30th Intl. Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC 2015), pp.420-423, 2015. Anh-Tuan Hoang was born in 1976. He received his Ph.D. from Ritsumeikan University in 2010. He was a postdoctoral researcher at Ritsumeikan University, Japan, from 2010 to 2012. Since 2012, he has been a postdoctoral researcher at the Research Institute for Nanodevice and Bio System, Hiroshima University, Japan. His research interest is side channel attack and tamper-resistant logic design, real-time image processing and recognition for vehicle and medical image type identification. He is a member of the IEEE and IEICE. Tetsushi Koide (M’92) was born in Wakayama, Japan, in 1967. He received a BEng in Physical Electronics, an MEng and a PhD in Systems Engineering from Hiroshima University in 1990, 1992, and 1998, respectively. He was a Research Associate and an Associate Professor in the Faculty of Engineering at Hiroshima University from 1992-1999 and in 1999, respectively. After 1999, he was with the VLSI Design and Education Center (VDEC), The University of Tokyo, as an Associate Professor. Since 2001, he has been an Associate Professor in the Research Center for Nanodevices and Systems, Hiroshima University. His research interests include system design and architecture issues for functional memory-based systems, real-time image processing, healthcare medical image processing systems, VLSI CAD/DA, genetic algorithms, and combinatorial optimization. Dr. Koide is a member of the Institute of Electrical and Electronics Engineers, the Association for Computing Machinery, the Institute of Electronics, Information and Communication Engineers of Japan, and the Information Processing Society of Japan. Masaharu Yamamoto was born in 1990. He received a BEng in Electrical Engineering from Hiroshima University, Hiroshima, Japan in 2013. He is currently at TMEiC, Japan. Since 2012, he has been involved in the research and development of hardware-oriented image processing for advanced driver assistance systems (ADASs). IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.4.251 251 IEIE Transactions on Smart Processing and Computing Pair-Wise Serial ROIC for Uncooled Microbolometer Array Syed Irtaza Haider1, Sohaib Majzoub2, Mohammed Alturaigi1 and Mohamed Abdel-Rahman1 1 2 College of Engineering, King Saud University, Riyadh, KSA {sirtaza, mturaigi, mabdelrahman}@ksu.edu.sa Electrical and Computer Engineering, University of Sharjah, UAE sohaib.majzoub@ieee.org * Corresponding Author: Syed Irtaza Haider Received June 20, 2015; Revised July 15, 2015; Accepted August 24, 2015; Published August 31, 2015 * Short Paper * Extended from a Conference: Preliminary results of this paper were presented at the ICEIC 2015. This present paper has been accepted by the editorial board through the regular reviewing process that confirms the original contribution. Abstract: This work presents modelling and simulation of a readout integrated circuit (ROIC) design considering pair-wise serial configuration along with thermal modeling of an uncooled microbolometer array. A fully differential approach is used at the input stage in order to reduce fixed pattern noise due to the process variation and self-heating–related issues. Each pair of microbolometers is pulse-biased such that they both fall under the same self-heating point along the self-heating trend line. A ±10% process variation is considered. The proposed design is simulated with a reference input image consisting of an array of 127x92 pixels. This configuration uses only one unity gain differential amplifier along with a single 14-bit analog-to-digital converter in order to minimize the dynamic range requirement of the ROIC Keywords: Microbolometer, Process variation, ROIC, Self-heating thermal imaging system 1. Introduction Infrared uncooled thermal imagers have been employed in a wide range of civilian and military applications for smartphone cameras, industrial process monitoring, driver night vision enhancement, and military surveillance [1-3]. Micro-electro-mechanical system (MEMS) microbolometer thermal detectors are the most widely used pixel element detectors in today's infrared uncooled thermal imaging cameras. Microbolometer sensor arrays are fabricated using MEMS technology, but they suffer from process variation, which introduces fixed pattern noise (FPN) in detector arrays [4]. At the early stage of sensor fabrication, a sensor’s resistance discrepancy of ±10% is expected [5]. A microbolometer detector changes its resistance when exposed to IR radiation due to its thermally sensitive layer. If the target temperature differs from the ambient temperature, i.e. ΔTscene, by 1K, it results in a temperature increase in the microbolometer membrane on the order of 4mK [2]. These thermal detectors need to be electrically biased during the readout in order to monitor the change in resistance. Electrical biasing generates heat, which results in self-heating of the microbolometer detector and causes a change in resistance. Heat generated by self-heating cannot be quickly dissipated through thermal conduction to the substrate. It results in a change in temperature due to selfheating much higher than a change in temperature due to incident radiation [6, 7]. FPN and self-heating results in major degradation and poor performance of the thermal imaging system, and hence, imposes a strict requirement on ROIC for noise compensation in order to detect the actual change due to infrared radiation. Readout topologies extensively discussed in the literature are pixel-wise [8], column-wise [5] and serial readout [9]. Pixel-wise readout improves noise performance of the microbolometer by increasing the integration time up to the frame rate [10]. Column-wise readout reduces the number of amplifiers and integrators, which thus serves as a good compromise between the silicon area and parallel components [10, 11]. Serial readout architecture is read pixel by pixel, and therefore, it requires only one amplifier and integrator, resulting in low power consumption and a compact layout [10]. This paper focuses on ROIC design considering the impact of process variation and self-heating on performance. Pair-wise, time-multiplexed, column-wise configuration is used, in which one pair of microbolometers is selected at a 252 Haider et al.: Pair-Wise Serial ROIC for Uncooled Microbolometer Array time. Readout is performed differentially during the pulse duration. The focal plane array (FPA) consists of 127x92 normal microbolometers and one row of blind microbolometers to provide a reference for the ROIC. This paper is organized as follows. Section 2 describes the literature review. Section 3 covers thermal modelling of an uncooled microbolometer. Section 4 explains the pair-wise serial readout architecture in detail, along with an ROIC simulator. Finally, we conclude this paper in the last section. 2. Literature Review This section summarizes the studies conducted on different aspects of thermal imaging systems. Some of the important figures of merit are discussed, which are helpful in evaluating the performance of infrared detectors. Some of the commonly used thermal sensing materials that influence the sensitivity of microbolometers are discussed. The focal plane array and the readout integrated circuit are two major building blocks of a thermal imaging system. The operation of a thermal imaging system starts with the absorption of the incident infrared radiation by the uncooled microbolometer detector array. Each microbolometer detector changes its resistance based on the absorbed infrared radiation. It is important to establish criteria with which different infrared thermal detectors are compared. The most important figures of merit are the thermal time constant (τ), noise equivalent power (NEP), responsivity ( ), noise equivalent temperature difference (NETD) and detectivity (D*) [2, 10, 12-14]. The performance of a microbolometer is influenced by the thermal sensing material. There are three types of material that are suitable for the bolometer: metal, semiconductor and superconductor. Metal and semiconductor microbolometers operate at room temperature, whereas superconductor microbolometers require cryogenic coolers. Typical materials used for microbolometers are titanium, vanadium oxide and amorphous silicon. Single input mode and differential input mode ROIC designs are widely used at the reading stage of the ROIC in order to detect the resistance value of a microbolometer and to generate the voltage value. Differential input mode assumes that the adjacent pixels are subjected to a small radiation difference, and hence, results in a small resistance change, causing a similar voltage change for adjacent microbolometer cell resistances that can be read and handled by a differential amplifier [15]. In addition, process variation creates resistance discrepancy among the sensors during the wafer process. A differential input mode ROIC design attempts to cancel these resistance differences among microbolometers using the differentiation method. Thus, it suppresses the common error and amplifies the differential signal. The conventional single input mode, on the other hand, is known to be inefficient at compensating for fixed pattern noise since it has low immunity to process variation. A comparison between single input mode ROIC and differential input mode ROIC to decrease the error due to process variation was presented [16]. Pixel-parallel, serial and column-wise readout archi- tectures have mostly been discussed in the past. Parallel readout increases power consumption and the complexity of the readout, whereas serial readout reduces the speed of a thermal imager due to its time-multiplexed nature. Pixelparallel readout, also known as frame-simultaneous readout, is used for very-high-speed thermal imaging systems. Each pixel in a cell array consists of detector, amplifier and integrator. The pixel-wise readout architecture is suitable for very-low-noise applications because it reduces Johnson noise, one of the largest noise sources. The disadvantage of the pixel-parallel architecture is the complex readout resulting in a large pixel area and extensive power dissipation. Conventional ROIC uses column-wise readout because this architecture serves as a good compromise between the speed and complexity of readout due to parallel components. Finally, the last readout architecture is serial readout. This approach uses only one amplifier and one analog-to-digital converter (ADC) to perform the readout due to the time-multiplexed nature of its readout. Advantages with serial readout are compact layout and low power consumption. 3. Thermal Modeling of a Microbolometer Self-heating is an unavoidable phenomenon which causes the temperature of a thermal detector to rise, even though the bias duration is much smaller, compared to the thermal time constant of the detector. The heat balance equation of a microbolometer, including self-heating, can be written as: H d ΔT + GΔT = PBIAS + PIR dt (1) where H is the thermal capacitance, G is the thermal conductance of the microbolometer, PBIAS is the bias power, and PIR is the infrared power absorbed by the microbolometer detector. For metallic microbolometer materials, resistance RB has linear dependence on temperature, and can be expressed as: RB ( t BIAS ) = R0 (1 + αΔT ( t BIAS ) ) (2) where α is the temperature coefficient of resistance (TCR) of the detector, R0 is the nominal resistance, and ΔT(tBIAS) is the temperature change due to self-heating during pulse biasing, and is given by: ΔT ( t BIAS ) = T − T0 = PBIAS ( t BIAS ) × t BIAS G.τ (3) where τ is the thermal time constant. Under normal conditions, tBIAS << τ and ∆T due to self-heating is independent of thermal conductance. If IBIAS is the constant bias current applied to the microbolometer during the readout, self-heating power can be expressed as: 2 PBIAS ( t BIAS ) = I BIAS RB ( t BIAS ) By solving the above equation using (2) and (3), (4) IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 PBIAS ( t BIAS ) = 2 I BIAS R0 H 2 H − I BIAS t BIAS R0α Table 1.Thermal Parameters of the Microbolometer. (5) When FPA is exposed to incident radiation, the difference in radiant flux incoming to the microbolometer can be estimated as ΔφIR = Ab ΔTscene dP ( ) 300 K , Δλ dT 4F 2 (6) where ΔTscene is the difference in temperature between target and ambient temperatures, and (dP/dT)300K,Δλ is the change in power per unit area with respect to temperature change radiated by a black body at an ambient temperature in the wavelength interval 8μm-14μm. The temperature change due to the absorbed infrared power is given by ΔTIR = ΔφIR G 253 (7) Parameters Nominal Resistance, R0 (kΩ) Pulse Duration, tBIAS (μs) Bias Current, IBIAS (μA) Ambient Temperature, T0 (K) Thermal Time Constant, τ (ms) Temperature Coefficient of Resistance, α (%/K) Thermal Conductance, G (W/K) Thermal Capacitance, H (J/K) Optics F/Number Area of Microbolometer Pixel, Ab(m2) Fill Factor of Microbolometer, β (%) Transmission of Infrared Optics, ΦΔλ (%) Absorption of microbolometer membrane, εΔλ (%) Temperature Contrast (dP/dT) 300K, Δλ (WK-1m-2) FPA Size Frame Rate (frames per second) Values 100 6 20 300 11.7 -2.6 3.7e-8 4.34e-10 1 6.25e-10 62 98 92 2.624 128 × 92 10 Resistance change of a microbolometer detector due to change in scene temperature can be evaluated as ΔRIR = R0 (α . β .φΔλ .ε Δλ . ΔRIR ) (8) where β is the fill factor of the microbolometer, ΦΔλ is the transmission of optics, and εΔλ is the absorption of microbolometer membrane in an infrared region. When both incident and bias power are zero, the temperature of the microbolometer cools down based on the equation below: TCOOL (t ) = TB e ( −t τ ) (9) where TB is the temperature of the microbolometer at the end of pulse biasing. Infrared system parameters mentioned in Table 1 are taken from [5]. 4. Thermal Modeling of a Microbolometer Fig. 1 demonstrates the flowchart of the ROIC simulator. Temperature mapping of a thermal image is performed at the beginning. For each pixel, ΔTscene is calculated based on the target temperature and ambient temperature. An infrared radiation model evaluates the difference in incoming flux ΔΦIR, the change in bolometer temperature ΔTIR, and the resistance change in the bolometer due to the absorbed incident power ΔRIR. RTOTAL = R0 + RPV + RIR Fig. 1. ROIC Simulator Flowchart. (10) where RPV is resistance due to process variation and is a ±10% deviation from nominal resistance, and RIR is the change in microbolometer resistance due to incident radiation. Parameters mentioned in Table 1 and synchronization pulse sequence, as mentioned in Table 2, are provided to the ROIC simulator. It evaluates self-heating power, temperature drift and change in resistance due to pulse biasing. A 14-bit ADC is used to convert the signal to digital representation. The blind microbolometer exhibits the same thermal characteristics as that of a normal micro- 254 Haider et al.: Pair-Wise Serial ROIC for Uncooled Microbolometer Array Table 2. Pulse sequence for pair-wise serial ROIC architecture, where BD is microbolometer detector. Pulse 1 2 : (X/2) (X/2) + 1 : (X-1) Selected Microbolometer Detector (BD) Pair Blind (1,i) and BD(2,i) BD (3,i) and BD(4,i) : i = 1,2,3,…Y, BD (X-1,i) and BD (X, i) BD (2,i)and BD (3,i) : BD (X-2,i) and BD (X-1,i) The serial architecture is time-multiplexed; thus, only one pair of microbolometers is read during a single pulse duration. Thus, the minimum time to perform the readout of a single frame of a 127x92 pixel focal plane array is approximately 100ms for a TBIAS of 6μs, and approximately 150ms for a TBIAS of 10μs. This constraint limits the thermal imager to a maximum frame rate of 10 frames per second and 6 frames per second, respectively. The following are the calculations of the start time of the pulse for the microbolometer in the tenth column second row, i.e. microbolometer (2, 10): TWAIT = 2μs, TBIAS = 6μs, TCOLUMN = (TWAIT + TBIAS) * (X – 1) bolometer but remains unaffected by incident radiation, and thus, serves as a reference point. Each microbolometer, either blind or normal, is connected with a current source through a p-channel metal-oxide-semiconductor (PMOS) switch. The gate of the PMOS switch is controlled through the digital circuitry that generates the pulse sequence based on the serial readout topology. Absolute values of each pixel are then calculated based on the reference blind microbolometer after all values are converted to the digital domain. Fixed pattern noise correction is achieved by performing the complete readout under the dark condition. This measurement takes place when the focal plane array is not exposed to any incident power, and it is considered the reference value for each pixel. Later, every pixel reading has to be adjusted based on the pixel’s reference point taken during the dark condition. where X is the number of rows. It means, in order to perform the readout of a complete column, X-1 pulses are required. For the first pulse for microbolometer (2, 10): 5. Pair-wise Serial Readout Architecture Before the second pulse for the same microbolometer arrives, the cooling time of a microbolometer is evaluated as follows: Pair-wise serial readout architecture selects and bias one pair of microbolometers at a time, and the readout is performed differentially. A pair of microbolometers is biased twice during a frame rate, and the reading is performed differentially. Once the readout of the selected pair is finished, bias current source is switched to the next pair of microbolometers. The recently biased detector pair is left to cool off until the next pulse (for the next reading within the same frame rate) is applied. A blind microbolometer is biased once in a column with the first normal microbolometer, and it exhibits the same temperature drift due to self-heating as normal microbolometers, which can also be drastically minimized using a differential approach. The normal microbolometers are read twice with each adjacent neighbor, except for the last normal microbolometer. Table 2 shows the pulse sequence based on pairwise serial configuration, where X is the number of rows and Y is the number of columns. Consider a case for serial readout architecture where, TWAIT = 2μs, TBIAS = 6μs, TCOLUMN = (TWAIT + TBIAS) * (X – 1) TCOLUMN = (2μs + 6μs) * (128 – 1) = 10.16ms TFRAME = TCOLUMN * Y TFRAME = 10.16ms * 92 = 93.472ms TPULSE1 START = TCOLUMN * 9 + TWAIT TPULSE1 START = 9.146ms TPULSE1 END = TPULSE1 START + TBIAS TPULSE1 END = 9.146ms + 6μs = 9.152ms Similarly, the time of the second pulse for microbolometer (2, 10) is calculated as follows: TPULSE2 START = TPULSE1 END + ((X/2) – 1)TBIAS + (X/2)TWAIT TPULSE2 START = 9.152ms + 63*TBIAS + 64*TWAIT TPULSE2 START = 9.658ms TPULSE2 END = TPULSE2 START + TBIAS TPULSE2 END = 9.658ms + 6μs = 9.664ms TCOOL1 = TPULSE2 START - TPULSE1 END TCOOL1 = 9.658ms - 9.152ms = 506μs After the second pulse, cooling time of the microbolometer is evaluated as TCOOL2 = TFRAME - TPULSE2 END TCOOL2 = 100ms – 9.658ms = 90.142ms 6. Simulation Results Self-heating of a microbolometer causes the resistance of the microbolometer to drop due to its negative TCR. The resistance drop due to self-heating is of higher magnitude than the resistance drop due to incident infrared radiation, and thus imposes a strict requirement on the dynamic range of the ROIC. In order to relax this requirement, the microbolometers are pulse biased during the readout time, and the voltage drop due to self-heating is minimized. Electrical biasing generates Joule heating and causes a change in the resistance of the microbolometer detector. 255 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 Fig. 2. Microbolometer self-heating for tBIAS= 20μs and tBIAS=10μs. Even though the bias duration is much shorter than the thermal time constant of the microbolometer detector, it still results in a significant rise in the temperature of the detector. This is due to the fact that applied bias power is much higher, compared to infrared absorbed power. It results in a much higher change in temperature due to selfheating, compared to incident radiation. Heat generated by self-heating cannot be quickly dissipated through thermal conduction to the substrate. Readout circuits are required to have complex circuit design and a high dynamic range, if self-heating not compensated for. Thus, self-heating must be compensated for in order to improve the performance of readout circuit, and eventually, the thermal imaging system. In this work, microbolometers are pulse biased with a nominal value of bias current. If the bias current is too high, or biased for a long duration, it will result in excessive heating and permanent damage to the thermal detectors. Similarly, if the bias current is too low, it will result in low responsivity in the microbolometers. If the microbolometers are biased for a long duration, it will result in high temperature drift in the microbolometers due to self-heating. Fig. 2 demonstrates the effect of self-heating by measuring resistance versus bias current for a pulse duration of 20μs and 10μs. For a pulse duration of 20μs, increasing the bias current from 1μA to 25μA results in a decrease in the nominal resistance of the microbolometer by approximately 7000Ω, which is equivalent to a temperature rise of 2.5K. In order to minimize the impact of self-heating, one way is to reduce the bias current to the lowest practical value, as shown in Fig. 2. The thermal time constant of the microbolometer, along with time-multiplexed integration, limits the maximum frame rate of the thermal imaging system. For an FPA of 128x92 pixels, and by using the proposed pulse sequence, each pair of microbolometers can be selected for a pulse duration of 6μs with 10 frames per second. Fig. 3 shows the temperature variation of microbolometer pixel (2, 1) due to self-heating when given two pulses, one at t = 2μs and the second at t = 82μs. From t = 8μs to t = 82μs and from t = 88µs to t = tframe, both incident and bias power are zero, and hence, the bolometer cools down, as Fig. 3. Variation in microbolometer temperature with and without IR. Fig. 4. Voltage variation of pixel (1,1) and pixel (2,1), and differential readout. (a) (b) Fig. 5. (a) Input thermal image 127 × 92, and (b) ROIC output. per (9), where tframe is the total time for the readout. Fig 4 shows the voltage variation of reference microbolometer (1, 1) and normal microbolometers (2, 1), (3, 1) and (4, 1) during the readout. Few things can be concluded from the figure. Only one pair of microbolometers is biased at a time, and the voltage variation during the readout period is different for each microbolometer, even when there is no input power. This is due to the fact that during fabrication, microbolometers suffer from process variation due to process immaturity. Individual 256 Haider et al.: Pair-Wise Serial ROIC for Uncooled Microbolometer Array slopes of reference for microbolometer (1, 1) and normal microbolometer (2, 1) are 5.348mV/μs and 4.2mV/μs, respectively. Both the microbolometers have different slopes, because they have different nominal resistances before the readout. Fig. 5(a) shows the input thermal image to an ROIC simulator [17], which is 320 × 240 but cropped to 127 × 92 for simulation purposes. Fig. 5(b) shows the output image of the simulator using the proposed architecture, mapping a temperature change of about 12°C. Fixed pattern noise can be seen in the output image. 7. Conclusion The ROIC model presented in this paper uses a pulse bias current scheme to reduce the effect of self-heating. Simulation results show that the temperature drift due to self-heating is compensated for by using differential readout, but it is not completely eliminated due to the consideration of a very high resistance discrepancy of ±10% due to process variation. A pulse sequence to each pair of microbolometers is provided, such that they both fall under the same self-heating point along the selfheating trend line, i.e. the pair picked is such that both are biased the same number of times. The proposed architecture for the ROIC requires one differential amplifier and one 14-bit ADC in order to reduce the dynamic range requirement, power dissipation and area at the expense of a longer readout time for a large focal plane array. Acknowledgement This work was supported by the NSTIP strategic technologies program number 12-ELE2936-02 in the Kingdom of Saudi Arabia. References [1] M. Perenzoni, D. Mosconi, and D. Stoppa, "A 160× 120-pixel uncooled IR-FPA readout integrated circuit with on-chip non-uniformity compensation," in ESSCIRC, 2010 Proceedings of the, 2010, pp. 122125. Article (CrossRef Link) [2] B. F. Andresen, B. Mesgarzadeh, M. R. Sadeghifar, P. Fredriksson, C. Jansson, F. Niklaus, A. Alvandpour, G. F. Fulop, and P. R. Norton, "A low-noise readout circuit in 0.35-μm CMOS for low-cost uncooled FPA infrared network camera," Infrared Technology and Applications XXXV, vol. 7298, pp. 72982F-72982F-8, 2009. Article (CrossRef Link) [3] P. Neuzil and T. Mei, "A Method of Suppressing Self-Heating Signal of Bolometers," IEEE Sensors Journal, vol. 4, pp. 207-210, 2004. Article (CrossRef Link) [4] S. J. Hwang, H. H. Shin, and M. Y. Sung, "High performance read-out IC design for IR image sensor applications," Analog Integrated Circuits and Signal Processing, vol. 64, pp. 147-152, 2009. Article (CrossRef Link) [5] D. Svärd, C. Jansson, and A. Alvandpour, "A readout IC for an uncooled microbolometer infrared FPA with on-chip self-heating compensation in 0.35 μm CMOS," Analog Integrated Circuits and Signal Processing, vol. 77, pp. 29-44, 2013. Article (CrossRef Link) [6] X.Gu, G.Karunasiri, J.Yu, G.Chen, U.Sridhar, and W.J.Zeng, "On-chip compensation of self-heating effects in microbolometer infrared detecto arrays," Sensors and Actuators A: Physical, vol. 69, pp. 92-96, 1998. Article (CrossRef Link) [7] P. J. Thomas, A. Savchenko, P. M. Sinclair, P. Goldman, R. I. Hornsey, C. S. Hong, and T. D. Pope, "Offset and gain compensation in an integrated bolometer array," 1999, pp. 826-836. Article (CrossRef Link) [8] C. H. Hwang, C. B. Kim, Y. S. Lee, and H. C. Lee, "Pixelwise readout circuit with current mirroring injection for microbolometer FPAs," Electronics Letters, vol. 44, pp. 732-733, 2008. Article (CrossRef Link) [9] S. I. Haider, S. Majzoub, M. Alturaigi, and M. AbdelRahman, "Modeling and Simulation of a Pair-Wise Serial ROIC for Uncooled Microbolometer Array," in ICEIC 2015, Singapore, 2015. [10] D. Jakonis, C. Svensson, and C. Jansson, "Readout architectures for uncooled IR detector arrays," Sensors and Actuators A: Physical, vol. 84, pp. 220229, 2000. Article (CrossRef Link) [11] S. I. Haider, S. Majzoub, M. Alturaigi, and M. AbdelRahman, "Column-Wise ROIC Design for Uncooled Microbolometer Array," presented at the International Conference on Information and Communication Technology Research (ICTRC), Abu-Dhabi, 2015. Article (CrossRef Link) [12] R. K. Bhan, R. S. Saxena, C. R. Jalwania, and S. K. Lomash, "Uncooled Infrared Microbolometer Arrays and their Characterisation Techniques," Defence Science Journal, vol. Vol. 59, pp. 580-589, 2009. [13] R. T. R. Kumar, B. Karunagaran, D. Mangalaraj, S. K. Narayandass, P. Manoravi, M. Joseph, V. Gopal, R. K. Madaria, and J. P. Singh, "Determination of Thermal Parameters of Vanadium oxide Uncooled Microbolometer Infrared Detector," International Journal of Infrared and Millimeter Waves, vol. 24, pp. 327-334, 2003. Article (CrossRef Link) [14] J.-C. Chiao, F. Niklaus, C. Vieider, H. Jakobsen, X. Chen, Z. Zhou, and X. Li, "MEMS-based uncooled infrared bolometer arrays: a review," MEMS/MOEMS Technologies and Applications III, vol. 6836, pp. 68360D-68360D-15, 2007. Article (CrossRef Link) [15] S. J. Hwang, A. Shin, H. H. Shin, and M. Y. Sung, "A CMOS Readout IC Design for Uncooled Infrared Bolometer Image Sensor Application," presented at the IEEE ISIE, 2006. Article (CrossRef Link) [16] S. J. Hwang, H. H. Shin, and M. Y. Sung, "A New CMOS Read-out IC for Uncooled Microbolometer Infrared Image Sensor," International Journal of Infrared and Millimeter Waves, vol. 29, pp. 953-965, 2008. Article (CrossRef Link) IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 [17] SPI infrared. Available: Article (CrossRef Link) Syed Irtaza Haider received his BE degree in Electronics Engineering from National University of Sciences and Technology (NUST), Pakistan, in 2010 and MS degree in Electronics Engineering from King Saud University (KSU), Saudi Arabia, in 2015. Currently, he is a researcher at Embedded Computing and Signal Processing Lab (ECASP) at the King Saud University. His research interest includes signal processing, mixed signal design and image processing. He is a student member IEEE. Sohaib Majzoub completed his BE in Electrical Engineering, Computer Section at Beirut Arab University 2000, Beirut Lebanon, and his ME degree from American University of Beirut, 2003 Beirut Lebanon. Then he worked for one year at the Processor Architecture Lab at the Swiss Federal Institute of Technology, Lausanne Switzerland. In 2010, he finished his PhD working at the System-on-Chip research Lab, University of British Columbia, Canada. He worked for two years as assistant professor at American University in Dubai, Dubai, UAE. He then worked for three years starting in 2012 at King Saud University, Riyadh, KSA, as a faculty in the electrical engineering department. In September 2015, he joined the Electrical and Computer Department at the University of Sharjah, UAE. His research field is delay and power modeling, analysis, and design at the system level. He is an IEEE member. Copyrights © 2015 The Institute of Electronics and Information Engineers 257 Dr. Muhammad Turaigi is a professor of Electronics at the EE department, College of Engineering King Saud University. He got his PhD from the department of Electrical Engineering Syracuse University, Syracuse NY, USA in the year 1983. His PhD thesis is in the topic of parallel processing. In 1980 he got his MSc from the same school. His B.Sc. is from King Saud University (formerly Riyadh University) Riyadh Saudi Arabia. In Nov. 1983 He joined the department of Electrical Engineering, King Saud University. He teaches the courses in digital and analog electronic circuits. He has more than 40 published papers in refereed journals and international conferences. His research interest is in the field of parallel processing, parallel computations, and electronic circuits and instrumentation. He was the director of the University Computer Center from July/1987 until Nov/1991. He was a member of the University Scientific Council from July/1999 until June/2003. Dr. Mohamed Ramy is an Assistant Professor in the Electrical Engineering Department, King Saud University, Riyadh, Saudi Arabia. He was previously associated with Prince Sultan Advanced Technologies Research Institute (PSATRI) as the Director of the Electro-Optics Laboratory (EOL). He has a PhD degree in Electrical Engineering from UCF (University of Central Florida), Orlando, USA. He has over 10 years of research and development experience in infrared/millimeter wave detectors and focal plane arrays. He has worked on the design and development of infrared, millimeter wave sensors, focal arrays and camera systems. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.4.258 258 IEIE Transactions on Smart Processing and Computing Implementation of an LFM-FSK Transceiver for Automotive Radar HyunGi Yoo1, MyoungYeol Park2, YoungSu Kim2, SangChul Ahn2 and Franklin Bien1* 1 School of Electrical and Computer Engineering, Ulsan National Institute of Science and Technology, Banyeon-ri, Eonyang-eup, Ulju-gun, Ulsan, Korea {bien}@unist.ac.kr 2 Comotech corp. R&D center, 908-1 Apt-bldg, Hyomun dong, Buk-gu, Ulsan, Korea mypark@comotech.com * Corresponding Author: Franklin Bien Received June 20, 2015; Revised July 15, 2015; Accepted August 24, 2015; Published August 31, 2015 * Short Paper Abstract: The first 77 GHz transceiver that applies a heterodyne structure–based linear frequency modulation–frequency shift keying (LFM-FSK) front-end module (FEM) is presented. An LFMFSK waveform generator is proposed for the transceiver design to avoid ghost target detection in a multi-target environment. This FEM consists of three parts: a frequency synthesizer, a 77 GHz up/down converter, and a baseband block. The purpose of the FEM is to make an appropriate beat frequency, which will be the key to solving problems in the digital signal processor (DSP). This paper mainly focuses on the most challenging tasks, including generating and conveying the correct transmission waveform in the 77 GHz frequency band to the DSP. A synthesizer test confirmed that the developed module for the signal generator of the LFM-FSK can produce an adequate transmission signal. Additionally, a loop back test confirmed that the output frequency of this module works well. This development will contribute to future progress in integrating a radar module for multi-target detection. By using the LFM-FSK waveform method, this radar transceiver is expected to provide multi-target detection, in contrast to the existing method. Keywords: 77-GHz radar module, Front-end module (FEM), LFM-FSK, Multi-target detection, Millimeter wave transceiver, Patch array antenna, RF, Homodyne structure 1. Introduction An intelligent transportation system (ITS) is a representative fusion technology that brings vibrant change to the car itself by using information technology, such as smart cruise control and blind spot detection (BSD) [1-3]. Over the years, various attempts have been made to develop and adapt techniques to provide convenience [4-6]. Considering the potential for treacherous driving conditions, automotive assistance systems require a reliable tracking system. Thus, radar, which has been widely used in the military and aviation fields, has emerged as an alternative approach [7-9]. Today, car companies and suppliers are already working to develop the nextgeneration of long-range radar (LRR) at 77 GHz, which will improve the maximum and minimum ranges, provide a wider field of view, as well as improved range, angular resolution and accuracy, self-alignment, and blockage detection capability. The most commonly used LRR is the frequency-modulated continuous wave (FMCW) radar because of its high performance-to-cost (P/C) ratio compared to other radar modulation methods. However, this FMCW radar waveform has some serious limitations in multiple target situations due to the technically complicated association step. On the other hand, linear frequency modulation–frequency shift keying (LFM-FSK), which combines a frequency shift–keying method with a linear frequency modulation waveform, is seen as a new alternative in this situation because of its advanced structure [10, 11]. This paper presents a transceiver module for a 77 GHz complementary metal-oxide semiconductor long-range automotive radar module, which was developed to provide accurate information for consumers, even in multi-target situations, by using the LFM-FSK radar method. Section 2 identifies the overall architecture of this radar module, and Section 3 gives a system description, which is followed by a presentation of the measured results in Section 4. The conclusions are pre- 259 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 sented in Section 5, the final section of this paper. 2. Overall Architecture 2.1 LFM-FSK In the automotive radar market, the classic radar module has some limitations due to its structure. The most common limitation is difficulty detecting multiple targets in real traffic environments. There have been a variety of solutions by developing an advanced module [12-14]. In this paper, the LFM-FSK radar front-end module (FEM) is proposed for a transceiver design to avoid ghost target detection in a multi-target environment. The LFM-FSK waveform is a new waveform designed for automotive applications, based on continuous wave (CW) transmit signals, which leads to an extremely short measurement time. The basic idea is a combination of LFM and FSK CW waveforms in an intertwined technique. Unambiguous range and velocity measurement with high resolution and accuracy can be required in this case, even in multi-target situations. Fig. 1. 77 GHz LFM-FSK radar module. 2.2 Overall Architecture The main purpose of the radar module is gathering and supplying correct results to a user. As demonstrated in Fig. 1, the radar module commonly consists of three parts: an FEM, an antenna, and a digital signal processing (DSP) module. Among these, constructing an efficient radar FEM with the correct waveform generator was commonly considered to be the most challenging task. This is an FMCW signal waveform to which the classic frequency shift keying (FSK) modulation method has been applied. This module consists of the radar FEM, microstrip patch array antenna, up/down converter, baseband block, and DSP module. Every part is covered in this paper, except the DSP and antenna. To implement the LFM-FSK radar FEM, a heterodyne structure was considered, which is one of the most stable ways to generate a signal waveform using radar technology. In this structure, the signal was generated by the frequency synthesizer block though the phase locked loop (PLL) block with a crystal oscillator. The 77 GHz up/down converter down-converted the received signal, which was up-converted from the baseband block to 77 GHz in the transmission module before reaching the antenna array with an up-and-down mixer. The baseband block located after the converter handles the signal filtering process to efficiently supply information to the next module, the DSP. The hybrid coupler in the up/down converter is employed to supply this beat frequency by simultaneously distributing the generated signal to the transmit antenna and baseband block. In the LFM-FSK radar waveform method, the beat frequency needs to be calculated using special equations. As demonstrated in Fig. 2, there are key concepts that allow this module to be implemented successfully. First, the frequency range of the synthesizer is 76.5 to 76.6 GHz, with enough chirp time and delay time. Chirp time is the overall elapsed time in one period. The delay time men- (a) With fshift / fstep (b) With chirp / delay time Fig. 2. LFM-FSK waveform. tioned here is the float time during which the DSP implements the algorithm, which gives the information to the user. This module also has the goal of generating a signal with enough fstep (the difference value between two transmitting signals). A crystal oscillator was employed in the synthesizer block to generate the signal. The receiver translates the channel of interest directly from 77 GHz to the baseband block in a single stage. This structure needs 260 Yoo et al.: Implementation of an LFM-FSK Transceiver for Automotive Radar Fig. 4. Synthesizer of 77 GHz FEM. Fig. 3. Block diagram of the synthesizer. less hardware compared to a heterodyne structure, which is currently the most widely used structure in wireless transceivers. In this structure, the integrated circuit’s low noise amplifier (LNA) does not need to match 50 Ω because there is no image reject filter between the LNA and mixer. Another advantage of this structure is the amplification at the baseband block, which results in power savings. 3. System Description 3.1 Front-End Module (FEM) The implemented radar module includes the synthesizer, 77 GHz up/down converter, and baseband block. Fig. 3 shows the implemented block synthesizer. The PLL in the radar FEM generates a baseband signal at 3 GHz. The 77 GHz up/down converter processes the generated signal passed from the frequency synthesizer block. In the transmit stage, this block converts the signal to the 77 GHz frequency band to radiate the RF signal, while the reflected signal from the target is down-converted in the receiver stage. 3.2 Synthesizer Block As demonstrated in Fig. 3, this block is divided into the synthesizer, which generates the LFM-FSK waveform, and the PLL, which generates the local oscillator (LO) frequency, along with the power divider (PDV), which sends the phase information to the voltage-controlled oscillator (VCO) to generate the correct signal in the PLL. This module requires the signal to have an LFM-FSK shape for multi-target detection. In the synthesizer, the VCO generates a 3 GHz signal. The PLL creates a specialized shape waveform according to a command from the PIC block, which generates the programming code from the J1 outside the synthesizer block. The lock detector controls the generated signal through J2, which verifies the operation of the PLL. The conventional PLL is a chip that is used for the purpose of locking a fixed frequency. On the other hand, this module uses a fractional synthesizer that locks the Fig. 5. Block diagram of a 77 GHz up/down converter. PLL by controlling the VCO. This module includes the VCO block in the synthesizer block to generate a stable waveform, while the primary design concept is ensuring immunity to noise when each frequency step is changed. Therefore, this synthesizer generates a variety of modulated waveforms, such as single, sawtooth, and triangular ramp. In particular, this structure can generate a waveform that has regular intervals and time delay. By using this, we can make a specialized ramp that is similar to the LFM-FSK waveform in transmission output. Here, the output frequency of the LO is 3 GHz with the VCO output frequency shift. Thus, we can generate a very similar LFM-FSK signal in the synthesizer block by using this PLL. The PDV element controls the VCO by comparing the phases of the two signals, which are the input signal and reference signal in the PLL. To operate this synthesizer block, this module uses a 5 v/2000 mA bias source, which comes from outside the module. Fig. 4 shows the implemented synthesizer of a 77 GHz FEM. This block can minimize the interference from another block’s signal due to the metal wall. 3.3 77 GHz Up/down Converter Block & Baseband Block As demonstrated in Fig. 5, the up/down converter consists of a transmitter and receiver pair, and the signal is raised to 77 GHz through the LO. After mixing with the carrier frequency in the receiver, there is a phase IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 261 Fig. 6. Block diagram of the baseband block. difference that leads to an ambiguous measurement for distance and relative speed. The synthesizer signal is conveyed for transmission through the amplifier, a band pass filter (BPF), and an up-mixer. The LO signal is applied where the 3 GHz frequency band can generate a 70 GHz signal through the 4 GHz and 6 GHz frequency bands. The RF signal that is sent through the up-mixer is elevated to 76.5 GHz using a drive amplifier (DRA). The transmitter outputs this signal through the antenna. The received signal is amplified by the LNA and converted to an intermediate frequency (IF) signal though the downmixer. This signal is dropped to a baseband signal through the I/Q mixer. To split the generated signal and make the beat frequency, which represents the phase difference between the transmitted and received signals, the mixer employs a hybrid coupler. This module uses a mixer for the high-frequency E-band. In this system, the LO is used to generate a signal to convert the signal of interest to a different frequency. The receiver converts the received signal frequency to the IF block through the mixer. Fig. 6 shows the implemented block of the baseband block. As shown in this figure, the baseband block downconverts the high-frequency signal to the baseband frequency for accurate signal processing. It needs an automatic gain control (AGC) block because of the level difference between the received signals. It can be substituted for an op-amp, which can regulate the signal output level by controlling the gain slope of a high pass filter (HPF). By doing this, it can reduce the level dynamic range of the received signal according to the target distance. This module uses an op-amp that was used to implement the filter and an AGC function to reduce the output change in accordance with the power of the received signal and generate a constant output level. The module uses a band pass filter that consists of an HPF and a low-pass filter (LPF) pair, to reduce the noise from the adjacent frequency band. To maintain constant output power for the signal, this module employs the AGC in the system instead of a general amplifier. If the target is far from the transceiver, there might be low signal power compared to when the target is close to the system. The gain of the processing amplifier can be adjusted by using the AGC elements to maintain a constant signal strength and send the correct information to the DSP though the J3 and J4 ports. If the mixed received signal is digitized and subjected to Fourier transformation within a single period in the DSP, the ambiguities for distance and speed can be resolved by combining the measurement results in accordance with the special equations. The purpose of the radar FEM is to generate an appropriate beat frequency, which will be the key to solving the problem in the DSP. Fig. 7. 77GHz up/down converter & baseband block. Fig. 7 shows the implemented 77 GHz up/down converter and baseband block. 4. Measurement Results 4.1 Block Measurement Result 4.1.1 Synthesizer Test The most important part in this LFM-FSK radar module implementation is the synthesizer. As shown in Fig. 8, the synthesizer was measured by using this PCB. Before frequency multiplication, the test frequency was 3 GHz in the synthesizer. In the synthesizer test, the Vctrl signal waveform controlled the VCO. The LFM-FSK waveform was generated correctly, which shows the output signal of the synthesizer. Fig. 9 represents a transmitted signal’s frequency range of 76.45 to 76.55 GHz. According to Fig. 10 by conducting the spectrum analyzer, the bandwidth of the transmitter is 100 MHz. Finally, according to Fig. 11, the chirp time was around 6 ms, which means the measurement resolution time of this block is fast enough. 4.1.2 Loop Back Test Fig. 12 shows the conducted loop back test for simulation. The beat frequency is the difference between the transmitted and reflected echo signals. The cable used in the transmission line represented a time delay in the actual driving environment. By changing the cable length, the reflected signal was measured with a variety of time delays. Each cable length represented a different driving 262 Yoo et al.: Implementation of an LFM-FSK Transceiver for Automotive Radar (a) Test block diagram Fig. 11. Chirp time of the synthesizer. (b) Synthesizer test Fig. 8. Synthesizer test. Fig. 12. 77 GHz LFM-FSK radar module. Fig. 9. Vctrl signal waveform of the synthesizer. attenuator (ATTN) was employed to reduce the tested output power to create conditions similar to actual situations. Fig. 13(a), shows the results of the loop back test when using a 1.4 m cable length. Unlike this result, which assumes that the distance between the target and observer is small, Fig. 13(b) shows that the beat frequency was lower with a longer cable length. This is an important finding: the shorter the distance between the target and the observer, the more likely it is that the module will hand over the beat frequency. Based on this simulation, we can conclude that the implemented radar FEM can operate well with the correct signal information, even in a real traffic environment. A long-length cable represented a target that was far away from the transceiver module, and a short length cable represented the opposite situation. 4.2 System Measurement Result Fig. 10. RF sweep of the synthesizer. situation; a cable length of 35 m represented a target that was far away from the observer, compared to the 1.4 m length. The loop back test created an artificial delay to check the beat frequency using cable length variation. An This paper mainly focuses on the most challenging tasks: generating and conveying the correct transmission waveform in the 77 GHz frequency band to the DSP. The 77 GHz radar FEM was designed using the LFM-FSK method, unlike the conventional FMCW radar. This implementation emphasizes generating the appropriate waveform at a high frequency. The synthesizer test of the IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 263 the advantages of the proposed system. Using this implementation for an automotive radar module will promote its commercialization for multi-target detection. References (a) Close situation: 1.4 m cable (b) Far away situation : 35 m cable Fig. 13. Loop back test result. Table 1. Performance of the LFM-FSK radar transceiver. Module Transmitter Receiver Parameter Tx (RF) Tx (IF) Bandwidth Output LO Dynamic range Conversion gain Rx input P1dB Noise figure Value 76.5 ± 0.05 GHZ 2.97 ± 0.05 GHz 100 MHz +10 dBm 73.53 GHz -23 to -112 dBm 88 dB -22 dBm 10 dB developed module confirmed that the LFM-FSK signal generator could produce an adequate signal to transmit. Additionally, the loop back test confirmed that the output frequency of this module works well. Using these methods, the performance of the radar module could be verified in a simulation. The measurement results for the LFM-FSK radar transceiver are summarized in Table 1. 5. Conclusion This paper presented the first 77 GHz transceiver that applies an LFM-FSK FEM with a frequency synthesizer block, an up/down converter block and a baseband block. The performance of the implemented module was experimentally evaluated twice using a synthesizer and a loop back test. The results of the experiment demonstrated [1] C.T. Chen, Y.S. Chen.: ‘Real-time approaching vehicle detection in blind-spot area’, Proceedings of the 12th International IEEE Conference on Intelligent Transportation Systems, 2009. [2] Article (CrossRef Link) [3] Sawant, H., Jindong Tan, Qingyan Yang, Qizhi Wang.: ‘Using Bluetooth and sensor networks for intelligent transportation systems’, Intelligent Transportation Systems, 2004. Proceedings. The 7th International IEEE Conference on, 2004, pp.767, 772, 3-6 . [4] Article (CrossRef Link) [5] Papadimitratos, P., La Fortelle, A., Evenssen, K., Brignolo, R., Cosenza, S.: ‘Vehicular communication systems: Enabling technologies, applications, and future outlook on intelligent transportation’, Communications Magazine, IEEE , 2009, 47, (11), pp.8495. [6] Article (CrossRef Link) [7] Kan Zheng, Fei Liu, Qiang Zheng, Wei Xiang, Wenbo Wang.: ‘A Graph-Based Cooperative Scheduling Scheme for Vehicular Networks’, Vehicular Technology, IEEE Transactions on, 2013, 62, (4), pp.1450-1458. [8] Alam, N., Dempster, A.G.: ‘Cooperative Positioning for Vehicular Networks: Facts and Future’, Intelligent Transportation Systems, IEEE Transactions on, 2013, 14, (4), pp.1708-1717. [9] Punithan, M.X., Seung-Woo Seo: ‘King's GraphBased Neighbor-Vehicle Mapping Framework’, Intelligent Transportation Systems, IEEE Transactions on, 2013, 14, (3), pp.1313-1330. [10] Brennan, P.V., Lok, L.B., Nicholls, K., Corr, H.: ‘Phase-sensitive FMCW radar system for highprecision Antarctic ice shelf profile monitoring’, Radar, Sonar & Navigation, IET, 2014, 8, (7), pp. 776- 786. [11] Article (CrossRef Link) [12] M.-S. Lee, Y.-H. Kim: ‘New data association method for automotive radar tracking’, IEE Proceedings Radar, Sonar and Navigation, 2001, 148, (5), pp. 297301. [13] Article (CrossRef Link) [14] A. Polychronopoulos, A. Amditis, N. Floudas, H. Lind: ‘Integrated object and road border tracking using 77 GHz automotive radars’, IEE Proceedings Radar, Sonar and Navigation, 2004, 151, (6), pp. 375381. [15] Article (CrossRef Link) [16] Marc-Michael Meinecke, Hermann Rohling: ‘Combination of LFMCW and FSK Modulation Principles for Automotive Radar Systems’, German Radar Symposium GRS2000, 2000. [17] Hermann Rohling, Christof Möller: ‘Radar waveform 264 [18] [19] [20] [21] Yoo et al.: Implementation of an LFM-FSK Transceiver for Automotive Radar for automotive radar systems and applications’, Radar Conference 2008 '08 IEEE, 2008, pp.1-4. Bi Xin, Du Jinsong: ‘A New Waveform for RangeVelocity Decoupling in Automotive Radar’, 2010 2nd International Conference on Signal Processing Systems (ICSPS), 2010. M. Musa, S. Salous: ‘Ambiguity elimination in HF FMCW radar systems’, IEE Proceedings - Radar, Sonar and Navigation, 2012, 147, (4), pp. 182-188. Article (CrossRef Link) Eugin Hyun, Woojin Oh, Jong-Hun Lee: ‘Multitarget detection algorithm for FMCW radar ’, Radar Conference (RADAR), 2012 IEEE, 2012, pp. 338341. HyunGi Yoo received his B.S degree in electronic Engineering from Korea Polytechnic University, Korea, in 2010. From 2013, he joined in M.S degree at Ulsan National Institute of Science and Technology (UNIST), Ulsan, Korea. His research interests are electronics for electric vehicles, control system for crack detection and analog/RF IC design for automotive radar technology. MyoungYeol Park received the B.S. degree and the M.S. degree from University of Ulsan, Korea, in 1998. Since 1999, he works for Comotech Corp, as a chief researcher. Comotech Corp. is a leading company developing the world best point-to-point ultra broad-bandwidth wireless link up to 1.25Gbps data rate by using millimeter-wave such as 60 GHz or 70/80 GHz frequency band. Also the company produces high performance components above 18 GHz Kband to 110GHz W-band including 77 GHz automotive radar front-end modules. Copyrights © 2015 The Institute of Electronics and Information Engineers YoungSu Kim received the Ph.D. degree in the School of Electrical and Computer Engineering at Ulsan National Institute of Science and Technology (UNIST) in 2014. He now works with Comotech Corp. as a Senior Field Application Manager working on Eband radiolinks and mmW transceiver. Back in 2004, he was with LG-Innotek as a Junior Research Engineer working on 77 GHz radar system and 10 GHz X-band military radars. His research interests include E-band radiolink, RF front-end module & devices in microwave and millimeter-wave frequency ranges for wireless communication systems. SangChul Ahn received the M.S. degree from University of Ulsan, Korea, in 2004. Since 2009, he works for Comotech Corp, as a Manager in Development Team. His research interests include millimeter-wave radio, radar systems and antenna design. Franklin Bien is currently an Associate professor in the School of Electrical and Computer Engineering at Ulsan National Institute of Science and Technology (UNIST), Ulsan, Korea, since March, 2009. Prior to joining UNIST, Dr. Bien was with Staccato Communications, San Diego, CA as a Senior IC Design Engineer working on analog/mixedsignal IC and RF front-end blocks for Ultra-Wideband (UWB) products such as Wireless-USB in 65nm CMOS technologies. Prior to working at Staccato, he was with Agilent Technologies and Quellan Inc., developing transceiver ICs for enterprise segments that improve the speed and reach of communications channels in consumer, broadcast, enterprise and computing markets. His research interests include analog/RF IC design for wireless communications, signal integrity improvement for 10+Gb/sec broadband communication applications, circuits for wireless power transfer technologies, and electronics for electric vehicles. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.4.265 265 IEIE Transactions on Smart Processing and Computing A Novel Red Apple Detection Algorithm Based on AdaBoost Learning Donggi Kim, Hongchul Choi, Jaehoon Choi, Seong Joon Yoo and Dongil Han Department of Computer Engineering, Sejong University {kdg1016,maltessfox,s041735}@sju.ac.kr, {sjyoo,dihan}@sejong.ac.kr * Corresponding Author: Dongil Han Received June 20, 2015; Revised July 15, 2015; Accepted August 24, 2015; Published August 31, 2015 * Short Paper Abstract: This study proposes an algorithm for recognizing apple trees in images and detecting apples to measure the number of apples on the trees. The proposed algorithm explores whether there are apple trees or not based on the number of image block-unit edges, and then it detects apple areas. In order to extract colors appropriate for apple areas, the CIE L*a*b* color space is used. In order to extract apple characteristics strong against illumination changes, modified census transform (MCT) is used. Then, using the AdaBoost learning algorithm, characteristics data on the apples are learned and generated. With the generated data, the detection of apple areas is made. The proposed algorithm has a higher detection rate than existing pixel-based image processing algorithms and minimizes false detection. Keywords: Crop yield estimation, Image segmentation, Apple tree detection, Apple detection, Object detection 1. Introduction Generally, a crop disaster evaluation procedure entails a sample survey that measures the yield before and after a natural disaster through visual inspection and manual work to judge the size of the crops and the disaster damage. This consumes a lot of time and money, and it is possible to compromise fairness depending on the accuracy of the inspector. However, developments in image processing and machine vision technologies have been proposed as a solution to these problems. Most existing studies on fruit detection have determined the area with the pixel-based image processing method and counted the number of detected areas [1-3]. This study analyzes the shape of the apple tree to detect its existence and recognize the apple tree in the first stage. Then, apples on the trees are detected using learned apple data through AdaBoost Learning as an edge-based preprocess. Various detection errors occur when detecting apples on a tree. Examples of errors in apple detection include detecting other objects with similar colors to the fruit, errors due to reflection or the shade of the apples, and not detecting fruit hidden behind objects like leaves. This method extracts colors suitable to an apple area using the CIE L*a*b color space to minimize detection errors with colors similar to an apple. In addition, this study utilizes modified census transform (MCT), which reduces the influence from illumination changes by extracting structural information about the apple region. Also, in the post-processing stage, outliers are eliminated by using normal distribution characteristics, which finally reduce the fault detection area of the apple. 2. Related Work Existing studies that detected and measured the fruit area with an image processing technique generally utilized external or structural information of the pixel itself. They proposed a method that implemented labeling and boundary detection after removing the background of the input image and extracting the area of the fruit, finally counting and calculating the amount of fruit. Wang et al. [2] used the HSV color space to extract the pixel area of the red of the apple and labelled nearby pixels. Patel et al. [5] implemented a noise reduction algorithm on the extracted area based on color to detect the orange color Kim et al.: A Novel Red Apple Detection Algorithm Based on AdaBoost Learning 266 Fig. 1. Entire system block diagram. and judged circular forms as the fruit through boundary detection. Other studies have shown common detection errors including missing the fruit area owing to the light source, missing fruit hidden by other fruit or by a leaf, and counting as fruit items that have colors similar to the fruit. 3. The Proposed Scheme This paper proposes a system that dramatically reduces the fault detection rate while minimizing the effect from the existing light source by introducing AdaBoost Learning and MCT algorithms. The proposed apple detection system consists of apple tree recognition and apple area detection modules. The pre-process, or apple tree recognition module, analyzes the external shape of the tree on the input image based on the edge to judge the existence of the apple tree, and then goes to the apple area detection stage. The block diagram for the entire system is shown in Fig. 1. Fig. 2. Apple tree region extraction algorithm. 3.1 Apple Tree Recognition Most of existing apple detection systems generated false detection results in performing the detection process, even though the input image was not an apple tree. An unfocused camera image also generates the same results. This paper segments the input image into blocks as a preprocess of the system to solve such recognition errors. And then, the number of edges in each block is extracted to judge whether the block is included in the tree candidate area. The apple area contains a smaller number of edges than that of the tree area, causing empty space in the tree area, and the empty space is filled. Finally, the tree block candidate is judged as a tree if the block meets a certain rate. The study tested hundreds of apple tree images to analyze the edge information on the tree shape and finally extract them to more precisely apply the tree shape information to the algorithm. Th1 is the parameter indicating the number of pixels on the edge for each block area, and Th2 is the parameter indicating the number of blocks corresponding to the tree block in each block area. Th1 and Th2 had values of 50 and 120, respectively, on an experimental basis. The edge extraction resulting images from the tree shape pre-process and the proposed algorithm are as follows. 3.2 CIE LAB Color Space Analysis It is important to clearly separate the apple region from the background to precisely detect the apples in the image. Therefore, this study compared and analyzed various color (a) Input image (b) Apple tree region extraction Fig. 3. An example of the apple tree region extraction result. space models (RGB, HSI, CIE L*a*b) to find out the proper color range of the apple. The CIE L*a*b space [11], similar to the human visualization model, showed the most remarkable separation of the apple from the background. Therefore, this study used the L*, a* and b* color space to define the range of the red apple area. The color range in the defined condition and the extracted apple area are as follows. 0 ≤ L∗ ≤ 100 15 ≤ a∗ ≤ 80 (1) ∗ 0 ≤ b ≤ 60 The three coordinates of CIE L*a*b* represent the lightness of the color (L* = 0 yields black and L* = 100 indicates diffuse white), its position between red/magenta 267 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 Fig. 4. Apple tree shape analysis examples. source or items hidden by shade on the apple area. The MCT calculates the average brightness based on a 3×3 mask, and then compares and calculates the brightness to nearby pixels. Eq. (2) and the MCT calculation process are as follows. X = (x, y) means the location of each pixel in the image, and the image brightness corresponding to each position is defined as I(X). The 3×3 window, of which X is the center, is W(X); N′ is a set of pixels in W(X), and Y represents nine pixels each in the window. In addition, I ( X ) is the average value of pixels in the window, and I(Y) is the brightness value of each pixel in the window. As a comparison function, ζ() becomes 1 if I ( X ) < I ( Y ) , (a) Input image (b) Apple color area extraction Fig. 5. Apple area extraction from the CIE L*a*b* color space. and green (a*, negative values indicate green, while positive values indicate magenta) and its position between yellow and blue (b*, negative values indicate blue and positive values indicate yellow). Figures with various color space models (RGB, HSI, CIE L*a*b) used to find the proper color range of the apple and background area are shown in Fig. 6. 3.3 MCT The MCT of the apple area extracted from the input image may minimize the effect from the light source and only extract texture information in the apple area to minimize fault detection due to reflection from the light otherwise, it is 0. As a set operator, ⊗ connects binary patterns of function, and then nine binary patterns are connected through the operations. MCT was applied for apple and non-apple areas in a 20 × 20 window. The extracted features effectively distinguish apple and non-apple areas through the AdaBoost learning algorithm proposed by Viola and Jones [12]. Γ(X) = ⊗ ζ( I ( X ), I (Y )) Y ∈ N' (2) 3.4 Removal of Fault Detection Area There may exist areas with faulty detection of the apples due to colors and patterns similar to the apple area during apple area detection. The study calculated the average (μ) and standard deviation (σ) of the areas of the apple area detected during apple image extraction to eliminate such errors. The apple area detected from the normal distribution feature was distributed within a range of 3σ standard deviation (μ -3σ, μ +3σ) for 99.7%. Therefore, the study judged the red object rather than the Kim et al.: A Novel Red Apple Detection Algorithm Based on AdaBoost Learning 268 Fig. 6. Apple and background areas in the RGB, HSI and CIE L*a*b* color spaces (red: apple region, green: background). Table 1. Processing time and accuracy of the proposed method. Fig. 7. MCT calculation process. (a) Experiment image 1 (b) Experiment image 2 Image index Pieces of fruits 1 2 3 4 5 6 7 8 9 10 Average (30) 15 15 16 17 18 16 16 15 15 15 502 Proposed Detection rate (%) 86.70 80.00 68.80 70.60 77.80 68.80 87.50 80.00 86.70 80.00 80.68 False detection 2 3 5 5 4 5 2 3 2 1 18 Module name Image size Processing timing Cumulative timing Image resize 864 x 648 ㆍ ㆍ Apple tree recognition 384 x 288 0.370 sec 0.370 sec Apple detection 384 x 288 0.440 sec 0.810 sec Table 2. Comparison of detection performance. (c) MCT conversion image of (a) (d) MCT conversion image of (b) Fig. 8. MCT of two images with different objects. apple within the 3σ range as faulty and eliminated it from the detection area. 4. Performance Evaluation Apple detection was verified with about 30 apple images under environments with various apple sizes, colors and light sources. A comparison of the proposed apple detection study Algorithm Yeon Linker Proposed Chinhhuluun Aggelopoulou et al. et al. method and Lee [6] et al. [8] [1] [3] Coefficient of determination 0.8402 0.7621 0.8006 ( R 2 value) 0.8300 0.7225 against the study by Yeon et al. [1] shows that the proposed method recorded a 3.5% higher detection rate and faulty detection was dramatically decreased. Furthermore, we have compared the coefficient of determination values of existing detection systems. The proposed method shows the best results, as shown in Table 2. Fig. 10 shows the output of our apple detection system for color images. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 Fig. 9. Comparison between ground truth and the proposed method. Fig. 10. Results of apple area detection. 269 270 Kim et al.: A Novel Red Apple Detection Algorithm Based on AdaBoost Learning 5. Conclusion The analysis of existing apple detection methods shows that a lot of the faulty detection was due to the color, the light source, and shades being similar to the apple. This paper recognizes the apple tree first and extracts proper colors for the apple area through various color space analyses as a pre-process to solving various problems in apple detection. In addition, the study applied MCT to minimize problems in reflection and shade, and conducts an AdaBoost machine learning process on the applied features to learn the shape information (pattern) corresponding to the apple. The study developed an apple detection algorithm that dramatically decreases faulty detection and improves the detection rate, compared to existing studies, and verified that it operates in real time. Acknowledgement This work was supported by a National Research Foundation of Korea Grant funded by the Korean Government (No. 2012-007498), (NRF-2014R1A1A2058592), and was also supported by the ICT R&D Program of MSIP [I0114-14-1016, Development of Prediction and Response Technology for Agricultural Disasters Based on ICT] and the Creative Vitamin Project. [7] M.W.Hannan, T.F.Burks, D.M.Bulano, "“A Machine Vision Algorithm for Orange Fruit Detection", Agricultural Engineering International: the CIGR Ejournal. Manuscript 1281. Vol.XI, December, 2009. Article (CrossRef Link) [8] A. D. Aggelopoulou, D. Bochtis, S. Fountas, K. C. Swain, T. A. Gemtos, G. D. Nanos, “Yield Prediction in apple orchards based on image processing”, Journal of Precision Agriculture, 2011. Article (CrossRef Link) [9] Ulzii-Orshikh Dorj, Malrey Lee, Sangsub Han, “A Comparative Study on Tangerine Detection Counting and Yield Estimation Algorithm”, Journal of Security and Its Applications, 7(3), pp.405-412, May 2013. Article (CrossRef Link) [10] Bernhard Fröba and Andreas Ernst, “Face detection with the Modified Census Transform”, IEEE International Conf. On Automatic Face and Gesture Recognition(AFGR), pp. 91-96, Seoul, Korea, May. 2004. Article (CrossRef Link) [11] CIE Color Space. [Online]. Available: Article (CrossRef Link) [12] P. Viola and M. Jones, “Fast and robust classification using asymmetric AdaBoost and a detector cascade”, in NIPS 14, 2002, pp. 1311-1318. Article (CrossRef Link) References [1] Hanbyul Yeon, SeongJoon Yoo, Dongil Han, Jinhee Lim, “Automatic Detection and Count of Apples Using N-adic Overlapped Object Separation Technology”, Proceeding of International Conference on Information Technology and Management, November 2013. Article (CrossRef Link) [2] Qi Wang, Stephen Nuske, & Marcel Bergerman, E.A., “Design of Crop Yield Estimation System for Apple Orchards Using Computer Vision”, In Proceedings of ASABE, July 2012. Article (CrossRef Link) [3] Raphael Linker, Oded Cohen, Amos Naor, “Determination of the number of green apples in RGB images recorded in orchards”, Journal of Computers and Electronics in Agriculture, pp. 45-57 February 2012. Article (CrossRef Link) [4] Y. Song, C.A. Glasbey, G.W. Horgan, G. Polder, J.A. Dieleman, G.W.A.M. van der Heijden, "Automatic fruit recognition and counting from multiple images", Journal of Biosystems Engineering, pp.203-215 February 2013. Article (CrossRef Link) [5] H. N. Patel, R. K. Jain, M. V. Joshi, “Automatic Segmentation and Yield Measurement of Fruit using Shape Analysis”, Journal of Computer Applications, 45(7), pp.19-24, 2012. Article (CrossRef Link) [6] Radnaabazer Chinhhuluun, Won Suk Lee, “Citrus Yield Mapping System in Natural Outdoor Scenes using the Watershed Transform”, In Proceeding of American Society of Agricultural and Biological Engineers, 2006. Article (CrossRef Link) Donggi Kim received his BSc in Electronic Engineering from Sunmoon University, Asan, Korea, in 2014. He is currently in the Master’s course in the Vision & Image Processing Laboratory at Sejong University. His research interest is image processing. Hongchul Choi received his BSc in Computer Engineering from Sejong University, Seoul, Korea, in 2013. He is currently in the Master’s course in the Vision & Image Processing Laboratory at Sejong University. His research interest is image processing. Jaehoon Choi received his BSc in Computer Engineering from Sejong University, Seoul, Korea, in 2013. He is currently in the Master’s course in the Vision & Image Processing Laboratory at Sejong University. His research interest is image processing. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 Seong Joon Yoo From 1982 to 2000, he was Information Search and Research Team Leader at the Electronics and Communications Research Institute and was head of the research institute of Search Cast Co., Ltd. Since 2002, he has been with the Department of Computer Engineering, Sejong University, Korea, where he is currently a professor. His research interests include data mining. Copyrights © 2015 The Institute of Electronics and Information Engineers 271 Dongil Han received his BSc in Computer Engineering from Korea University, Seoul, Korea, in 1988, and an MSc in 1990 from the Department of Electric and Electronic Engineering, KAIST, Daejeon, Korea. He received his PhD in 1995 from the Department of Electric and Electronic Engineering at KAIST, Daejeon, Korea. From 1995 to 2003, he was senior researcher in the Digital TV Lab of LG Electronics. Since 2003, he has been with the Department of Computer Engineering, Sejong University, Korea, where he is currently a professor. His research interests include image processing. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.4.272 272 IEIE Transactions on Smart Processing and Computing Construction of Confusion Lines for Color Vision Deficiency and Verification by Ishihara Chart Keuyhong Cho, Jusun Lee, Sanghoon Song, and Dongil Han* Department of Computer Engineering, Sejong University isolat09@naver.com, jusunleeme@nate.com, song@sejong.ac.kr, dihan@sejong.ac.kr * Corresponding Author: Author: Dongil Han Received June 20, 2015; Revised July 15, 2015; Accepted August 24, 2015; Published August 31, 2015 * Regular Paper Abstract: This paper proposes color databases that can be used for various purposes for people with a color vision deficiency (CVD). The purpose of this paper is to group colors within the sRGB gamut into the CIE L*a*b* color space using the Brettel algorithm to simulate the representative colors of each group into colors visible to people with a CVD, and to establish a confusion line database by comparing colors that might cause confusion for people with different types of color vision deficiency. The validity of the established confusion lines were verified by using an Ishihara chart. The different colors that confuse those with a CVD in an Ishihara chart are located in the same confusion line database for both protanopia and deutanopia. Instead of the 3D RGB color space, we have grouped confusion colors to the CIE L*a*b* space coordinates in a more distinctive and intuitive manner, and can establish a database of colors that can be perceived by people with a CVD more accurately. Editor - Highlight - Do these changes reflect the intended meaning? If not, please rephrase as intended. Keywords: Color vision deficiency, Ishihara chart, Confusion line, Color vision deficiency simulation 1. Introduction The recent remarkable developments in display technology have ushered in an environment where color information display devices are ubiquitous, and where people can enjoy more color information than at any time in history. Although about 92% of the world's people can enjoy the benefits of rich color, for the remaining 8% who have a color vision deficiency (CVD), it is not possible. People with a CVD are usually classified into three groups: protan (red color-blind), deutan (green color-blind), and tritan (blue color-blind). The protan population has abnormal L-cone cells, which are sensitive to red light, whereas the deutan population has abnormal M-cone cells, which perceive green; both types account for approximately 95% of the people with a CVD. The remaining 5%, a small percentage, belong to the tritan group, characterized by an absence of the S-cone cells sensitive to blue light [1]. In this paper, we proposed an algorithm to construct a database of confusion lines in order to identify colors that cause confusion for people with a CVD. To solve the problem, we used the protanopia and deuteranopia simulation [2, 3, 4] algorithms proposed by Brettel to construct a database of confusion lines within the sRGB gamut, but in the CIE L*a*b* color space [5]. However, because the definition of color differences within the sRGB color gamut is ambiguous, we selected the representative values from the sRGB color gamut in the CIE L*a*b* color space, which is more similar to real color perception in humans, in order to produce more effective results. By using colors that are grouped according to different types of CVD, we could establish a database of confusion lines that cause confusion for people with each type of CVD. 2. Related Work Some previous research [6] proposed a way of changing colors from the color pallet that cause confusion for people with a CVD in order to prevent that confusion, and other research [7, 8] proposed establishing a database of confusion lines by using the sRGB color space. But this 273 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 coordinates, and selected their central points as their representative values. We selected the representative values and simulated them by using the protanopia and deutanopia simulation algorithms proposed by Brettel and constructed a database of confusion lines based on the given conditions by comparing the simulated representative values. 3.1 Grouping Phase of RGB values in CIE L*a*b* Fig. 1. sRGB color gamut in the CIE L*a*b* color space. The gamut of the CIE L*a*b* color space [10] is L* : 0 ~ 100, a* : -128 ~ 128, b* : -128 ~ 128. This phase groups all L*a*b* color values to the CIE L*a*b* 3D color coordinates. In the CIE L*a*b* color space, five units in L* are grouped together, while 13 units in each of a* and b* are grouped together. Therefore, virtual boxes such as L* box, a* box and b* box were created by dividing L* , a* and b* , respectively, into 20 equal parts. We can create these virtual boxes by using Eqs. (1) to (3). Although the number of virtual boxes in the CIE L*a*b* color space can be 8,000 (20x20x20), we removed colors that exist in the CIE L*a*b* color space but not in the sRGB space. The total number of virtual boxes used by this research is 982. After creating virtual boxes from (0, 0, -2) to (19, -2, 5), we calculated the central points of the virtual boxes by using Eqs. (4) to (6), and then assigned the representative values of L* pri, a* pri and b* pri to their respective virtual boxes. L* box = ( L* /5); a* box = ( a* /13); (1) (2) (3) (4) (5) (6) b* box = ( b* /13); L pri = L* box + 2.5; a* pri = a* box + 6.5; * b* pri = b* box + 6.5; 3.2 Simulation Phase of Representative Color Values Fig. 2. These figures showed the four steps of a confusion line database construction procedure. previous research did not consider the characteristics of human color perception. As a result, some color data might be lost in the process of grouping representative values. Given this, we used colors within the sRGB gamut in the CIE L*a*b* color space to reflect the features of human color perception, and that color gamut is shown in Fig. 1 [9]. 3. The Proposed Scheme Fig. 2 diagrams how to construct a confusion line database (DB) proposed by this research. In the preliminary phase, we created color boxes in the CIE L*a*b* color space by grouping RGB color values to CIE L*a*b* In the simulation phase of representative color values, we conducted the simulation of color appearance by comparing the representative color value of the central point of a rectangular L*a*b* virtual box, which was created in the RGB value grouping process, to the CIE L*a*b* space. We also implemented the protanopia and deutanopia simulation algorithms proposed by Vienot et al. [2]. Fig. 3 shows the confusion lines of protanopia in a CIE 1931 chromaticity diagram and the simulation image perceived by someone with protanopia. 3.3 Comparison Phase of Representative Values of Confusion Lines The representative values of the existing confusion lines were calculated using Eqs. (4) to (6). Eq. (7) calculates the color difference between ( L*1 , a1* , b1* ) and 274 Cho et al.: Construction of Confusion Lines for Color Vision Deficiency and Verification by Ishihara Chart (a) Yellow confusion line group (b) Blue confusion line group Fig. 3. Before and after Brettel color simulation in a CIE 1931 chromaticity diagram. ( L , a , b ) in CIE L*a*b* space. If condition (8) is met, * 2 * 2 * 2 each representative color pair, which looked different before color simulation, looks like the same color after the simulation. And this color pair causes confusion to people who have a CVD, but not to people who do not. In addition, whenever a color is added to a confusion line, the representative values PL*( x ) , Pa*( x ) , Pb*(x ) of each confusion line were recalculated using Eqs. (9) - (11) to provide a more reliable algorithm to construct a database of confusion lines. A conventional color difference equation uses equation (12), which was used to calculate the existing color differences [10, 11] with two color pairs: (L1*, a1*, b1*), (L2*, a2*, b2*). There are some cases where people with a CVD can distinguish similar colors, depending on the level of brightness. To avoid this, we used condition (8) instead of condition (12). Δ L* = L*1 − L*2 Δa b* = * 1 * 2 2 * 1 * * 2 2 (7) * Δ L <= 3 and Δ a b <= 15 n ∑ (9) n 1 x Pa*( x ) = ٛ ⋅ a*i nx i =0 ∑ (10) n x 1 Pb ( x ) = ٛ ⋅ b*i nx i =0 Δ E* = (a1* − ٛa2* ∑ 2 + b1* − b2* 2 (11) + L*1 − L*2 2 (12) Blue : φ >= 270o || φ <= 90o Yellow : φ < 270o && φ > 90o 3.4 Discrimination Phase of Confusion Lines by Major Colors When we looked into the simulated color DB after we finished the the construction of confusion lines using the algorithm proposed by this research, the simulated colors were divided into two dominant colors. The major colors that can be perceived by people with protanopia or deutanopia are divided into yellow and blue regions. In this research, we classified each confusion line either into the yellow DB or into the blue DB. By dividing the confusion line DB by major colors, confusing colors can be clearly distinguished. Given this, we converted the coordinates in the CIE L*a*b* color space into spherical coordinates, and distinguished confusion lines by major colors using condition (13) [12], which was designed to discriminate colors using angle φ in order to make color compensation more useful for people with a CVD. The result of this algorithm is shown in Fig. 4. The X-axis represents the number of colors in each confusion line, and the y-axis represents the confusion line type in both groups. 4. Experimental Results (8) 1 x PL*( x ) = ٛ ⋅ L*i n x i =0 * Fig. 4. Yellow confusion line group and blue confusion line group for people with protanopia. (13) where ϕ is the azimuth angle component in a spherical coordinate system To verify the confusion lines created based on the algorithm proposed by this research, we compared them with the existing confusion lines. Fig. 5(a) shows the theoretical confusion lines for people with protanopia. Fig. 5(b) shows the generated confusion lines from using the proposed algorithm. The theoretical confusion lines ignore brightness. Thus, the colors on the same confusion lines can be differentiated by someone with protanopia because of the difference in brightness. But the confusion lines established by the suggested algorithm can distinguish brightness, thus creating more effective confusion lines. Fig. 6 shows the generated confusion lines by using the proposed algorithm in the CIE L*a*b* color space. Fig. 7 shows all confusion lines for those with protanopia and deutanopia. The vertical axis indicates the number of confusion lines for people with each type of CVD, while the horizontal axis shows the different colors that exist on the same confusion line. For example, Fig. 7(c) lists the representative color values of each confusion line for people with deutanopia. We can see that red and green 275 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 (a) Confusion lines in the CIE xy color space [5] (b) Proposed confusion lines Fig. 5. Proposed confusion lines for people with protanopia within the CIE xy color space. (a) CIE L*a*b* color space (b) Proposed confusion lines Fig. 6. Proposed confusion lines for people with protanopia within the CIE L*a*b* color space. (a) Protanopia confusion lines (104 lines) (b) Simulation image (a) perceived by someone with protanopia (c) Deutanopia confusion lines (94 lines) (d) Simulation image (c) perceived by someone with deutanopia Fig. 7. Arrangement of representative colors for created confusion lines. exist on the same confusion lines. Because each CVD patient can discern similar colors depending on brightness, we can see that they do not exist on the same horizontal axis in Fig. 7(c). For Fig. 7(a), the results are the same. Fig. 8 shows three confusion lines, which were magnified to help explain Fig. 7. After the construction of confusion 276 Cho et al.: Construction of Confusion Lines for Color Vision Deficiency and Verification by Ishihara Chart Table 1. Constructed Confusion Lines. Deficiency (a) Deutanopia confusion lines in D60, D61, and D62 Protanopia Confusion Lines P1 … P27 … P104 Confusion Boxes (0 0 -2) … (5 3 -6) (5 4 -7) (5 4 -6) … (19 -1 0), (19 -1 1) D1 … D27 … D95 (0 0 -2) … (5 3 -4) (6 0 -3) (6 1 -4) (6 2 -4) (6 2 -3) … (19 -1 0), (19 -1 1) Deuteranopia (b) Simulation image (a) perceived by someone with deutanopia Fig. 8. Magnified images of arrangement of representative colors. (a) (b) (c) (d) (a) Ishihara chart perceived by people without a CVD; (b) Ishihara chart perceived by people with a CVD; (c) colors on the P14 confusion line within an Ishihara chart, and (d) colors on the P17 confusion line within an Ishihara chart. Fig. 9. Ishihara chart test results. lines for people with protanopia or deutanopia, we found that the number of confusion lines for those with protanopia amounted to 104, ranging from P1 to P104, while the number for those with deutanopia was 95, ranging from D1 to D95, as shown in Table 1. To verify these results, we classified colors based on the databases of confusion lines created in an Ishihara chart. We checked how confusion lines were distributed by analyzing colors within an Ishihara chart using the confusion line DB. As shown in Fig. 9(a), numbers 6 and 2 are visible for people without a CVD. Other examples are Figs. 9(c) and (d). Figs. 9(c) and (d) show the colors that exist on the P14 confusion line and the P17 confusion line, respectively, within each Ishihara chart. As seen below, we found that the color for numbers 6 and 2 in the Ishihara chart and the background colors exist on the same confusion lines. Therefore, people with a CVD could not see the numbers within Ishihara charts like Fig. 9(b). 5. Conclusion Previous research was conducted by grouping confusion colors in the 3D RGB color space, and some color data might be lost in the grouping process. However, if we use the confusion line construction method proposed by this research, we can group confusion colors on the CIE L*a*b* color space coordinates in a more distinctive and intuitive manner and can establish a database of colors that can be perceived by people with a CVD in a more accurate IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 manner. Therefore, we were able to group a wider color range and more varied colors that can be perceived by people with a CVD than the existing confusion line databases. The generated confusion line DB for protanopia and deutanopia will contribute greatly to the development of other research on people with a CVD. Acknowledgement This work was supported by a National Research Foundation of Korea grant funded by the Korean Government (NRF-2014R1A1A2058592), and was also supported by a National Research Foundation of Korea grant funded by the Korean Government (No. 2012007498). References [1] M. P. Simunovic, "Colour vision deficiency," Eye, 24.5, pp. 747-755, 2009. Article (CrossRef Link) [2] F. Vienot, H. Brettel, J. D. Mollon, "Digital Video Colourmaps for Checking the Legibility of Displays by Dichromats," Color Research and Application, 24.4, pp. 243-252, August, 1999. Article (CrossRef Link) [3] G. M. Machado, et al., "A physiologically-based model for simulation of color vision deficiency," Visualization and Computer Graphics, IEEE Transactions on 15.6, pp. 1291-1298, 2009. Article (CrossRef Link) [4] C. Rigden, "'The Eye of the Beholder'-Designing for Colour-Blind Users," British Telecommunications Engineering 17, pp. 291-295, 1999. Article (CrossRef 277 Link) [5] CIE ColorSpace. [Online]. Available: Article (CrossRef Link) [6] D. Han, S. Yoo and B. Kim, "A Novel ConfusionLine Separation Algorithm Based on Color Segmentation for Color Vision Deficiency," Journal of Imaging Science & Technology, 56.3, 2012. Article (CrossRef Link) [7] S. Park, Y. Kim, “The Confusing Color line of the Color deficiency in Panel D-15 using CIELab Color Space” Journal of Korean Ophthalmic Optics Society 6 pp. 139-144, 2001 Article (CrossRef Link) [8] P. Doliotis, et al., "Intelligent modification of colors in digitized paintings for enhancing the visual perception of color-blind viewers," Artificial Intelligence Applications and Innovations III, pp. 293-301, 2009. Article (CrossRef Link) [9] CIE 1931. [Online].Available: Article (CrossRef Link) [10] CIE. Technical report: Industrial colour-difference evaluation. CIE Pub. No. 116. Vienna: Central Bureau of the CIE; 1995. Article (CrossRef Link) [11] Manuel Melgosa, "Testing CIELAB-based colordifference formulas," Color Research & Application, 25.1, pp. 49-55, 2000. Article (CrossRef Link) [12] H. Brettel, F. Viénot, and J. D. Mollon. "Compu-terized simulation of color appearance for dichromats," JOSA A, 14.10, pp. 2647-2655, 1997. Article (CrossRef Link) Appendix See Table 2 Table 2. Confusion line map for protanopin Type of confusion line P1 Box positions on same confusion line ( L* a* b* ) ( 0 0 -2) P2 ( 0 0 -1) P3 ( 1 -1 -1) P4 ( 1 -1 0), ( 1 0 0), ( 1 1 0) P5 ( 1 0 -2), ( 1 0 -1), ( 1 1 -2), ( 1 1 -1) P6 ( 1 1 -3), ( 1 2 -4), ( 1 2 -3) P7 ( 2 -1 -1), ( 2 -1 0), ( 2 0 -1), ( 2 0 0), ( 2 1 -1), ( 2 1 0), ( 2 2 -1), ( 2 2 0) P8 ( 2 0 -2), ( 2 1 -2), ( 2 2 -2) P9 ( 2 1 -3), ( 2 2 -4), ( 2 2 -3) P10 ( 2 1 1), ( 2 2 1) P11 ( 2 3 -5), ( 3 2 -4), ( 3 3 -5), ( 3 3 -4) P12 ( 3 -2 0), ( 3 -1 -1), ( 3 -1 0), ( 3 0 -1), ( 3 0 0), ( 3 1 -1), ( 3 1 0), ( 3 2 -1), ( 3 2 0), ( 4 3 -1), ( 4 3 0) P13 ( 3 -2 1), ( 3 -1 1), ( 3 0 1), ( 3 1 1), ( 3 2 1) P14 ( 3 0 -2), ( 3 1 -2), ( 3 2 -2), ( 4 3 -2) P15 ( 3 1 -3), ( 3 2 -3) P16 ( 4 -2 0), ( 4 -2 1), ( 4 -1 0), ( 4 -1 1), ( 4 0 0), ( 4 0 1), ( 4 1 0), ( 4 1 1), ( 4 2 0), ( 4 2 1), ( 5 3 0), ( 5 3 1),( 5 3 2) P17 ( 4 -1 -2), ( 4 -1 -1), ( 4 0 -2), ( 4 0 -1), ( 4 1 -2), ( 4 1 -1), ( 4 2 -2), ( 4 2 -1), ( 5 3 -2), ( 5 3 -1) 278 Cho et al.: Construction of Confusion Lines for Color Vision Deficiency and Verification by Ishihara Chart P18 ( 4 0 -3), ( 4 1 -4), ( 4 1 -3), ( 4 2 -4), ( 4 2 -3), ( 4 3 -4), ( 4 3 -3), ( 5 3 -3) P19 ( 4 1 2), ( 4 2 2) P20 ( 4 2 -5), ( 4 3 -5), ( 4 4 -6), ( 5 4 -5) P21 ( 5 -3 2), ( 5 -2 2) P22 ( 5 -2 0), ( 5 -2 1), ( 5 -1 0), ( 5 -1 1), ( 5 0 0), ( 5 0 1), ( 5 1 0), ( 5 1 1), ( 5 2 0), ( 5 2 1), ( 5 2 2), ( 6 3 0), ( 6 3 1), ( 6 3 2) P23 ( 5 -1 -2), ( 5 -1 -1), ( 5 0 -2), ( 5 0 -1), ( 5 1 -2), ( 5 1 -1), ( 5 2 -2), ( 5 2 -1), ( 6 3 -2), ( 6 3 -1), ( 6 4 -2), ( 6 4 -1) P24 ( 5 -1 2), ( 5 0 2), ( 5 1 2), ( 6 2 2), ( 6 3 3) P25 ( 5 0 -3), ( 5 1 -4), ( 5 1 -3), ( 5 2 -4), ( 5 2 -3), ( 5 3 -4), ( 6 4 -4), ( 6 4 -3) P26 ( 5 2 -5), ( 5 3 -5), ( 6 3 -4), ( 6 4 -5) P27 ( 5 3 -6), ( 5 4 -7), ( 5 4 -6) P28 ( 6 -3 1), ( 6 -3 2), ( 6 -2 1), ( 6 -2 2), ( 6 -1 1), ( 6 -1 2), ( 6 0 1), ( 6 0 2), ( 6 1 1), ( 6 1 2), ( 6 2 1), ( 7 2 1), ( 7 2 2), ( 7 2 3), ( 7 3 1), ( 7 3 2), ( 7 3 3), ( 7 4 2), ( 7 4 3) P29 ( 6 -2 -1), ( 6 -2 0), ( 6 -1 -1), ( 6 -1 0), ( 6 0 -1), ( 6 0 0), ( 6 1 -1), ( 6 1 0), ( 6 2 -1), ( 6 2 0), ( 7 3 -1), ( 7 3 0), ( 7 4 -1), ( 7 4 0), ( 7 4 1) P30 ( 6 -1 -2), ( 6 0 -3), ( 6 0 -2), ( 6 1 -3), ( 6 1 -2), ( 6 2 -3), ( 6 2 -2), ( 6 3 -3), ( 7 3 -2), ( 7 4 -3), ( 7 4 -2) P31 ( 6 1 -4), ( 6 2 -4), ( 7 4 -4) P32 ( 6 2 -5), ( 6 3 -5), ( 7 3 -4), ( 7 4 -5), ( 7 5 -5), ( 8 5 -4) P33 ( 6 3 -6), ( 6 4 -6), ( 7 5 -6) P34 ( 6 4 -7), ( 6 5 -7) P35 ( 6 5 -8), ( 7 5 -7) P36 ( 7 -3 1), ( 7 -3 2), ( 7 -2 1), ( 7 -2 2), ( 7 -1 1), ( 7 -1 2), ( 7 0 1), ( 7 0 2), ( 7 1 1), ( 7 1 2), ( 7 1 3), ( 8 2 1), ( 8 2 2), ( 8 2 3), ( 8 3 2), ( 8 3 3), ( 8 4 2), ( 8 4 3), ( 9 5 2), ( 9 5 3), ( 9 5 4) P37 ( 7 -2 -1), ( 7 -2 0), ( 7 -1 -1), ( 7 -1 0), ( 7 0 -1), ( 7 0 0), ( 7 1 -1), ( 7 1 0), ( 7 2 -1), ( 7 2 0), ( 8 3 -1), ( 8 3 0), ( 8 3 1), ( 8 4 0), ( 8 4 1),( 9 5 0), ( 9 5 1) P38 ( 7 -1 -2), ( 7 0 -3), ( 7 0 -2), ( 7 1 -3), ( 7 1 -2), ( 7 2 -3), ( 7 2 -2), ( 7 3 -3), ( 8 4 -3), ( 8 4 -2), ( 8 4 -1), ( 8 5 -2) P39 ( 7 0 -4), ( 7 1 -5), ( 7 1 -4), ( 7 2 -5), ( 7 2 -4), ( 7 3 -5), ( 8 4 -5), ( 8 4 -4), ( 8 5 -5), ( 8 5 -3) P40 ( 7 0 3) P41 ( 7 2 -6), ( 7 3 -6), ( 7 4 -6), ( 8 5 -6) P42 ( 7 4 -7), ( 8 4 -6), ( 8 5 -7), ( 9 6 -6) P43 ( 7 5 -8) P44 ( 8 -4 3), ( 8 -3 3), ( 8 -2 3), ( 8 -1 3), ( 8 0 3), ( 8 1 3), ( 9 2 3), ( 9 3 3), ( 9 4 3), ( 9 4 4), (10 5 3), (10 5 4) P45 ( 8 -3 0), ( 8 -3 1), ( 8 -2 0), ( 8 -2 1), ( 8 -1 0), ( 8 -1 1), ( 8 0 0), ( 8 0 1), ( 8 1 0), ( 8 1 1), ( 8 2 0), ( 9 2 1), ( 9 2 2), ( 9 3 0), ( 9 3 1), ( 9 3 2), ( 9 4 1), ( 9 4 2), (10 5 1), (10 5 2) P46 ( 8 -3 2), ( 8 -2 2), ( 8 -1 2), ( 8 0 2), ( 8 1 2) P47 ( 8 -2 -1), ( 8 -1 -2), ( 8 -1 -1), ( 8 0 -2), ( 8 0 -1), ( 8 1 -2), ( 8 1 -1), ( 8 2 -2), ( 8 2 -1), ( 8 3 -2), ( 9 3 -2), ( 9 3 -1), ( 9 4 -2), ( 9 4 -1), ( 9 4 0), ( 9 5 -1) P48 ( 8 -1 -3), ( 8 0 -4), ( 8 0 -3), ( 8 1 -4), ( 8 1 -3), ( 8 2 -4), ( 8 2 -3), ( 8 3 -4), ( 8 3 -3), ( 9 4 -4), ( 9 4 -3), ( 9 5 -3), ( 9 5 -2), (10 6 -3) P49 ( 8 1 -5), ( 8 2 -5), ( 8 3 -5), ( 9 4 -5), ( 9 5 -5), ( 9 5 -4), (10 6 -4) P50 ( 8 2 -6), ( 8 3 -6), ( 9 5 -6), (10 6 -5) P51 ( 8 3 -7), ( 8 4 -7), ( 9 4 -6) P52 ( 9 -4 2), ( 9 -4 3), ( 9 -3 2), ( 9 -3 3), ( 9 -2 2), ( 9 -2 3), ( 9 -1 2), ( 9 -1 3), ( 9 0 2), ( 9 0 3), ( 9 1 2), ( 9 1 3), (10 2 2), (10 2 3), (10 2 4), (10 3 3), (10 3 4), (10 4 3), (10 4 4) P53 ( 9 -3 0), ( 9 -3 1), ( 9 -2 0), ( 9 -2 1), ( 9 -1 0), ( 9 -1 1), ( 9 0 0), ( 9 0 1), ( 9 1 0), ( 9 1 1), ( 9 2 0), (10 2 0), (10 2 1), (10 3 0), (10 3 1), (10 3 2), (10 4 1), (10 4 2), (11 5 1), (11 5 2) P54 ( 9 -2 -2), ( 9 -2 -1), ( 9 -1 -2), ( 9 -1 -1), ( 9 0 -2), ( 9 0 -1), ( 9 1 -2), ( 9 1 -1), ( 9 2 -2), ( 9 2 -1), (10 3 -2), (10 3 -1), (10 4 -2), (10 4 -1), (10 4 0), (10 5 -1), (10 5 0) P55 ( 9 -1 -3), ( 9 0 -4), ( 9 0 -3), ( 9 1 -4), ( 9 1 -3), ( 9 2 -4), ( 9 2 -3), ( 9 3 -4), ( 9 3 -3), (10 4 -4), (10 4 -3), (10 5 -3), (10 5 -2), (11 6 -3), (11 6 -2) P56 ( 9 1 -5), ( 9 2 -6), ( 9 2 -5), ( 9 3 -6), ( 9 3 -5), (10 4 -5), (10 5 -6), (10 5 -5), (10 5 -4), (11 6 -4) P57 (10 -4 1), (10 -4 2), (10 -3 1), (10 -3 2), (10 -2 1), (10 -2 2), (10 -1 1), (10 -1 2), (10 0 1), (10 0 2), (10 1 1), (10 1 2), (10 1 3), (11 2 1), (11 2 2), (11 2 3), (11 3 2), (11 3 3), (11 4 2), (11 4 3) P58 (10 -4 3), (10 -3 3), (10 -2 3), (10 -1 3), (10 0 3), (10 0 4), (10 1 4), (11 2 4), (11 3 4), (11 4 4) IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 P59 (10 -3 0), (10 -2 -1), (10 -2 0), (10 -1 -1), (10 -1 0), (10 0 -1), (10 0 0), (10 1 -1), (10 1 0), (10 2 -1), (11 2 0), (11 3 -1), (11 3 0), (11 3 1), (11 4 0), (11 4 1), (11 5 0) P60 (10 -2 -2), (10 -1 -3), (10 -1 -2), (10 0 -3), (10 0 -2), (10 1 -3), (10 1 -2), (10 2 -3), (10 2 -2), (10 3 -3), (11 3 -3), (11 3 -2), (11 4 -3), (11 4 -2), (11 4 -1), (11 5 -2), (11 5 -1) P61 (10 0 -4), (10 1 -5), (10 1 -4), (10 2 -5), (10 2 -4), (10 3 -5), (10 3 -4), (11 4 -4), (11 5 -5), (11 5 -4), (11 5 -3) P62 (10 2 -6), (10 3 -6), (10 4 -6), (11 6 -5) P63 (10 6 -6) P64 (11 -5 4), (11 -4 3), (11 -4 4), (11 -3 3), (11 -3 4), (11 -2 3), (11 -2 4), (11 -1 3), (11 -1 4), (11 0 3), (11 0 4), (11 1 3), (11 1 4), (12 2 3), (12 2 4), (12 3 4) P65 (11 -4 1), (11 -4 2), (11 -3 1), (11 -3 2), (11 -2 1), (11 -2 2), (11 -1 1), (11 -1 2), (11 0 1), (11 0 2), (11 1 1), (11 1 2), (12 2 1), (12 2 2), (12 3 1), (12 3 2), (12 3 3), (12 4 2) P66 (11 -3 -1), (11 -3 0), (11 -2 -1), (11 -2 0), (11 -1 -1), (11 -1 0), (11 0 -1), (11 0 0), (11 1 -1), (11 1 0), (11 2 -1), (12 2 0), (12 3 -1), (12 3 0), (12 4 -1), (12 4 0), (12 4 1) P67 (11 -2 -2), (11 -1 -3), (11 -1 -2), (11 0 -3), (11 0 -2), (11 1 -3), (11 1 -2), (11 2 -3), (11 2 -2), (12 3 -3), (12 3 -2), (12 4 -3), (12 4 -2), (12 5 -2) P68 (11 0 -5), (11 0 -4), (11 1 -5), (11 1 -4), (11 2 -5), (11 2 -4), (11 3 -5), (11 3 -4), (11 4 -5), (12 4 -4), (12 5 -4), (12 5 -3), (12 6 -4) P69 (12 -5 3), (12 -5 4), (12 -4 3), (12 -4 4), (12 -3 3), (12 -3 4), (12 -2 3), (12 -2 4), (12 -1 3), (12 -1 4), (12 0 3), (12 0 4), (12 1 3), (12 1 4), (13 1 5), (13 2 3), (13 2 4), (13 2 5) P70 (12 -4 1), (12 -4 2), (12 -3 1), (12 -3 2), (12 -2 1), (12 -2 2), (12 -1 1), (12 -1 2), (12 0 1), (12 0 2), (12 1 1), (12 1 2), (13 2 1), (13 2 2), (13 3 1), (13 3 2), (13 3 3) P71 (12 -3 -1), (12 -3 0), (12 -2 -1), (12 -2 0), (12 -1 -1), (12 -1 0), (12 0 -1), (12 0 0), (12 1 -1), (12 1 0), (12 2 -1), (13 2 -1), (13 2 0), (13 3 -1), (13 3 0), (13 4 -1) P72 (12 -2 -3), (12 -2 -2), (12 -1 -3), (12 -1 -2), (12 0 -3), (12 0 -2), (12 1 -3), (12 1 -2), (12 2 -3), (12 2 -2), (13 3 -3), (13 3 -2), (13 4 -3), (13 4 -2) P73 (12 -1 -4), (12 0 -5), (12 0 -4), (12 1 -5), (12 1 -4), (12 2 -5), (12 2 -4), (12 3 -5), (12 3 -4), (13 4 -4), (13 5 -4) P74 (13 -5 2), (13 -5 3), (13 -4 2), (13 -4 3), (13 -3 2), (13 -3 3), (13 -2 2), (13 -2 3), (13 -1 2), (13 -1 3), (13 0 2), (13 0 3), (13 1 2), (13 1 3), (13 1 4), (14 2 2), (14 2 3), (14 2 4) P75 (13 -5 4), (13 -4 4), (13 -3 4), (13 -2 4), (13 -1 4), (13 0 4), (14 1 4), (14 1 5) P76 (13 -4 0), (13 -4 1), (13 -3 0), (13 -3 1), (13 -2 0), (13 -2 1), (13 -1 0), (13 -1 1), (13 0 0), (13 0 1), (13 1 0), (13 1 1), (14 2 0), (14 2 1) P77 (13 -3 -1), (13 -2 -2), (13 -2 -1), (13 -1 -2), (13 -1 -1), (13 0 -2), (13 0 -1), (13 1 -2), (13 1 -1), (13 2 -2), (14 3 -2), (14 3 -1) P78 (13 -2 -3), (13 -1 -4), (13 -1 -3), (13 0 -4), (13 0 -3), (13 1 -4), (13 1 -3), (13 2 -4), (13 2 -3), (13 3 -4), (14 3 -3), (14 4 -3) P79 (14 -5 2), (14 -5 3), (14 -4 2), (14 -4 3), (14 -3 2), (14 -3 3), (14 -2 2), (14 -2 3), (14 -1 2), (14 -1 3), (14 0 2), (14 0 3), (14 1 2), (14 1 3), (15 1 4) P80 (14 -5 4), (14 -5 5), (14 -4 4), (14 -4 5), (14 -3 4), (14 -3 5), (14 -2 4), (14 -2 5), (14 -1 4), (14 -1 5), (14 0 4), (14 0 5), (15 1 5) P81 (14 -4 0), (14 -4 1), (14 -3 0), (14 -3 1), (14 -2 0), (14 -2 1), (14 -1 0), (14 -1 1), (14 0 0), (14 0 1), (14 1 0), (14 1 1), (15 2 0) P82 (14 -3 -2), (14 -3 -1), (14 -2 -2), (14 -2 -1), (14 -1 -2), (14 -1 -1), (14 0 -2), (14 0 -1), (14 1 -2), (14 1 -1), (14 2 -2), (14 2 -1) P83 (14 -2 -3), (14 -1 -3), (14 0 -3), (14 1 -3), (14 2 -3), (15 3 -3) P84 (15 -6 4), (15 -6 5), (15 -5 4), (15 -5 5), (15 -4 4), (15 -4 5), (15 -3 4), (15 -3 5), (15 -2 4), (15 -2 5), (15 -1 4), (15 -1 5), (15 0 4), (15 0 5) P85 (15 -5 1), (15 -5 2), (15 -4 1), (15 -4 2), (15 -3 1), (15 -3 2) (15 -2 1), (15 -2 2), (15 -1 1), (15 -1 2), (15 0 1), (15 0 2), (15 1 1), (15 1 2) P86 (15 -5 3), (15 -4 3), (15 -3 3), (15 -2 3), (15 -1 3), (15 0 3), (15 1 3) P87 (15 -4 -1), (15 -4 0), (15 -3 -1), (15 -3 0), (15 -2 -1), (15 -2 0), (15 -1 -1), (15 -1 0), (15 0 -1), (15 0 0), (15 1 -1), (15 1 0), (15 2 -1) P88 (15 -3 -2), (15 -2 -3), (15 -2 -2), (15 -1 -3), (15 -1 -2), (15 0 -3), (15 0 -2), (15 1 -3), (15 1 -2), (15 2 -3), (15 2 -2) P89 (16 -6 3), (16 -6 4), (16 -5 3), (16 -5 4), (16 -4 3), (16 -4 4), (16 -3 3), (16 -3 4), (16 -2 3), (16 -2 4), (16 -1 3), (16 -1 4), (16 0 3), (16 0 4), (16 0 5) P90 (16 -6 5), (16 -5 5), (16 -4 5), (16 -3 5), (16 -2 5), (16 -1 5) 279 280 Cho et al.: Construction of Confusion Lines for Color Vision Deficiency and Verification by Ishihara Chart P91 (16 -5 1), (16 -5 2), (16 -4 1), (16 -4 2), (16 -3 1), (16 -3 2), (16 -2 1), (16 -2 2), (16 -1 1), (16 -1 2), (16 0 1), (16 0 2), (16 1 1) P92 (16 -4 -1), (16 -4 0), (16 -3 -1), (16 -3 0), (16 -2 -1), (16 -2 0), (16 -1 -1), (16 -1 0), (16 0 -1), (16 0 0), (16 1 -1), (16 1 0) P93 (16 -3 -2), (16 -2 -2), (16 -1 -2), (16 0 -2), (16 1 -2), (16 2 -2) P94 (17 -6 3), (17 -6 4), (17 -5 3), (17 -5 4), (17 -4 3), (17 -4 4), (17 -3 3), (17 -3 4), (17 -2 3), (17 -2 4), (17 -1 3), (17 -1 4) P95 (17 -6 5), (17 -5 5), (17 -4 5), (17 -4 6), (17 -3 5), (17 -3 6), (17 -2 5), (17 -2 6), (17 -1 5), (17 -1 6) P96 (17 -5 1), (17 -5 2), (17 -4 1), (17 -4 2), (17 -3 1), (17 -3 2), (17 -2 1), (17 -2 2), (17 -1 1), (17 -1 2), (17 0 1), (17 0 2) P97 (17 -4 -1), (17 -4 0), (17 -3 -1), (17 -3 0), (17 -2 -1), (17 -2 0), (17 -1 -1), (17 -1 0), (17 0 -1), (17 0 0), (17 1 -1) P98 (17 -3 -2), (17 -2 -2), (17 -1 -2) P99 (18 -4 1), (18 -4 2), (18 -3 1), (18 -3 2), (18 -2 1), (18 -2 2), (18 -1 1), (18 -1 2) P100 (18 -4 3), (18 -4 4), (18 -3 3), (18 -3 4), (18 -2 3), (18 -2 4), (18 -1 3) P101 (18 -4 5), (18 -4 6), (18 -3 5), (18 -3 6), (18 -2 5), (18 -2 6) P102 (18 -3 -1), (18 -3 0), (18 -2 -1), (18 -2 0), (18 -1 -1), (18 -1 0), (18 0 -1), (18 0 0) P103 (19 -2 4), (19 -2 5) P104 (19 -1 0), (19 -1 1) Keuyhong Cho received his BSc in Physics from Sejong University, Seoul, Korea, in 2015. He is currently a Master’s student in the Vision & Image Processing Laboratory at Sejong University. His research interest is image processing. Jusun Lee received his BSc in Computer Engineering from Sejong University, Seoul, Korea, in 2015. He is currently a Master’s student in the Vision & Image Processing Laboratory at Sejong University. His research interest is image processing. Copyrights © 2015 The Institute of Electronics and Information Engineers Sanghoon Song received his BSc in Electronics Engineering from Yonsei University, Seoul, Korea, in 1977, and an MSc in 1979 from the Department of Computer Science, KAIST, Korea. He received his PhD in 1992 from the Department of Computer Science at the University of Minnesota, Minneapolis, U.S.A. Since 1992, he has been with the Department of Computer Engineering, Sejong University, Korea, where he is currently a professor. His research interests are embedded computing systems, computer arithmetic, and distributed systems. Dongil Han received his BSc in Computer Engineering from Korea University, Seoul, Korea, in 1988 and an MSc in 1990 from the Department of Electric and Electronic Engineering, KAIST, Daejeon, Korea. He received his PhD in 1995 from the Department of Electric and Electronic Engineering at KAIST, Daejeon, Korea. From 1995 to 2003, he was Senior Researcher in the Digital TV Lab of LG Electronics. Since 2003, he has been with the Department of Computer Engineering, Sejong University, Korea, where he is currently a professor. His research interests include image processing. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.4.281 281 EIE Transactions on Smart Processing and Computing A Survey of Human Action Recognition Approaches that use an RGB-D Sensor Adnan Farooq and Chee Sun Won* Department of Electronics and Electrical Engineering, Dongguk University -Seoul, South Korea {aadnan, cswon}@dongguk.edu * Corresponding Author: Chee Sun Won Received June 20, 2015; Revised July 15, 2015; Accepted August 24, 2015; Published August 31, 2015 * Regular Paper: This paper reviews the recent progress possibly including previous works in a particular research topic, and has been accepted by the editorial board through the regular reviewing process. Abstract: Human action recognition from a video scene has remained a challenging problem in the area of computer vision and pattern recognition. The development of the low-cost RGB depth camera (RGB-D) allows new opportunities to solve the problem of human action recognition. In this paper, we present a comprehensive review of recent approaches to human action recognition based on depth maps, skeleton joints, and other hybrid approaches. In particular, we focus on the advantages and limitations of the existing approaches and on future directions. Keywords: Human action recognition, Depth maps, Skeleton joints, Kinect 1. Introduction The study of human action recognition introduces various new methods for understanding actions and activities from video data. The main concern in human action recognition systems is how to identify the type of action from a set of video sequences. Different systems like consumer interactive entertainment, gaming, surveillance systems, smart homes, and life-care systems include several feasible applications [1, 2], which have become the utmost inspiration for researchers, who have hence developed algorithms for human action recognition. Previously, RGB cameras have always been a focal point of studies into identifying actions from image sequences taken by these cameras [3, 4]. Various constraints relating to 2D cameras are responsiveness to illumination changes, surrounding clutter, and disorder [3, 4]. It has been a tough and difficult task to precisely recognize human actions. However, with the development of cost-effective RGB depth (RGB-D) camera sensors (e.g., the Microsoft Kinect), the results from action recognition have improved, and they have become a point of consideration for many researchers [5]. Depth camera sensors provide more discriminating and clear information by giving a 3D structural view from which to recognize action, compared to visible light cameras. Furthermore, depth sensors also help lessen and ease the low-level complications found in RGB images, such as background subtraction and light variations. Also, depth cameras can be beneficial for the entire range of day-to-day work, even at night, like patient monitoring systems. Depth images enable us to view and assess human skeleton joints in a 3D coordinate system. These 3D skeleton joints provide additional information to examine for recognition of action, which in turn increases the accuracy of the human– computer interface [5]. Depth sensors, like the Kinect, usually provide three types of data: depth images, 3D skeleton joints, and color (RGB) images. So, it has been a big challenge to utilize these data, together or independently, to present human behavior and to improve the accuracy of action recognition. Human movement is classified into four levels (motion, action, activity, and behavior), where motion is a small movement of a body part for a very short time span. However, motion is a key factor in actions, which helps to identify other movement, such as the following [6]. An action is a collection of recurring different motions, which show what a person is doing, like running, sitting, etc., or interaction of the person with certain things. The duration of the action lasts no more than a few seconds. An activity is also an assortment of various actions that help in perceiving and understanding human behavior while performing designated tasks, like cooking, cleaning, studying, etc., which are activities that can continue for 282 Farooq et al.: A Survey of Human Action Recognition Approaches that use an RGB-D Sensor Table 1. Publicly available RGB-D datasets for evaluating action recognition systems. Fig. 1. Flow of an action recognition system. much longer times. Behavior is extremely meaningful in understanding human motion that can last for hours (or even days) and that can be considered either normal or abnormal. Action, activity and behavior can be differentiated on the basis of supportive dissimilar features concerning time scales. In this study, we focus on shorter and medium time period actions, such as raising a hand or sitting down. Human action recognition has been mainly focused on three leading applications: 1) surveillance systems, 2) entertainment environments, and 3) healthcare systems, which comprise systems to track or follow individuals automatically [2, 7-14]. In a surveillance system, the authorities need to monitor and detect all kinds of criminal and suspicious activities [2, 7, 8]. Most surveillance systems, equipped with several cameras, require welltrained staff to monitor human actions on screen. However, using automatic human action recognition algorithms, it is possible to reduce the number of staff and immediately create a security alert in order to prevent dangerous situations. Furthermore, human action recognition systems can also be used to identify entertainment actions, including sports, dance and gaming. For entertainment actions, response time to interact with a game is very important. Thus, a number of techniques have been developed to address this issue using depth sensors [9, 10]. In healthcare systems, it is important to monitor the activities of a patient [11, 12]. The aim of using such healthcare systems is to assist the health workers to care for, treat, and diagnose patients, hence, improving the reliability of diagnosis. These medical healthcare systems can also help decrease the work load on medical staff and provide better facilities to patients. Generally, human action–recognition approaches involve several steps, as shown in Fig. 1, where feature extraction is one of the important blocks, which performs a vital role in the action recognition system. The performance of feature extraction methods for an action recognition system is evaluated on the basis of classification accuracy. Several available datasets, recorded from depth sensors, are widely available and accessible for developing an innovative recognition system. Every dataset includes different actions and activities performed by different volunteer subjects, and each dataset is designed to resolve a particular challenge. Table 1 provides a summary of the most popular datasets. Most of the methods reviewed in this paper are evaluated on one or more of these datasets. In this survey, we review human action recognition systems that have been proposed to recognize human actions. This review paper is organized as follows. In Section 2, we review human action systems based on depth maps, skeleton joints, and hybrid methods Datasets Size Microsoft Research Action3D [13] 10 subjects/ 20 actions/ 2-3 repetitions Microsoft Research Daily Activity 3D [14] 10 subjects/ 16 activities/ 2 repetitions UT-Kinect Action [15] 10 subjects/ 10 actions/ 2 repetitions UCF-Kinect [16] 16 subjects/ 16 activities/ 5 repetitions Kitchen scene action [17] 9 activities Remarks There are a total of 567 depth map sequences with a resolution of 320x240. The dataset was recorded using the Kinect sensor. All are interactions with game consoles (i.e. draw a circle, two-hand wave, forward kick, etc.). 16 indoor activities were done by 10 subjects. Each subject performed each activity once in a standing position and once in a sitting position. Three channels are recorded using the Kinect sensor: (i) depth maps, (ii) RGB video, (iii) skeleton joint positions. In the UT-Kinect Action dataset, there are 10 different actions with three channels: (i) RGB, (ii) depth, and (iii) 3D skeleton joints. The UCF-Kinect dataset is a long-sequence dataset that is used to test latency. Different activities in the kitchen have been performed to recognize cooking motions. (i.e., depth and color, depth and skeleton). A summary of all the reviewed work is presented in Section 3, which includes the advantages and disadvantages of each reviewed method. The conclusion is presented in Section 4. 2. Human Action Recognition 2.1 Human Action Recognition Using Depth Maps Li et al. [18] introduced a method that recognizes human actions from depth sequences. The motivation of this work was to develop a method that does not require joint tracking. It also uses 3D contour points instead of 2D points. Depth maps are projected on three orthogonal Cartesian planes, and a specified number of points along the contours of all three projections are sampled for each frame. These sampled points are then used as a “bag-ofpoints” to illustrate a set of salient postures that correspond to the nodes of an action graph used to model the dynamics of the actions. The authors used their own dataset for the experiments, which later became known as the Microsoft Research (MSR) Action3D dataset, and they achieved a IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 74.4% recognition accuracy. The limitation of this approach is that the sampling of 3D points from the whole body requires a large dataset. Also, due to noise and occlusion in the depth maps, YZ and XZ views may not be reliable for extracting 3D points. To overcome some of the issues [13] in Table 1, Vieira et al. [18] proposed space–time occupancy patterns (STOP) to represent the sequence of depth maps, where the space and time axes are divided into multiple segments so that each action sequence is embedded in a multiple 4D grid. In order to enhance the role of spare cells, a saturation scheme was proposed, which typically consists of points on a silhouette or moving parts of the body. To recognize the actions, a nearest neighbor classifier based on cosine distance was employed. Experimental results on the MSR Action3D dataset show that STOP features for action classification yield better accuracy than that of Rougier et al. [12]. The major drawback to this approach is that they empirically set the parameter for dividing sequences into cells. A method that addresses the noise and occlusion issues in action recognition systems using depth images was proposed by Wang et al. [19]. The authors considered a 3D action sequence as a 4D shape and proposed random occupancy pattern (ROP) features extracted from randomly sampled 4D sub-volumes of different sizes and at different locations using a weighted sampling scheme. An elastic-net regularized classification is then modeled to further select the most discriminative features, which are robust to noise and less sensitive to occlusions. Finally, support vector machine (SVM) is used to recognize the actions. Experimental results on the MSR Action3D dataset show that the proposed method outperforms previous methods by Li et al. [13] and Vieira et al. [18]. An action recognition system that is capable of extracting additional shape and motion information using 3D depth maps was proposed by Yang et al. [20]. In this system, each 3D depth map is first projected onto three orthogonal Cartesian planes. Each projected view is generated by thresholding the difference of consecutive depth frames and stacks to obtain a depth motion map (DMM) for each projected view. A histogram of oriented gradients (HOG) [21] is then applied to each 2D projected view to extract the features. Furthermore, the features from all three views are then concatenated to form a DMMHOG descriptor. An SVM classifier is used to recognize the actions. Steps for extracting the HOG from the DMM are shown in Fig. 2. The drawback of this system is that their approach does not show the direction of the variation. Also, the authors explored the number of frames required to generate satisfactory results, which showed that roughly 35 frames are enough to generate acceptable results. Nonetheless, it cannot be applied to complex actions to get satisfactory results. Ahmad et al. [22] employed an R transform [23] to compute a 2D angular projection map of an activity silhouette via Radon transform and to compare the proposed method with other feature extraction methods (i.e. PCA and ICA) [24, 25]. The authors argue that PCA and ICA are sensitive to scale and translation using depth silhouettes. Therefore, a 2D Radon transform converts into 283 Fig. 2. Histogram of oriented gradients descriptor on motion maps. Fig. 3. Framework of the human activity recognition system using R transform [22]. a 1D R transform profile to provide a highly compact shape representation for each depth silhouette. That is, to extract suitable features from the 1D R transformed profiles of depth silhouettes, linear discriminant analysis (LDA) is used to make the features more discriminative. Finally, the features are trained and tested using hidden Markov models (HMMs) [26] on the codebook of vectors generated using the Linde-Buzo-Gray (LBG) clustering algorithm [27] for recognition. Fig. 3 shows the overall flow of the proposed method. Experimental results show that their feature extraction method is robust on the 10 human activities collected by the authors. Using this dataset, they achieved an accuracy of 96.55%. The limitation to this system is that the proposed method is view-dependent. Using depth sequences, a new feature descriptor named histogram of oriented 4D surface normal (HON4D) was proposed by Oreifej and Liu [28]. The proposed feature descriptor is more discriminative than the average 4D occupancy [18] and is robust against noise and occlusion [18]. HON4D features consider the 3D depth sequences as a surface in 4D spatio-temporal space–time, depth and spatial coordinates. In order to construct HON4D, the 4D space is quantized using the 120 vertices of a 600-cell polychoron. Then, the quantization is refined using a discriminative density measure by inducing additional projectors in the direction, where the 4D normal is denser and more discriminative. An SVM classifier is used to recognize the actions. Experimental results show that HON4D achieves high accuracy compared to state-of-theart methods. The limitation to this system is that HON4D can only roughly characterize the local spatial shape around each joint to represent human–object interaction. Also, differential operation on depth images can enhance noise. 284 Farooq et al.: A Survey of Human Action Recognition Approaches that use an RGB-D Sensor Xia et al. [29] proposed an algorithm for extracting local spatio-temporal interest points (STIPs) from depth videos (DSTIPs) and described a local 3D depth cuboid using the depth cuboid similarity feature (DCSF). The DSTIPs deal with the noise in the depth videos, and DCSF was presented as a descriptor for the 3D local cuboid in depth videos. The authors claimed that the DSTIPs+DCSF pipeline recognizes activities from the depth videos without depending on skeleton joint information, motion segmentation, tracking or de-noising procedures. The experimental results reported for the MSR Daily Activity 3D dataset show that it is possible to recognize human activities using the DSTIPs and DCSF with an accuracy of 88.20% by using 12 out of 16 activities. Four activities that have less motion (i.e., sitting still, reading a book, writing on paper, and using a laptop) were removed from the experiments because most of the recognition errors come from these relatively motionless activities. Furthermore, the accuracy of the proposed system is highly dependent on the noise level of the depth images. Recent work by Song et al. [30] focuses on the use of depth information to describe human actions in videos that seem to be of essential concern and can greatly influence the performance of human action recognition. The 3D point cloud is exercised because it holds points in the 3D real-world coordinate system to symbolize the human body’s outer surface. An attribute named body surface context (BSC) is presented to explain the sharing of relative locations of neighbors for a reference point in the point cloud. Tests using the Kinect Human Action Dataset resulted in an accuracy of 91.32%. Using the BSC feature, experiments on the MSR Action3D dataset yielded an average accuracy of 90.36% and an accuracy of 77.8% with the MSR Daily Activity 3D dataset. Experimentation showed that superior performance is attained with the tested feature and it performed robustly when observing variations (i.e. translation and rotation). 2.2 Human Action Recognition Using Skeleton Joints Xia et al. [31] showed the advantages of using 3D skeleton joints and represented 3D human postures using a histogram of 3D joint locations (HOJ3D). In their representation, 3D space is partitioned into bins using a modified spherical coordinate system. That is, 12 manually selected joints were used to build a compact representation of the human posture. To make the representation more robust, votes of 3D skeleton joints were cast into neighboring bins using a Gaussian weight function. To extract most dominant and discriminative features, LDA was applied to reduce the dimensionality. These discriminative features were then clustered into a fixed number of posture vocabularies which represent the prototypical poses of actions. Finally, these visual words were trained and tested using a discrete HMM. According to reported experimental results on the MSR Action3D dataset, and by using their own proposed dataset, their proposed method has the salient advantage of using 3D skeleton joints of the human posture. However, the drawback to their method is its reliance on the hip joint, which might potentially Fig. 4. Steps for computing eigen-joint features proposed by Yang et al. [32]. compromise recognition accuracy due to the noise embedded in the estimation of hip joint location. In a similar way, Yang et al. [32] illustrated that skeleton joints are computationally inexpensive, more compact, and distinctive compared to depth maps. Based on that, the authors proposed an eigen joints–based action recognition system, which extracts three different kinds of features using skeleton joints. These features include posture (Fcc), motion features (Fcp) that encode spatial and temporal characteristics of skeleton joints, and offset features (Fci), which calculate the difference between a current pose and the initial one. Then, applying PCA to these joint differences to obtain eigen joints by reducing the redundancy and noise, the Naive-Bayes-NearestNeighbor (NBNN) classifier [33] is used to recognize multiple action categories. Fig. 4 shows the process of extracting eigen joints. Also, they further explore the number of frames that are sufficient to recognize the action for their system. Experimental results on the MSR Action3D dataset show that a short sequence of 15-20 frames is sufficient for action recognition. The drawback to this approach is, while calculating the offset feature, the authors assume that the initial skeleton pose is neutral, which is not always correct. Using the advantages of 3D joints, Yang et al. [32] proposed a compact but effective local skeleton descriptor that creates a pose representation invariant to any similarity conversion, which is, hence, view-invariant. The new skeletal feature, which is called skeletal quad [34], locally encodes the relation of joint quadruples so that 3D similarity invariance is assured. Experimental results of the proposed method verify its state-of-the-art performance in human action recognition using 3D joint positions. The proposed action recognition method was tested on broadly used datasets, such as the MSR Action3D dataset and HDM05. Experimental results with MSR Action3D using skeleton joints showed an average accuracy of 89.86%, and showed 93.89% accuracy with HDM05. 2.3 Human Action Recognition Using Hybrid Methods The work done by Wang et al. [14] utilizes the advantages of both skeleton joints and point cloud information. Most of the actions differ mainly due to the objects in interactions, whereas in such cases, using only skeleton information is not sufficient. Moreover, to capture 285 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 the intra-class variance via occupancy information, the authors proposed a novel actionlet ensemble model. An important observation made by them in terms of skeleton joints is that the pairwise relative positions of joints are more discriminative than the joint positions themselves. Interaction between human and environmental objects is characterized by a local occupancy pattern (LOP) at each joint. Furthermore, the proposed method is evaluated using the CMU MoCap dataset, the MSR Action3D dataset, and a new dataset called the MSR Daily Activity 3D dataset. Experimental results showed that their method has superior performance compared to previous methods. The drawback of their system is that it relies on skeleton localization, which is unreliable for posture with selfocclusion. Lei at al. [35] combined depth and color features to recognize kitchen activities. Their method successfully demonstrated tracking the interactions between hands and objects during kitchen activities, such as mixing flour with water and chopping vegetables. For object recognition, the reported system uses a gradient kernel descriptor on both color and depth data. The global features are extracted by applying PCA on the gradient of the hand trajectories, which are extracted by tracking the skin characteristics, and local features are defined using a bag-of-words for snippets of trajectory gradients. All the features are then fed into an SVM classifier for training. The overall reported accuracy is 82% for combined action and object recognition. This work shows the initial concept of recognizing the object and actions in a real-world kitchen environment. However, using such system in real time requires a large dataset to train the system. Recently, Althloothi et al. [36] proposed a human activity recognition system using multi-features and multiple kernel learning (MKL) [37]. In order to recognize human actions from a sequence of RGB-D data, their method utilizes surface representation and a kinematics structure of the human body. It extracts shape features from a depth map using a spherical harmonics representation that describes the 3D silhouette structure, whereas the motion features are extracted using 3D joints that describe the movement of the human body. The author believes that segments such as forearms and the shin provide sufficient and compact information to recognize human activities. Therefore, each distal limb segment is described by orientation and translation with respect to the initial frame to create temporal features. Then, both feature sets are combined using an MKL technique to produce an optimally combined kernel matrix within the SVM for activity classification. The drawback to their system is that the shape features extracted using spherical harmonics are large in size. Also, at the beginning and at the end of each depth sequence in the MSR Action3D and MSR Daily Activity 3D datasets, the subject is in a stand-still position with small body movements. However, while generating the motion characteristics of an action, these small movements at the beginning and at the end generate large pixel values, which ultimately contribute to large reconstruction error. 3. Summary The advantages and disadvantages of the above reviewed methods, based on depth maps, skeleton joints, and hybrid approaches, are presented in Table 2. Although Table 2. Advantages and disadvantages of the existing methods. Feature Extraction Methods 3D sampled points [13] STOP: Space–Time Occupancy Patterns [18] Random Occupancy Patterns (ROP) [19] Motion maps [20] General comments Pros Cons Using depth silhouettes, 3D points have been extracted on the contour of the depth map. They extend RGB approaches to extract contour points on depth images. However, their method can recognize the action performed by single or multiple parts of the human body without tracking the skeleton joints. Due to noise and occlusion, contours of multiple views are not reliable, and the current sampling scheme is view-dependent. Space–time occupancy patterns Spatial and temporal contextual are presented by dividing the information has been used to depth sequence into a 4D recognize the actions, which is space–time grid. All the cells in robust against noise and occlusion. the grid have the same size. ROP features are extracted from randomly sampled 4D subvolumes with different sizes and The proposed feature extraction different volumes. Then, all the method is robust to noise and less points in the sub-volumes are sensitive to occlusion. accumulated and normalized with a sigmoid function. Motion maps provide shape as They are computationally efficient well as motion information. action recognition systems based However, HOG has been used to on depth maps for extraction of extract local appearance and shape additional shape and motion of motion maps. information. There is no method defined to set the parameter for dividing the sequence into cells. Feature patterns are highly complex and need more time during processing. Motion maps do not provide directional velocity information between the frames. 286 Farooq et al.: A Survey of Human Action Recognition Approaches that use an RGB-D Sensor R Transform [22] R transform has been used to extract features from depth silhouettes, comparing the proposed method with PCA and ICA. HON4D [24] Captures histogram distribution of the surface normal orientation in the 4D volume of time, depth and spatial coordinates. DCSF [29] Extracting STIP from depth videos and describing local 3D DCSF around interest points can be efficiently used to recognize actions. Body surface context (BSC) [30] HOJ3D [31] Eigen joints [32] Quadruples [34] 3D point clouds have been used to represent the 3D surface of the body, which contains rich information to recognize human actions. Twelve manually selected skeleton joints are converted to a spherical coordinate system to make a compact representation of the human posture. This is an action recognition system that extracts spatiotemporal change between the joints. Then, PCA is used to obtain eigen joints by reducing redundancy and noise. A skeleton joint–based feature extraction method called skeletal quad ensures 3D similarity invariance of joint quadruples by local encoding using a Fisher kernel. Hybrid method (3D point cloud + skeleton) [14] Local occupancy pattern (LOP) features are calculated from depth maps around the joints’ locations. Kitchen activities (depth + RGB) [35] Fine-grained kitchen activities are recognized using depth and color cues. Multi-feature (3D point cloud and skeleton joints) [36] This human activity recognition system combines spherical harmonics features from depth maps and motion features using 3D joints. R transform–based translation and scale-invariant feature extraction methods can be used for human activity recognition systems. The proposed feature extraction method is robust against noise and occlusion and more discriminative than other 4D occupancy methods. Also, it captures the distribution of changing shape and motion cues together. Uses DSTIPs and DCSF to recognize the activities from depth videos without depending on skeleton joints, motion segmentation and tracking or de-noising procedures. The R transform–based feature extraction method is not viewinvariant. This method can roughly characterize the local spatial shape around each joint. Differential operation on a depth image can enhance noise. It is difficult to analyze the method for full activities, and most of the recognition errors come from those activities. 3D point clouds of the body’s surface can avoid perspective distortion in depth images. It is based on different combinations of features for each dataset, but it is not feasible for an automatic system to select the combination for high accuracy. Skeleton joints are more informative and can achieve high accuracy with a smaller number of joints. Relying only on the hip joint might potentially compromise recognition accuracy. It is a skeleton joint–based feature extraction method that extracts features in both spatial and temporal domains. It is more accurate and informative than trajectory-based methods. Offset feature computation depends on the assumption that the initial skeleton pose is neutral, which is not correct. A view-invariant descriptor using joint quadruples encodes Fisher kernel representations. It is not a good choice to completely rely on skeleton joints, because these 3D joints are noisy and fail when there are occlusions. A highly discriminative and translation invariant feature extraction method captures relations between the human body parts and the env ironmental objects in the interaction. Also, it represents the temporal structure of an individual joint. It is an efficient feature extraction method taking advantage of both RGB and depth images to recognize objects and fine-grained kitchen activities. It is a view-invariant feature extraction method based on shape representation and the kinematics structure of the human body. That is, both features are fused using MKL to produce an optimal combined kernel matrix. Heavily relying on skeleton localization becomes unreliable for postures with self-occlusion. Requires a large dataset to train the system. Shape features are large in size, which may be unreliable for postures with self-occlusion, whereas it extracts motion features on the assumption that the initial pose is in a neutral state, which is not the case. 287 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 Table 3. Summary of feature selection, classification and recognition methods Paper Extracted Features [13] 3D points at the contour of a depth map [18] Depth values Feature Selection/ Dimension reduction Clustering Classification Action graph PCA K-means HMM LDA Elastic net regularized classifier SVM PCA, LDA LBG HMM [19] 3D point cloud [20] Histogram of gradients [22] Depth values [28] Histogram of surface normal [29] Histogram of depth pixels PCA K-means DS-SRC [30] 3D point cloud PCA K-means SVM [31] Histogram of 3D joints in spherical coordinates LDA K-means HMM [32] Skeleton joints PCA [34] Gradient values [14] Low-frequency Fourier coefficients SVM SVM NBNN SVM Actionlet ensemble SVM [35] Gradient values SVM [36] 3D point cloud and skeleton joints SVM Table 4. Recognition accuracies of reviewed action recognition systems on benchmark datasets. Paper MSR Action3D [13] 74.7% [18] 84.80% [19] 86.50% [20] 91.63% MSR Daily Activity 3D UCF Kinect dataset Kitchen scene action [22] [28] 88.89% 80% [29] 89.3% 83.6% [30] 90.36% 77.8% [31] 78.97% [32] 82.33% [34] 89.86% [14] 88.2% 97.1% 85.75% [35] [36] 82% 79.7% all the above methods are capable of dealing with the actions and activities of daily life, there are also drawbacks and limitations to using depth map–based, skeleton joint– based and hybrid methods for action recognition systems. Depth maps fail to recognize human actions when finegrained motion is required, whereas extracting 3D points at the contours may incur loss of inner information from the depth silhouettes. Furthermore, shape-based features do not provide any information for calculating the directional velocity of the action between the frames, and it is an important parameter for differentiating the actions. Hence, 93.1% depth-based features are neither very efficient for, nor sufficient for, certain applications, such as entertainment, human–computer interaction, and smart healthcare systems. The 3D skeleton joints estimated using the depth maps are often noisy and may have large errors when there are occlusions (e.g., legs or hands crossing over each other). Moreover, motion information extracted using 3D joints alone is not sufficient to differentiate similar activities, such as drinking water and eating. Therefore, there is a need to include extra information in the feature vector to improve classification performance. Thus, a hybrid method 288 Farooq et al.: A Survey of Human Action Recognition Approaches that use an RGB-D Sensor can be helpful by taking full advantage of using depth maps and 3D skeleton joints to enhance the classification performance of human action recognition. A summary of all the feature-selection, clustering and recognition methods used in the above reviewed papers is in Table 3. Because most of the studied action recognition systems select dominant and discriminative features using LDA, these features are then represented by the codebook, which is generated using a k-means algorithm. Finally, after training the system, it recognizes the learned actions via the trained SVM. The recognition accuracy of the reviewed methods on the datasets mentioned in Table 1 is summarized in Table 4. The assessment method adopted by the mentioned works for the MSR Action3D dataset is a cross-subject test. This method was originally proposed by Li et al. [13] by dividing the 20 actions into three subsets, with each subset containing eight actions. For the MSR Daily Activity 3D dataset, all the authors verified the performance of their method using a leave-one-subject-out (LOSO) test. For the UCF Kinect dataset, 70% of the actions were used for training and 30% for testing. Jalal et al. [22] proposed their own human activity dataset and evaluated the performance of their proposed method using 30% video clips for training and 70% for testing. 4. Conclusion Over the last few years, there has been a lot of work by researchers in the field of human action recognition using the low-cost depth sensor. The success of these works is demonstrated in entertainment systems that estimate the body poses and recognize facial and hand gestures, by smart healthcare systems to care for patients and monitor their activities, and in the security systems that recognize suspicious activities and create an alert to prevent dangerous situations. Different databases have been used by the authors to test the performance of their algorithms. For the MSR Action3D dataset, Yang et al. [20] achieved 91.63% accuracy, whereas for the MSR Daily Activity 3D dataset, Althloothi et al. [36] achieved 93.1% accuracy. Moreover, Yang et al. [32] achieved 97.1% accuracy for the UCF Kinect dataset. Currently, human action systems focus only on extracting boundary information from depth silhouettes. However, using only skeleton information may not be feasible, because the skeleton joints are not always accurate. Furthermore, to overcome the limitations and drawbacks of the current human action recognition systems, it is necessary to extract valuable information from inside the depth silhouettes. Also, it is necessary to use the joint points with the depth silhouettes for an accurate and stable human action recognition system. Acknowledgement This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF- 2013R1A1A2005024). References [1] A. Veeraraghavan et al., “Matching shape sequences in video with applications in human movement analysis,” Pattern Analysis and Machine Intelligence, IEEE Transactions, pp.1896-1909, Jun. 2004. Article (CrossRef Link) [2] W. Lin et al., "Human activity recognition for video surveillance,” in Circuits and Systems, IEEE International Symposium on, pp. 2737-2740, May. 2008. Article (CrossRef Link) [3] H. S. Mojidra et al., “A Literature Survey on Human Activity Recognition via Hidden Markov Model,” IJCA Proc. on International Conference on Recent Trends in Information Technology and Computer Science 2012 ICRTITCS, pp. 1-5, Feb. 2013. Article (CrossRef Link) [4] R. Gupta et al., “Human activities recognition using depth images,” in Proc. of the 21st ACM international conference on Multimedia, pp. 283-292, Oct. 2013. Article (CrossRef Link) [5] Z. Zhang et al., “Microsoft kinect sensor and its effect.” MultiMedia, IEEE, Vol. 19, No. 2, pp. 4-10, Feb. 2012. Article (CrossRef Link) [6] A. A. Chaaraoui, "Vision-based Recognition of Human Behaviour for Intelligent Environments," Director: Florez Revuelta, Franciso, Jan. 2014. Article (CrossRef Link) [7] M. Valera et al., “Intelligent distributed surveillance systems: a review,” Vision, Image and Signal Processing, IEE Proceedings, Vol. 152, No. 2, pp. 192-204. Apr. 2005. Article (CrossRef Link) [8] J. W. Hsieh et al., “Video-based human movement analysis and its application to surveillance systems,” Multimedia, IEEE Transactions on, Vol. 10, No. 3, pp. 372-384, Apr. 2008. Article (CrossRef Link) [9] V. Bloom et al., “G3d: A gaming action dataset and real time action recognition evaluation framework,” Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pp. 7-12, Jul. 2012. Article (CrossRef Link) [10] A. Fossati et al., “Consumer depth cameras for computer vision: research topics and applications,” Springer Science & Business Media, Article (CrossRef Link) [11] M. Parajuli et al., “Senior health monitoring using Kinect,” Communications and Electronics (ICCE), Fourth International Conference on, pp. 309-312, Aug. 2012. Article (CrossRef Link) [12] C. Rougier et al., “Fall detection from depth map video sequences,” Toward Useful Services for Elderly and People with Disabilities, Vol. 6719, pp. 121-128. Hun. 2011. Article (CrossRef Link) [13] W. Li et al., “Action recognition based on a bag of 3d points,” Computer Vision and Pattern Recognition Workshops (CVPRW) IEEE Computer Society Conference on, pp. 9-14, Jun. 2010. Article (CrossRef Link) [14] J. Wang et al., “Mining actionlet ensemble for action IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] recognition with depth cameras,” Computer Vision and Pattern Recognition (CVPR) IEEE Conference on, pp. 1290-1297. Jun. 2012. Article (CrossRef Link) L. Xia et al., “View invariant human action recognition using histograms of 3d joints,” Computer Vision and Pattern Recognition Workshops (CVPRW) IEEE Computer Society Conference on, pp. 20-27, Jun. 2012. Article (CrossRef Link) C. Ellis et al., “Exploring the trade-off between accuracy and observational latency in action recognition,” International Journal of Computer Vision, Vol. 101, No. 3. Pp. 420-436. Aug. 2012. Article (CrossRef Link) A. Shimada et al., “Kitchen scene context based gesture recognition: A contest in ICPR2012,” Advances in Depth Image Analysis and Applications, Vol. 7854, pp. 168-185, Nov. 2011. Article (CrossRef Link) A. W. Vieira et al., “Stop: Space-time occupancy patterns for 3d action recognition from depth map sequences,” Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, Vol. 7441, pp. 252-259. Sep. 2012. Article (CrossRef Link) J. Wang et al., “Robust 3d action recognition with random occupancy patterns.” 12th European Conference on Computer Vision, pp. 872-885. Oct. 2012. Article (CrossRef Link) X. Yang et al., “Recognizing actions using depth motion maps-based histograms of oriented gradients,” in Proc. of the 20th ACM international conference on Multimedia, pp. 1057-1060, Nov. 2012. Article (CrossRef Link) N. Dalal et al., “Histograms of oriented gradients for human detection,” Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, Vol. 1, pp. 886-893, Jun. 2005. Article (CrossRef Link) A. Jalal et al., “Depth video-based human activity recognition system using translation and scaling invariant features for life logging at smart home,” Consumer Electronics, IEEE Transactions on, Vol. 58, No. 3, pp. 863-871, Aug. 2012. Article (CrossRef Link) Y. Wang, K. Huang, and T. Tan. "Human activity recognition based on r transform." In Computer Vision and Pattern Recognition, IEEE Conference on, pp. 1-8. Jun. 2007. Article (CrossRef Link) M. Z. Uddin et al., “Independent shape componentbased human activity recognition via Hidden Markov Model,” Applied Intelligence, Vol. 33, No. 2, pp. 193-206. Jan. 2010. Article (CrossRef Link) J. Han et al., “Human activity recognition in thermal infrared imagery,” Computer Vision and Pattern Recognition. IEEE Computer Society Conference on, pp. 17, Jun. 2005. Article (CrossRef Link) H. Othman et al., “A separable low complexity 2D HMM with application to face recognition,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, Vol. 25. No. 10, pp. 1229 – 1238, Oct. 2003. Article (CrossRef Link) 289 [27] Y. Linde et al., "An algorithm for vector quantizer design,” Communications, IEEE Transactions on, Vol. 28, No. 1, pp. 84–95, Jan. 1980. Article (CrossRef Link) [28] O. Oreifej, & Z. Liu., “Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences,” Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, 2013. Article (CrossRef Link) [29] L. Xia et al., “Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera,” Computer Vision and Pattern Recognition, IEEE Conference on, pp. 2834-2841. Jun. 2013. Article (CrossRef Link) [30] Y. Song et al., “Body Surface Context: A New Robust Feature for Action Recognition from Depth Videos,” Circuits and Systems for Video Technology, IEEE Transactions on, Vol. 24, No. 6, pp. 952-964, Jan. 2014. Article (CrossRef Link) [31] L. Xia et al., “View invariant human action recognition using histograms of 3d joints,” Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, pp. 20-27. Jun. 2012. Article (CrossRef Link) [32] X. Yang et al., “Effective 3d action recognition using eigenjoints,” Journal of Visual Communication and Image Representation, Vol. 25, No. 1, pp. 2-11, Jan. 2014. Article (CrossRef Link) [33] O. Boiman et al., “In defense of nearest-neighbor based image classification,” Computer Vision and Pattern Recognition, IEEE Conference on, pp. 1-8, Jun. 2008. Article (CrossRef Link) [34] G. Evangelidis et al., “Skeletal quads: Human action recognition using joint quadruples,” Pattern Recognition (ICPR), 22nd International Conference on, pp. 4513-4518. Aug. 2014. Article (CrossRef Link) [35] J. Lei et al., “Fine-grained kitchen activity recognition using rgb-d.” in Proc. of the ACM Conference on Ubiquitous Computing, pp. 208-211. Sep. 2012. Article (CrossRef Link) [36] S. Althloothi et al., “Human activity recognition using multi-features and multiple kernel learning,” Pattern Recognition, Vol. 47. No. 5, pp. 1800-1812. May. 2014. Article (CrossRef Link) [37] M. Gönen and E. Alpaydin, “Multiple kernel learning algorithms,” The Journal of Machine Learning Research, Vol. 12, pp. 2211-2268. Jan. 2011. Article (CrossRef Link) Adnan Farooq is a Ph.D. student in Department of Electrical and Electronics Engineering at Dongguk University, Seoul, South Korea. He received his B.S degree in Computer Engineering from COMSATS Institute of Science and Technology, Abbottabad, Pakistan and M.S. degree in Biomedical Engineering from Kyung Hee University, Republic of Korea. His research interest includes Image Processing, Computer vision. 290 Farooq et al.: A Survey of Human Action Recognition Approaches that use an RGB-D Sensor Chee Sun Won received the B.S. degree in electronics engineering from Korea University, Seoul, in 1982, and the M.S. and Ph.D. degrees in electrical and computer engineering from the University of Massachusetts, Amherst, in 1986 and 1990, respectively. From 1989 to 1992, he was a Senior Engineer with GoldStar Co., Ltd. (LG Electronics), Seoul, Korea. In 1992, he joined Dongguk University, Seoul, Korea, where he is currently a Professor in the Division of Electronics and Electrical Engineering. He was a Visiting Professor at Stanford University, Stanford, CA, and at McMaster University, Hamilton, ON, Canada. His research interests include MRF image modeling, image segmentation, robot vision, image retrieval, image/video compression, video condensation, stereoscopic 3D video signal processing, and image watermarking. Copyrights © 2015 The Institute of Electronics and Information Engineers IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.4.291 291 IEIE Transactions on Smart Processing and Computing Design of High-Speed Comparators for High-Speed Automatic Test Equipment Byunghun Yoon and Shin-Il Lim* Department of Electronics Engineering, Seokyeong University / Seoul, South Korea {bhyoon, silim}@skuniv.ac.kr * Corresponding Author: Shin-Il Lim Received June 20, 2015; Revised July 15, 2015; Accepted August 24, 2015; Published August 31, 2015 * Short Paper Abstract: This paper describes the design of a high-speed comparator for high-speed automatic test equipment (ATE). The normal comparator block, which compares the detected signal from the device under test (DUT) to the reference signal from an internal digital-to-analog converter (DAC), is composed of a rail-to-rail first pre-amplifier, a hysteresis amplifier, and a third pre-amplifier and latch for high-speed operation. The proposed continuous comparator handles high-frequency signals up to 800MHz and a wide range of input signals (0~5V). Also, to compare the differences of both common signals and differential signals between two DUTs, the proposed differential mode comparator exploits one differential difference amplifier (DDA) as a pre-amplifier in the comparator, while a conventional differential comparator uses three op-amps as a pre-amplifier. The chip was implemented with 0.18μm Bipolar CMOS DEMOS (BCDMOS) technology, can compare signal differences of 5mV, and operates in a frequency range up to 800MHz. The chip area is 0.514mm2. Keywords: ATE, High-speed continuous comparator, Hysteresis, Differential difference amplifier 1. Introduction For testing and characterizing an application processor (AP) and a system on chip (SoC), the pin card in highspeed automatic test equipment (ATE) includes a driver integrated circuit (IC) to force the signal to the device under test (DUT) and to detect the signal from the DUT. In the driver IC, a parametric measurement unit (PMU), a digital-to-analog converter (DAC), comparators, an active load and serial peripheral interface (SPI) memory resistors are included [1], [2]. This paper describes new techniques in designing a high-speed comparator for this driver IC in ATE with Bipolar CMOS DEMOS (BCDMOS) technology. The comparator compares the detected signal values from the DUTs to the output of the DAC, which are the reference values. The comparator must be able to handle a wide input signal range (0V to 5V) and operate with an input signal up to 800MHz. Also, it should have sufficient accuracy to compare a signal difference of 5mV. Moreover, considering unexpected noise, the comparator must have a hysteresis function. The implementation of a high-speed comparator with 0.18μm BCDMOS technology is challeng- ing work because BCDMOS devices have large parasitic capacitors and greater resistance. 2. Architecture of the Comparator A block diagram of the high-speed comparator is depicted in Fig. 1. This comparator is continuous, so it Fig. 1. Block diagram of the comparator. 292 Yoon et al.: Design of High-Speed Comparators for High-Speed Automatic Test Equipment Fig. 2. Block diagram of continuous comparator. Fig. 3. First stage rail-to-rail pre-amplifier. does not need a clock signal. This comparator has two operating modes. The first mode is normal comparator mode (NCM), which compares the detected values from the DUT (DUT0) to the output of the DAC reference values (DACVOH, DACVOL). The second mode is differential comparator mode (DCM), which also compares the difference of both common mode signals and differential signals from the two DUTs (DUT0, DUT1) to the output of the DAC reference signals (DACVOH, DACVOL). Decided results from comparators are transferred to the control blocks through the high-speed driver circuits. 3. The Proposed Comparator Design 3.1 High Speed Comparator in Normal Mode Fig. 2 shows the proposed block diagram of the highspeed continuous comparator. For high accuracy and highspeed operation with BCDMOS technology, three cascode stages of a pre-amplifier and a latch are considered. The first stage pre-amplifier is the rail-to-rail amplifier to deal with the wide input range of 0V to 5V. And then, the second stage amplifier has hysteresis circuits controlled from 2b SPI signals. Finally, the high-speed third preamplifier and latch are designed to accomplish high-speed and high-accuracy requirements. Fig. 3 shows the first stage rail-to-rail pre-amplifier for accepting the wide range of input signals (0V to 5V). Because the active load devices in BCDMOS technology have large parasitic capacitors and large output resistances, small passive resistances instead of active loads are used for high-bandwidth operation. The second pre-amplifier with the hysteresis function is shown in Fig. 4 [3]. This hysteresis circuit provides hysteresis voltages in a continuous comparator to overcome unexpected noise. Hysteresis can be achieved with the difference of device ratios between the diode-connected active transistors and switch-controlled active transistors. The 2b signals from the SPI resister can control the S1, S2, S3 and S4 switches and achieves hysteresis of four-step Fig. 4. Second pre-amplifier with hysteresis function. voltages from a minimum 0mV to a maximum 96mV. The relation between the hysteresis voltage and device size of switch-connected PMOSFET (PMOS) transistors is shown in Eq. (1). ( Vhys = 1 − M ) 2I (1 + M ) μnCox ( W ) L (1) M is the device ratio between the diode-connected active transistors and switch-controlled active transistors. I is the tail current of the second pre-amplifier. Figs. 5 and 6 show the third pre-amplifier and latch, respectively. A fully differential third pre-amplifier exploits simple common mode feedback (CMFB) with two resisters to stabilize the output common mode voltage. The latch is composed of two inverters, which are connected to each other forming positive feedback. Inverter chains are added to clarify the high or low output level with minimum delay. 3.2 Comparator in Differential Mode To implement DMC mode, a differential difference amplifier (DDA)[4] is used, as shown in Fig. 7, as a pre- 293 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 Fig. 8. The Circuits of DDA. Fig. 5. Third pre-amplifier. Fig. 6. Latch with inverter chain. amplifier is enough for processing all differences in common mode signals and the difference in differential mode signals. Since the proposed DMC uses one op-amp, it does not suffer from mismatches, the cumulated offsets and the noise of many op-amps. Moreover, the proposed DMC consumes less power and requires less hardware area, and hence, offers low-cost implementation. Fig. 8 shows a transistor-level schematic diagram of the DDA. For high gain, a folded cascode structure was used. This folded cascode structure inherently guarantees stable high-frequency operation and does not require additional compensation capacitors. The input stage is designed for rail-to-rail operation by paralleling the PMOS input pair and the NMOSFET (NMOS) input pair. The output voltage, VOUT, includes both the difference of common mode signals and the difference of differential signals from DUT0 (VIP) and DUT1 (VIN), as expressed in Eq. (2): VOUT = VIP − VIN + VCM = (V p.di − Vn,di ) + (V p ,cm − Vn,cm ) + VCM (2) 4. Simulated and Measured Results Fig. 7. Block Diagram of the DDA. amplifier. Conventional design of the DMC needs three amplifiers, one for differential signal generation, another for common mode signal generation, and the final one for summing these two signals. But if we use a DDA, only one DDA The proposed comparator was implemented with 0.18μm BCDMOS technology. The layout size of the proposed comparator is 620μm x 830μm, as shown in Fig. 9. There are four continuous comparators without clock signals, a DDA and two output stages. All the supply voltages in comparators are 5V. The first rail-to-rail preamplifier has a gain of 10dB, and a unit gain frequency of 2.5GHz, while the third pre-amplifier has a gain of 14dB and a unit gain frequency of 1.53GHz. Fig. 10 shows the simulation results of normal mode continuous comparators. The proposed comparator can compare a difference of 5mV at a frequency of 800MHz. Fig. 11 shows the simulation results of hysteresis voltages. Hysteresis voltages such as 0mV, 37.9mV, 68mV and 96mV are achieved according to the codes 00, 01, 10 and 11, respectively. The AC simulation results of the DDA in DMC mode are shown in Fig. 12. The gain is 32dB, and the unity gain frequency is 1.33GHz. Phase margin is 65°. 294 Yoon et al.: Design of High-Speed Comparators for High-Speed Automatic Test Equipment Fig. 9. Comparator Layout (0.5146mm2). Fig. 11. Simulation results of hysteresis. Fig. 10. Transient simulation results of comparator. Fig. 12. AC simulation results of DDA. Fig. 13. Simulated and measured results of normal comparator. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 4, August 2015 295 Fig. 14. Simulated & measured results of DDA (Vout). Table 1. Performance Comparison. Process This Work 0.18μm BCDMOS [2] BiCMOS Supply Voltage 5V 5V Input Range 0 ~ 5V 0 ~ 5V Maximum Input Freq. 800MHz 1.2GHz Hysteresis (00 ~ 11) 0mV ~ 96mV 0mV ~ 100mV Power Consumption 300mW 1.1W (per channel) Resolution 5mV N/A Chip Area (w/o pad) 620μm × 830μm N/A Fig. 13 shows the simulated results of the proposed continuous comparator on the left side and the measured results of the proposed continuous comparator on the right side. To verify the wide range of operation, the difference voltage of 5mV from the common mode voltage of 2.5V, 4.9V and 0.1V were tested. And the outputs of the comparator were correctly detected with high or low levels, as shown in Fig. 13 at a signal frequency of 800MHz. Fig. 14 shows the simulated results of the proposed DDA on the left side and also the measured results of the proposed DDA on the right side. As shown in Fig. 14, the differences in the common mode voltages and the difference in differential mode voltages are measured correctly. Three cases of input signals are tested to prove correct operation of the DDA. For simulated results on the left side, there are two input signals on the top, and the simulated output of the comparator is on the bottom, while for measured results on the right side, two input signals are shown on the bottom. The performance summary of this comparator is in Table 1. 5. Conclusion This paper describes how to design a high-speed comparator in a driver IC for automatic test equipment with BCDMOS technology. To accept a wide input range of signals from 0V to 5V and to handle a frequency range up to 800MHz, a cascode amplifier structure for a continuous comparator is proposed. Also, to measure the difference in output signals between the two DUTs, a DDA is exploited with minimum hardware, with less power consumption and with lower noise to detect the differences of both common mode signals and differential signals. Acknowledgment This research was supported by the Ministry of Science, ICT and Future Planning (MSIP), Korea, under the Information Technology Research Center (ITRC) support program (IITP-2015-H8501-15-1010) supervised by the Institute for Information & Communication Technology Promotion (IITP) and also supported by the Industrial Core 296 Yoon et al.: Design of High-Speed Comparators for High-Speed Automatic Test Equipment Technology Development Program (10049009) funded by the Ministry of Trade, Industry & Energy (MOTIE), Korea. References [1] In-Seok Jung, Yong-Bin Kim, “Cost Effective Test Methodology Using PMU for Automated Test Equipment Systems”, International Journal of VLSI design & Communication Systems (VLSICS), vol.5, no.1, pp. 15-28, February 2014. Article (CrossRef Link) [2] Data sheet of ADATE318, Analog Devices. Article (CrossRef Link) [3] Xinbo Qian, “A Low-power Comparator with Programmable hysteresis Level for Blood Pressure Peak Detection” TENCON2009, pp.1-4, Jan. Singapore, 2009. Article (CrossRef Link) [4] E. Säackinger and W. Guggenbüuhl, “A Versatile Building Block: The CMOS Differential Difference Amplifier”, IEEE Journal of Solid-State Circuits, pp.287-294, April, 1987. Article (CrossRef Link) [5] G. Nicollini and C. Guadiani, “A 3.3-V 800-nV noise, gain-programmable CMOS microphone preamplifier design using yield modeling technique” IEEE J. Solid-State Circuits, vol. 28. No. 8, pp. 915-920, Aug. 1993. Article (CrossRef Link) [6] Vladimir Milovanovi, Zimmermann, H. “A 40nm LP CMOS Self-Biased Continuous-Time Comparator with sub-100ps Delay at 1.1V & 1.2mW”, ESSCIRC, pp. 101-104, 2013. Article (CrossRef Link) [7] Hong-Wei Huang, Chia-Hsiang Lin and Ke-Horng Chen, “A Programmable Dual Hysteretic Window Comparator” ISCAS, pp.1930-1933, 2008. Article (CrossRef Link) [8] Baker, R. Jacob, “CMOS: Circuit Design, Layout and Simulation, 3rd Edition” IEEE Press Series on Microelectronics Systems, August 2010. Article (CrossRef Link) [9] Behzad Razavi, “Design of Analog CMOS Integrated Circuits”, McGraw Hill, 2001. Article (CrossRef Link) Copyrights © 2015 The Institute of Electronics and Information Engineers Byung-Hun Yoon received a BSc from the Department of Electronic Engineering at Seokyeng University, Seoul, Korea, in 2014. Since 2014, he has been in the master’s course at Seokyeong University. His research interests include analog and mixed mode IC design and PMIC. Shin-Il Lim received his BSc, MSc and PhD in electronic engineering from Sogang University, Seoul, Korea, in 1980, 1983, and 1995, respectively. He was with the Electronics and Telecommunication Research Institute (ETRI) from 1982 to 1991 as senior technical staff. He was also with the Korea Electronics Technology Institute (KETI) from 1991 to 1995 as a senior engineer. Since 1995, he has been with Seokyeong University, Seoul, Korea, as a professor. His research areas are in analog and mixed mode IC design for communication, consumer, biomedical and sensor applica-tions. He was the TPC chair of ISOCC’2009 and also was the general chair of ISOCC’2011