WLD: A Robust Local Image Descriptor Jie Chen1,2,3 Shiguang Shan1 Chu He2,4 Guoying Zhao2 Matti Pietikäinen2 Xilin Chen1 Wen Gao1,3 1 Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100080, China 2 Machine Vision Group, Department of Electrical and Information Engineering, P. O. Box 4500 FI-90014 University of Oulu, Finland 3 School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China 4 School of Electronic Information, Wuhan University, Wuhan, 430079, China {jiechen, gyzhao, mkp}@ee.oulu.fi, hc@eis.whu.edu.cn, {sgshan, xlchen, wgao}@jdl.ac.cn Abstract Inspired by Weber's Law, this paper proposes a simple, yet very powerful and robust local descriptor, called Weber Local Descriptor (WLD). It is based on the fact that human perception of a pattern depends not only on the change of a stimulus (such as sound, lighting) but also on the original intensity of the stimulus. Specifically, WLD consists of two components: its differential excitation and orientation. A differential excitation is a function of the ratio between two terms: One is the relative intensity differences of its neighbors against a current pixel; the other is the intensity of the current pixel. An orientation is the gradient orientation of the current pixel. For a given image, we use the differential excitation and the orientation components to construct a concatenated WLD histogram feature. Experimental results on Brodatz and KTH-TIPS2-a textures databases show that WLD impressively outperforms the other classical descriptors (e.g., Gabor and SIFT). Especially, experimental results on human face detection show a promising performance. Although we train only one classifier based on WLD features, the classifier obtains a comparable performance to the state-of-the-art methods on MIT+CMU frontal face test set, AR face dataset and CMU profile test set. 1. Introduction Recently there have been many interests in object and view matching using local invariant features [21], 1 classification of textured regions using micro textons [28] and in face detection using local feature [41]. There are several studies to evaluate the performance of these methods, such as [24, 25, 27, 32]. Meanwhile, the proposed methods can be divided into two classes: One is the sparse descriptor, which first detects the interest points in a given image and then samples a local patch and describes its invariant features [24, 25]; the other one is the dense descriptor, which extracts the local feature of an input image pixel by pixel [27, 32]. For sparse descriptors, a typical one is scale-invariant feature transform (SIFT), introduced by Lowe [21]. It performs best in the context of matching and recognition [25] due to its invariance to scaling and rotations. Several attempts to improve the SIFT have been reported in the literature. Ke and Sukthankar developed the PCA-SIFT descriptor which represents local appearance by principal components of the normalized gradient field [17]. Mikolajczyk and Schmid modified the SIFT descriptor by changing the gradient location orientation grid as well as the quantization parameters of the histograms [25]. The descriptor was then projected to lower dimensional space with PCA. This modification slightly improved performance in matching tests. Dalal and Triggs proposed “histogram of gradients” (HOG) [10]. The HOG differs from the SIFT in several aspects, such as normalization and spatial distribution as well as size of the bins. Lazebnik et al. proposed a rotation invariant version called Rotation Invariant Feature Transform (RIFT) [18], which overcomes the problem of dominant gradient orientation estimation required by SIFT, at a cost of lower discriminative power. Bay et al. proposed an efficient implementation of SIFT by applying the integral image to compute image derivatives and quantizing the gradient orientations in a small number of histogram bins [3]. Winder and Brown learned optimal parameter setting on a large training set to maximize the matching performance [42]. Mikolajczyk and Matas developed the optimal linear projection to improve the matching quality and speed of the SIFT descriptors [26]. Likewise, in order to improve the efficiency of the local descriptor, Tola et al. replaced the weighted sums used by SIFT by sums of convolutions [38]. In addition, Cheng et al. proposed to use multiple support regions of different sizes surrounding 2 an interest point since the amount of deformation for each location varies and is unpredictable such that it is difficult to choose the best scale of the support region [8]. Among the most popular dense descriptors are the Gabor [22] and local binary pattern (LBP) [28]. The Gabor representation has been shown to be optimal in the sense of minimizing the joint two-dimensional uncertainty in space and frequency [22]. These filters can be considered as orientation and scale tunable edge and line (bar) detectors, and the statistics of these microfeatures in a given region are often used to characterize the underlying texture information. The Gabor has been widely used in image analysis applications including texture classification and segmentation, image registration, motion tracking [22], and face recognition [47]. Another important dense local descriptor is LBP, which has gained increasing attention due to its simplicity and excellent performance in various texture and face image analysis tasks [28]. Many variants of LBP have been recently proposed and provided considerable success in various tasks. Ahonen et al. exploited the LBP for face recognition [1]. Rodriguez and Marcel proposed Adapted local binary pattern histograms for face authentication [34]. Tan and Triggs changed the thresholding means for face recognition under difficult lighting conditions [37]. Zhao and Pietikäinen proposed the local binary pattern on three orthogonal planes and used it for dynamic texture recognition [46]. Zhang et al. proposed the local Gabor binary pattern histogram sequence for a non-statistical model for face representation and recognition [47]. In this paper, we propose a simple, yet very powerful and robust local descriptor. It is inspired by Weber's Law, which is a psychological law [15]. It states that the change of a stimulus (such as sound, lighting) that will be just noticeable is a constant ratio of the original stimulus. When the change is smaller than this constant, human being would recognize it as a background noise rather than a valid signal. Motivated by this point, the proposed Weber Local Descriptor (WLD) is computed based on the ratio between the two terms: One is the relative intensity differences of its neighbors against a current pixel; the other is the intensity of the current pixel. We then represent input image parts by histograms of gradient locations and orientations. Thus, WLD is dense descriptor [7]. 3 Although the proposed WLD descriptor employs the advantages of SIFT (i.e., computing the histogram using the gradient and its orientation) and LBP (i.e., computational efficiency and support regions around 3×3), the difference is obvious. As mentioned above, the SIFT descriptor is a 3D histogram of gradient locations and orientations, in which two dimensions correspond to image spatial dimensions and the additional dimension to the image gradient direction. It is a sparse descriptor, which is computed for interest regions (located around detected interest points) that have usually already been normalized with respect to scale and rotation. Texture classification with the SIFT is done using information in these sparsely located interest regions, e.g., [11]. WLD, on the other hand, is a dense descriptor computed for every pixel. It is a 2D histogram of local properties that depend both on the local intensity variation and the magnitude of the center pixel's intensity. Texture classification with WLD is done using these 2D feature histograms. The granularity of statistical pattern is much smaller than SIFT. For example, WLD is computed around a 3×3 square region. However, SIFT is computed around a 16×16 region [21, 8]. That is to say, the salient feature of WLD is of finer granularity compared to that of SIFT. The smaller size of the support regions for WLD makes it capture more local salient patterns. As for the LBP descriptor, it represents an input image by building statistics on the local micro-pattern variations. These local patterns might correspond to the bright/dark spots, edges and flat areas etc. However, WLD first computes the salient micro-patterns and then builds statistics on these salient patterns according to the orientation of the current point. It combines both the gradient (salient micro-pattern) and orientation. Several researchers have used Weber’s Law in computer vision. Bruni and Vitulano used this Law for scratch detection on digital film materials [5]. Phiasai et al. employed a Weber ratio to control the strength of watermark [31]. Different from them, we use Weber’s Law to generate a new representation of an image. The rest of this paper is organized as follows: In Section 2, we introduce a descriptor framework. In 4 Section 3, we propose a local descriptor WLD and compare other existing methods by casting them into the framework. In Sections 4 and 5, experiments are carried out dealing with the applications of WLD in texture classification and face detection. Section 6 concludes the paper. 2. Descriptor framework Recently, Chu et al. proposed a descriptor framework: Filtering, Labeling and Statistics (FLS) [9]. It describes image attributes from different aspects: Filtering depicts the inter-pixel relationship in a local image region; Labeling (which includes quantization and mapping) describes the intensity variations which cause psychology redundancies; Statistics captures the attribute not on adjacent regions. The framework is based on an analogy between image descriptor and image compression (Both of these try to describe image with less information). It is known that the whole procedure of image compression is usually composed of three steps toward three redundancies: filtering toward the inter-pixel redundancy, quantization toward psychology redundancy, and statistic toward coding redundancy. Thus, using the FLS framework and mapping each step of the existing descriptors to these three stages, one can obtain further knowledge about these descriptors and can compare them under a uniform framework. The framework is illustrated in Fig. 1. The first stage of FLS is to filter an input image by a filter bank. Specifically, given an image on a rectangular pixel lattice S, which contains a set of pixels Y={ys, s∈S}, it is filtered by a filter bank in size M, i.e., F={f0, f 1, …, fM-1}. The responses by the filter bank F at the same site of the lattice are represented as vsm f m ( ys ) (m=0, 1, … , M-1). Concatenating the response of each filter, we form a response vector for the pixel vs [vs0 , v1s , , vsM 1 ] . Combining the response vectors of each pixel, we build up a vector image V {vs , s S} . The second stage is to quantize each response vector and map it into a label by a provided function. In the vector space, a response vector vs is labeled as xs f vq (vs ) , xs∈RK (k is the dimensionality of the label space) . For 5 example, Varma and Zisserman carry out the clustering of the response vectors [40]. The label of each response vector is then set as a number (i.e., the index of each cluster). Thus, we can obtain a mapped image X={ xs , s∈S }. The third stage is to carry out the statistic of each pixel’s label. One of the usually used approaches is to compute the histogram of these labels: H=fhist(X). Fig. 1. The FLS framework of image descriptors. 3. WLD for Image Representation In this section, we review Weber's Law and then propose a descriptor called WLD. Subsequently, we compare WLD with some existing descriptors (i.e., LBP and SIFT) by casting them into the FLS framework. 3.1. Weber's Law Ernst Weber, an experimental psychologist in 19th century, observed that the ratio of the increment threshold to the background intensity is a constant [15]. This relationship, known since as Weber's Law, can be expressed as: I k, I (1) where ∆I represents the increment threshold (just noticeable difference for discrimination); I 6 represents the initial stimulus intensity and k signifies that the proportion on the left side of the equation remains constant despite of variations in the I term. The fraction ∆I/I is known as the Weber fraction. Weber's Law, more simply stated, says that the size of a just noticeable difference (i.e., ∆I) is a constant proportion of the original stimulus value. So, for example, in a noisy environment one must shout to be heard while a whisper works in a quiet room. 3.2 WLD Motivated by Weber’s Law, we propose a descriptor WLD. It consists of two components: its differential excitation (ξ) and orientation (θ). The first component ξ is a function of the Weber fraction (i.e., the relative intensity differences of its neighbors against a current pixel and the current pixel itself). The second component θ is a gradient orientation of the current pixel. 3.2.1 Differential excitation We use the intensity differences between its neighbors and a current pixel as the changes of the current pixel. By this means, we hope to find the salient variations within an image to simulate human beings perception of patterns. Specifically, a differential excitation ξ(yc) of a current pixel is computed as illustrated in Fig. 2, where yc denotes the intensity of the current pixel; yi (i=0, 1, …p-1) denote the intensities of p neighbors of yc (p=8 here). To compute ξ(yc), we first calculate the differences between its neighbors and a center point by the filter f00: Gdiff ( yi ) yi yi yc . (2) Subsequently, we consider the neighbor effects on the current point using a sum of the differences: p 1 vs00 yi . (3) i 0 Hinted by Weber’s Law, we then compute the ratios of the differences to the intensity of the current 7 point by combining the output of the two filters f00 and f01 (whose output vs01 is the original image in fact): vs00 Gratio yc 01 . vs (4) Fig. 2. An illustration of computing a WLD feature of a pixel To improve the robustness of a WLD to noise, we use an arctangent function as a filter on Gratio(٠). That is: 8 Garctan Gratio =arctan Gratio . (5) Combining Eqs. (2), (3), (4) and (5), we have: p 1 y yc v 00 Garctan Gratio =xs0 =arctan s01 arctan i vs i 0 yc . (6) So, ξ(yc) is computed as: p 1 yi yc vs00 arctan 01 vs i 0 yc yc arctan . (7) Note that ξ(yc) may take a minus value if the neighbors intensities are smaller than that of a current pixel. By this means, we attempt to preserve more discriminating information in comparison to using the absolute value of ξ(yc). Intuitively, if ξ(yc) is positive, it simulates the case that the surroundings are lighter than the current pixel. In contrast, if ξ(yc) is negative, it simulates the case that the surroundings are darker than the current pixel. Arctangent function As shown in Eq. (5), we use an arctangent filter Garctan(٠) to compute ξ(yc). As shown in Fig. 3, the reason is that the arctangent filter can bound the output to prevent it from increasing or decreasing too quickly when the input becomes larger or smaller. One optional filter is a logarithm function, which also matches human being’s perception well. However, many outputs of Eq. (2) are negative, which is against this kind of function. Another optional filter is a sigmoid function: 1 e x sigmoid ( x) , 1 e x (8) where β>0. It is a typical neuronal non-linear transfer function and widely used in artificial neural networks [2]. Both the arctangent and sigmoid functions have the similar curves as shown in Fig 3, especially when β=2. In our case, we use the former for simplicity. As shown in Fig. 4, we plot an average histogram of the differential excitations on 2,000 texture images. One can find that there are more frequencies at the two sides of the average histogram (e.g., 9 [-π/2, -π/3] and [π/3, π/2]). It results from the approach of computing the differential excitation ξ of a pixel (i.e., a sum of the difference ratios of p neighbors against a central pixel) as shown in Eq. (7). However, it is valuable for a classification task. For more details, please refer to Section 3.2.4, Section 4 and 5. y y = y=π/2 pi/2 1 0.5 -4 -3 -2 -1 0 1 -0.5 -1 2 3 4 x y=arctan(x) y=x y=sigmoid(x) y=sigmoid(2x) y=sigmoid(3x) y=sigmoid(4x) y = -pi/2 y=-π/2 Fig. 3. A plot of an arctangent function y= arctan(x). Note that the output of arctan (·) is in radian measure. Fig. 4. A plot of an average histogram of the differential excitations on 2,000 texture images. 3.2.2. Orientation The orientation component of WLD is computed as: ( yc ) x1s arctan( ), (9) and v10 v11 v12 v13 s s s s , , , 12 13 10 11 vs vs vs vs median 10 where v1k s (k=0, 1, 2, 3) is the output of the four filters f1k. For example, vs = y0 y4 . 10 (10) Note that in Eq. (10), we are only use p/2 (i.e., 4) filters to compute χ not p (i.e., 8) filters because there exists symmetry for χ’s when k takes its values in the two intervals [0, p/2-1] and [p/2, p-1]. For example, if k=5, we have v15 s y4 y0 v10 s . y6 y2 For simplicity, θ’s value is quantized into T dominant orientations. However, before the quantization, we perform the mapping f : , where θ [-π/2, π/2] and [0, 2π]. Note that the mapping concerns the value of θ computed using Eq. (9) and the sign of the denominator and numerator of the right side of Eq. (10). Thus, the quantization function is then as follows: Фt=φ( )= 2t T 1 , T . 2 / T 2 , and t mod (11) For example, if T=8, these T dominant orientations Фt=(tπ)/4, (t=0, 1, …, T-1). In other words, those orientations located within the interval [Фt-π/T, Фt +π/T] are quantized as Фt. The idea of representing image parts by histograms of gradient locations and orientations has been used in biologically plausible vision systems and in recognition [26]. Motivated by this idea, we compute the 2D histogram {WLD(ξj, Фt)}j after compute the differential excitation (ξj) and orientation (Фt) of each pixel. As shown in Fig. 2, each column corresponds to a dominant orientation Фt. Each row corresponds to a differential excitation. Thus, the intensity of each cell corresponds to the frequency of a certain differential excitation on a dominant orientation. 3.2.3. WLD histogram Given an image, as shown in Fig. 5 (a), we encode the WLD features into a histogram H. Using the 2D histogram {WLD(ξj, Фt)}j computed as shown in Fig. 2, we project each column of the 2D histogram to form a 1D histogram H(t) (t=0, 1, …, T-1) as shown in Fig. 5 (a). That is, we regroup the differential excitations ξ(yc) as T sub-histograms H(t), each sub-histogram H(t) corresponding to a dominant orientation (i.e., Фt). Subsequently, each sub-histogram H(t) is evenly divided into M segments, i.e., Hm,t, (m=0, 1, ..., M-1, and in our implementation we let M=6.). These segments Hm,t are then 11 reorganized as the histogram H. Specifically, H is concatenated by M sub-histograms, i.e., H={Hm}, m=0, 1, ..., M-1. For each sub-histogram Hm, it is concatenated by T segments Hm={Hm,t}, t=0,1,…T-1. (a) (b) Fig. 5. An illustration of a WLD histogram feature for a given image, (a) H is concatenated by M sub-histograms {Hm}(m=0, 1, …, M-1). Each Hm is concatenated by T histogram segments Hm,t (t=0, 1, …, T-1). Meanwhile, for each column of the histogram matrix, all of M segments Hm,t (m=0, 1, …, M-1) have the same dominant orientation Фt. In contrast, for each row of the histogram matrix, the differential excitations ξj of each segment Hm,t (t=0, 1, …, T-1) belongs to the same interval lm. (b) A histogram segment Hm,t. Note that if t is fixed, for any m or s, the dominant orientation of a bin hm,t,s is fixed (i.e., Фt). Note that after each sub-histogram H(t) is evenly divided into M segments, the range of differential excitations ξj (i.e., l=[-π/2, π/2]) is also evenly divided into M intervals lm (m=0, 1, ..., M-1). Thus, for 12 each interval lm, we have lm=[ηm,l, ηm,u], here, the lower bound ηm,l = (m/M-1/2)π and the upper bound ηm,u =[(m+1)/M-1/2]π. For example, l0=[-π/2, -π/3]. Furthermore, as shown in Fig. 5 (b), a segment Hm,t is composed of S bins, i.e., Hm,t={hm,t,s}, s=0, 1, …, S-1. Herein, hm,t,s is computed as: hm,t , s j 1( S j s ) , j m ,l 1 , j lm , ( j ) t , S j (m,u m,l ) / S 2 (12) where 1(٠) is a function as follows: 1 X is true 1( X ) . 0 otherwise (13) Thus, hm,t,s means the number of the pixels whose differential excitations ξj belong to the same interval lm and orientations j are quantized to the same dominant orientation Фt and that the computed index Sj is equal to s. Meanwhile, (m,u m,l ) / S 1 ( m ,u m ,l ) / S ( j m ,l ) is the width of each bin, is a linear mapping to map the differential excitation to its corresponding bin since the values of S j are of real. We segment the range of ξ into several intervals due to the fact that different intervals correspond to the different variances in a given image. For example, given two pixels Pi and Pj, if their differential excitations ξi l0 and ξj l2, we say that the intensity variance around Pi is larger than that of Pj. That is, flat regions of an image produce smaller values of ξ while non-flat regions produce larger values. However, besides the flat regions of an image, there are two kinds of intensity variations around a central pixel which might lead to smaller differential excitations. One is the clutter noise around a central point; the other is the “uniform” patterns as shown in [28]. Meanwhile, the latter provides a majority of variations in comparison to the former, and the latter can be discriminated by the orientations of the current pixels. 13 Here, we let M=6 for the reason that we attempt to use these intervals to approximately simulate the variances of high, middle or low frequency in a given image. That is, for a pixel Pi, if its differential excitation ξi l0 or l5, we call that the variance near Pi is of high frequency; if ξi l1 or l4, or ξi l2 or l3, we call that the variance near Pi is of middle frequency or low frequency, respectively. 3.2.4. Weight for a WLD histogram Intuitively, one often pays more attention to the variances in a given image compared to the flat regions. That is, the different frequency segments Hm play different roles for a classification task. Thus, we can weight the different frequency segments with different weights for a better classification performance. Table 1 Weights for a WLD histogram H0 H1 H2 H3 H4 H5 Frequency percent 0.2519 0.1179 0.1186 0.0965 0.0875 0.3276 Weights ( m ) 0.2688 0.0852 0.0955 0.1000 0.1018 0.3487 For weight selection, a heuristic approach is to take into account the different contributions of the different frequency segments Hm (m=0, 1, …, M-1). First, by computing the recognition rate on a collected texture dataset from Internet for each sub-histogram Hm separately, we obtain M rates R={rm}; then, we let each weight m rm / i ri as shown in Table 1. Simultaneously, as shown in Table 1, we collect statistics of the percent of frequencies of each sub-histogram. From this table, one can find that these two groups of values (i.e., frequency percent and weights) are very similar. In addition, a high frequency segment H0 or H5 includes more frequencies (Refer to Fig. 4 for more details) than the middle or low frequency segment. Note that a high frequency segment H0 or H5 taking a larger weight is consistent with the intuition that a good classification feature should pay more attention to the salient variations of an object. Here, besides the large changes at edges or occlusion boundaries within an image, the approach of computing differential excitation of a pixel (i.e., a sum of the differences of p neighbors against a central pixel) 14 further contributes to a larger weight of H0 or H5. However, a side effect of the approach of weighting might enlarge the influence of noise. One can avoid this disadvantage by removing a few bins at the ends of high frequency segments, that is, left ends of H0,t and right ends of HM-1,t (t=0,1,…T-1). 3.3 Characteristics of WLD A descriptor WLD is based on a physiological law. It extracts the features from an image by simulating a human to sense his surroundings. Specifically, as shown in Fig 2, a WLD uses the ratios of the intensity differences vs00 to vs01 . As expected, WLD gains powerful representation ability for textures. Fig. 6. The upper row is original images and the bottom is filtered images. Meanwhile, the value of the intensity in the filtered image is the differential excitation and is scaled to [0, 255]. As shown in Eq. (2), a WLD preserves the differences ( vs00 ) between its neighbors and a center pixel. Thus, the detected edges by a WLD depend on the perceived luminance difference and match the subjective criterion elegantly. For example, sometimes vs00 may be quite large. But if vs00 / vs01 is smaller than a noticeable threshold, there is not a noticeable edge. In contrast, vs00 may be quite small. But if vs00 / vs01 is larger than a noticeable threshold, there is a noticeable edge. In Fig. 6 we show some filtered images by WLD, from which one could conclude that a WLD extracts the edges of images perfectly even with heavy noise (Fig. 6, middle column). Especially, as shown in the last column of Fig. 15 6, a WLD can extract the relatively weak edges (textures) effectively. Furthermore, the results of texture analysis show that much of the discriminative texture information is contained in high spatial frequencies such as edges [28]. Thus, a WLD works well to obtain a powerful feature for textures. WLD is robust to noise appeared in a given image. Specifically, a WLD reduces the influence of noise as it is similar to the smoothing in image processing. As shown in Fig. 2, a differential excitation is computed by a sum of its p-neighbor differences to a current pixel. Thus, it reduces the influence of noisy pixels. Moreover, the sum of its p-neighbor differences is further divided by the intensity of the current pixel, which also decreases the influence of noise in an image. Moreover, the feature vector is modified to reduce the effects of illumination change. On one side, WLD computes the differences vs00 between its neighbors and a current pixel. Thus, a brightness change in which a constant is added to each image pixel will not affect the differences values. On the other side, WLD performs the division between the differences vs00 and vs01 . Thus, a change in image contrast in which each pixel value is multiplied by a constant will multiply differences by the same constant, and this contrast change will be canceled by the division. Therefore, the descriptor is robust to changes in illumination. Furthermore, regrouping the differential excitation and orientation into a 2D histogram and then weighting the different frequency segments can also improve the performance of the descriptor WLD. 3.4. Comparison with existing descriptors Using the FLS framework described in Section 2, we can easily compare our descriptor to the existing ones as shown in Table 2. For LBP, it computes the intensity differences between the center pixel yc and its neighbors in the first stage, and the responses of each neighbor are thresholded at zero and are then concatenated to a binary string in the second stage. Each binary string corresponding to each pixel is then used to compute a histogram feature in the last stage. For SIFT, it computes the gradient magnitude and orientation at each image sample point in a region around the keypoint location in the 16 first stage. The orientations are quantized to 8 dominant ones in the second stage. For the third stage, the gradient magnitudes are weighted by a Gaussian window and then accumulated into orientation histograms by summarizing the contents over denoted subregions. Table 2. Comparison of LBP and other descriptors according to the framework. filtering labeling statistic LBP Intensity difference Thresholding at zero Histogram over the binary strings SIFT Intensity difference Orientation quantization WLD Weber fraction Orientation quantization Histogram over the weighted gradient on dominant orientation Histogram over the differential excitation on dominant orientation Although WLD also computes the difference between the center pixel yc and its neighbors like LBP in the first stage, these differences are added together and then divided by the center pixel yc to obtain the differential excitation like the Weber fraction. Different from LBP, WLD uses the median of the gradient orientations to describe the direction of edges. The medians of the gradient directions are then quantized to 8 dominant orientations in the second stage. Different from SIFT, we use the differential excitation not the weighted gradient to compute the histogram. Moreover, differential excitations are not accumulated over denoted subregions around the keypoint location. In contrast, we compute the frequency of the occurrence of differential excitations for each bin of the histogram. Although we weight the WLD histogram as shown in Table 1, the weighted object is the frequency of each bin and the weights are computed according to the recognition performance based on the statistics, not weighting the values of gradients in terms of the distances between the neighbors and the keypoint as SIFT does [21]. Furthermore, we also compare the time complexity of WLD with LBP and SIFT theoretically. Given an image in m×n, the time complexity for WLD is simple. It is as follows: OWLD = C1mn, (14) where C1 is a constant. We use C1 for the computation of each pixel in WLD through several additions, divisions and filtering by an arctangent function. 17 Likewise, the time complexity for LBP is also very simple: OLBP = C2mn, (15) where C2 is also a constant. We use C2 for the computation of each pixel in LBP through several additions and multiplication. However, the time complexity for SIFT is a little complex: OSIFT = C31(αβ)(pq)(mn)+ C32k1+ C33k2st+ C34k2st. (16) Here, the four terms correspond to the four steps: detection of scale-space extrema, accurate keypoint localization, orientation assignment and building the local image descriptor. Meanwhile, C3i (i=1, ... , 4) are of four constants. For the first term, it represents the convolution of a variable-scale Gaussian with the given image. Here, p and q are the size of the convolution template; α, β represent the levels of octave and scales of each octave, respectively. For the second term, k1 represents the number of the keypoint candidates. For the third and fourth terms, k2 represents the number of the keypoints and s, t represent the size of the support regions for each keypoint (e.g., s, t=16). So we have: OSIFT ≈ C31(αβ)(pq)(mn). (17) Comparing the Eqs.(14), (15) and (17), one can find out that both LBP and WLD are more efficient to extract the image features than SIFT. 4. Application to Texture classification In this section, we use WLD features for texture classification and compare both the performance and computational efficiency with that of the state-of-the-art methods. 4.1. Background Texture classification plays an important role in many applications such as robot vision, remote sensing, crop classification, content-based access to image databases, and automatic tissue recognition in medical images. It has led to a wide interest in investigating the theoretical as well as practical issues regarding texture classification. 18 Several approaches for the extraction of texture features have been proposed. On one hand, there are some recent attempts using sparse descriptors for this task, such as [11, 19]. Dorkó and Schmid optimized the keypoint detection and then used SIFT for the image representation [11]. Lazebnik et al. presented a probabilistic part-based approach to describe the texture and object [19]. On the other hand, there are also several attempts using dense descriptors for this task, such as [16, 22, 27, 28, 29, 39, 40]. Jalba et al. presented a multi-scale method based on mathematical morphology [16]. Manjunath and Ma used Gabor filters for texture analysis showing that they exhibit excellent discrimination abilities over a broad range of textures [22]. Ojala et al. proposed to use signed gray-level differences and their multidimensional distributions for texture description [29]. The original LBP is its simplification, discarding the contrast of local image texture [27, 28]. Urbach et al. described a multiscale and multishape morphological method for pattern-based analysis and classification of gray-scale images using connected operators [39]. Varma and Zisserman used the joint distribution of intensity values over extremely compact neighborhoods (starting from as small as 3×3 pixels square) as image descriptors for the texture classification [40]. 4.2. Dataset Experiments are carried out on two different texture databases: Brodatz [4] and KTH-TIPS2-a [6]. Examples of the 32 Brodatz [29] textures used in the experiments are shown in Fig. 7 (a). The images are 256×256 pixels in size and they have 256 gray levels. Each image was divided into 16 disjoint 64×64 samples, which were independently histogram-equalized to remove luminance differences between textures. To make the classification problem more challenging and generic and make a comparison possible, we use the same experimental set-ups as [16, 29, 39]. Three additional samples were generated from each sample: (i) a sample rotated by 90 degrees, (ii) a 64×64 scaled sample obtained from the 45×45 pixels in the middle of the original sample, and (iii) a sample that was both rotated and scaled. Consequently, the entire data set, which we refer to as Brodatz data set, comprises 19 2048 samples, with 64 samples in each of the 32 texture categories. The database KTH-TIPS2-a contains 4 physical, planar samples of each of 11 materials under varying illumination, pose and scale. Some examples from each sample are shown in Fig. 7 (b). The KTH-TIPS2-a texture dataset contains 11 texture classes with 4395 images. The images are 200×200 pixels in size (we did not include those images which are not of this size) and they are transformed into 256 gray levels. The database contains images at 9 scales, under four different illumination directions, and three different poses. (a) (b) Fig. 7 Some examples from two texture datasets (a) Brodatz and (b) KTH-TIPS2-a. Note that for the two texture datasets (i.e., Brodatz and KTH-TIPS2-a), experiments are carried out with ten-fold cross validation to avoid bias. For each round, we randomly divide the samples in each class into two subsets of the same size, one for training and the other for testing. In this fashion, the images belonging to the training set and to the test set are disjoint. The results are reported as the average value and standard deviation over the ten runs. 20 4.3. WLD Feature for Classification To perform the texture classification, there are two essential issues for developing such a system: one is of texture representation and the other is of classifier design. We use WLD feature as a representation and build a system for texture classification. For texture representation, given an image, we extract WLD features as shown in Fig. 5. Here, we experientially let M=6, T=8, S=20. In addition, we also weight each sub-histogram Hm using the same weights as shown in Table 1. As the classifier we use K-nearest neighbor, which has been successfully utilized in classification. In our case, K=3. To compute the distance between two given images I1 and I2, we first obtain their WLD feature histograms H1 and H2. We then measure the similarity between H1 and H2. In our experiments, we use the normalized histogram intersection Π(H1, H2) as a similarity measurement of two histograms [36]: L ( H 1 , H 2 ) min( H 1,i , H 2,i ) , (18) i 1 where L is the number of bins in a histogram. The intuitive motivation for this measurement is to calculate the common parts of two histograms. Note that the input histograms Hi (i=1, 2) are normalized. 4.4. Experimental Results Experimental results on Brodatz and KTH-TIPS2-a textures are illustrated in Fig. 8. Herein, the accuracy of our method is given as percentages of correct classifications. It is computed as follows: accuracy # correct classification . # total images (19) As shown in Fig. 8 (a), we compare our method with others on the classification task of Brodatz textures: SIFT, Jalba [16], Ojala [29] (i.e., signed grey level difference (SD) and LBP), Urbach [39], and Manjunath [22] (i.e., Gabor). Note that all the results by other methods in Fig. 8 are quoted directly 21 from the original papers except that of Manjunath [22] and SIFT. The approach in [22] is a “traditional” texture analysis method using global mean and standard deviation of the responses of Gabor filters. We use the results in [29] for a substitution, in which Ojala et al. use the same set-ups for Gabor filters of 6 orientations and 4 scales. In addition, The SIFT is re-implemented by us according to [11], in which they optimized the keypoint detection to achieve stable local descriptors. In addition, we also refer to the code by A. Vedaldi (http://vision.ucla.edu/~vedaldi/code/siftpp/siftpp.html) Results for Brodatz textures Results for KTH-TIPS2-a textures 100 100 97. 5( 0. 6) 96. 3( 0. 5) 96. 5( 0. 6) 95. 9( 0. 9) 95 93. 8( 0. 6) Accuracy Accuracy 96. 8( 0. 7) 95 93. 5( 1. 5) 90 88. 6( 1. 1) 91. 4( 0. 6) 7 8 (a) 0 1 2 3 W LD 6 LB P 5 W LD 4 LB P U rb ac h G ab or 3 SD 2 Ja lb a FT 1 SI FT 85 0 SI 90 91. 2( 0. 7) 4 (b) Fig. 8. Results comparison with state-of-the-art methods on Brodatz and KTH-TIPS2-a textures, where the values above the bars are the accuracy and corresponding standard deviations. As shown in Fig. 8 (b), we also compare our method with SIFT and LBP on the classification task of KTH-TIPS2-a textures. Likewise, both SIFT and LBP are re-implemented by us. However, in the implementation of SIFT, we use the Laplacian to detect keypoints in Fig. 8 (a) while in Fig. 8 (b), we employ Harris followed the idea in [11]. From Fig. 8, one can find out that our approach works in a very robust way in comparison to other methods. Moreover, the standard deviation of WLD is smaller compared with other methods. Furthermore, though we have rotated and scaled the subimages of Brodatz textures as mentioned in Section 4.2 and there are diverse variations in KTH-TIPS2-a (i.e., poses, scales and illuminations), we 22 also obtain favorable results. It shows that WLD extracts powerful discriminating features which are robust to rotation, scaling and illumination. The poorer performance of SIFT can be partly explained by the small image size (e.g., 64×64 in Brodatz database) for a sparse descriptor, from which a too small number of keypoints may be located, leading to the performance decrease. Due to its robustness to illumination, LBP performs better for KTH-TIPS2-a textures than for the histogram-equalized Brodatz samples. In addition, we also use the two components (differential excitation ξ and orientation θ) of WLD as the features respectively and compute the histograms on Brodatz textures for classification. The classification rates are 90.6% and 89.3% respectively. That is, both differential excitation ξ and 1 1 0.95 0.95 0.95 0.9 0.85 Accuracy 1 Accuracy Accuracy orientation θ provide effective information for texture classification. 0.9 0.85 0.85 M 0.8 2 4 6 M (a) 8 0.9 T 10 0.8 2 4 6 T (b) 8 S 10 0.8 0 5 10 15 S 20 25 30 (c) Fig. 9. The effects of using different M, S, and T parameters 4.5 The effects of using different M, S, and T parameters In this section, we discuss the influence of the parameters setting of M, S, and T on the final results. For a histogram based method, the setting of M, S, and T is a tradeoff between the discriminability and the statistical reliability. In general, if these parameters (i.e., M, S, and T) become larger, the dimensionality of the histogram (i.e., the number of its bins) becomes larger and thus the histogram becomes more discriminable. However, in real application, the entries of each bin become smaller because the size of the input image/patch is fixed. It degrades the statistical reliability of the histogram. If the entries of each bin become too small, it in turns degrades the discriminability the histogram because its poor 23 statistical reliability. In contrast, if these parameters (i.e., M, S, and T) become smaller, the entries of each bin become larger and the histogram becomes statistically reliable. However, if these parameters are too small, the dimensionality of the histogram also becomes too small. It also degrades the discriminability of the histogram. The experimental results on the Brodatz textures by varying the parameters are plot as follows. In these experiments, we just vary one parameter and fix the other ones. One can find out that the performance changes slightly, which shows that the histogram based method is relatively robust although its dimensionality changes obviously. 5. Application to face detection In this section, we use WLD features for human face detection. Although we train only one classifier, we use it to detect frontal, occluded and profile faces. Furthermore, experimental results show that this classifier obtains comparable performance to state-of-the-art methods. 5.1. Background The goal of face detection is to determine whether there are any faces in a given image, and return the location and extent of each face in the image if one or more faces are present. Recently, many methods for detecting faces have been proposed and most of them extract the features densely [45]. Among these methods, learning based approaches to capture the variations in facial appearances have attracted much attention, such as [33, 35]. One of the most important progresses is the appearance of boosting-based method, such as [14, 20, 30, 41, 43, 44]. In addition, Hadid et al. use Local Binary Pattern (LBP) for face detection and recognition [13]. Garcia and Delak is use a convolutional face finder for fast and robust face detection [12]. 5.2. WLD Feature for Face Samples We use WLD features as a facial representation and build a face detection system. For the facial 24 representation, as illustrated in Fig.9, we propose a new facial representation. Specifically, we divide an input sample into overlapping regions and use a p-neighborhood WLD operator (here, p=8). In our case, we normalize the samples into w×h (i.e., 32×32) and derive WLD representations as follows: Fig. 9. An illustration of WLD histogram feature for face detection. We divide a face sample of size w×h into K overlapping blocks (Here, K=9 in our experiments) of size (w/2)×(h/2) pixels. The overlapping size is equal to w/4 pixels. For each block, we compute a concatenated histogram Hk, k=0, 1, …, K-1. Herein, each Hk is computed as shown in Fig. 5. That is, each Hk is a concatenated histogram by M sub-histogram H mk , m=0, 1,…, M-1, and H mk is also concatenated by T histogram patches H mk ,t , t=0, 1, …T-1. Moreover, H mk ,t is an S-bin histogram patch. In addition, for this group of experiments, we experientially let M=6, T=4, S=3. Thus, each Hk is a 72-bin histogram (Note that for each sub-histogram H mk , we use the same weights as shown in Table 1.). For each block, we train a Support Vector Machine (SVM) classifier using an Hk histogram feature to verify whether kth block is a valid face block (In our case, we use a second degree polynomial kernel function for SVM classifier). If the number of the valid face blocks is larger than a given threshold Ξ, we say that a face exists in the input window. As to the parameter Ξ, its value is a tradeoff between the 25 detection rate and false alarms for a face detector. That is, when the value of Ξ becomes larger, detection rate decreases but false alarms also decrease. In the contrast, when the value of Ξ becomes smaller, detection rate increases but false alarms also increase. The value of Ξ also varies with the pose of faces. For more details, please refer to section 5.4. 5.3. Dataset The training set is composed of two sets, i.e., a positive set Sf and a negative set Sn. For the positive set, it consists of 50,000 frontal face samples. They are collected from web, video and digital camera and cover wide variations in poses, facial expressions and also in lighting conditions. To make the detection method robust to affine transform, the training samples are often rotated, translated and scaled [33]. After these preprocessing, we obtain the set Sf including 100,000 face samples. For the negative set Sn, it consists of 31,085 images containing no faces and they are collected from Internet. As for the test sets, we use three sets: The first one is the MIT+CMU frontal face test set, which consists of 130 images showing 507 upright faces [33]. The second one is a subset from Aleix Martinez-Robert (AR) face database [23]. AR face database consists of over 3,200 color images of the frontal view faces of 126 subjects. However, we choose those images with occlusions (i.e., conditions of 8-13 from the first session, and conditions of 21-26 from the second session). The resulting test set consists of 1,512 images. The third one is the CMU profile testing set [35] (441 multiview faces in 208 images). Note that the face samples are of the size 32×32. In order to detect some faces smaller or larger than the sample size, we enlarge and shrink each input image. 5.4. Classifier Training As described in Section 5.3, the set Sf is composed of a large number of face samples. Furthermore, we can also extract hundreds of thousands of non-face samples from the set Sn. Thus, it is extremely time consuming to train a SVM classifier using the two sets Sf and Sn. To this problem, we use the 26 resampling methods to train a SVM classifier. Specifically, motivated by [44], we also resample both the positive and negative samples during classifier training. For the positive samples, we first randomly pick out a sub-set Sf1 with the size Np (in our experiments, Np=3,000). Likewise, we also randomly crop out a sub-set Sn1 with the size Nn (in our experiments, Nn=3,000) from the non-face database Sn. Note that for the samples in Sn1, we normalize their sizes to w×h (i.e., 32×32). Subsequently, we extract WLD histogram features of both the face and non-face samples as shown in Fig.9. Using the extracted features of faces and non-faces, we train a lower-performance SVM classifier. Simultaneously, we obtain a support-vector set S1, which includes a face support-vector subset S 1f and a non-face support-vector subset S n1 . Using the resulting lower-performance SVM classifier, we test it on the two training subsets (i.e., Sf and Sn) to collect Np misclassified face samples Sf2 and Nn misclassified non-face samples Sn2. Combining the newly-collected sample sets (i.e., Sf2 and Sn2) and the two support-vector subsets obtained last time(i.e., S 1f and S n1 ), we obtain two new training sets: ( S 1f +Sf2) and ( S n1 +Sn2). We then train another SVM classifier with a better performance. After several iterations of the aforemensioned procedure, we finally train a well performed SVM classifier. Note that we actually train K sub-classifiers of SVM. Each sub-classifier corresponds to a block as shown in Fig. 9. Combining these K sub-classifiers, we obtain a final strong classifier. 5.5. Experimental Results The resulting final strong SVM classifier is tested on three testing sets described in Section 5.3. Experimental results are shown in Figs. 10, 11 and Table 3, respectively. Herein, we also compare the performance of the resulting SVM classifier (we call it “SVM-WLD”) with some existing methods. Note that all the results by other methods in Figs. 10 and 11 are quoted directly from the original papers except Hadid [13] (which is implemented by us following their idea). During testing on these sets, the parameter Ξ takes the different values as described in Section 5.2. For MIT+CMU frontal test set, AR 27 test set, and CMU profile test set, Ξ is equal to 8, 7 and 6 respectively. ROC Curves for Face Detectors Correct Detection Rate 1 0.95 Bourdev Garcia Hadid Huang Lin Viola SVM-Grey SVM-WLD 0.9 0.85 0.8 0.75 0 10 20 30 40 50 60 False Positive 70 80 90 100 Fig. 10. A performance comparison of our method with some existing methods on MIT+CMU frontal face test set. Table 3. Performance of our method on AR test set. Detection rate False Alarms 99.74% 0 100% 3 ROC Curves for Face Detectors Correct Detection Rate 1 0.95 0.9 0.85 Huang Schneiderman SVM-WLD 0.8 0.75 0 30 60 90 120 150 180 210 240 270 300 False Positive Fig. 11. A performance comparison of our method with some existing methods on CMU profile testing set. As shown in Fig. 10, we compare the performance of our method with some existing methods on MIT+CMU frontal face test set, such as Bourdev [48], Garcia [12], Hadid [13], Huang [49], Lin [20], and Viola [41]. Meanwhile, Lin et al. also proposed a method to detect occluded faces. “SVM-Grey” 28 denotes that we only use the grey intensities as input of SVM classifier, and other experimental set-ups are the same as “SVM-WLD”. From Fig. 10, one can find out that SVM-WLD locates 89.9% faces without any false alarms and works much better than SVM-Grey (76.4% faces without a false alarm). Furthermore, SVM-WLD works comparable to the existing methods, e.g., Lin [20]. Fig. 12. Some results of our detector on the MIT+CMU frontal test set (first row), AR database (second row) and CMU profile test set (third row). As shown in Table 3, we show the detection results on AR test set. From this table, one can find that SVM-WLD locates 99.74% faces without any false alarms and locates all faces with only 3 false alarms. In addition, in Fig. 11, we compare SVM-WLD with some existing methods, such as Huang [14] and Schneiderman [35] on CMU profile test set. To locate those profile faces with in-plane rotation, we also rotate the testing images. From Fig. 11, one can find that SVM-WLD locates 84.58% faces without any false alarms—the best result reported so far to our knowledge. 29 However, different criteria (e.g., training examples involved and the number of scanned windows during detection etc.) can be used to favor one over another, which makes it difficult to evaluate the performance of different methods even though they use the same benchmark data sets [45]. Thus, the results shown in Figs. 10 and 11 just illustrate that our method works robustly and achieves a performance comparable to the state-of-the-art methods. Some results of our detector on these three test sets are shown in Fig.12. 6. Conclusion We propose a novel discriminative descriptor called WLD. It is inspired by Weber’s Law, which is a law developed according to the perception of human beings. We organize WLD features to compute a histogram by encoding both differential excitations and orientations at certain locations. Experimental results clearly show that WLD illustrates a favorable performance on both Brodatz and KTH-TIPS2-a textures compared to state-of-the-art methods. Specifically, we have achieved 97.5% accuracy by WLD compared to 91.4% of SIFT and 95.9% of Gabor on Brodatz dataset. Moreover, although the diverse variations in the KTH-TIPS2-a set (i.e., pose, scale and illumination), we have obtained 96.3% accuracy compared to 88.6% of SIFT and 93.8% of LBP. Besides the performance comparison to the other methods, we also compare the computational complexity of WLD with LBP and SIFT theoretically. The analysis shows that WLD is much faster compared to SIFT and is comparable to LBP. Especially for the face detection task, we train only one classifier but it can be used to detect the frontal, occluded and profile faces. The results on the three sets, i.e., the MIT+CMU frontal face test set, Aleix Martinez-Robert (AR) face database and the CMU profile testing set demonstrate the effectiveness of the proposed method through experiments and comparisons to other existing face detectors. The system trained from the training set by the proposed method has achieved 89.9% accuracy with no false alarm on MIT+CMU frontal face test set. For the AR test set. Our trained face 30 detector locates 99.74% faces without any false alarms and locates all faces with only 3 false alarms. Furthermore, for the CMU profile testing set, our detector locates 84.58% faces without any false alarms—the best result reported so far to our knowledge. The current solution is developed for texture classification and face detection. It is not yet exploited for the domain of face recognition. Nor is it employed for the domain of object recognition. Future works will be enhancing solutions to these fields. In addition, exploring the multiscale version of WLD is also of future interest. Reference [1] T. Ahonen, A. Hadid and M. Pietikäinen, Face Description with Local Binary Patterns: Application to Face Recognition, PAMI, 2006. [2] J. Anderson. An Introduction to Neural Networks. The MIT Press, Cambridge, MA. 1995. [3] H. Bay, T. Tuytelaars, L. van Gool. SURF: Speeded Up Robust Features. ECCV, 2006. [4] P. Brodatz. Textures: A Photographic Album for Artists and Designers, Dover Publications, New York, 1966. [5] V. Bruni and D. Vitulano. A Generalized Model for Scratch Detection, IEEE TIP, 2004. [6] B. Caputo, E. Hayman and P. Mallikarjuna, Class-Specific Material Categorisation, ICCV, 2005. [7] J. Chen, S. Shan, G. Zhao, X. Chen, W. Gao, M. Pietikäinen. A Robust Descriptor based on Weber’s Law. CVPR, 2008. [8] H. Cheng, Z. Liu N. Zheng, and J. Yang, A Deformable Local Image Descriptor, CVPR, 2008. [9] C. He, T. Ahonen, M. Pietikäinen. A Bayesian Local Binary Pattern texture descriptor, submitted to ICPR 2008. [10] N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. CVPR, 2005. [11] G. Dorkó and C. Schmid. Maximally Stable Local Description for Scale Selection, ECCV, 2006. [12] C. Garcia and M. Delakis. Convolutional Face Finder: A Neural Architecture for Fast and Robust Face Detection. PAMI, 2004. [13] A. Hadid, M. Pietikäinen and T. Ahonen. A Discriminative Feature Space for Detecting and Recognizing Faces, CVPR, 2004. [14] C. Huang, H. Ai, Y. Li and S. Lao. High-Performance Rotation Invariant Multiview Face Detection, PAMI, 2007. [15] A. K. Jain. Fundamentals of Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1989. 31 [16] A.C. Jalba, M.H.F. Wilkinson and J.B.T.M. Roerdink. Morphological Hat-transform Scale Spaces and Their Use in Pattern Classification. PR, 2004. [17] Y. Ke and R. Sukthankar, PCA-SIFT: A More Distinctive Representation for Local Image Descriptors, CVPR, 2004. [18] S. Lazebnik, C. Schmid, J. Ponce. A Sparse Texture Representation Using Local Affine Regions. CVPR, 2003. [19] S. Lazebnik C. Schmid J. Ponce. A Maximum Entropy Framework for Part-Based Texture and Object Recognition, ICCV, 2005. [20] Y. Y. Lin, T.-L. Liu, and C.-S. Fuh. Fast Object Detection with Occlusions. ECCV, 2004. [21] D. Lowe. Distinctive Image Features form Scale Invariant Key Points. IJCV, 2004. [22] B. Manjunath and W. Ma. Texture Features for Browsing and Retrieval of Image Data, PAMI, 1996. [23] A. M. Martinez, R. Benavente. The AR Face Database. CVC Technical Report, #24, 1998. [24] P. Moreels and P. Perona, Evaluation of Features Detectors and Descriptors based on 3D objects, IJCV, 2007. [25] K. Mikolajczyk and C. Schmid. A Performance Evaluation of Local Descriptors. PAMI, 2005. [26] K. Mikolajczyk and J. Matas, Improving Descriptors for Fast Tree Matching by Optimal Linear Projection, ICCV, 2007. [27] T. Ojala, M. Pietikäinen, and D. Harwood, A Comparative Study of Texture Measures with Classification Based on Feature Distributions, PR, 1996. [28] T. Ojala, M. Pietikäinen, T. Mäenpää. Multiresolution Gray Scale and Rotation Invariant Texture Analysis with Local Binary Patterns, PAMI, 2002. [29] T. Ojala, K. Valkealahti, E. Oja and M. Pietikäinen, Texture Discrimination with Multidimensional Distributions of Signed Gray Level Differences. PR, 2001. [30] M.-T. Pham T.-J. Cham, Fast Training and Selection of Haar Features Using Statistics in Boosting-based Face Detection, ICCV, 2007. [31] T. Phiasai, P. Kumhom and K. Chamnongthai. A Digital Image Watermarking Technique Using Prediction Method and Weber Ratio, International Symposium on Communications and Information Technologies, 2004. [32] T. Randen and J. H. Husoy, Filtering For Texture Classification: A Comparative Study, PAMI, 1999. [33] H. A. Rowley, S. Baluja, and T. Kanade. Neural Network-Based Face Detection, PAMI, 1998. [34] Y. Rodriguez and S. Marcel , Face Authentication Using Adapted Local Binary Pattern Histograms, ECCV, 2006. [35] H. Schneiderman and T. Kanade. A Statistical Method for 3D Object Detection Applied to Faces and Cars. CVPR, 32 2000. [36] M. Swain and D. Ballard. Color indexing. IJCV, 1991. [37] X. Tan, B. Triggs. Enhanced Local Texture Feature Sets for Face Recognition under Difficult Lighting Conditions. Analysis and Modelling of Faces and Gestures, 2007. [38] E. Tola, V. Lepetit and P. Fua, A Fast Local Descriptor for Dense Matching, CVPR, 2008. [39] E. R. Urbach, J. B.T.M. Roerdink, and M. H.F. Wilkinson. Connected Shape-Size Pattern Spectra for Rotation and Scale-Invariant Classification of Gray-Scale Images, PAMI, 2007. [40] M. Varma, and A. Zisserman. Texture Classification: Are Filter Banks Necessary? CVPR, 2003. [41] P. Viola and M. Jones. Rapid Object Detection Using a Boosted Cascade of Simple Features. CVPR, 2001. [42] S.Winder and M. Brown. Learning Local Image Descriptors. CVPR, 2007. [43] R. Xiao, H. Zhu, H. Sun and X. Tang, Dynamic Cascades for Face Detection, ICCV, 2007. [44] S. Yan, S. Shan, X. Chen, W. Gao, J. Chen. Matrix-Structural Learning (MSL) of Cascaded Classifier from Enormous Training Set. CVPR, 2007. [45] M. H. Yang, D. Kriegman, and N. Ahuja. Detecting Faces in Images: A Survey, PAMI, 2002. [46] G. Zhao and M. Pietikäinen, Dynamic Texture Recognition Using Local Binary Patterns with an Application To Facial Expressions, PAMI, 2007. [47] W. Zhang, S. Shan, W. Gao, X. Chen, and H. Zhang, Local Gabor Binary Pattern Histogram Sequence (LGBPHS): A Novel Non-Statistical Model for Face Representation and Recognition, ICCV, 2005. [48] L. Bourdev and J. Brandt. Robust object detection via soft cascade. CVPR, 2005. [49] C. Huang, H. Ai, T., Yamashita, S. Lao, M. Kawade, Incremental Learning of Boosted Face Detector, ICCV, 2007 33