3. WLD for Image Representation

advertisement
WLD: A Robust Local Image Descriptor
Jie Chen1,2,3 Shiguang Shan1 Chu He2,4 Guoying Zhao2 Matti Pietikäinen2 Xilin Chen1 Wen Gao1,3
1
Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences (CAS),
Institute of Computing Technology, CAS, Beijing, 100080, China
2
Machine Vision Group, Department of Electrical and Information Engineering, P. O. Box 4500
FI-90014 University of Oulu, Finland
3
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
4
School of Electronic Information, Wuhan University, Wuhan, 430079, China
{jiechen, gyzhao, mkp}@ee.oulu.fi, hc@eis.whu.edu.cn, {sgshan, xlchen, wgao}@jdl.ac.cn
Abstract
Inspired by Weber's Law, this paper proposes a simple, yet very powerful and robust local descriptor,
called Weber Local Descriptor (WLD). It is based on the fact that human perception of a pattern
depends not only on the change of a stimulus (such as sound, lighting) but also on the original intensity
of the stimulus. Specifically, WLD consists of two components: its differential excitation and
orientation. A differential excitation is a function of the ratio between two terms: One is the relative
intensity differences of its neighbors against a current pixel; the other is the intensity of the current
pixel. An orientation is the gradient orientation of the current pixel. For a given image, we use the
differential excitation and the orientation components to construct a concatenated WLD histogram
feature. Experimental results on Brodatz and KTH-TIPS2-a textures databases show that WLD
impressively outperforms the other classical descriptors (e.g., Gabor and SIFT). Especially,
experimental results on human face detection show a promising performance. Although we train only
one classifier based on WLD features, the classifier obtains a comparable performance to the
state-of-the-art methods on MIT+CMU frontal face test set, AR face dataset and CMU profile test set.
1. Introduction
Recently there have been many interests in object and view matching using local invariant features [21],
1
classification of textured regions using micro textons [28] and in face detection using local feature [41].
There are several studies to evaluate the performance of these methods, such as [24, 25, 27, 32].
Meanwhile, the proposed methods can be divided into two classes: One is the sparse descriptor, which
first detects the interest points in a given image and then samples a local patch and describes its
invariant features [24, 25]; the other one is the dense descriptor, which extracts the local feature of an
input image pixel by pixel [27, 32].
For sparse descriptors, a typical one is scale-invariant feature transform (SIFT), introduced by Lowe
[21]. It performs best in the context of matching and recognition [25] due to its invariance to scaling
and rotations. Several attempts to improve the SIFT have been reported in the literature. Ke and
Sukthankar developed the PCA-SIFT descriptor which represents local appearance by principal
components of the normalized gradient field [17]. Mikolajczyk and Schmid modified the SIFT
descriptor by changing the gradient location orientation grid as well as the quantization parameters of
the histograms [25]. The descriptor was then projected to lower dimensional space with PCA. This
modification slightly improved performance in matching tests. Dalal and Triggs proposed “histogram
of gradients” (HOG) [10]. The HOG differs from the SIFT in several aspects, such as normalization
and spatial distribution as well as size of the bins. Lazebnik et al. proposed a rotation invariant version
called Rotation Invariant Feature Transform (RIFT) [18], which overcomes the problem of dominant
gradient orientation estimation required by SIFT, at a cost of lower discriminative power. Bay et al.
proposed an efficient implementation of SIFT by applying the integral image to compute image
derivatives and quantizing the gradient orientations in a small number of histogram bins [3]. Winder
and Brown learned optimal parameter setting on a large training set to maximize the matching
performance [42]. Mikolajczyk and Matas developed the optimal linear projection to improve the
matching quality and speed of the SIFT descriptors [26]. Likewise, in order to improve the efficiency
of the local descriptor, Tola et al. replaced the weighted sums used by SIFT by sums of convolutions
[38]. In addition, Cheng et al. proposed to use multiple support regions of different sizes surrounding
2
an interest point since the amount of deformation for each location varies and is unpredictable such that
it is difficult to choose the best scale of the support region [8].
Among the most popular dense descriptors are the Gabor [22] and local binary pattern (LBP) [28].
The Gabor representation has been shown to be optimal in the sense of minimizing the joint
two-dimensional uncertainty in space and frequency [22]. These filters can be considered as orientation
and scale tunable edge and line (bar) detectors, and the statistics of these microfeatures in a given
region are often used to characterize the underlying texture information. The Gabor has been widely
used in image analysis applications including texture classification and segmentation, image
registration, motion tracking [22], and face recognition [47]. Another important dense local descriptor
is LBP, which has gained increasing attention due to its simplicity and excellent performance in various
texture and face image analysis tasks [28]. Many variants of LBP have been recently proposed and
provided considerable success in various tasks. Ahonen et al. exploited the LBP for face recognition
[1]. Rodriguez and Marcel proposed Adapted local binary pattern histograms for face authentication
[34]. Tan and Triggs changed the thresholding means for face recognition under difficult lighting
conditions [37]. Zhao and Pietikäinen proposed the local binary pattern on three orthogonal planes and
used it for dynamic texture recognition [46]. Zhang et al. proposed the local Gabor binary pattern
histogram sequence for a non-statistical model for face representation and recognition [47].
In this paper, we propose a simple, yet very powerful and robust local descriptor. It is inspired by
Weber's Law, which is a psychological law [15]. It states that the change of a stimulus (such as sound,
lighting) that will be just noticeable is a constant ratio of the original stimulus. When the change is
smaller than this constant, human being would recognize it as a background noise rather than a valid
signal. Motivated by this point, the proposed Weber Local Descriptor (WLD) is computed based on the
ratio between the two terms: One is the relative intensity differences of its neighbors against a current
pixel; the other is the intensity of the current pixel. We then represent input image parts by histograms
of gradient locations and orientations. Thus, WLD is dense descriptor [7].
3
Although the proposed WLD descriptor employs the advantages of SIFT (i.e., computing the
histogram using the gradient and its orientation) and LBP (i.e., computational efficiency and support
regions around 3×3), the difference is obvious. As mentioned above, the SIFT descriptor is a 3D
histogram of gradient locations and orientations, in which two dimensions correspond to image spatial
dimensions and the additional dimension to the image gradient direction. It is a sparse descriptor, which
is computed for interest regions (located around detected interest points) that have usually already been
normalized with respect to scale and rotation. Texture classification with the SIFT is done using
information in these sparsely located interest regions, e.g., [11]. WLD, on the other hand, is a dense
descriptor computed for every pixel. It is a 2D histogram of local properties that depend both on the
local intensity variation and the magnitude of the center pixel's intensity. Texture classification with
WLD is done using these 2D feature histograms. The granularity of statistical pattern is much smaller
than SIFT. For example, WLD is computed around a 3×3 square region. However, SIFT is computed
around a 16×16 region [21, 8]. That is to say, the salient feature of WLD is of finer granularity
compared to that of SIFT. The smaller size of the support regions for WLD makes it capture more local
salient patterns.
As for the LBP descriptor, it represents an input image by building statistics on the local
micro-pattern variations. These local patterns might correspond to the bright/dark spots, edges and flat
areas etc. However, WLD first computes the salient micro-patterns and then builds statistics on these
salient patterns according to the orientation of the current point. It combines both the gradient (salient
micro-pattern) and orientation.
Several researchers have used Weber’s Law in computer vision. Bruni and Vitulano used this Law
for scratch detection on digital film materials [5]. Phiasai et al. employed a Weber ratio to control the
strength of watermark [31]. Different from them, we use Weber’s Law to generate a new representation
of an image.
The rest of this paper is organized as follows: In Section 2, we introduce a descriptor framework. In
4
Section 3, we propose a local descriptor WLD and compare other existing methods by casting them
into the framework. In Sections 4 and 5, experiments are carried out dealing with the applications of
WLD in texture classification and face detection. Section 6 concludes the paper.
2. Descriptor framework
Recently, Chu et al. proposed a descriptor framework: Filtering, Labeling and Statistics (FLS) [9]. It
describes image attributes from different aspects: Filtering depicts the inter-pixel relationship in a local
image region; Labeling (which includes quantization and mapping) describes the intensity variations
which cause psychology redundancies; Statistics captures the attribute not on adjacent regions. The
framework is based on an analogy between image descriptor and image compression (Both of these try
to describe image with less information). It is known that the whole procedure of image compression is
usually composed of three steps toward three redundancies: filtering toward the inter-pixel redundancy,
quantization toward psychology redundancy, and statistic toward coding redundancy. Thus, using the
FLS framework and mapping each step of the existing descriptors to these three stages, one can obtain
further knowledge about these descriptors and can compare them under a uniform framework.
The framework is illustrated in Fig. 1. The first stage of FLS is to filter an input image by a filter
bank. Specifically, given an image on a rectangular pixel lattice S, which contains a set of pixels Y={ys,
s∈S}, it is filtered by a filter bank in size M, i.e., F={f0, f 1, …, fM-1}. The responses by the filter bank F
at the same site of the lattice are represented as vsm  f m ( ys ) (m=0, 1, … , M-1). Concatenating the
response of each filter, we form a response vector for the pixel vs  [vs0 , v1s ,
, vsM 1 ] . Combining the
response vectors of each pixel, we build up a vector image V  {vs , s  S} . The second stage is to
quantize each response vector and map it into a label by a provided function. In the vector space, a
response vector vs is labeled as xs  f vq (vs ) , xs∈RK (k is the dimensionality of the label space) . For
5
example, Varma and Zisserman carry out the clustering of the response vectors [40]. The label of each
response vector is then set as a number (i.e., the index of each cluster). Thus, we can obtain a mapped
image X={ xs , s∈S }. The third stage is to carry out the statistic of each pixel’s label. One of the usually
used approaches is to compute the histogram of these labels: H=fhist(X).
Fig. 1. The FLS framework of image descriptors.
3. WLD for Image Representation
In this section, we review Weber's Law and then propose a descriptor called WLD. Subsequently, we
compare WLD with some existing descriptors (i.e., LBP and SIFT) by casting them into the FLS
framework.
3.1. Weber's Law
Ernst Weber, an experimental psychologist in 19th century, observed that the ratio of the increment
threshold to the background intensity is a constant [15]. This relationship, known since as Weber's Law,
can be expressed as:
I
k,
I
(1)
where ∆I represents the increment threshold (just noticeable difference for discrimination); I
6
represents the initial stimulus intensity and k signifies that the proportion on the left side of the
equation remains constant despite of variations in the I term. The fraction ∆I/I is known as the Weber
fraction.
Weber's Law, more simply stated, says that the size of a just noticeable difference (i.e., ∆I) is a
constant proportion of the original stimulus value. So, for example, in a noisy environment one must
shout to be heard while a whisper works in a quiet room.
3.2 WLD
Motivated by Weber’s Law, we propose a descriptor WLD. It consists of two components: its
differential excitation (ξ) and orientation (θ). The first component ξ is a function of the Weber fraction
(i.e., the relative intensity differences of its neighbors against a current pixel and the current pixel itself).
The second component θ is a gradient orientation of the current pixel.
3.2.1 Differential excitation
We use the intensity differences between its neighbors and a current pixel as the changes of the current
pixel. By this means, we hope to find the salient variations within an image to simulate human beings
perception of patterns. Specifically, a differential excitation ξ(yc) of a current pixel is computed as
illustrated in Fig. 2, where yc denotes the intensity of the current pixel; yi (i=0, 1, …p-1) denote the
intensities of p neighbors of yc (p=8 here).
To compute ξ(yc), we first calculate the differences between its neighbors and a center point by the
filter f00:
Gdiff ( yi )  yi  yi  yc .
(2)
Subsequently, we consider the neighbor effects on the current point using a sum of the differences:
p 1
vs00    yi  .
(3)
i 0
Hinted by Weber’s Law, we then compute the ratios of the differences to the intensity of the current
7
point by combining the output of the two filters f00 and f01 (whose output vs01 is the original image in
fact):
vs00
Gratio  yc   01 .
vs
(4)
Fig. 2. An illustration of computing a WLD feature of a pixel
To improve the robustness of a WLD to noise, we use an arctangent function as a filter on Gratio(٠).
That is:
8
Garctan Gratio    =arctan Gratio    .
(5)
Combining Eqs. (2), (3), (4) and (5), we have:
 p 1  y  yc
 v 00 
Garctan Gratio    =xs0 =arctan  s01   arctan    i
 vs 
 i 0  yc

 .

(6)
So, ξ(yc) is computed as:
 p 1  yi  yc
 vs00 

arctan
 
01 
 vs 
 i 0  yc
  yc   arctan 

 .

(7)
Note that ξ(yc) may take a minus value if the neighbors intensities are smaller than that of a current
pixel. By this means, we attempt to preserve more discriminating information in comparison to using
the absolute value of ξ(yc). Intuitively, if ξ(yc) is positive, it simulates the case that the surroundings are
lighter than the current pixel. In contrast, if ξ(yc) is negative, it simulates the case that the surroundings
are darker than the current pixel.
Arctangent function As shown in Eq. (5), we use an arctangent filter Garctan(٠) to compute ξ(yc).
As shown in Fig. 3, the reason is that the arctangent filter can bound the output to prevent it from
increasing or decreasing too quickly when the input becomes larger or smaller.
One optional filter is a logarithm function, which also matches human being’s perception well.
However, many outputs of Eq. (2) are negative, which is against this kind of function. Another optional
filter is a sigmoid function:
1  e  x
sigmoid (  x) 
,
1  e  x
(8)
where β>0. It is a typical neuronal non-linear transfer function and widely used in artificial neural
networks [2]. Both the arctangent and sigmoid functions have the similar curves as shown in Fig 3,
especially when β=2. In our case, we use the former for simplicity.
As shown in Fig. 4, we plot an average histogram of the differential excitations on 2,000 texture
images. One can find that there are more frequencies at the two sides of the average histogram (e.g.,
9
[-π/2, -π/3] and [π/3, π/2]). It results from the approach of computing the differential excitation ξ of a
pixel (i.e., a sum of the difference ratios of p neighbors against a central pixel) as shown in Eq. (7).
However, it is valuable for a classification task. For more details, please refer to Section 3.2.4, Section
4 and 5.
y
y = y=π/2
pi/2
1
0.5
-4
-3
-2
-1
0
1
-0.5
-1
2
3
4
x
y=arctan(x)
y=x
y=sigmoid(x)
y=sigmoid(2x)
y=sigmoid(3x)
y=sigmoid(4x)
y
= -pi/2
y=-π/2
Fig. 3. A plot of an arctangent function y= arctan(x). Note that the output of arctan (·) is in radian measure.
Fig. 4. A plot of an average histogram of the differential excitations on 2,000 texture images.
3.2.2. Orientation
The orientation component of WLD is computed as:
 ( yc )  x1s  arctan(  ),
(9)
and
 v10

v11
v12
v13
s
s
s
s
,
,
,
12
13
10
11 
 vs vs vs vs 
  median 
10
where v1k
s (k=0, 1, 2, 3) is the output of the four filters f1k. For example, vs = y0  y4 .
10
(10)
Note that in Eq. (10), we are only use p/2 (i.e., 4) filters to compute χ not p (i.e., 8) filters because
there exists symmetry for χ’s when k takes its values in the two intervals [0, p/2-1] and [p/2, p-1]. For
example, if k=5, we have v15
s 
y4  y0
 v10
s .
y6  y2
For simplicity, θ’s value is quantized into T dominant orientations. However, before the quantization,
we perform the mapping f : 
  , where θ  [-π/2, π/2] and    [0, 2π]. Note that the mapping
concerns the value of θ computed using Eq. (9) and the sign of the denominator and numerator of the
right side of Eq. (10). Thus, the quantization function is then as follows:
Фt=φ(   )=
2t
T
  
1

 , T .
  2 / T 2 

 , and t  mod  
(11)
For example, if T=8, these T dominant orientations Фt=(tπ)/4, (t=0, 1, …, T-1). In other words, those
orientations located within the interval [Фt-π/T, Фt +π/T] are quantized as Фt.
The idea of representing image parts by histograms of gradient locations and orientations has been
used in biologically plausible vision systems and in recognition [26]. Motivated by this idea, we
compute the 2D histogram {WLD(ξj, Фt)}j after compute the differential excitation (ξj) and orientation
(Фt) of each pixel. As shown in Fig. 2, each column corresponds to a dominant orientation Фt. Each
row corresponds to a differential excitation. Thus, the intensity of each cell corresponds to the
frequency of a certain differential excitation on a dominant orientation.
3.2.3. WLD histogram
Given an image, as shown in Fig. 5 (a), we encode the WLD features into a histogram H. Using the 2D
histogram {WLD(ξj, Фt)}j computed as shown in Fig. 2, we project each column of the 2D histogram to
form a 1D histogram H(t) (t=0, 1, …, T-1) as shown in Fig. 5 (a). That is, we regroup the differential
excitations ξ(yc) as T sub-histograms H(t), each sub-histogram H(t) corresponding to a dominant
orientation (i.e., Фt). Subsequently, each sub-histogram H(t) is evenly divided into M segments, i.e.,
Hm,t, (m=0, 1, ..., M-1, and in our implementation we let M=6.). These segments Hm,t are then
11
reorganized as the histogram H. Specifically, H is concatenated by M sub-histograms, i.e., H={Hm},
m=0, 1, ..., M-1. For each sub-histogram Hm, it is concatenated by T segments Hm={Hm,t}, t=0,1,…T-1.
(a)
(b)
Fig. 5. An illustration of a WLD histogram feature for a given image, (a) H is concatenated by M sub-histograms {Hm}(m=0,
1, …, M-1). Each Hm is concatenated by T histogram segments Hm,t (t=0, 1, …, T-1). Meanwhile, for each column of the
histogram matrix, all of M segments Hm,t (m=0, 1, …, M-1) have the same dominant orientation Фt. In contrast, for each row
of the histogram matrix, the differential excitations ξj of each segment Hm,t (t=0, 1, …, T-1) belongs to the same interval lm.
(b) A histogram segment Hm,t. Note that if t is fixed, for any m or s, the dominant orientation of a bin hm,t,s is fixed (i.e., Фt).
Note that after each sub-histogram H(t) is evenly divided into M segments, the range of differential
excitations ξj (i.e., l=[-π/2, π/2]) is also evenly divided into M intervals lm (m=0, 1, ..., M-1). Thus, for
12
each interval lm, we have lm=[ηm,l, ηm,u], here, the lower bound ηm,l = (m/M-1/2)π and the upper bound
ηm,u =[(m+1)/M-1/2]π. For example, l0=[-π/2, -π/3].
Furthermore, as shown in Fig. 5 (b), a segment Hm,t is composed of S bins, i.e., Hm,t={hm,t,s}, s=0,
1, …, S-1. Herein, hm,t,s is computed as:
hm,t , s   j 1( S j  s ) ,

  j   m ,l
1
 ,
  j  lm ,  ( j )  t , S j  


 (m,u  m,l ) / S 2  

(12)
where 1(٠) is a function as follows:
 1 X is true
1( X )  
.
0 otherwise
(13)
Thus, hm,t,s means the number of the pixels whose differential excitations ξj belong to the same
interval lm and orientations  j are quantized to the same dominant orientation Фt and that the
computed index Sj is equal to s. Meanwhile, (m,u  m,l ) / S
1
( m ,u  m ,l ) / S
( j   m ,l )
is the width of each bin,
is a linear mapping to map the differential excitation to its corresponding bin
since the values of S j are of real.
We segment the range of ξ into several intervals due to the fact that different intervals correspond to
the different variances in a given image. For example, given two pixels Pi and Pj, if their differential
excitations ξi  l0 and ξj  l2, we say that the intensity variance around Pi is larger than that of Pj. That
is, flat regions of an image produce smaller values of ξ while non-flat regions produce larger values.
However, besides the flat regions of an image, there are two kinds of intensity variations around a
central pixel which might lead to smaller differential excitations. One is the clutter noise around a
central point; the other is the “uniform” patterns as shown in [28]. Meanwhile, the latter provides a
majority of variations in comparison to the former, and the latter can be discriminated by the
orientations of the current pixels.
13
Here, we let M=6 for the reason that we attempt to use these intervals to approximately simulate the
variances of high, middle or low frequency in a given image. That is, for a pixel Pi, if its differential
excitation ξi  l0 or l5, we call that the variance near Pi is of high frequency; if ξi  l1 or l4, or ξi  l2 or l3,
we call that the variance near Pi is of middle frequency or low frequency, respectively.
3.2.4. Weight for a WLD histogram
Intuitively, one often pays more attention to the variances in a given image compared to the flat regions.
That is, the different frequency segments Hm play different roles for a classification task. Thus, we can
weight the different frequency segments with different weights for a better classification performance.
Table 1 Weights for a WLD histogram
H0
H1
H2
H3
H4
H5
Frequency percent
0.2519
0.1179
0.1186
0.0965
0.0875
0.3276
Weights (  m )
0.2688
0.0852
0.0955
0.1000
0.1018
0.3487
For weight selection, a heuristic approach is to take into account the different contributions of the
different frequency segments Hm (m=0, 1, …, M-1). First, by computing the recognition rate on a
collected texture dataset from Internet for each sub-histogram Hm separately, we obtain M rates R={rm};
then, we let each weight m  rm /  i ri as shown in Table 1. Simultaneously, as shown in Table 1, we
collect statistics of the percent of frequencies of each sub-histogram. From this table, one can find that
these two groups of values (i.e., frequency percent and weights) are very similar. In addition, a high
frequency segment H0 or H5 includes more frequencies (Refer to Fig. 4 for more details) than the
middle or low frequency segment.
Note that a high frequency segment H0 or H5 taking a larger weight is consistent with the intuition
that a good classification feature should pay more attention to the salient variations of an object. Here,
besides the large changes at edges or occlusion boundaries within an image, the approach of computing
differential excitation of a pixel (i.e., a sum of the differences of p neighbors against a central pixel)
14
further contributes to a larger weight of H0 or H5. However, a side effect of the approach of weighting
might enlarge the influence of noise. One can avoid this disadvantage by removing a few bins at the
ends of high frequency segments, that is, left ends of H0,t and right ends of HM-1,t (t=0,1,…T-1).
3.3 Characteristics of WLD
A descriptor WLD is based on a physiological law. It extracts the features from an image by simulating
a human to sense his surroundings. Specifically, as shown in Fig 2, a WLD uses the ratios of the
intensity differences vs00 to vs01 . As expected, WLD gains powerful representation ability for textures.
Fig. 6. The upper row is original images and the bottom is filtered images. Meanwhile, the value of the intensity in the
filtered image is the differential excitation and is scaled to [0, 255].
As shown in Eq. (2), a WLD preserves the differences ( vs00 ) between its neighbors and a center pixel.
Thus, the detected edges by a WLD depend on the perceived luminance difference and match the
subjective criterion elegantly. For example, sometimes vs00 may be quite large. But if vs00 / vs01 is
smaller than a noticeable threshold, there is not a noticeable edge. In contrast, vs00 may be quite small.
But if vs00 / vs01 is larger than a noticeable threshold, there is a noticeable edge. In Fig. 6 we show some
filtered images by WLD, from which one could conclude that a WLD extracts the edges of images
perfectly even with heavy noise (Fig. 6, middle column). Especially, as shown in the last column of Fig.
15
6, a WLD can extract the relatively weak edges (textures) effectively. Furthermore, the results of
texture analysis show that much of the discriminative texture information is contained in high spatial
frequencies such as edges [28]. Thus, a WLD works well to obtain a powerful feature for textures.
WLD is robust to noise appeared in a given image. Specifically, a WLD reduces the influence of
noise as it is similar to the smoothing in image processing. As shown in Fig. 2, a differential excitation
is computed by a sum of its p-neighbor differences to a current pixel. Thus, it reduces the influence of
noisy pixels. Moreover, the sum of its p-neighbor differences is further divided by the intensity of the
current pixel, which also decreases the influence of noise in an image.
Moreover, the feature vector is modified to reduce the effects of illumination change. On one side,
WLD computes the differences vs00 between its neighbors and a current pixel. Thus, a brightness
change in which a constant is added to each image pixel will not affect the differences values. On the
other side, WLD performs the division between the differences vs00 and vs01 . Thus, a change in image
contrast in which each pixel value is multiplied by a constant will multiply differences by the same
constant, and this contrast change will be canceled by the division. Therefore, the descriptor is robust to
changes in illumination.
Furthermore, regrouping the differential excitation and orientation into a 2D histogram and then
weighting the different frequency segments can also improve the performance of the descriptor WLD.
3.4. Comparison with existing descriptors
Using the FLS framework described in Section 2, we can easily compare our descriptor to the existing
ones as shown in Table 2. For LBP, it computes the intensity differences between the center pixel yc
and its neighbors in the first stage, and the responses of each neighbor are thresholded at zero and are
then concatenated to a binary string in the second stage. Each binary string corresponding to each pixel
is then used to compute a histogram feature in the last stage. For SIFT, it computes the gradient
magnitude and orientation at each image sample point in a region around the keypoint location in the
16
first stage. The orientations are quantized to 8 dominant ones in the second stage. For the third stage,
the gradient magnitudes are weighted by a Gaussian window and then accumulated into orientation
histograms by summarizing the contents over denoted subregions.
Table 2. Comparison of LBP and other descriptors according to the framework.
filtering
labeling
statistic
LBP
Intensity difference
Thresholding at zero
Histogram over the binary strings
SIFT
Intensity difference
Orientation quantization
WLD
Weber fraction
Orientation quantization
Histogram over the weighted gradient
on dominant orientation
Histogram over the differential excitation
on dominant orientation
Although WLD also computes the difference between the center pixel yc and its neighbors like LBP
in the first stage, these differences are added together and then divided by the center pixel yc to obtain
the differential excitation like the Weber fraction. Different from LBP, WLD uses the median of the
gradient orientations to describe the direction of edges. The medians of the gradient directions are then
quantized to 8 dominant orientations in the second stage. Different from SIFT, we use the differential
excitation not the weighted gradient to compute the histogram. Moreover, differential excitations are
not accumulated over denoted subregions around the keypoint location. In contrast, we compute the
frequency of the occurrence of differential excitations for each bin of the histogram. Although we
weight the WLD histogram as shown in Table 1, the weighted object is the frequency of each bin and
the weights are computed according to the recognition performance based on the statistics, not
weighting the values of gradients in terms of the distances between the neighbors and the keypoint as
SIFT does [21].
Furthermore, we also compare the time complexity of WLD with LBP and SIFT theoretically. Given
an image in m×n, the time complexity for WLD is simple. It is as follows:
OWLD = C1mn,
(14)
where C1 is a constant. We use C1 for the computation of each pixel in WLD through several additions,
divisions and filtering by an arctangent function.
17
Likewise, the time complexity for LBP is also very simple:
OLBP = C2mn,
(15)
where C2 is also a constant. We use C2 for the computation of each pixel in LBP through several
additions and multiplication.
However, the time complexity for SIFT is a little complex:
OSIFT = C31(αβ)(pq)(mn)+ C32k1+ C33k2st+ C34k2st.
(16)
Here, the four terms correspond to the four steps: detection of scale-space extrema, accurate keypoint
localization, orientation assignment and building the local image descriptor. Meanwhile, C3i (i=1, ... , 4)
are of four constants. For the first term, it represents the convolution of a variable-scale Gaussian with
the given image. Here, p and q are the size of the convolution template; α, β represent the levels of
octave and scales of each octave, respectively. For the second term, k1 represents the number of the
keypoint candidates. For the third and fourth terms, k2 represents the number of the keypoints and s, t
represent the size of the support regions for each keypoint (e.g., s, t=16). So we have:
OSIFT ≈ C31(αβ)(pq)(mn).
(17)
Comparing the Eqs.(14), (15) and (17), one can find out that both LBP and WLD are more efficient to
extract the image features than SIFT.
4. Application to Texture classification
In this section, we use WLD features for texture classification and compare both the performance and
computational efficiency with that of the state-of-the-art methods.
4.1. Background
Texture classification plays an important role in many applications such as robot vision, remote sensing,
crop classification, content-based access to image databases, and automatic tissue recognition in
medical images. It has led to a wide interest in investigating the theoretical as well as practical issues
regarding texture classification.
18
Several approaches for the extraction of texture features have been proposed. On one hand, there are
some recent attempts using sparse descriptors for this task, such as [11, 19]. Dorkó and Schmid
optimized the keypoint detection and then used SIFT for the image representation [11]. Lazebnik et al.
presented a probabilistic part-based approach to describe the texture and object [19]. On the other hand,
there are also several attempts using dense descriptors for this task, such as [16, 22, 27, 28, 29, 39, 40].
Jalba et al. presented a multi-scale method based on mathematical morphology [16]. Manjunath and Ma
used Gabor filters for texture analysis showing that they exhibit excellent discrimination abilities over a
broad range of textures [22]. Ojala et al. proposed to use signed gray-level differences and their
multidimensional distributions for texture description [29]. The original LBP is its simplification,
discarding the contrast of local image texture [27, 28]. Urbach et al. described a multiscale and
multishape morphological method for pattern-based analysis and classification of gray-scale images
using connected operators [39]. Varma and Zisserman used the joint distribution of intensity values
over extremely compact neighborhoods (starting from as small as 3×3 pixels square) as image
descriptors for the texture classification [40].
4.2. Dataset
Experiments are carried out on two different texture databases: Brodatz [4] and KTH-TIPS2-a [6].
Examples of the 32 Brodatz [29] textures used in the experiments are shown in Fig. 7 (a). The images
are 256×256 pixels in size and they have 256 gray levels. Each image was divided into 16 disjoint
64×64 samples, which were independently histogram-equalized to remove luminance differences
between textures. To make the classification problem more challenging and generic and make a
comparison possible, we use the same experimental set-ups as [16, 29, 39]. Three additional samples
were generated from each sample: (i) a sample rotated by 90 degrees, (ii) a 64×64 scaled sample
obtained from the 45×45 pixels in the middle of the original sample, and (iii) a sample that was both
rotated and scaled. Consequently, the entire data set, which we refer to as Brodatz data set, comprises
19
2048 samples, with 64 samples in each of the 32 texture categories.
The database KTH-TIPS2-a contains 4 physical, planar samples of each of 11 materials under
varying illumination, pose and scale. Some examples from each sample are shown in Fig. 7 (b). The
KTH-TIPS2-a texture dataset contains 11 texture classes with 4395 images. The images are 200×200
pixels in size (we did not include those images which are not of this size) and they are transformed into
256 gray levels. The database contains images at 9 scales, under four different illumination directions,
and three different poses.
(a)
(b)
Fig. 7 Some examples from two texture datasets (a) Brodatz and (b) KTH-TIPS2-a.
Note that for the two texture datasets (i.e., Brodatz and KTH-TIPS2-a), experiments are carried out
with ten-fold cross validation to avoid bias. For each round, we randomly divide the samples in each
class into two subsets of the same size, one for training and the other for testing. In this fashion, the
images belonging to the training set and to the test set are disjoint. The results are reported as the
average value and standard deviation over the ten runs.
20
4.3. WLD Feature for Classification
To perform the texture classification, there are two essential issues for developing such a system: one is
of texture representation and the other is of classifier design. We use WLD feature as a representation
and build a system for texture classification.
For texture representation, given an image, we extract WLD features as shown in Fig. 5. Here, we
experientially let M=6, T=8, S=20. In addition, we also weight each sub-histogram Hm using the same
weights as shown in Table 1.
As the classifier we use K-nearest neighbor, which has been successfully utilized in classification. In
our case, K=3. To compute the distance between two given images I1 and I2, we first obtain their WLD
feature histograms H1 and H2. We then measure the similarity between H1 and H2. In our experiments,
we use the normalized histogram intersection Π(H1, H2) as a similarity measurement of two histograms
[36]:
L
 ( H 1 , H 2 )   min( H 1,i , H 2,i ) ,
(18)
i 1
where L is the number of bins in a histogram. The intuitive motivation for this measurement is to
calculate the common parts of two histograms. Note that the input histograms Hi (i=1, 2) are
normalized.
4.4. Experimental Results
Experimental results on Brodatz and KTH-TIPS2-a textures are illustrated in Fig. 8. Herein, the
accuracy of our method is given as percentages of correct classifications. It is computed as follows:
accuracy 
# correct classification
.
# total images
(19)
As shown in Fig. 8 (a), we compare our method with others on the classification task of Brodatz
textures: SIFT, Jalba [16], Ojala [29] (i.e., signed grey level difference (SD) and LBP), Urbach [39],
and Manjunath [22] (i.e., Gabor). Note that all the results by other methods in Fig. 8 are quoted directly
21
from the original papers except that of Manjunath [22] and SIFT. The approach in [22] is a “traditional”
texture analysis method using global mean and standard deviation of the responses of Gabor filters. We
use the results in [29] for a substitution, in which Ojala et al. use the same set-ups for Gabor filters of 6
orientations and 4 scales. In addition, The SIFT is re-implemented by us according to [11], in which
they optimized the keypoint detection to achieve stable local descriptors. In addition, we also refer to
the code by A. Vedaldi (http://vision.ucla.edu/~vedaldi/code/siftpp/siftpp.html)
Results for Brodatz textures
Results for KTH-TIPS2-a textures
100
100
97. 5( 0. 6)
96. 3( 0. 5)
96. 5( 0. 6)
95. 9( 0. 9)
95
93. 8( 0. 6)
Accuracy
Accuracy
96. 8( 0. 7)
95
93. 5( 1. 5)
90
88. 6( 1. 1)
91. 4( 0. 6)
7
8
(a)
0
1
2
3
W
LD
6
LB
P
5
W
LD
4
LB
P
U
rb
ac
h
G
ab
or
3
SD
2
Ja
lb
a
FT
1
SI
FT
85
0
SI
90
91. 2( 0. 7)
4
(b)
Fig. 8. Results comparison with state-of-the-art methods on Brodatz and KTH-TIPS2-a textures, where the values above the
bars are the accuracy and corresponding standard deviations.
As shown in Fig. 8 (b), we also compare our method with SIFT and LBP on the classification task of
KTH-TIPS2-a textures. Likewise, both SIFT and LBP are re-implemented by us. However, in the
implementation of SIFT, we use the Laplacian to detect keypoints in Fig. 8 (a) while in Fig. 8 (b), we
employ Harris followed the idea in [11].
From Fig. 8, one can find out that our approach works in a very robust way in comparison to other
methods. Moreover, the standard deviation of WLD is smaller compared with other methods.
Furthermore, though we have rotated and scaled the subimages of Brodatz textures as mentioned in
Section 4.2 and there are diverse variations in KTH-TIPS2-a (i.e., poses, scales and illuminations), we
22
also obtain favorable results. It shows that WLD extracts powerful discriminating features which are
robust to rotation, scaling and illumination. The poorer performance of SIFT can be partly explained by
the small image size (e.g., 64×64 in Brodatz database) for a sparse descriptor, from which a too small
number of keypoints may be located, leading to the performance decrease. Due to its robustness to
illumination, LBP performs better for KTH-TIPS2-a textures than for the histogram-equalized Brodatz
samples.
In addition, we also use the two components (differential excitation ξ and orientation θ) of WLD as
the features respectively and compute the histograms on Brodatz textures for classification. The
classification rates are 90.6% and 89.3% respectively. That is, both differential excitation ξ and
1
1
0.95
0.95
0.95
0.9
0.85
Accuracy
1
Accuracy
Accuracy
orientation θ provide effective information for texture classification.
0.9
0.85
0.85
M
0.8
2
4
6
M
(a)
8
0.9
T
10
0.8
2
4
6
T
(b)
8
S
10
0.8
0
5
10
15
S
20
25
30
(c)
Fig. 9. The effects of using different M, S, and T parameters
4.5 The effects of using different M, S, and T parameters
In this section, we discuss the influence of the parameters setting of M, S, and T on the final results. For
a histogram based method, the setting of M, S, and T is a tradeoff between the discriminability and the
statistical reliability. In general, if these parameters (i.e., M, S, and T) become larger, the dimensionality
of the histogram (i.e., the number of its bins) becomes larger and thus the histogram becomes more
discriminable. However, in real application, the entries of each bin become smaller because the size of
the input image/patch is fixed. It degrades the statistical reliability of the histogram. If the entries of
each bin become too small, it in turns degrades the discriminability the histogram because its poor
23
statistical reliability. In contrast, if these parameters (i.e., M, S, and T) become smaller, the entries of
each bin become larger and the histogram becomes statistically reliable. However, if these parameters
are too small, the dimensionality of the histogram also becomes too small. It also degrades the
discriminability of the histogram.
The experimental results on the Brodatz textures by varying the parameters are plot as follows. In
these experiments, we just vary one parameter and fix the other ones. One can find out that the
performance changes slightly, which shows that the histogram based method is relatively robust
although its dimensionality changes obviously.
5. Application to face detection
In this section, we use WLD features for human face detection. Although we train only one classifier,
we use it to detect frontal, occluded and profile faces. Furthermore, experimental results show that this
classifier obtains comparable performance to state-of-the-art methods.
5.1. Background
The goal of face detection is to determine whether there are any faces in a given image, and return the
location and extent of each face in the image if one or more faces are present. Recently, many methods
for detecting faces have been proposed and most of them extract the features densely [45]. Among
these methods, learning based approaches to capture the variations in facial appearances have attracted
much attention, such as [33, 35]. One of the most important progresses is the appearance of
boosting-based method, such as [14, 20, 30, 41, 43, 44]. In addition, Hadid et al. use Local Binary
Pattern (LBP) for face detection and recognition [13]. Garcia and Delak is use a convolutional face
finder for fast and robust face detection [12].
5.2. WLD Feature for Face Samples
We use WLD features as a facial representation and build a face detection system. For the facial
24
representation, as illustrated in Fig.9, we propose a new facial representation. Specifically, we divide an
input sample into overlapping regions and use a p-neighborhood WLD operator (here, p=8). In our case,
we normalize the samples into w×h (i.e., 32×32) and derive WLD representations as follows:
Fig. 9. An illustration of WLD histogram feature for face detection.
We divide a face sample of size w×h into K overlapping blocks (Here, K=9 in our experiments) of
size (w/2)×(h/2) pixels. The overlapping size is equal to w/4 pixels. For each block, we compute a
concatenated histogram Hk, k=0, 1, …, K-1. Herein, each Hk is computed as shown in Fig. 5. That is,
each Hk is a concatenated histogram by M sub-histogram H mk , m=0, 1,…, M-1, and H mk is also
concatenated by T histogram patches H mk ,t , t=0, 1, …T-1. Moreover, H mk ,t is an S-bin histogram patch.
In addition, for this group of experiments, we experientially let M=6, T=4, S=3. Thus, each Hk is a
72-bin histogram (Note that for each sub-histogram H mk , we use the same weights as shown in Table
1.).
For each block, we train a Support Vector Machine (SVM) classifier using an Hk histogram feature to
verify whether kth block is a valid face block (In our case, we use a second degree polynomial kernel
function for SVM classifier). If the number of the valid face blocks is larger than a given threshold Ξ,
we say that a face exists in the input window. As to the parameter Ξ, its value is a tradeoff between the
25
detection rate and false alarms for a face detector. That is, when the value of Ξ becomes larger,
detection rate decreases but false alarms also decrease. In the contrast, when the value of Ξ becomes
smaller, detection rate increases but false alarms also increase. The value of Ξ also varies with the pose
of faces. For more details, please refer to section 5.4.
5.3. Dataset
The training set is composed of two sets, i.e., a positive set Sf and a negative set Sn. For the positive set,
it consists of 50,000 frontal face samples. They are collected from web, video and digital camera and
cover wide variations in poses, facial expressions and also in lighting conditions. To make the detection
method robust to affine transform, the training samples are often rotated, translated and scaled [33].
After these preprocessing, we obtain the set Sf including 100,000 face samples. For the negative set Sn,
it consists of 31,085 images containing no faces and they are collected from Internet.
As for the test sets, we use three sets: The first one is the MIT+CMU frontal face test set, which
consists of 130 images showing 507 upright faces [33]. The second one is a subset from Aleix
Martinez-Robert (AR) face database [23]. AR face database consists of over 3,200 color images of the
frontal view faces of 126 subjects. However, we choose those images with occlusions (i.e., conditions
of 8-13 from the first session, and conditions of 21-26 from the second session). The resulting test set
consists of 1,512 images. The third one is the CMU profile testing set [35] (441 multiview faces in 208
images).
Note that the face samples are of the size 32×32. In order to detect some faces smaller or larger than
the sample size, we enlarge and shrink each input image.
5.4. Classifier Training
As described in Section 5.3, the set Sf is composed of a large number of face samples. Furthermore, we
can also extract hundreds of thousands of non-face samples from the set Sn. Thus, it is extremely time
consuming to train a SVM classifier using the two sets Sf and Sn. To this problem, we use the
26
resampling methods to train a SVM classifier. Specifically, motivated by [44], we also resample both
the positive and negative samples during classifier training.
For the positive samples, we first randomly pick out a sub-set Sf1 with the size Np (in our experiments,
Np=3,000). Likewise, we also randomly crop out a sub-set Sn1 with the size Nn (in our experiments,
Nn=3,000) from the non-face database Sn. Note that for the samples in Sn1, we normalize their sizes to
w×h (i.e., 32×32). Subsequently, we extract WLD histogram features of both the face and non-face
samples as shown in Fig.9. Using the extracted features of faces and non-faces, we train a
lower-performance SVM classifier. Simultaneously, we obtain a support-vector set S1, which includes a
face support-vector subset S 1f and a non-face support-vector subset S n1 .
Using the resulting lower-performance SVM classifier, we test it on the two training subsets (i.e., Sf
and Sn) to collect Np misclassified face samples Sf2 and Nn misclassified non-face samples Sn2.
Combining the newly-collected sample sets (i.e., Sf2 and Sn2) and the two support-vector subsets
obtained last time(i.e., S 1f and S n1 ), we obtain two new training sets: ( S 1f +Sf2) and ( S n1 +Sn2). We then
train another SVM classifier with a better performance. After several iterations of the aforemensioned
procedure, we finally train a well performed SVM classifier.
Note that we actually train K sub-classifiers of SVM. Each sub-classifier corresponds to a block as
shown in Fig. 9. Combining these K sub-classifiers, we obtain a final strong classifier.
5.5. Experimental Results
The resulting final strong SVM classifier is tested on three testing sets described in Section 5.3.
Experimental results are shown in Figs. 10, 11 and Table 3, respectively. Herein, we also compare the
performance of the resulting SVM classifier (we call it “SVM-WLD”) with some existing methods.
Note that all the results by other methods in Figs. 10 and 11 are quoted directly from the original papers
except Hadid [13] (which is implemented by us following their idea). During testing on these sets, the
parameter Ξ takes the different values as described in Section 5.2. For MIT+CMU frontal test set, AR
27
test set, and CMU profile test set, Ξ is equal to 8, 7 and 6 respectively.
ROC Curves for Face Detectors
Correct Detection Rate
1
0.95
Bourdev
Garcia
Hadid
Huang
Lin
Viola
SVM-Grey
SVM-WLD
0.9
0.85
0.8
0.75
0
10
20
30
40
50
60
False Positive
70
80
90
100
Fig. 10. A performance comparison of our method with some existing methods on MIT+CMU frontal face test set.
Table 3. Performance of our method on AR test set.
Detection rate
False Alarms
99.74%
0
100%
3
ROC Curves for Face Detectors
Correct Detection Rate
1
0.95
0.9
0.85
Huang
Schneiderman
SVM-WLD
0.8
0.75
0
30
60
90
120 150 180 210 240 270 300
False Positive
Fig. 11. A performance comparison of our method with some existing methods on CMU profile testing set.
As shown in Fig. 10, we compare the performance of our method with some existing methods on
MIT+CMU frontal face test set, such as Bourdev [48], Garcia [12], Hadid [13], Huang [49], Lin [20],
and Viola [41]. Meanwhile, Lin et al. also proposed a method to detect occluded faces. “SVM-Grey”
28
denotes that we only use the grey intensities as input of SVM classifier, and other experimental set-ups
are the same as “SVM-WLD”. From Fig. 10, one can find out that SVM-WLD locates 89.9% faces
without any false alarms and works much better than SVM-Grey (76.4% faces without a false alarm).
Furthermore, SVM-WLD works comparable to the existing methods, e.g., Lin [20].
Fig. 12. Some results of our detector on the MIT+CMU frontal test set (first row), AR database (second row) and CMU
profile test set (third row).
As shown in Table 3, we show the detection results on AR test set. From this table, one can find that
SVM-WLD locates 99.74% faces without any false alarms and locates all faces with only 3 false
alarms. In addition, in Fig. 11, we compare SVM-WLD with some existing methods, such as Huang
[14] and Schneiderman [35] on CMU profile test set. To locate those profile faces with in-plane
rotation, we also rotate the testing images. From Fig. 11, one can find that SVM-WLD locates 84.58%
faces without any false alarms—the best result reported so far to our knowledge.
29
However, different criteria (e.g., training examples involved and the number of scanned windows
during detection etc.) can be used to favor one over another, which makes it difficult to evaluate the
performance of different methods even though they use the same benchmark data sets [45]. Thus, the
results shown in Figs. 10 and 11 just illustrate that our method works robustly and achieves a
performance comparable to the state-of-the-art methods. Some results of our detector on these three test
sets are shown in Fig.12.
6. Conclusion
We propose a novel discriminative descriptor called WLD. It is inspired by Weber’s Law, which is a
law developed according to the perception of human beings. We organize WLD features to compute a
histogram by encoding both differential excitations and orientations at certain locations. Experimental
results clearly show that WLD illustrates a favorable performance on both Brodatz and KTH-TIPS2-a
textures compared to state-of-the-art methods. Specifically, we have achieved 97.5% accuracy by WLD
compared to 91.4% of SIFT and 95.9% of Gabor on Brodatz dataset. Moreover, although the diverse
variations in the KTH-TIPS2-a set (i.e., pose, scale and illumination), we have obtained 96.3%
accuracy compared to 88.6% of SIFT and 93.8% of LBP. Besides the performance comparison to the
other methods, we also compare the computational complexity of WLD with LBP and SIFT
theoretically. The analysis shows that WLD is much faster compared to SIFT and is comparable to
LBP.
Especially for the face detection task, we train only one classifier but it can be used to detect the
frontal, occluded and profile faces. The results on the three sets, i.e., the MIT+CMU frontal face test set,
Aleix Martinez-Robert (AR) face database and the CMU profile testing set demonstrate the
effectiveness of the proposed method through experiments and comparisons to other existing face
detectors. The system trained from the training set by the proposed method has achieved 89.9%
accuracy with no false alarm on MIT+CMU frontal face test set. For the AR test set. Our trained face
30
detector locates 99.74% faces without any false alarms and locates all faces with only 3 false alarms.
Furthermore, for the CMU profile testing set, our detector locates 84.58% faces without any false
alarms—the best result reported so far to our knowledge.
The current solution is developed for texture classification and face detection. It is not yet exploited
for the domain of face recognition. Nor is it employed for the domain of object recognition. Future
works will be enhancing solutions to these fields. In addition, exploring the multiscale version of WLD
is also of future interest.
Reference
[1] T. Ahonen, A. Hadid and M. Pietikäinen, Face Description with Local Binary Patterns: Application to Face
Recognition, PAMI, 2006.
[2] J. Anderson. An Introduction to Neural Networks. The MIT Press, Cambridge, MA. 1995.
[3] H. Bay, T. Tuytelaars, L. van Gool. SURF: Speeded Up Robust Features. ECCV, 2006.
[4] P. Brodatz. Textures: A Photographic Album for Artists and Designers, Dover Publications, New York, 1966.
[5] V. Bruni and D. Vitulano. A Generalized Model for Scratch Detection, IEEE TIP, 2004.
[6] B. Caputo, E. Hayman and P. Mallikarjuna, Class-Specific Material Categorisation, ICCV, 2005.
[7] J. Chen, S. Shan, G. Zhao, X. Chen, W. Gao, M. Pietikäinen. A Robust Descriptor based on Weber’s Law. CVPR,
2008.
[8] H. Cheng, Z. Liu N. Zheng, and J. Yang, A Deformable Local Image Descriptor, CVPR, 2008.
[9] C. He, T. Ahonen, M. Pietikäinen. A Bayesian Local Binary Pattern texture descriptor, submitted to ICPR 2008.
[10] N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. CVPR, 2005.
[11] G. Dorkó and C. Schmid. Maximally Stable Local Description for Scale Selection, ECCV, 2006.
[12] C. Garcia and M. Delakis. Convolutional Face Finder: A Neural Architecture for Fast and Robust Face Detection.
PAMI, 2004.
[13] A. Hadid, M. Pietikäinen and T. Ahonen. A Discriminative Feature Space for Detecting and Recognizing Faces, CVPR,
2004.
[14] C. Huang, H. Ai, Y. Li and S. Lao. High-Performance Rotation Invariant Multiview Face Detection, PAMI, 2007.
[15] A. K. Jain. Fundamentals of Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1989.
31
[16] A.C. Jalba, M.H.F. Wilkinson and J.B.T.M. Roerdink. Morphological Hat-transform Scale Spaces and Their Use in
Pattern Classification. PR, 2004.
[17] Y. Ke and R. Sukthankar, PCA-SIFT: A More Distinctive Representation for Local Image Descriptors, CVPR,
2004.
[18] S. Lazebnik, C. Schmid, J. Ponce. A Sparse Texture Representation Using Local Affine Regions. CVPR, 2003.
[19] S. Lazebnik C. Schmid J. Ponce. A Maximum Entropy Framework for Part-Based Texture and Object Recognition,
ICCV, 2005.
[20] Y. Y. Lin, T.-L. Liu, and C.-S. Fuh. Fast Object Detection with Occlusions. ECCV, 2004.
[21] D. Lowe. Distinctive Image Features form Scale Invariant Key Points. IJCV, 2004.
[22] B. Manjunath and W. Ma. Texture Features for Browsing and Retrieval of Image Data, PAMI, 1996.
[23] A. M. Martinez, R. Benavente. The AR Face Database. CVC Technical Report, #24, 1998.
[24] P. Moreels and P. Perona, Evaluation of Features Detectors and Descriptors based on 3D objects, IJCV, 2007.
[25] K. Mikolajczyk and C. Schmid. A Performance Evaluation of Local Descriptors. PAMI, 2005.
[26] K. Mikolajczyk and J. Matas, Improving Descriptors for Fast Tree Matching by Optimal Linear Projection, ICCV,
2007.
[27] T. Ojala, M. Pietikäinen, and D. Harwood, A Comparative Study of Texture Measures with Classification Based on
Feature Distributions, PR, 1996.
[28] T. Ojala, M. Pietikäinen, T. Mäenpää. Multiresolution Gray Scale and Rotation Invariant Texture Analysis with Local
Binary Patterns, PAMI, 2002.
[29] T. Ojala, K. Valkealahti, E. Oja and M. Pietikäinen, Texture Discrimination with Multidimensional Distributions of
Signed Gray Level Differences. PR, 2001.
[30] M.-T. Pham T.-J. Cham, Fast Training and Selection of Haar Features Using Statistics in Boosting-based Face
Detection, ICCV, 2007.
[31] T. Phiasai, P. Kumhom and K. Chamnongthai. A Digital Image Watermarking Technique Using Prediction Method and
Weber Ratio, International Symposium on Communications and Information Technologies, 2004.
[32] T. Randen and J. H. Husoy, Filtering For Texture Classification: A Comparative Study, PAMI, 1999.
[33] H. A. Rowley, S. Baluja, and T. Kanade. Neural Network-Based Face Detection, PAMI, 1998.
[34] Y. Rodriguez and S. Marcel , Face Authentication Using Adapted Local Binary Pattern Histograms, ECCV, 2006.
[35] H. Schneiderman and T. Kanade. A Statistical Method for 3D Object Detection Applied to Faces and Cars. CVPR,
32
2000.
[36] M. Swain and D. Ballard. Color indexing. IJCV, 1991.
[37] X. Tan, B. Triggs. Enhanced Local Texture Feature Sets for Face Recognition under Difficult Lighting Conditions.
Analysis and Modelling of Faces and Gestures, 2007.
[38] E. Tola, V. Lepetit and P. Fua, A Fast Local Descriptor for Dense Matching, CVPR, 2008.
[39] E. R. Urbach, J. B.T.M. Roerdink, and M. H.F. Wilkinson. Connected Shape-Size Pattern Spectra for Rotation and
Scale-Invariant Classification of Gray-Scale Images, PAMI, 2007.
[40] M. Varma, and A. Zisserman. Texture Classification: Are Filter Banks Necessary? CVPR, 2003.
[41] P. Viola and M. Jones. Rapid Object Detection Using a Boosted Cascade of Simple Features. CVPR, 2001.
[42] S.Winder and M. Brown. Learning Local Image Descriptors. CVPR, 2007.
[43] R. Xiao, H. Zhu, H. Sun and X. Tang, Dynamic Cascades for Face Detection, ICCV, 2007.
[44] S. Yan, S. Shan, X. Chen, W. Gao, J. Chen. Matrix-Structural Learning (MSL) of Cascaded Classifier from Enormous
Training Set. CVPR, 2007.
[45] M. H. Yang, D. Kriegman, and N. Ahuja. Detecting Faces in Images: A Survey, PAMI, 2002.
[46] G. Zhao and M. Pietikäinen, Dynamic Texture Recognition Using Local Binary Patterns with an Application To Facial
Expressions, PAMI, 2007.
[47] W. Zhang, S. Shan, W. Gao, X. Chen, and H. Zhang, Local Gabor Binary Pattern Histogram Sequence (LGBPHS): A
Novel Non-Statistical Model for Face Representation and Recognition, ICCV, 2005.
[48] L. Bourdev and J. Brandt. Robust object detection via soft cascade. CVPR, 2005.
[49] C. Huang, H. Ai, T., Yamashita, S. Lao, M. Kawade, Incremental Learning of Boosted Face Detector, ICCV, 2007
33
Download