ttp2013102484s - IEEE Computer Society

ANVAR et al.: Multi-View Face Detection and Registration Requiring Minimal Manual Intervention
Finding Distinctive Features
The scale invariant features (SIFT) extracted from face
images will usually return a substantial number of common or not informative features in addition to the required distinctive features. Thus, it is necessary to select
only the distinctive features obtained from the discriminative parts of a face. Two hypotheses exist, either the
feature is part of a face (𝐻𝑖,π‘˜ = 1) or it is from the background (𝐻𝑖,π‘˜ = 0). As at this stage, we do not know where
the face is, we assume that the distinctive feature arising
from the face should be unique. Thus in order to identify
the distinctive features and discard common features, a
likelihood ratio test (A.1) is performed to all the features
extracted from both images. Those features with likelihood ratio less than one are considered as common features and are discarded.
𝛾(𝑓𝑖,π‘˜ ) =
𝑃(𝐻𝑖,π‘˜ = 1|Ω)
𝑃(𝐻𝑖,π‘˜ = 0|Ω)
< 1,
(A. 1)
where 𝑓𝑖,π‘˜ is the kth feature of the image i, γ(fi,k ) is the likelihood ratio for this feature when some independent data Ω has been seen. As 𝛾(𝑓𝑖,π‘˜ ) has to be determined for all
features found in the image, we use the Bernoulli distribution in (A.2) to determine the probability that a feature
𝑓𝑖,π‘˜ comes from a part of object or non-object (𝐻𝑖,π‘˜ πœ–{0,1}).
𝑃(𝐻𝑖,π‘˜ |πœƒ) = πœƒ 𝐻𝑖,π‘˜ (1 − πœƒ)1−𝐻𝑖,π‘˜ ,
(A. 2)
where πœƒ is the distribution parameter. We use the maximum likelihood estimate (MLE) to estimate the probability in (A.1) as follows: Let feature 𝑓1,π‘˜ be a feature from
the first image 𝐼1 which compared to all 𝑁2 features (𝑓2,𝑗 ,
𝑗 = 1, … , 𝑁2 ) in the second image 𝐼2 . If this feature is a distinctive feature, it should be unique and the occurrence of
it in the other part of the image (both face and background) would be very limited. Ideally, there would only
be one such feature and for symmetrical parts, only two
features can be found. Thus, we can estimate its probabil2
ity as 𝑃(𝐻1,π‘˜ = 1|Ω) = . But if this feature originates
from the non-distinctive part of the image, the number of
similarities found for it could be any number, 𝛼 > 2.
Thus, we can estimate its probability as 𝑃(𝐻1,π‘˜ = 0|Ω) =
. To compute the similarity number, 𝛼, of feature 𝑓1,π‘˜ in
the first image, its appearance descriptor (𝑓1,π‘˜
) is com𝐴
pared to the appearance descriptors (𝑓2,𝑗 ) of all features in
the second image within a global threshold ((f1,k
) <
Thr ). This procedure is then repeated for each feature
found in the second image, which is then compared to all
the features in the first image. Finally, all common features found are removed. To estimate the suitable value
for π‘‡β„Žπ‘Ÿ π‘Ž several image pairs from different databases
(FERET, CMU, and FDDB) with various poses (left, right
and frontal views) are used for evaluation. Some sample
results of finding the correspondence points between image pairs with different threshold is demonstrated in
Fig.A.1. From the empirical tests, we found that a value
between 0.2-0.3 gives the best result and is used for Thr a .
A larger value increases the number of wrongly eliminated distinctive features while a lower threshold increases
the number of false distinctive features. Since this threshold is applicable only for the appearance descriptor, its
value is fixed for the features extracted using the SIFT
method (standard setting with 128 bins).
Fig. A.1. Shows the results of applying different feature appearance thresholds (π‘‡β„Žπ‘Ÿ π‘Ž ) to find common features and distinctive features. Several image pairs from different databases (CMU, FDDB and FERET) and faces from various poses (left, frontal and right views) are used to
determine the suitable value for the threshold. The empirical test shows that a value between 0.2 to 0.3 is a suitable value in keeping valid
distinctive points and eliminating the noise.
Face Features Estimation
Let feature π‘“π‘˜ be a binary random variable that might
originate from a part of face (π»π‘˜ = 1) or non-face (π»π‘˜ = 0)
area. We can assign a Bernoulli distribution with parameter πœƒ to it that determines its uncertain value as:
𝑃(π»π‘˜ |πœƒ) = πœƒ π»π‘˜ (1 − πœƒ)1−π»π‘˜ ,
(𝐡. 1)
Assuming a sequence of 𝐿 independent observation of
data πœ“ over feature π‘“π‘˜ in the training data, the likelihood
of obtaining πœ“ is given by:
𝑃(πœ“|πœƒ) = ∏ πœƒπ»π‘˜ (1 − πœƒ)π»π‘˜ = πœƒ πΏπ‘˜ (1 − πœƒ)𝐿−πΏπ‘˜ ,
(𝐡. 2)
𝛀(𝛼1 +𝛼0 ) 𝛼 −1
πœƒ πΏπ‘˜ (1 − πœƒ)𝐿−πΏπ‘˜
πœƒ 1 (1 − πœƒ)𝛼0 −1
𝛀(𝛼1 )𝛀(𝛼0 )
= π΅π‘’π‘‘π‘Ž(πœƒ|πΏπ‘˜ + 𝛼1 , 𝐿 − πΏπ‘˜ + 𝛼0 ),
To calculate the probability of π‘“π‘˜ from a sequence of 𝐿
observation data, the uncertainty of the distribution parameter πœƒ is considered by integrating the feature probability function over πœƒ distribution as in (B.8). This gives
the probability that feature π‘“π‘˜ comes from a part of face
(π»π‘˜ = 1) when a sequence of 𝐿 observation data πœ“ over
this feature is given but the number of them is small.
where πΏπ‘˜ is the number of time that feature π‘“π‘˜ has represented a face area in the observation data πœ“. When 𝐿 is
large, we can estimate the distribution parameter πœƒ using
MLE estimation as:
πœƒΜ‚ 𝑀𝐿𝐸 = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯ 𝑃(πœ“|πœƒ),
(𝐡. 3)
𝛀(𝛼1 +𝛼0 ) 𝛼 −1
πœƒ 1 (1 − πœƒ)𝛼0 −1 ,
𝛀(𝛼1 )𝛀(𝛼0 )
(𝐡. 4)
where α1 and α0 are hyper-parameters and is influenced
by the number of virtual face/non-face data. The gamma
function 𝛀(π‘₯) is given by:
𝛀(π‘₯) = ∫ 𝑒 π‘₯−1 𝑒 −𝑒 𝑑𝑒 ,
(𝐡. 5)
Considering the Bayesian rule, the posterior probability of
the distribution parameter πœƒ can be calculated in the form
of prior probabilities as:
𝑃(πœƒ|πœ“) =
𝑃(π»π‘˜ = 1|πœ“) = ∫ 𝑃(π»π‘˜ = 1|πœƒ)𝑃(πœƒ|πœ“) π‘‘πœƒ
= ∫ πœƒ 𝑃(πœƒ|πœ“) π‘‘πœƒ
= ∫ πœƒπ΅π‘’π‘‘π‘Ž(πœƒ|πΏπ‘˜ + 𝛼1 , 𝐿 − πΏπ‘˜ + 𝛼0 ) π‘‘πœƒ
However, if 𝐿 is small, the MLE will produce a bias outcome and there is also the sparse data problem. Since we
know that the number of false correspondences will be
low, we used Bayesian approach and considering another
uncertainty on the distribution parameter πœƒ. Thus, we
used a Beta probability distribution with prior probability
𝑃(πœƒ|𝛼1 , 𝛼0 ) =
(𝐡. 7)
𝑃(πœ“|πœƒ)𝑃(πœƒ|𝛼1 , 𝛼0 )
(𝐡. 6)
Substituting (B.2) and (B.4) into (B.6) yields another Beta
distribution with new hyper-parameters:
πΏπ‘˜ + 𝛼1
𝐿 + 𝛼1 + 𝛼0
(B. 8)