ANVAR et al.: Multi-View Face Detection and Registration Requiring Minimal Manual Intervention APPENDIX A Finding Distinctive Features The scale invariant features (SIFT) extracted from face images will usually return a substantial number of common or not informative features in addition to the required distinctive features. Thus, it is necessary to select only the distinctive features obtained from the discriminative parts of a face. Two hypotheses exist, either the feature is part of a face (π»π,π = 1) or it is from the background (π»π,π = 0). As at this stage, we do not know where the face is, we assume that the distinctive feature arising from the face should be unique. Thus in order to identify the distinctive features and discard common features, a likelihood ratio test (A.1) is performed to all the features extracted from both images. Those features with likelihood ratio less than one are considered as common features and are discarded. πΎ(ππ,π ) = π(π»π,π = 1|Ω) π(π»π,π = 0|Ω) < 1, (A. 1) where ππ,π is the kth feature of the image i, γ(fi,k ) is the likelihood ratio for this feature when some independent data Ω has been seen. As πΎ(ππ,π ) has to be determined for all features found in the image, we use the Bernoulli distribution in (A.2) to determine the probability that a feature ππ,π comes from a part of object or non-object (π»π,π π{0,1}). π(π»π,π |π) = π π»π,π (1 − π)1−π»π,π , (A. 2) where π is the distribution parameter. We use the maximum likelihood estimate (MLE) to estimate the probability in (A.1) as follows: Let feature π1,π be a feature from 1 the first image πΌ1 which compared to all π2 features (π2,π , π = 1, … , π2 ) in the second image πΌ2 . If this feature is a distinctive feature, it should be unique and the occurrence of it in the other part of the image (both face and background) would be very limited. Ideally, there would only be one such feature and for symmetrical parts, only two features can be found. Thus, we can estimate its probabil2 ity as π(π»1,π = 1|Ω) = . But if this feature originates π2 from the non-distinctive part of the image, the number of similarities found for it could be any number, πΌ > 2. Thus, we can estimate its probability as π(π»1,π = 0|Ω) = πΌ . To compute the similarity number, πΌ, of feature π1,π in π2 π΄ the first image, its appearance descriptor (π1,π ) is comπ΄ pared to the appearance descriptors (π2,π ) of all features in 2 A A the second image within a global threshold ((f1,k -f2,j ) < a Thr ). This procedure is then repeated for each feature found in the second image, which is then compared to all the features in the first image. Finally, all common features found are removed. To estimate the suitable value for πβπ π several image pairs from different databases (FERET, CMU, and FDDB) with various poses (left, right and frontal views) are used for evaluation. Some sample results of finding the correspondence points between image pairs with different threshold is demonstrated in Fig.A.1. From the empirical tests, we found that a value between 0.2-0.3 gives the best result and is used for Thr a . A larger value increases the number of wrongly eliminated distinctive features while a lower threshold increases the number of false distinctive features. Since this threshold is applicable only for the appearance descriptor, its value is fixed for the features extracted using the SIFT method (standard setting with 128 bins). Fig. A.1. Shows the results of applying different feature appearance thresholds (πβπ π ) to find common features and distinctive features. Several image pairs from different databases (CMU, FDDB and FERET) and faces from various poses (left, frontal and right views) are used to determine the suitable value for the threshold. The empirical test shows that a value between 0.2 to 0.3 is a suitable value in keeping valid distinctive points and eliminating the noise. 2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, APPENDIX B π(π|π) Face Features Estimation Let feature ππ be a binary random variable that might originate from a part of face (π»π = 1) or non-face (π»π = 0) area. We can assign a Bernoulli distribution with parameter π to it that determines its uncertain value as: π(π»π |π) = π π»π (1 − π)1−π»π , (π΅. 1) Assuming a sequence of πΏ independent observation of data π over feature ππ in the training data, the likelihood of obtaining π is given by: πΏ π π π(π|π) = ∏ ππ»π (1 − π)π»π = π πΏπ (1 − π)πΏ−πΏπ , (π΅. 2) = 1 π€(πΌ1 +πΌ0 ) πΌ −1 π πΏπ (1 − π)πΏ−πΏπ π 1 (1 − π)πΌ0 −1 π(π) π€(πΌ1 )π€(πΌ0 ) = π΅ππ‘π(π|πΏπ + πΌ1 , πΏ − πΏπ + πΌ0 ), To calculate the probability of ππ from a sequence of πΏ observation data, the uncertainty of the distribution parameter π is considered by integrating the feature probability function over π distribution as in (B.8). This gives the probability that feature ππ comes from a part of face (π»π = 1) when a sequence of πΏ observation data π over this feature is given but the number of them is small. π=1 1 where πΏπ is the number of time that feature ππ has represented a face area in the observation data π. When πΏ is large, we can estimate the distribution parameter π using MLE estimation as: πΜ ππΏπΈ = ππππππ₯ π(π|π), (π΅. 3) π π€(πΌ1 +πΌ0 ) πΌ −1 π 1 (1 − π)πΌ0 −1 , π€(πΌ1 )π€(πΌ0 ) (π΅. 4) where α1 and α0 are hyper-parameters and is influenced by the number of virtual face/non-face data. The gamma function π€(π₯) is given by: ∞ π€(π₯) = ∫ π’ π₯−1 π −π’ ππ’ , (π΅. 5) 0 Considering the Bayesian rule, the posterior probability of the distribution parameter π can be calculated in the form of prior probabilities as: π(π|π) = π(π»π = 1|π) = ∫ π(π»π = 1|π)π(π|π) ππ 0 1 = ∫ π π(π|π) ππ 0 1 = ∫ ππ΅ππ‘π(π|πΏπ + πΌ1 , πΏ − πΏπ + πΌ0 ) ππ 0 However, if πΏ is small, the MLE will produce a bias outcome and there is also the sparse data problem. Since we know that the number of false correspondences will be low, we used Bayesian approach and considering another uncertainty on the distribution parameter π. Thus, we used a Beta probability distribution with prior probability as: π(π|πΌ1 , πΌ0 ) = (π΅. 7) π(π|π)π(π|πΌ1 , πΌ0 ) , π(π) (π΅. 6) Substituting (B.2) and (B.4) into (B.6) yields another Beta distribution with new hyper-parameters: = πΏπ + πΌ1 , πΏ + πΌ1 + πΌ0 (B. 8)