Approach to Face Recognition Using Neural Networks and Bayesian Network David Witherspoon University of Colorado at Boulder Boulder, Colorado 80309 ABSTRACT This paper covers some of the basic areas of Face recognition and areas that improvements have been made over the years. It will not be covering face detection and that area is left for reading of other research papers. The focus here is on face recognition and will be looking at two different classifiers, which are Neural Network and Bayesian Network. The paper will cover the use of Principle Component Analysis to produce Eigenfaces to reduce the dimensionality of the space. This has many benefits that will be explained in the section on Eigenfaces. It will also cover the use of Ensemble Neural Network and their benefits and a way to increase the speed of Back Propagation, which is a big deterrent for many wide spread use of Neural Networks with large data sources. Keywords Neural Network, Face Recognition, Bayesian Network, Ensemble Neural Network, Eigenfaces. 1. INTRODUCTION Throughout the years there has been a dramatic increase in interest around pattern and face recognition as seen in Figure 1. Face recognition has the potential or is being currently being utilized in the police departments, federal agencies, military, security systems, and many more. Areas that I have recently being hearing about is the ability to train a Face recognition application to learn a face and search through all of your photos and determine which ones you are in or that person is in. Figure 1: Number of items published on Face Recognition There have been many advances over the years in the classifiers that are being used and the way in which they are being used. Also advances in the way that features are being selected from the images and the manner in which they are being selected. I believe that there could even be an application for face recognition on sites like Facebook or MySpace. They already have the ability for the user to manually mark a picture and label that marking with the person in the photograph. So they already have the ability to store the results from an automated face recognition system to perform the classifications. These ideas lead me to the research that I am presenting in this paper and will be explained in the section on Proposed Project. I will not be covering the different methods in Face detection, which would be the first component that we need to deal with. My focus has been on the last three steps more related to the face recognition vs. the face detection. I will be covering in the related work section the Eigenfaces, Neural Networks (including Ensemble Neural Networks, Multilayer Neural Networks, and Accelerating Back Propagation), Bayesian Network, and finally some examples of Face Data sources. 2. RELATED WORK 2.1 Eigenfaces Eigenfaces can be extracted out of an image by means of performing Principle Component Analysis (PCA) and Sirovich and Kirby are among the first researchers to utilize this. They showed that any particular face can be represented along the eigenpicture coordinate space utilizing a much smaller amount of memory [4]. Also a face can be reconstructed utilizing a small collection of eigenpictures and their corresponding projections, called coefficients, along each eigenpicture [4]. PCA computes K orthonormal vectors that provide a basis for the normalized input data, since the data is projected on a smaller space this results in dimensionality reduction. The coefficients are stored in decreasing order of significance, which allows even further reduction by only utilizing the top 20, 50, or so coefficients within the eigenvector. Applying PCA and producing the Eigenfaces reduces the number of dimensions that need to be explored by the classifier, which helps the classifier if there is sparse and skewed data. The eigenvectors are displayed and are called Eigenfaces due to the fact that they look like ghostly faces as seen in Figure 2. Finally we can calculate the reconstructed image projection weight utilizing equation (5) [9]. of the Now that we have all of these calculations we can utilize the reconstructed image value from equation (5) and the distance between (I(b) – ψ) from equation (4) to determine whether the block I(b) is similar to a face image or not [9]. Since we only select the Eigenfaces with the higher eigenvalues we have a smaller space to compare and classify the images with. This is the benefit in using Eigenfaces. 2.2 Neural Network The average face of the training set is defined in equation (1), where the average face of the whole face distribution is calculated by averaging each pixel of all the images using this formula [3]. In the overall design of the neural network you must make sure that the number of free parameters using in the neural network are less than the number of training examples [1]. If this does occur then overfitting will occur and it will become impossible for the neural network to learn. The other issue to be concerned about is making sure that the number of features that the neural network is going to have as input into the input layer is reduced; otherwise the size of the neural network will exceed the memory space of any server that the application would be running on. In the formula 2.2.1 Neural Network Ensemble Figure 2: Examples of Eigenfaces The Eigenfaces define a feature space or face space, where the dimensionality of this new space is dramatically reduced from the original space [4]. Having this reduced spaces saves in complexity of the classification and the storage of these images. is a NxN intensity array and the training set of the face regions are { (x,y), n=1,2,3,….,M} [9]. The mean adjusted image can be defined in equation (2) [9]. The Neural Network Ensemble is a collection of multilayer feedforward neural network, which is a network with hidden neurons. These hidden neurons are contained within a hidden layer and within most common multilayer feed-forward neural networks (MNN) they typically only contain a single hidden layer as shown in Figure 3. The advantage of adding a hidden layer to the neural network is that it enlarges the space of hypotheses that the neural network can represent [8]. The covariance matrix, denoted C, is defined in equation (3) [9]. Utilizing the PCA we can obtain a set of M orthonormal eigenvectors u(k) and their eigenvalues λ(k) of the covariance matrix C above [9]. Utilizing equation (4) we can calculate the Eigenface components of a new image block I(b) by projecting I(b) onto the average face space that was defined in equation (1) [9]. Figure 3: Multilayer Neural Network As you can see in Figure 3 the output from the previous layer is the input for the next layer. Not all MNN have a single neuron within the output layer and may have more than one depending on the classification that the neural network is trying to classify on. In dealing with the domain of face recognition we will have a neuron in the output layer for each face or person that we are trying to classify. return the majority vote provided by the collection of MNNs. This is the main advantage of utilizing the EMNN architecture, since it increases the recognition rate compared to using a single MNN [7]. In order to calculate the number of neurons needed in the Input Layer to classifying a face from an image is determined by the product of the row and column size of the image. For example if we have a 200x100 pixel image then we will need 20,000 neurons in the Input Layer of our MNN. With this many neurons in the Input Layer alone, this is the reason for have the Eigenfaces or another way of reducing the search space. Finally we have the edges or transformations between the outputs of the neurons from the previous layer to the input of the neurons in the next layer. Each neuron has an activation function that determines what the output of the neuron will be. In regards to pattern/face recognition it has show that the tan-sigmoid activation function with an output range of [-1, 1] is better than the log-sigmoid [7]. The value of the activation function is the value that is passed as the input to the next layer of the neural network. The closer the value is to 1 the higher the probability that the class associated with the neuron in the output layer is the class of the image that is being evaluated. Now that the architecture of the multilayer neural network has been described, we need to discuss the recommended method for training the MNN. In Figure 4 we can see the benefits of using the multiple MNN in the EMNN over a single MNN. The results in Figure 4 show that having five or seven MNNs in the EMNN has a lower average recognition error. In comparing the EMNN (5xMNNs) vs. the EMNN (7xMNNs) we can see that there is not a drastic advantage between the two. So choosing the EMNN (5xMNNs) is selected due to the fact that the time for training the neural networks is less than the one with seven [7]. While training the MNN we will be determining the number of nodes contained within the hidden layer through experimentation. For example we can utilize cross-validation techniques, typically choosing the 10-fold cross-validation, to test the MNN with different number of nodes within the hidden layer to determine the number of nodes that gives us the most accurate predictions. When we are performing the training we are utilizing back propagation in order to determine the weights on the edges between the nodes of the different layers. This is accomplished by back-propagating the error from the output layer back through the hidden layer. It has been found that the most efficient algorithm for back propagation for pattern/face recognition according to the criteria of training process stability and recognition accuracy is the gradient descent method with momentum and adaptive learning rate [7]. Going through the process of back propagating the error to alter the weights in the neural network is called an epoch. The system will continue to go through hundreds of epochs until the specified stopping criteria for changes in the weights has been met. This is one of the big advantages and disadvantages of the neural network and back propagation. The advantage is that the developer is letting the system tune itself by back propagating the error and altering the weights. The disadvantage is the amount of time it takes to reach the stopping criteria and the number of epochs that the system has to go through in order to meet that criterion. In section 2.4.2 we will look at a way to accelerate the back propagation problem. Now that we have covered the background on multilayer neural networks, we need to look at the advantages of the ensemble of multilayer feed-forward neural network (EMNN). As stated before the EMNN is a collection of MNNs, where each has its weights initialized randomly and are trained in the same manner as discussed above. After they are trained and all of the MNNs have met the stopping criteria set for altering the weights during back propagation, they are ready to classify the images. With the EMNN we are looking at having each individual MNN provide a vote on the classification of the image is and the EMNN will Figure 4: Face Recognition using MNN and EMNN In order to improve on the high recognition error rate on unknown classes, a modification to the EMNN’s decision rule is needed. Instead of just using the voting rule, it turns out that using a combination of the voting rule and a recognition threshold value work much better as discovered in [7]. The new decision rule for the EMNN is when one of the MNN’s output layers output is within the majority, it must be greater than or equal to the threshold value [7]. This new decision rule is called 3t and the results are shown in Figure 5. As you can see the error rate for unknown classes has been reduced from 46% to 25% when the threshold value was 0.4. The average error went from 26% down to 17% when the threshold value was 0.4 Figure 5: EMNN with 3t decision rule After altering the decision rule again to be when the majority of the EMNNs contained the most of MNNs plus one and the threshold for one or two of the MNNs were used, there was another improvement in the results [7]. As you can see in Figure 6 the average recognition error went from a 17% in the previous altered decision rule to 13%. Figure 6: EMNN with 4t decision rule Therefore, utilizing the EMNN architecture for determining the classification of the image has a very low error rate due to the increase in number of MNNs used for each classification. We will look at ways to improve the speed in training the classifier next. There are two basic issues associated with the learning rate of the back propagation algorithm. The first issue is dealing with the convergence rate that is dependent on λmax \ λmin. If a large value is selected for the learning rate for the weights associated with λmin, then the learning rate will be too large for the weights associated with λmax. The same is true if a small value is selected for the learning rate for the weights associated with λmax and then the learning rate will be too small for the weights associated with λmin [2]. In summary, the smaller the learning rate the slower the rate of convergence which is much slower and the higher the learning rate the higher the rate of convergence but cause the issue with oscillating across the minimum error. The other basic issue is related to the length of gradient over a wide range which makes solving difficult problems very time consuming [2]. Therefore, the learning rate needs to be assigned a small value in order to find a solution at the cost of a slow convergence rate and a slow system. So from these two issues we can see that the value of the learning rate is critical, which leads us to selecting the Method 2 learning rate update rule defined in [2]. The result presented in Figure 7 were taken from [2] show that the Method 2 learning rate update rule has a large increase in improvement over the back propagation and then number of epochs are greatly reduced. Even though the Method 1 has better numbers the Method 2 does not require any comparison of the gradient ∂E(n) / ∂Wji(n), therefore Method 2 was selected as the optimal solution in accelerating the learning process of the back propagation [2]. 2.2.2 Accelerating Back Propagation The time required to train a neural network utilizing the back propagation method is the most deterring factor for anyone wanting to utilize it with the training of a neural network. Back propagation learning is a slow and time consuming process for most applications. For instance, as the application becomes more complex, there is an increase in the number of neurons, the process of back propagation requires too much time and thus is not scalable in large systems. As we found earlier the gradient (or steepest) decent is the best method for pattern/face recognition to reduce the error. The squared error for instance n over all the neurons j in the output layer where dj(n) is the desired outcome and aj(n) is the actual outcome is written as: Ε(n) = (1/2) ∑ (dj(n) – aj(n))2 (6) To calculate the average squared error over the total number N is written as: Eavg = (1/N) ∑ E(n) (7) During the learning process the back propagation wants to minimize Eavg by adjusting the synaptic weight Wji(n) on the edge connecting the output of neuron(i) to input of neuron(j) and thresholds. The weights are adjusted using the formula below during gradient descent Wji(n+1) = Wji(n) – η(∂E(n) / ∂Wji(n)) (8) Figure 7: Results of Method to Accelerate Back Propagation 2.3 Bayesian Classifier Here we are looking at the probabilistic similarity measure based on the Bayesian belief that two images intensity differences are the characteristics of a typical difference in appearance of a specific person [5,6]. For example some of these differences could be lighting or expressions or similar small changes. The difference in intensity is represented as ∆ = I1 – I2, where I1 is the intensity from image 1 and I2 is the intensity of image 2. One of the types of classes of facial image variations is the interpersonal variations Ω(I), which is corresponding to facial expression and different lighting of the same person. The other type of class of facial image variation is extrapersonal variations Ω(E), which is corresponding to variations between different individuals. With the above information defined we will be looking at calculating the similarity measure in terms of the probability S(I(1), I(2)) = P(∆ ∈ Ω(I)) = P(Ω(I) | ∆) (9) Where P(Ω(I) | ∆) is the a posteriori probability given by Bayes rule, using the estimates of the likelihoods P(∆ | Ω(I)) and P(∆ | Ω(E)) [5,6]. Given these likelihoods we can calculate the similarity score between image 1 and image 2 in terms of the intrapersonal a posteriori probability as given Bys rule using equation (10) below: S(I(1),I(2)) = P(∆|Ω(I))P(Ω(I)) / P(∆|Ω(I))P(Ω(I)) + P(∆|Ω(E))P(Ω(E)) (10) Figure 8: Example of the Database of Faces This simpler problem is then solved using the maximum a posteriori (MAP) rule, where two images are of the same individual if S(I(1),I(2)) > 1/2 or P(Ω(I)|∆) > P(Ω(E)|∆) [5,6]. An alternative probabilistic similarity measure can be defined using the intrapersonal likelihood be itself to create a simpler formula S` = P(∆|Ω(I)) = (11) Which leads to maximum likelihood (ML) recognition [5,6]. The experimental results presented in [5] indicates that this simplified ML measure is almost as effective as the more complicated MAP measure in most cases. 3. DATA SOURCES 3.1 AT&T The Database of Faces The Database of faces was formally known as the ORL Database of Faces. The database can be located at http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html . The database contains 10 different images of each distinct individual and there are 40 individuals. Some of the individuals the pictures were taken at different times, with varying lighting and facial expressions. Some of the types of facial expression are open or closed eyes, smiling or not smiling and facial details of glasses or no glasses. From the research that I have done I have noticed that there are more issues related to having facial augmentation by having colored glasses or change in facial hair. This is due to the fact that the eyes are not able to be seen and in the other case the features of the face are hidden by facial hair. All of the images included in the database were taken against a dark homogeneous background, which is not as realistic in a real life setting. Also the subjects are upright and frontal position with some slight side movement, which again is much more of the best case scenarios if you were to ever deal with realistic situations. An example is provided in Figure 8. 3.2 FERET Data source The FERET database contains images of 1196 individuals, with up to 5 different images captured for each individual. An example is provided in Figure 9. The images are separated into two sets: gallery images and probes images. Gallery images are images with known labels, while probe images are matched to gallery images for identification. The database is broken into four categories: FB: Two images were taken of an individual, one after the other. One image is of the individual with a neutral facial expression, while the other is of the individual with a different expression. One of the images is placed into the gallery file while the other is used as a probe. In this category, the gallery contains 1196 images, and the probe set has 1195 images. Duplicate I: The only restriction of this category is that the gallery and probe images are different. The images could have been taken on the same day or 1 ½ years apart. In this category, the gallery consists of the same 1196 images as the FB gallery while the probe set contains 722 images. FC: Images in the probe set are taken with a different camera and under different lighting than the images in the gallery set. The gallery contains the same 1196 images as the FB & Duplicate I galleries, while the probe set contains 194 images. Duplicate II: Images in the probe set were taken at least 1 year after the images in the gallery. The gallery contains 864 images, while the probe set has 234 images. but feel that I have also learned a lot with the research that I have done. I believe that is one of the true signs that you have learned a lot on a topic is by realizing how vast the amount of knowledge on the subject is out there. With all of the papers that I have read I have found some issues out there that people are still working on that I believe have not been solved. One of those is dealing with the how faces change as people age. Another example is related to men altering facial hair and being able to extract the feature correctly, which is also affected by people wearing sunglasses. Another one might be the ability for people to augment their face with plastic surgery and how would the face recognition systems deal with that. It is truly amazing what we can do in recognizing faces in images and all the steps that we must have to go through to process the image and with such speed. 6. REFERENCES Figure 9: Examples of FERET frontal-view image pares 4. PROPOSED PROJECT 4.1 Face Recognition using Hybrid Network The proposed project for working with face recognition would be a hybrid between the use of the Ensemble Multilayer Neural Network and the Bayesian Classifier presented above. The application would begin by utilizing the PCA to generate the Eigenfaces to reduce the number of dimensions that system will have to work with. At that point the 20 to 50 coefficients that we selected from the Eigenfaces would be the inputs to the Input Layer of each of the MNNs contained within the EMNN. I would utilize the Accelerated Back Propagation to speed up the training of the classifier. The goal of the EMNN is to determine if the images are two completely different people or not. If they are completely different people then the classifier would return that. If the EMNN classifier knows for sure that they are the same people then it would return that classification that they are indeed the same person. If for some reason it does not know for sure that they are the same person, then it would pass the information on to the Bayesian classifier. Once it got to the Bayesian classifier we would use the more simplified ML measure to determine if the two images contained the same person and just had a difference in lighting or facial expression. At this point the Bayesian classifier would be able to determine the maximum likelihood that these two images contain the same person and return the correct classification as the result. I had much success in the Data Mining class in using a hybrid application (Locally Weighted Naïve Bayesian Classifier and KNN) to do predictions of movie rating for users and believe that it will work well in this case to have a hybrid system as well. I feel that this is the best of both world and with the added features of the accelerated back propagation to remove the issue around the use of Neural Network; I feel that this would be a viable application that could be used to perform face recognition. 5. CONCLUTION Throughout the many years there has been a lot of great research that has been done in the face detection and face recognition realm. I feel that I have only touched a small bit of the entire field, [1] Bouattour, H., Fogelman Soulie, F., and Viennet, E., “Neural Nets for Human Face Recognition”, International Joint Conference on Neural Nets for Human Face Recognition Volume 3, 7-11 June 1992 Page(s):700 - 704 vol.3 Digital Object Identifier 10.1109/IJCNN.1992.227070. [2] Evans, D.J., Ahmad Fadzil, M.H., and Zainuddin, Z., “Accelerating Back Propagation in Human Face”, Recognition. International Conference on Neural Networks, 1997.Volume 3, 9-12 June 1997. Page(s):1347 - 1352 vol.3 Digital Object Identifier 10.1109/ICNN.1997.613974 . [3] Jamil, N., Lqbal, S., and Iqbal, N., “Face Recognition Using Neural Networks”, Multi Topic Conference, 2001. IEEE INMIC 2001. Technology for the 21st Century. Proceedings. IEEE International 28-30 Dec. 2001 Page(s):277 – 281. [4] Liu, C. and Wechsler, H., “A Unified Bayesian Framework for Face Recognition”, Proc. of the 1998 IEEE International Conference on Image Processing, ICIP'98, 4-7 October 1998, Chicago, Illinois, USA, pp. 151-155. [5] Moghaddam, B., Jebara, T., and Pentland, A., “Bayesian Face Recognition and Pattern Recognition”, Vol. 33, Issue 11, November 2000, pp. 1771-1782. [6] Moghaddam, B., Nastar, C., and Pentland, A., “A Bayesian Similarity Measure for Deformable Image Matching”, Image and Vision Computing. Vol. 19, Issue 5, May 2001, pp. 235244. [7] Paliy, I., Sachenko, A., Koval, V., and Kurylyak, Y., “Approach to Face Recognition Using Neural Networks”, Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, 2005. IDAACS 2005. IEEE 5-7 Sept. 2005 Page(s):112 - 115 Digital Object Identifier 10.1109/IDAACS.2005.282951 [8] Russell, S. and Norvig, P. Artificial Intelligence: A Modern Approach, Second Edition. Prentice Hall. 2003 [9] Tsai, C.C., Cheng, W.C., Taur, J.S., and Tao, C.W., “Face Detection Using Eigenface and Neural Network”, Systems, Man and Cybernetics, 2006. SMC '06. IEEE International Conference on Systems, Man, and Cybernetics. Volume 5, 811 Oct. 2006 Page(s):4343 - 4347 Digital Object Identifier 10.1109/ICSMC.2006.384817.