3D Object Recognition – Face Detection

3D Object Recognition – Face Detection
April 17, 2007
Bo Gao
Final Project Draft
The goal of the project is to study and implement the Neural Network-based Face Detection
described in [1]. The basic goal is to study, implement, train and test the Neural Network-based
machine learning system.
Pattern recognition and pattern detection are both instances of a wider class of computer vision
problems, called pattern classification.
Face Detection:
Given as input an arbitrary image, which could be a digitized video signal or a scanned photograph,
determine whether or not there are any human faces in the image, and if there are, return an encoding
of the location and spatial extent of each human face in the image. [3]
Face Recognition:
Given an input image of a face, compare the input face against models in a library of known faces
and report if a match is found. [3]
Face Localization:
The input image contains exactly one human face, and the task is to determine the location and scale
of the face, and sometimes also its pose. [3]
View-based Detection:
It is used by almost all face detection systems. View-based detection is to detect face or non-face
based on a small view window (20 x 20 pixel described in [1]).
Overall, the face detection can be extremely difficult given various inputs with different facial
presents, and lighting, shadow, scaling and dimensional variances.
Training Data Preparation:
- For each face and non-face image:
o Subtract out an approximation of the shading plane to correct for single light source
o Rescale histogram so that every image has the same gray level range.
- Aggregate data into data sets.
Backpropagation Neural Network.
- Set all weight to random value range from -1.0 to 1.0.
- Set an input pattern (binary values) to the neurons of the net’s input layer.
- Active each neuron of the following layer:
o Multiply the weight values of the connections leading to this neuron with the output
values of the preceding neurons.
o Add up these values.
o Pass the result to an activation function, which computes the output value of this neuron.
Repeat this until the output layer is reached.
Compare the calculated output pattern to the desired target pattern and compute a square error
Change all weights values of each weight using the formula:
Weight (old) + Learning Rate * Output Error * Output (Neuron i) * Output (Neuron i + 1) *
(1 – Output (Neuron i + 1))
Go to the first step.
The algorithm end, if all output pattern match their target pattern.
Apply Face Detector to Image:
- Apply the 20 x 20 pixel view window at every pixel position in the input image.
- For each window region:
o Apply linear fit function and histogram equalization function on the region.
o Pass the region to the trained Neural Network to decide whether or not it is a face.
o Return a face rectangle box scaled by the scale factor, if the region is detected as a
- Scale the image down by a factor of 1.2.
- Go to the first step, if the image is larger than the 20 x 20 pixel window.
Training Data Preparation
In order to train the Neural Network to figure out the face and non-face images. We need to pass a
bunch of face and non-face training data and desire the net to give out 1 for face and -1 for non-face.
For the training images, because of the lighting, shadow, contract variances, we need to equalize
those differences. The linear fit function will approximate the overall brightness of each part of the
window. The histogram equalization will non-linearly map the intensity values to expand the range
of intensities in the window, so that increate the contract value.
Linear fit function can be described as following:
 
( x, y,1)   b   I ( x, y )
c 
 
I(x, y) stands for the intensity of a pixel x, y. (a, b, c)T are the linear model parameters. So
 
 b   (( x, y,1) ' ( x, y,1))  ( x, y,1) ' I ( x, y)
c 
 
I '( x, y )  I ( x, y )  ( a * x  b * y  c)
In order to implement the linear fit function, I used the Java Image Processing API [5], and the Java
Matrix Class [6]. The histogram equalization function is from the Java Image Processing API.
Also, as [2] suggests, it is a good idea to mask an oval within the face rectangle to prune the pixel
used in training in neural net. Figure 1 shows the some of the processed images. Figure 2 shows
some non-face images.
Figure1. The first line is the original faces. The second line is processed by linear fit function. The
third line is processed by histogram equalization function. The last line is masked by the oval mask.
Figure2. Non-face images are from randomly picked images with no face included.
Neural Network
Figure 3: Basic three layers neural network.
Neural Network is the simulation of human brain. Each neuron has an activation function, which can
be a linear or sigmoid function. Basically, the activation function of the neuron computes a value
based on the previous layer’s output and the weights of the links connected to it. Often used
activation functions include Identity Activation Function, Sigmoid Activation Function, Tanh
Activation Function, and etc. There are three kinds of neuron layers, which are input, hidden and
output. Both input and output have only one layer, but there can be multiple hidden layers in the net.
There is a special neuron named Bias in the neural network, which is used for all 0 inputs. That
means without the Bias, if all input neurons get 0, the weights are useless and the output is always 0.
Both [1] and [2] suggested using a three layers neural network, which includes one hidden layer. In
[1], the author stated that it is important to split the input into small pieces instead of using complete
connections on the entire input (Figure 3). While in [2], their experiments show that the detailed net
architecture is not crucial. I implemented the basic Backpropagation Neural Network based on the
code from [4]. The net has 400 input neurons mapping the 400 pixels in the view window, 20 hidden
neurons and 1 output neuron, with complete connections between layers. I also tried the architecture
mentioned by [1], but there was no apparent evidence to show that the results are better.
Figure 4: The basic algorithm used in [1]. The left row is the input image pyramid, scaled by factor
of 1.2. The middle row is the input images processed by the brightness gradient correction and
histogram equalization. The right row is the neural network architecture, whose input neurons are
grouped by different input regions.
Training and Results
Training images are collected from Picons database
(http://www.cs.indiana.edu/picons/ftp/index.html). This database consists of small (48 x 48 pixel)
images with 16 gray levels or colors, collected at Usenix conferences. Some other images are from
the Yale Face Database (http://cvc.yale.edu/projects/yalefaces/yalefaces.html). In order to balance
the face and non-face training, I created 40 face images and 40 non-face images with size of 20 x 20
pixels. I also implemented functions that mirror flip the image, translate the image 1 pixel distance in
up, down, left and right directions, scale the image up to ±10% (suggested by [1]). These functions
extend each image to 9 similar images (Figure 6). Therefore, the training data set has totally 800
images. The net was expected to output 1.0 for face image and -1.0 for non-face image. Trained by
1000 epochs (each epoch stands for one cycle of the training data set), the net gave out the result as
Correct Percentage
Training Data Size
Face Detection Rate
Non-face Detection Rate
Figure 5. The results above were created by feeding the net with the same data as they were trained.
Figure 6. From left to right is: original image, mirror flipped, translated up, translated down,
translated left, translated right, scaled 0.9, scaled 0.95, scaled 1.05 and scaled 1.10.
Following are the detection test results:
Figure 5: 48 x 48 pixel images. The first two faces are in the training data set, while the last three are
not. The faces are all detected.
Figure 6: 100 x 76 pixel image. The image is not in the training data set. The face is detected, but
with a lot of false detections.
Figure 7: 120 x 86 pixel image. The image is not in the training data set. None of the two faces is
Figure 8: 150 x 65 pixel image. The image is not in the training data set. Only one of the 7 faces is
detected, with a lot of false detections.
Conclusion and Improvement
For small size images with single face, the results are really good. No matter the face is whether or
not in the training data set. For larger size images with single or multiple faces, few faces are
detected, but tradeoff with a lot of false detections. Training data order may affect training result,
because I tried to pass the training data in different orders, such as passing whole face training
images and then non-face training images versus passing face and non-face training images one by
one. The later one gets better training results. In order to improve the results, the image processing
functions may need to be improved. Facial feature labeling and alignment [1] was used to reduce the
amount of variation between images of faces. Enlarging the training set will definitely improve the
results. The more the net is trained, the more accurate the detection will be. For the non-face training
data, there is a good manner in [1]:
1. Create an initial set of non-face images by generating 1000 images with random pixel
2. Apply the preprocessing steps to each of these images.
3. Train a neural network to produce an output of 1 for the face examples, and -1 for the nonface examples. The training algorithm is standard error backpropogation. On the first
iteration of this loop, the network’s weights are initially random. After the first iteration, we
use the weights computed by training in the previous iteration as the starting point for
4. Run the system on an image of scenery which contains no faces. Collect subimages in which
the network incorrectly identifies a face (an output activation > 0).
5. Select up to 250 of these subimages at random, apply the preprocessing steps, and add them
into the training set as negative examples. Go to step 2.
Merging overlapping detections and arbitration will generate single bounding box around the face.
And the compute speed can be improved with better search algorithm.
[1] Rowley, H., Baluja, S. and Kanade, T., Neural Network-Based Face Detection. IEEE
Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 1, January, 1998, pp. 23-38.
[2] Sung, Kah-Kay and Poggio, Tomaso, Example-based learning for view-based human face
detection. A.I. Memo 1521, CBCL Paper 112, MIT, December 1994.
[3] Duda, R.O., Hart, P.E. and Stork, D.G. Pattern Classification. Wiley, New York, 2001.
[4] Russell, S.J. and Norvig, P. Artificial intelligence: A modern approach. Prentice Hall/Pearson
Education, Upper Saddle River, N.J., 2003.
[5] Java Image Processing API. Version 2. http://www.ia.hiof.no/~por/imageprocAPI/version2/
[6] Sedgewick, Robert and Wayne, Kevin, Introduction to Programming: An Interdisciplinary
Approach. Addison Wesley, 2007. http://www.cs.princeton.edu/introcs/95linear/