Introduction - Image Formation and Processing (IFP) group

advertisement

Vision-Based Overhead View Person Recognition

Ira Cohen, Ashutosh Garg, Thomas S. Huang

1.

Introduction

Beckman Institute, University of Illinois at Urbana-Champaign

Email:{ iracohen, ashutosh, huang}@ifp.uiuc.edu

Abstract

Person recognition is a fundamental problem faced in any computer vision system. This problem is relatively easy if the frontal view is available, however, it gets intractable in the absence of the frontal view. We have provided a framework, which tries to solve this problem using the topview of the person. A special scenario of “Smart conference Room” is considered.

Although, not much information is available in the top view, we have shown that by making use of DTC and

Bayesian networks the output of the various sensors can be combined to solve this problem. The results presented in the end show that we can do person recognition (pose independent) with 96% accuracy for a group of 12 people. For pose dependent case, we have achieved 100% accuracy. Finally we have provided a framework to achieve this in real time

The advances in the computer-vision have led to the dawn of new frontiers of human-computer integrated applications. As the name suggests, the HCI systems are centered on human agents interacting with the environment. The central requirement faced by these systems is the tracking and recognition of humans and their actions. In the past, most of the research in computer vision has been focused on developing the basic algorithms required for these tasks. People have focused on tasks like tracking and recognition in depth. Some of the basic research has been devoted to doing face recognition (as a way of doing person identification) [7], gesture recognition [15] and tracking human body (for surveillance).

However, the basic nature of the problem tends to be application specific. Recently a large amount of research has gone into developing application specific systems

(e.g. “Easy Living” at Microsoft, “Smart Kiosk” at CRL).

The work presented in this paper is a step in this direction.

We have focused on developing a person recognition system with the particular application of conference room in mind. The setup of conference room automatically puts certain constraints on the environment. However, at the same time brings new problems and new challenges. For example, the hard vision problem of lighting conditions can be easily controlled. One the other hand, the access to a person's frontal view at all time is very hard. This poses a difficult problem since most face recognition algorithms use the person's frontal or near frontal view. As par the authors’ knowledge, this is the first work that tries to solve the task of the person recognition from the top view.

However, the authors will like to stress that the topview can be used to solve the problem of person recognition only for a small group of people. The advantage of using the topview is the absence of occlusion and accessibility to the view from the overhead mounted cameras. The problem however is the lack of information present in the top view. It’s hard for even a human to recognize people by just looking at their head. We argue that this task is possible if the number of different individuals is small

(which is allowed in this application as at any given time the no. of different people in a conference room is limited by its capacity). An individual is registered when he makes an entry in the room and then can be identified at any given time. We provide a solution to this problem for a small group of individuals. The results presented in the end show the feasibility of such a system. We have tested our system for a group of 12 individuals and have shown that by the judicious selection of features and classification scheme, perfect recognition is achievable.

We would start by describing the particular application of the conference room that we are targeting. The third section starts by describing the various features that have been selected. We describe each feature in detail, giving the reasons as to why we have selected it for this application. The fourth section gives a description of the decision tree classifier and the Bayesian Network in the context of this application. Giving results and future directions will conclude the paper.

2.

Problem – Smart Conference Room

The particular scenario of “Smart Conference Room” is under consideration. We want it to be capable of identifying and track (tracking is not yet implemented and will be done in future) the various people present in it.

These requirements are essential if we want to do automatic gisting of the meeting. The multiple person scenario makes the problem a tough one. At the same time, occlusion makes the problem of frontal face recognition almost impossible. We propose a solution to this problem by the use of cameras mounted on the ceiling. We are considering a setup of cameras covering the entire room, all-looking downwards. As a person enters the room he is labeled and his features are extracted for later recognition. This information will be stored as long as the person is in the room and the assumption is that there won't be significant changes in the person's appearance while in the room. This allows us to make use

of hair and clothing color and hair texture. We argue that a limited set of people and highly constraint environment makes it possible to use vision techniques for this task. As we see from the results obtained, any one feature is incapable of doing the task. However, information from all the features can be combined effectively for the successful completion of the task.

3.

The Features

The decision of which features should be used for recognition is the first fundamental step of building a classification system. In this application choosing the features was based on several criteria. Ability to separate well, the classes, is of course the most important one, but the computational expense of the feature is also a major factor in a real time application, such as ours.

In our study we have chosen simple color and texture features, which can be computed in real-time. The features selected are hair color, blob color (body and clothing), hair texture features (coarseness, directionality, contrast and local binary pattern). However, a step, which comes even before the feature extraction, is the image segmentation. In the following sections we will start by describing the techniques used for image segmentation followed by the experiments done for each feature.

Figure 1.

Some typical frames of different people

(segmented from the background)

3.1.

Experimental Setup

We have collected a database of 12 individuals. For each individual we have 60 frames at multiple poses and different lighting conditions. Among the pose variations, we considered people sitting up, standing, different rotation and translations. The images are of low resolution. Figure 1, shows the typical frame (segmented) for some person (the large variability in poses is easily evident).

3.2.

Segmentation

The first step in the extraction of features is of segmentation. This task is normally hard, but the static

Figure 2.

(a) The original topview of a person. (b) The segmented image using the grayscale difference image.

(c) The segmented image using the CrCb component of

YCrCb color space. nature of the environment (valid in this problem) allows us to make use of background subtraction. However, because of non-uniform lighting, shadowing causes a problem. To solve this, we are going to make use of the fact that shadow can be interpreted as the change in the intensity of light.

Since in the YCrCb space, intensity is represented only by the luminance component, we converted the image from the

RGB space to the YCrCb color space. We used the CrCb components only, without the luminance component, to obtain the difference image. This method gave a very good segmentation of the person without the shadow. This can be seen in Figure 2(a-c).

3.3.

Head Segmentation and Hair Color Modeling

Hair color is a basic feature that helps in discriminating between individuals. Although, intuitively limited to major categories (such as “ black ”, ” brown ”, ” blonde ” and

“ white ”), we found that, for a small group of individuals, it can also be used to discriminate between people with similar hair color. This can be attributed to the different luster, which will give different color values of their hairs.

The segmentation of the head from the rest of the segmented blob is done using generic models of major categories of hair color. We assume that hair color is a 3-

D Normally distributed random variable in the RGB color space [1],[2] with a given mean vector and covariance matrix.

To make sure that our Gaussian assumption is approximately correct, we tested the “Normality” of the data using the Normal Probability Plot [12]. This test compares the Quantiles of a given data vector sampled from an unknown distribution against the Quantiles of the

Normal Distribution. If the underlying unknown distribution is Gaussian, the plot will show a straight line.

Figure 4 gives the plot for the three marginal distributions of the Brown Hair color model. As can be seen, most of the data points lie on the straight line. The same statistical test for other color models showed similar results. This justifies our normality assumption. Examples from several people were used to learn the parameters for each typical hair color.

After the initial segmentation of the head, a person specific hair color model is learned. Figure 5 shows a plot of the mean and variances of the Red component (in RGB space) of hair color for 8 individuals learned in each of the 60 images of the person.

Figure 3.

Normal Probability plot for the three marginal distributions of Brown Color Hair

Figure4. (a) Plot of the variance of the R comp. of each persons head. (b) Plot of the mean of the R component of each person’s head. Labels (A,B,…) is the label given to each individual.

The means for each person is plotted as one point in the graph. It can be seen that there are significant differences among individuals (although not true for all of them).

Another apparent property is a good stability of the mean hair color for a specific individual in the different images under different poses of the head, although the variance suffers from larger changes. Figure 5 shows examples of the segmented heads of some of the individuals.

3.4.

Blob Color Modeling

The Blob color is modeled in a way similar to the hair color. The mean and covariance for the blob color are computed from the segmented blob image. In this case the variance is a very important component since the blob

(a) (b) (c) (d)

Figure 5. (a),(c) are the segmented images of person’s with black hair. (b) a brown colored hair person. (d) A blonde person. color has a larger deviation from the Normality assumption as compared to the hair color model.

Figure 6 shows the learned mean and variance of the red component of the blob color. We found that model changes a lot under different poses (because of wide variability in the color and its luster) however for a specific pose, the feature is found to be quiet stable.

3.5.

Texture Features

The texture of an individual’s hair is something that can uniquely distinguish him from others (in most of the

Figure 6 . Mean and Variance of the blob color Red component for 8 different people (different poses) cases). Different hairstyles lead to a unique description of a person (curly, straight hair, short/long hair etc.). Score of different texture features have proposed in the past

[3,4,5,6]. However, the low-resolution imagery (because of the wide angle cameras selected for this application) makes the task of reliable texture feature extraction, extremely tricky. Some texture features [6] are good in capturing the information, however, are highly computationally expensive. We figured that simple features (like directionality etc.) are good for this problem, as we are not looking at a single texture feature but a score of them. The texture features finally selected, based on their complexity (because of the real-time nature of the problem) and their performance, are directionality, coarseness, contrast and local binary pattern (LBP).

These are commonly used features and we will not describe them in detail, however we will like to refer the reader to [3][14] for more reference. These features can be computed in real time and can be made invariant to transformations such as rotation and translation

(directionality is rotation invariant if we are only looking at the variance of the histogram and not the actual values).

Figure 7(a-d) shows the computed values for these features for different individuals. An important thing to observe is that each feature seems to be incapable of discriminating between all people by itself but we show that the information from all of these can be combined to achieve good results.

4.

Classification

Selection of a classification strategy should take into consideration the capacity, computational complexity and adaptivity of the classifier, to the problem at hand. The two classifiers that have been selected for this application are the “

Decision Tree classifier

”(DTC) and the

“ Bayesian Network ”(BN). These classifiers are chosen because of their high capability of modeling the data and at the same time low computational complexity. Decision trees can be adaptively grown. If a person enters the scene or leaves the scene, DTC can be easily modified. At the same time Bayesian Networks present a strong probabilistic model for modeling a person based on various features. Another advantage that is offered by these is their capability of doing the classification based on incomplete data. This gives the algorithm the flexibility of computing the features “ on-demand ” i.e. a

(a) Directionality (b) Contrast

(c) Coarseness (d) LBP

Figure 7. Texture features computed for the 8 different people feature will be computed only if classification cannot be achieved (with high confidence) using the existing set of features.

The classification using DTC will make use of

Quadratic Discriminant function (QDF) at each node.

QDF is bayes optimal under normality assumption of the distribution of the features. For QDF, the mean and covariance matrix of each class and set of features is computed, e.g. consider class “i” with mean  i

and variance

 i

, now given a new feature vector “ x

”, its distance from class “i” can b written as – g i

( x )

 x t

W i x

 w i t x

 w i 0 where W i

 

1

 i

1

2

, w i

  i

1  i and w i 0

 

1

 i t  i

1

 i

1 ln

 i

2 2 x is classified as class “i” if g i

( x )>g j

( x ) for all j

 i. For the purpose of classification using Bayesian network, we are going to use the network shown in Figure 8, which takes all the features from a single set (e.g. mean and variance of the hair color) and use these features to do the classification. The probability of the feature belonging to a certain class is computed. The label of the class with the maximum probability is assigned to the feature. The difference between this and the QDF is that it does not make the normality assumption.

Person

Observed feature “1”

Observed feature “2”

Figure 8. Bayesian Network (BN) for doing recognition based on single feature.

4.1.

Decision Tree Construction

We have adopted the use of decision tree [8,9,10,11] for the purpose of classification. DTCs have certain nice properties, which make them extremely attractive for this task. Some of the salient points of DTC can be outlined as

1.

Features can be computed on demand basis. i.e. Only if the successful classification (with high confidence) is not achieved by the existing set of features, that we move on to compute next set of features.

2.

Easy to implement.

3.

Fast. Inferencing time is low.

Our Decision tree classifier can be expressed as a stepby-step algorithm for doing the classification. These steps can be outlined as-

1.

Do the image segmentation and see if there is more than one person present. This is achieved by K-means algorithm. If there is more than one person, do the image segmentation to obtain just single person blob.

2.

Extract the hair color feature for that person.

3.

Use QDF for calculating the distance of this blob with each of the existing classes. If the confidence in classification is more than 95% (by comparing g i

(x) with

 x

, it comes from the gaussian assumption) and if the relative probability of the classification for each of the other classes is smaller then a threshold

(calculated x )

 using

 j g i

(x) ), i.e. g j

( x ) / g that class, i

( i

,

 i , then give the label of

, to this image. τ gives a measure of between class confidences, and defines the classes, which might be confusable one with another.

4.

If the confidence of classification is high and is able to discriminate well with other classes then we are done. Otherwise, next set of the features is extracted.

Repeat the step 3 with using other set of features.

The features are computed in the increasing order of their complexity (this results in the minimum computation required). In our application, we are computing features in the following order (hair color, blob color, texture features).

This technique updates the tree automatically when the classification is done, since not only the confidence level of a class is used, but also a measure of the distance to the other classes is used.

4.2.

Bayesian Network

The Second technique used is Bayesian Networks.

Bayesian Networks [13] are a class of probabilistic models capable of encoding the probabilistic dependence among the different classes. Recently people have begun to use them in all kinds of vision application and the results are quiet compelling. Hidden Markov Models, which have been popular in speech community for long, are a special class of Bayesian networks. We tried the BN approach to see if BN can be effectively used to model

this scenario. The BN modeled for this task is shown in

Figure 9.

This BN is a generative model, which shows the dependence between the different features. We know that color features are more closely related (to each other) than the texture features. At a higher level, texture features are related to the color feature through the “Person” node.

What this means is that, given the person, output from the texture node is independent of the one from the color node. This network is learned using the Junction tree algorithm [13]. The inferencing in this tree accounts for computing P ( person

 i / color , texture features ) . This gives the probability of the image from person “ i ”, given the output of the color and the texture nodes. Given the observations, we can compute the probability of “ hair color ” and the “ blob color ” (corresponding to a particular person). These probabilities can inturn be used to infer the probability of image corresponding to a particular person.

For this purpose, we use the Likelihood ratio test. i.e. let

“i” corresponds to the label assigned to a particular blob, then

i

arg max

i

P ( person

i / color , texture features )

The decision made in this way is optimal in the Bayes sense. The network shown in Figure 9 was used for classification. Three-fold crossvalidation was done to ensure that parameters learned don’t overfit the data.

5.

Experimentation

We used the data set consisting of 12 different individuals with 60 samples of each under different poses.

The experiments were done in two stages. First we tested the individual features using the QDF and the Bayesian

Network. For the classification using quadratic discriminant function eq (1) was used. For the testing using Bayesian Network, network shown in Figure 8 was considered. Later we combined all the features and used

Person

Color

Features

Texture

Hair Color Blob Color

Texture

Features

Hair Color

Features

Blob Color

Features

Figure 9. Bayesian Network for doing the person identification based on input from different features.

Solid nodes are the hidden variables. The dotted nodes represent the set of observed variables. contrast, coarseness, LBP. the DTC and the BN (shown in Figure 9) to do the classification. The results of classification based on individual features are shown n Table 1.

As can be seen, each feature individually gave poor results and none of them seem to be able to do the classification just by itself. As against the normal intuition, texture features gave bad results compared to color based features. This can be attributed to the low resolution of the images. However, we found that these

Table 1. Results for Classification based on individual features.

Features QDF Bayesian

Network

Hair Color

(Mean only)

93.32% -

Hair Color

(Mean+Variance)

96.66% 96.83%

96.25% 97.17% Blob Color

(Mean+Variance)

Texture Features

Contrast

Directionality

Local Binary Pattern

Coarseness

75.29%

63.71%

80.59%

67.18%

79.38%

75.42%

81.25%

68.75% features can used together to give much better results.

5.1.

Classification Results using DTC and BN

The DTC and BN, as discussed, were used for doing the classification. The results obtained are presented in

Table 2.

We first did the experiment by combining all the color features. As mentioned, both BN and DTC are capable of doing the classification based on incomplete data. The results obtained in this way are much better than the ones given by each feature individually. On detailed analysis,

Table 2. Classification results of Features combined in two sets (color, texture)

Features DTC BN

Color (Hair +blob)

Texture Features (all)

All Features

99.44%

99.58%

100%

99.57%

99.69%

100% we observed that some of the features were complimentary to each other. Two people may have similar hair color but different clothing color. Similarly, two people may be wearing the clothing of similar color but may have different hair color. Similar behavior was observed for texture features. Although individually the texture feature gave errors ranging from 20% to 37%, together they performed extremely well. We also noticed that BN gave better results then the DTC in all cases. This can be attributed to the normality assumption made for

QDF, which was used as a classifier at each node of the

DTC.

Finally we used all the features together and perfect classification was obtained using both the classifiers.

These results are too good and we thought that it may be worthwhile to test the system to its limits. This time we considered a database of 12 individuals under very large pose changes and large lighting differences. Table 3 gives results for this case. We observed a much-degraded performance. The results of individual feature were extremely bad but combining those features using DTC gave a classification accuracy of 95.8%. These results suggest the need for adaptation of the features of a person during the session. This can be accomplished in an easy way by combining the recognition system with a tracking framework.

Table 3. Error Rates for different pose recognition using decision tree classifier.

Feature Classification

Rate (%)

Hair Color

Blob Color

Hair+Blob Color

Contrast

89.14%

86.22%

93.5%

69.14%

Coarseness

Directionality

Local Binary Pattern

All Texture

All Features

59.9%

57.05%

59.01%

85.25%

95.9%

6.

Conclusions and Future Work

This project addresses the application of “ Smart

Conference Room ”. We have provided a framework to do real time person identification by extracting the features from the images, which capture the overhead view of each person. This is probably the first work where features from the overhead view have been used for the purpose of recognition.

We have shown the feasibility of such a system and the results show that the hair color and texture features are rich enough to do classification reliably for a small set of people. We have tested our system for a group of 12 different individuals with 60 frames of each. The results show that perfect classification can be obtained.

Two classifiers are proposed for this application.

Bayesian Network adopts a probabilistic approach toward the solution of this problem. Decision tree presents a hierarchical approach. Both the classifiers are shown to achieve perfect classification. The DTC has some advantages, which are discussed in Section 4.2 However,

Bayesian network based approach can be easily extended to incorporate concepts of time. This opens the door for studying the motion patterns for each individual. HMMs can be combined with the Bayesian Networks. This suggests that, depending upon the problem, one of the two can be selected. For a pure recognition task, DTC is certainly better (in terms of complexity) but for the purpose of higher-level understanding of the problem, BN will score over DTC.

Some of the problems that remain to be addressed are of the fusion of tracking with the recognition techniques.

Another issue is the feature adaptation. This poses the problem of learning the Bayesian network again and again. This can be handled by having a mechanism in which parameters can be updated in an online fashion

(using a forgetting factor). Since in case of DTC, we are looking at each class individually, we need to adapt only the class, which has undergone large change. This implies that we need to change features for only that person who has been moving a lot (ie whose pose is changing a lot).

We are going to address these issues in future.

References

[1] J. Yang, W. Lu, A. Waibel, “ Skin-Color Modeling and

Adaptation

”, Proc. Of ACCV 98, vol II, pp 687-694, 1998.

[2]

M. J. Jones, and J. M. Rehg, “

Statistical Color Models with application to skin detection ,” Techreport CRL, 1998.

[3]

H. Tanura, S. Mori, and T. Yamawaki, “

Textural features corresponding to visual perception ”, IEEE Trans. on systems, man and cybernetics, 1978.

[4] P. P. Ohanian, and R. C. Dubes, “ Performance evaluation for four classes of textural features

”, Patter Recognition letters,1992.

[5]

R. M. Haralick, K. Shanmugam, and I. Dinstein, “

Textural

Features for Image classification

”, IEEE Trans. on systems, man and cybernetics, 1973.

[6] D. Dunn, and W. E. Higgins, “ Optimal Gabor Filters for texture segmentation

”, IEEE Trans. on Image Processing, 1995.

[7]

A. J. Colmenarez, “

Facial Analysis from continuous video with application to human-computer interface

”, PhD. Thesis,

UIUC, 1999.

[8] S. R. Safavian, and D. Landgrebe, “ A survey of decision tree classifier methodology

”, IEEE Trans. on systems, man and cybernetics, 1991.

[9]

P. Argentiero, R. Chin and P. Beaudet, “

An auomated approach to design of decision trees ”, IEEE Trans. PAMI 1982.

[10] W. S. Meisel and D. A. Michalopoulous, “ A Partitioning algorithm with application in pattern classificationand optimization of decision trees

”, IEEE Trans. Computers, 1973.

[11]

M. W. Kurzynski, “

Decision rules for a hierarchical classifier

”, Pattern Recognition letters, 1983

[12] J.D. Johnson, “ Applied Multivariate Data Analysis” ,

Springer Verlag, New-York, 1991.

[13] Pearl J, “ Probabilistic Reasoning in Intelligence Systems” ,

Morgan Kaufmann, 1988.

[14] T. Ojala, M. Pietikainen and D. Harwood, “ A comparative study of texture measures with classification based on feature distributions”

, Pattern Recognition 29, 51-59, 1996.

[15]

V. I. Pavlovic, R. Sharma, and T. S. Huang, “

Visual

Interpretation of hand getures for human-computer interaction:

A review”

, IEEE Trans. On PAMI, July 1997.

Download