A Few Days of A Robot’s Life in the Individual Recognition

Computer Science and Artificial Intelligence Laboratory
Technical Report
MIT-CSAIL-TR-2007-022
April 3, 2007
A Few Days of A Robot’s Life in the
Human’s World: Toward Incremental
Individual Recognition
Lijin Aryananda
m a ss a c h u se t t s i n st i t u t e o f t e c h n o l o g y, c a m b ri d g e , m a 02139 u s a — w w w. c s a il . mi t . e d u
A Few Days of A Robot’s Life in the Human’s
World: Toward Incremental Individual
Recognition
by
Lijin Aryananda
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
February 2007
c Massachusetts Institute of Technology 2007. All rights reserved.
°
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
February 22, 2007
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rodney Brooks
Professor
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Arthur C. Smith
Chairman, Department Committee on Graduate Students
A Few Days of A Robot’s Life in the Human’s World:
Toward Incremental Individual Recognition
by
Lijin Aryananda
Submitted to the Department of Electrical Engineering and Computer Science
on February 22, 2007, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Abstract
This thesis presents an integrated framework and implementation for Mertz, an expressive robotic creature for exploring the task of face recognition through natural
interaction in an incremental and unsupervised fashion. The goal of this thesis is to
advance toward a framework which would allow robots to incrementally “get to know”
a set of familiar individuals in a natural and extendable way. This thesis is motivated
by the increasingly popular goal of integrating robots in the home. In order to be
effective in human-centric tasks, the robots must be able to not only recognize each
family member, but also to learn about the roles of various people in the household.
In this thesis, we focus on two particular limitations of the current technology.
Firstly, most of face recognition research concentrate on the supervised classification
problem. Currently, one of the biggest problems in face recognition is how to generalize the system to be able to recognize new test data that vary from the training data.
Thus, until this problem is solved completely, the existing supervised approaches may
require multiple manual introduction and labelling sessions to include training data
with enough variations. Secondly, there is typically a large gap between research
prototypes and commercial products, largely due to lack of robustness and scalability
to different environmental settings.
In this thesis, we propose an unsupervised approach which would allow for a more
adaptive system which can incrementally update the training set with more recent
data or new individuals over time. Moreover, it gives the robots a more natural
social recognition mechanism to learn not only to recognize each person’s appearance,
but also to remember some relevant contextual information that the robot observed
during previous interaction sessions. Therefore, this thesis focuses on integrating
an unsupervised and incremental face recognition system within a physical robot
which interfaces directly with humans through natural social interaction. The robot
autonomously detects, tracks, and segments face images during these interactions and
automatically generates a training set for its face recognition system. Moreover, in
order to motivate robust solutions and address scalability issues, we chose to put the
robot, Mertz, in unstructured public environments to interact with naive passersby,
2
instead of with only the researchers within the laboratory environment.
While an unsupervised and incremental face recognition system is a crucial element
toward our target goal, it is only a part of the story. A face recognition system
typically receives either pre-recorded face images or a streaming video from a static
camera. As illustrated an ACLU review of a commercial face recognition installation,
a security application which interfaces with the latter is already very challenging. In
this case, our target goal is a robot that can recognize people in a home setting. The
interface between robots and humans is even more dynamic. Both the robots and the
humans move around.
We present the robot implementation and its unsupervised incremental face recognition framework. We describe an algorithm for clustering local features extracted
from a large set of automatically generated face data. We demonstrate the robot’s
capabilities and limitations in a series of experiments at a public lobby. In a final experiment, the robot interacted with a few hundred individuals in an eight day period
and generated a training set of over a hundred thousand face images. We evaluate the
clustering algorithm performance across a range of parameters on this automatically
generated training data and also the Honda-UCSD video face database. Lastly, we
present some recognition results using the self-labelled clusters.
Thesis Supervisor: Rodney Brooks
Title: Professor
3
Acknowledgments
I would like to express my gratitude to my advisor, Rodney Brooks, for his generous
support in many different aspects. I was very fortunate to have the opportunity to
work with him. His unconventional and almost rebellious thinking has really taught
me to formulate and approach my research differently. He has also been a tremendous
emotional support for me from the very beginning. Thank you, Rod, for waiting
patiently, supporting me financially during my leave of absence, and believing that I
would be able to come back to resume my study.
I would also like to thank my committee members, Cynthia Breazeal and Michael
Collins. I would like to thank Cynthia for her guidance and friendship, throughout
my Ph.D. career, as a lab-mate, a mentor, and eventually a committee member in
my thesis. I would like to thank Michael for all of his help and advice. Thank you
for taking the time to help me formulate my algorithms and improve the thesis.
The many years during the Ph.D. program are challenging, but almost nothing
compared to the last stretch during the last few months. I would not have been able to
survive the last few months without Eduardo, Iuliu, and Juan. Thank you, Eduardo,
for helping me with my thesis draft. I would not have finished it in time without
your help. Thank you, Iuliu, for going through the algorithms with me again and
again, and for helping me during my last push. Thank you, Juan, for proof-reading
my thesis, helping me with my presentation slides, and even ironing my shirt before
the defense.
Many people have kindly helped me during the project and thesis writing. I
would like to thank Ann for her warm generosity and inspiring insights; Jeff Weber
for his beautiful design of Mertz; Carrick, for helping me with so many things with
the robot; Philipp, for programming the complicated pie-chart plotter which really
saved my result presentation; Paulina, for proof-reading my thesis and going over
the clustering algorithm; Lilla, for proof-reading my thesis and always ready to help;
James, for spending so many hours helping me with my thesis and slides; Mac, for
helping me with the matlab plots; Marty, for discussing and giving feedback on my
4
results; Joaquin and Alex, for helping me through some difficult nights; Becky, for
always checking on me with warm smiles and encouragement; Iris, Julian, and Justin,
for helping me with various Mertz projects; Louis-Philippe and Bernd Heisele for
going over my clustering results.
While the recent memory of thesis writing is more salient, I would also like to
thank all of my lab-mates who have shaped my experience in the lab during the
past seven years: Jessica, Aaron, Una-May, Varun, Bryan, Myung-hee, Charlie, Paul,
Lorenzo, Giorgio, Artur, Howej, Scaz, Martin Martin, Maddog, and Alana. A special
thank you for Kathleen, my lab-mate, my best friend, my favorite anthropologist.
Most importantly, I would like to thank my parents. Without their unconditional support, I would not have been able to achieve my goals. And without their
progressive vision, I would not have aspired to my goals in the first place.
5
Contents
1 Introduction
17
1.1
Thesis Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
1.2
Integration and Opportunities . . . . . . . . . . . . . . . . . . . . . .
21
1.3
Integration, Challenges, and Robustness . . . . . . . . . . . . . . . .
22
1.4
The Task Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
1.5
Thesis Scope and Criteria . . . . . . . . . . . . . . . . . . . . . . . .
24
1.6
Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
1.7
Demonstration and Evaluation . . . . . . . . . . . . . . . . . . . . . .
27
1.8
Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2 Background and Related Work
30
2.1
Learning from Experience . . . . . . . . . . . . . . . . . . . . . . . .
30
2.2
Human-Robot Interaction . . . . . . . . . . . . . . . . . . . . . . . .
31
2.3
Extending Robustness and Generality . . . . . . . . . . . . . . . . . .
34
2.4
Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3 Robot Design and Some Early Lessons
3.1
39
Design Criteria and Strategy . . . . . . . . . . . . . . . . . . . . . . .
40
3.1.1
Increasing Robustness . . . . . . . . . . . . . . . . . . . . . .
40
3.1.2
Designing A Social Interface . . . . . . . . . . . . . . . . . . .
41
3.2
The Robotic Platform . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.3
Designing for Robustness . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.3.1
42
Mechanical Design . . . . . . . . . . . . . . . . . . . . . . . .
6
3.4
3.5
3.3.2
Low-Level Motor Control . . . . . . . . . . . . . . . . . . . . .
45
3.3.3
Modular Control . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.3.4
Behavior-Based Control . . . . . . . . . . . . . . . . . . . . .
48
3.3.5
Long-Term Software Testing . . . . . . . . . . . . . . . . . . .
49
Social Interface Design . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.4.1
Visual Appearance . . . . . . . . . . . . . . . . . . . . . . . .
49
3.4.2
Generating and Perceiving Social Behaviors . . . . . . . . . .
49
Some Early Experiments . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.5.1
Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
3.5.2
Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
3.5.3
Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
3.5.4
Summary and Lessons . . . . . . . . . . . . . . . . . . . . . .
58
4 Finding, Interacting With, and Collecting Data From People
64
4.1
Challenges and Approach . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.2
System Architecture and Overall Behavior . . . . . . . . . . . . . . .
67
4.3
Perceptual System . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
4.3.1
Face Detection and Tracking . . . . . . . . . . . . . . . . . . .
69
4.3.2
Color Segment Tracking . . . . . . . . . . . . . . . . . . . . .
71
4.3.3
Motion Detection . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.3.4
Auditory Perception . . . . . . . . . . . . . . . . . . . . . . .
72
Multi-modal Attention System . . . . . . . . . . . . . . . . . . . . . .
73
4.4.1
Implementation Details . . . . . . . . . . . . . . . . . . . . . .
74
4.4.2
Saliency Growth and Decay Rates . . . . . . . . . . . . . . . .
77
4.4.3
An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
Spatio-temporal Sensory Learning . . . . . . . . . . . . . . . . . . . .
80
4.5.1
Implementation Details . . . . . . . . . . . . . . . . . . . . . .
80
4.6
Behavior-Based Control Architecture . . . . . . . . . . . . . . . . . .
84
4.7
Automatic Data Storage and Processing . . . . . . . . . . . . . . . .
86
4.8
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
4.4
4.5
7
4.8.1
Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
4.8.2
Finding and Interacting With People . . . . . . . . . . . . . .
90
4.8.3
The Face Training Data . . . . . . . . . . . . . . . . . . . . . 106
5 Unsupervised Incremental Face Recognition
111
5.1
Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.2
Failure Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3
Unsupervised Face Clustering . . . . . . . . . . . . . . . . . . . . . . 115
5.4
5.3.1
The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3.2
A Toy Example . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.3.3
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.3.4
Clustering Performance Metric . . . . . . . . . . . . . . . . . . 124
5.3.5
Clustering Parameters and Evaluation . . . . . . . . . . . . . 128
5.3.6
Summary and Parameter specification strategy . . . . . . . . . 138
The Integrated Incremental and Unsupervised Face Recognition System 140
5.4.1
The Supervised Variant
. . . . . . . . . . . . . . . . . . . . . 144
5.4.2
Incremental Recognition Results . . . . . . . . . . . . . . . . . 146
5.4.3
The Self-Generated Clusters . . . . . . . . . . . . . . . . . . . 149
5.4.4
Incremental Clustering Results Using Different Parameters . . 152
5.5
Comparison to Related Work . . . . . . . . . . . . . . . . . . . . . . 159
5.6
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6 Conclusion
165
6.1
Lessons in Extending Robustness . . . . . . . . . . . . . . . . . . . . 166
6.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A Experiment Protocol
170
B Sample Face Clusters
172
C Results of Clustering Evaluation
192
D Sample Clusters of the Familiar Individuals
227
8
List of Figures
1-1 Mertz, an expressive robotic head robot . . . . . . . . . . . . . . . . .
18
1-2 A simplified diagram of the unsupervised incremental face recognition
scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
1-3 The fully integrated system from raw input to the incremental face
recognition system . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1-4 Some snapshots of Mertz in action
. . . . . . . . . . . . . . . . . . .
21
27
3-1 Multiple views of Mertz, an active vision humanoid head robot with
13 degrees of freedom . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3-2 Robot’s dimension and weight . . . . . . . . . . . . . . . . . . . . . .
43
3-3 Series Elastic Actuator . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3-4 The robot’s current system architecture . . . . . . . . . . . . . . . . .
47
3-5 The first prototype of Mertz’s head and face . . . . . . . . . . . . . .
50
3-6 The breakdown of what the robot tracked during a sequence of 14186
frames collected on day 2. . . . . . . . . . . . . . . . . . . . . . . . .
55
3-7 The face detector detected 114,880 faces with the average false positive
of 17.8% over the course of 4 days. On the right is a sample set of the
94,426 correctly detected faces. . . . . . . . . . . . . . . . . . . . . .
56
3-8 The characteristic of speech input received by the robot on each day
of the experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
3-9 Number of words (x axis) in utterances (y axis) collected during each
day of the experiment. Dashed line: all transcribed data. Solid line:
robot directed speech only.
. . . . . . . . . . . . . . . . . . . . . . .
9
58
3-10 The top 15 words on each experiment day and the common set of most
frequently said words across multiple days.
. . . . . . . . . . . . . .
59
3-11 Average pitch values extracted from robot directed and non-robot directed speech on each day of the experiment.
. . . . . . . . . . . . .
59
3-12 A sample experiment interaction session . . . . . . . . . . . . . . . .
60
3-13 A set of face images collected in different locations and times of day.
61
3-14 The robot’s attention system output during two experiment days, inside and outside the laboratory. . . . . . . . . . . . . . . . . . . . . .
62
4-1 The robot’s overall system architecture . . . . . . . . . . . . . . . . .
68
4-2 The same person face tracking algorithm . . . . . . . . . . . . . . . .
70
4-3 An example of an input image and the corresponding attentional map
73
4-4 The attention’s system’s saliency function for varying growth and decay
rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
4-5 A sample sequence of the attention map output during simultaneous
interaction with two people . . . . . . . . . . . . . . . . . . . . . . .
79
4-6 A diagram of the robot’s spatio-temporal learning map . . . . . . . .
81
4-7 An example of the spatio-temporal learning system’s activity function
when an object is entered a t time t=3,5,7,10,18,20,22,30 ms.
. . . .
83
4-8 Some samples of two-dimensional maps of the eleven hebbian weights
83
4-9 MERTZ’s final behavior-based controller . . . . . . . . . . . . . . . .
85
4-10 Some sample face sequences produced by the same-person tracking
module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
4-11 The distribution of the number of sequences and images from 214 people on day 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
4-12 The number of people that the robot interacted with during the seven
hours on day 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
4-13 A set of histograms illustrating the distribution of session durations
for 214 people on day 6 . . . . . . . . . . . . . . . . . . . . . . . . . .
10
94
4-14 Segmentation and Correlation Accuracy of the Paired Face and Voice
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
4-15 A set of histograms of the pixel area of faces collected without active
verbal requests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
4-16 A set of histograms of the pixel area of faces collected with active verbal
requests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4-17 A snapshot of the attention output illustrating the attention system’s
short term memory capacity.
. . . . . . . . . . . . . . . . . . . . . . 102
4-18 A snapshot of the attention output while the robot switched attention
due to sound cues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4-19 A snapshot of the attention output showing the robot interacting with
two people simultaneously.
. . . . . . . . . . . . . . . . . . . . . . . 104
4-20 A snapshot of the attention output showing the attention system’s
spatio-temporal persistence. . . . . . . . . . . . . . . . . . . . . . . . 105
4-21 A sequence of maps consisting of the eleven hebbian weights recorded
on day 7 of the experiment
. . . . . . . . . . . . . . . . . . . . . . . 107
4-22 Adaptive sound energy threshold value over time on day 6 and 7 . . . 108
4-23 Analysis of errors and variations in the face sequence training data . . 110
5-1 A sample of face sequences taken from interaction sessions. . . . . . . 113
5-2 The Unsupervised Face Sequence Clustering Procedure . . . . . . . . 116
5-3 A simple toy example to provide an intuition for the sequence matching
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5-4 The division of face image into six regions . . . . . . . . . . . . . . . 120
5-5 The feature extraction procedure of each face sequence . . . . . . . . 122
5-6 The Face Sequence Matching Algorithm . . . . . . . . . . . . . . . . 125
5-7 The Face Sequence Clustering Procedure . . . . . . . . . . . . . . . . 126
5-8 The clustering results with a data set of 300 sequences . . . . . . . . 130
5-9 The clustering results with a data set of 700 sequences . . . . . . . . 131
5-10 The normalized number of merging and splitting errors . . . . . . . . 132
11
5-11 The trade-off curves between merging and splitting errors . . . . . . . 133
5-12 The trade-off curves between merging and splitting errors for a data
set of 30 sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5-13 The trade-off mergind and splitting curves for data sets of 500 sequences with different sequence distributions . . . . . . . . . . . . . . 135
5-14 The trade-off mergind and splitting curves for data sets of 700 sequences with different sequence distributions . . . . . . . . . . . . . . 136
5-15 The sequence matching accuracy for different data set sizes and parameter C values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5-16 The face sequence clustering results with the Honda-UCSD video face
database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5-17 The unsupervised and incremental face recognition system . . . . . . 143
5-18 The four phases of the unsupervised and incremental face recognition
process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5-19 The adapted sequence matching algorithm for supervised face recognition146
5-20 The incremental recognition results of each sequence input . . . . . . 148
5-21 The incremental recognition results of each sequence input using a
different setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5-22 The self-generated clusters constructed during the incremental recognition process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5-23 Visualization of the self-generated clusters . . . . . . . . . . . . . . . 153
5-24 Six snapshots of the largest fifteen self-generated clusters . . . . . . . 154
5-25 A sample cluster of a familiar individual . . . . . . . . . . . . . . . . 155
5-26 A sample cluster of a familiar individual . . . . . . . . . . . . . . . . 156
5-27 A sample cluster of a familiar individual . . . . . . . . . . . . . . . . 157
5-28 A sample cluster of a familiar individual . . . . . . . . . . . . . . . . 158
5-29 The set of correlation functions between the data set size and parameter
C values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5-30 The incremental clustering results using different correlation functions
between the data set size and parameter C . . . . . . . . . . . . . . . 160
12
5-31 Comparison to Raytchev and Murase’s unsupervised video-based face
recognition system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5-32 Comparison to Berg et al’s unsupervised clustering of face images and
captioned text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
B-1 An example of a falsely merged face cluster. . . . . . . . . . . . . . . 173
B-2 An example of a falsely merged face cluster. . . . . . . . . . . . . . . 174
B-3 An example of a good face cluster containing sequences from multiple
days. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
B-4 An example of a good face cluster. . . . . . . . . . . . . . . . . . . . 176
B-5 An example of a non-face cluster. . . . . . . . . . . . . . . . . . . . . 177
B-6 An example of a good face cluster containing sequences from multiple
days. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
B-7 An example of a good face cluster . . . . . . . . . . . . . . . . . . . . 179
B-8 An example of a falsely merged face cluster. . . . . . . . . . . . . . . 180
B-9 An example of a good face cluster containing sequences from multiple
days. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
B-10 An example of a good face cluster. . . . . . . . . . . . . . . . . . . . 182
B-11 An example of a good face cluster. . . . . . . . . . . . . . . . . . . . 183
B-12 An example of a good face cluster. . . . . . . . . . . . . . . . . . . . 184
B-13 An example of a good face cluster. . . . . . . . . . . . . . . . . . . . 185
B-14 An example of a good face cluster. . . . . . . . . . . . . . . . . . . . 186
B-15 An example of a good face cluster. . . . . . . . . . . . . . . . . . . . 187
B-16 An example of a good face cluster. . . . . . . . . . . . . . . . . . . . 188
B-17 An example of a good face cluster containing sequences from multiple
days. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
B-18 An example of a good face cluster. . . . . . . . . . . . . . . . . . . . 190
B-19 An example of a good face cluster. . . . . . . . . . . . . . . . . . . . 191
C-1 Clustering results of 30 sequences, C = 0% . . . . . . . . . . . . . . . 193
C-2 Clustering results of 30 sequences, C = 70% . . . . . . . . . . . . . . 194
13
C-3 Clustering results of 300 sequences, C = 0% . . . . . . . . . . . . . . 195
C-4 Clustering results of 300 sequences, C = 30% . . . . . . . . . . . . . . 196
C-5 Clustering results of 300 sequences, C = 50% . . . . . . . . . . . . . . 197
C-6 Clustering results of 300 sequences, C = 70% . . . . . . . . . . . . . . 198
C-7 Clustering results of 500 sequences, C = 0% . . . . . . . . . . . . . . 199
C-8 Clustering results of 500 sequences, C = 30% . . . . . . . . . . . . . . 200
C-9 Clustering results of 500 sequences, C = 50% . . . . . . . . . . . . . . 201
C-10 Clustering results of 500 sequences, C = 70% . . . . . . . . . . . . . . 202
C-11 Clustering results of 500 sequences, C = 0% . . . . . . . . . . . . . . 203
C-12 Clustering results of 500 sequences, C = 30% . . . . . . . . . . . . . . 204
C-13 Clustering results of 500 sequences, C = 50% . . . . . . . . . . . . . . 205
C-14 Clustering results of 500 sequences, C = 70% . . . . . . . . . . . . . . 206
C-15 Clustering results of 500 sequences, C = 0% . . . . . . . . . . . . . . 207
C-16 Clustering results of 500 sequences, C = 30% . . . . . . . . . . . . . . 208
C-17 Clustering results of 500 sequences, C = 50% . . . . . . . . . . . . . . 209
C-18 Clustering results of 500 sequences, C = 70% . . . . . . . . . . . . . . 210
C-19 Clustering results of 700 sequences, C = 0% . . . . . . . . . . . . . . 211
C-20 Clustering results of 700 sequences, C = 30% . . . . . . . . . . . . . . 212
C-21 Clustering results of 700 sequences, C = 50% . . . . . . . . . . . . . . 213
C-22 Clustering results of 700 sequences, C = 70% . . . . . . . . . . . . . . 214
C-23 Clustering results of 700 sequences, C = 0% . . . . . . . . . . . . . . 215
C-24 Clustering results of 700 sequences, C = 30% . . . . . . . . . . . . . . 216
C-25 Clustering results of 700 sequences, C = 50% . . . . . . . . . . . . . . 217
C-26 Clustering results of 700 sequences, C = 70% . . . . . . . . . . . . . . 218
C-27 Clustering results of 1000 sequences, C = 0% . . . . . . . . . . . . . . 219
C-28 Clustering results of 1000 sequences, C = 30% . . . . . . . . . . . . . 220
C-29 Clustering results of 1000 sequences, C = 50% . . . . . . . . . . . . . 221
C-30 Clustering results of 1000 sequences, C = 70% . . . . . . . . . . . . . 222
C-31 Clustering results of 2025 sequences, C = 0% . . . . . . . . . . . . . . 223
C-32 Clustering results of 2025 sequences, C = 30% . . . . . . . . . . . . . 224
14
C-33 Clustering results of 2025 sequences, C = 50% . . . . . . . . . . . . . 225
C-34 Clustering results of 2025 sequences, C = 70% . . . . . . . . . . . . . 226
D-1 The generated cluster of familiar individual 2
. . . . . . . . . . . . . 228
D-2 The generated cluster of familiar individual 3
. . . . . . . . . . . . . 229
D-3 The generated cluster of familiar individual 4
. . . . . . . . . . . . . 230
D-4 The generated cluster of familiar individual 6
. . . . . . . . . . . . . 231
D-5 The generated cluster of familiar individual 9
. . . . . . . . . . . . . 232
D-6 The generated cluster of familiar individual 10 . . . . . . . . . . . . . 233
15
List of Tables
3.1
Experiment schedule, time, and location . . . . . . . . . . . . . . . .
52
3.2
List of observed failures during the experiment . . . . . . . . . . . . .
53
4.1
A set of predefined sentence for regulating interaction with people. . .
87
4.2
Experiment Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
4.3
The Face Tracking Sequence Output . . . . . . . . . . . . . . . . . .
92
4.4
The Same-Person Face Tracking Output on Day 6 . . . . . . . . . . .
92
4.5
Duration of interaction sessions for different number of people on day 6 95
4.6
The Voice Samples Output . . . . . . . . . . . . . . . . . . . . . . . .
96
4.7
The Paired Face and Voice Data . . . . . . . . . . . . . . . . . . . . .
97
4.8
Segmentation and Correlation Accuracy of the Paired Face and Voice
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9
99
The distribution of the number of images per sequence and the number
of sequences and appearance days per individual . . . . . . . . . . . . 109
5.1
The data set sizes and parameter values used in the clustering algorithm evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.2
The batch clustering results using the Honda-UCSD face video database141
16
Chapter 1
Introduction
“Learning is experiencing. Everything else is just information” (Albert
Einstein).
This thesis presents an integrated framework and implementation for Mertz, an
expressive robotic creature for exploring the task of face recognition through natural
interaction in an incremental and unsupervised fashion. The goal of this thesis is
to advance toward a framework which would allow robots to incrementally “get to
know” a set of familiar individuals in a natural and extendable way.
This thesis is motivated by the increasingly popular goal of integrating robots in
the home. We have now seen the Roomba take over the vacuum cleaning task in
many homes. As the robotic technology further advances, we would expect to see
more complex and general robotic assistants for various tasks, such as elder care,
domestic chores, etc. In order to be effective in human-centric tasks, the robots must
be able to not only recognize each family member, but also to learn about the roles of
various people in the household – who is the elderly person, who is the young child,
who is the part-time nurse caregiver.
In this thesis, we focus on two particular limitations of the current technology.
Firstly, most of face recognition research concentrate on the supervised classification
problem: given a set of manually labelled training data, find the correct person label
for a new set of test data. The supervised approach is not ideal for two reasons.
17
Figure 1-1: Mertz is an expressive head robot designed to explore incremental individual recognition through natural interaction. The robot’s task is to learn to recognize
a set of familiar individuals in an incremental and unsupervised fashion.
18
The first reason is a practical one. Currently, one of the biggest problems in face
recognition is how to generalize the system to be able to recognize new test data that
look different from the training data, due to variations in pose, facial expressions,
lighting, etc. Thus, until this problem is solved completely, the existing supervised
approaches may require multiple manual introduction and labelling sessions to include training data with enough variations. The second reason involves the human
factor and social interface. In a study investigating long-term human-robot social
interaction, the authors concluded that in order to establish long-term relationships,
the robot should be able to not only identify but also “get to know” people who the
robot frequently encounters [36].
The second limitation of current technology is that there is typically a large gap
between research prototypes and commercial products. Despite tremendous research
progress in humanoid robotics over the past decade, deployment of more complex
and general robotic assistants into the home is not simply a matter of time. Lack
of reliability and robustness are two of the most important obstacles in this path.
Similarly, despite many advances in face recognition technology, the American Civil
Liberty Union’s review of deployment of a commercial face recognition surveillance
system at the Palm Beach airport yielded unsatisfactory results [92].
1.1
Thesis Approach
Our target goal is a robot which can incrementally learn to recognize a set of familiar
individuals in a home setting. As shown in figure 1-2, the system starts with an
empty database. As the robot encounters each person, it has to decide on the person’s
identity. If she is a new person (i.e. does not exist in the database), the robot will
generate a new class in the database, into which the robot will insert her face and
voice data. Upon collecting enough data in a class, the robot subsequently trains a
recognition system using its automatically generated database. After a number of
encounters, the robot should be able to recognize this new person and update her
data in the database appropriately.
19
Figure 1-2: A simplified diagram of the unsupervised incremental face recognition
scheme. The robot first starts with an empty database. As the robot encounters each
person, it has to decide on the person’s identity. If she is a new person, the robot will
generate a new class in the database, into which the robot will insert her face and
voice data. After a number of encounters, the robot should be able to recognize this
new person and update her data in the database appropriately.
In this thesis, we propose an unsupervised approach which would allow for a more
adaptive system which can incrementally update the training set with more recent
data ore new individuals over time. Moreover, it gives the robots a more natural
social recognition mechanism to learn not only to recognize each person’s appearance,
but also to remember some relevant contextual information that the robot observed
during previous interaction sessions [58].
While an unsupervised and incremental face recognition system is a crucial element
toward our target goal, it is only a part of the story. A face recognition system
typically receives either pre-recorded face images or a streaming video from a static
camera. As illustrated by the ACLU review, a security application which interfaces
with the latter is already very challenging. In this case, our target goal is a robot that
can recognize people in a home setting. The interface between robots and humans is
even more dynamic. Both the robots and the humans move around.
Therefore, this thesis focuses on integrating an unsupervised and incremental face
recognition system within a physical robot which interfaces directly with humans
through natural social interaction. Moreover, in order to motivate robust solutions
and address scalability issues, we chose to put the robot, Mertz, in unstructured public
20
Figure 1-3: The fully integrated system from raw input to the incremental face recognition system, which we implemented in this thesis. Superimposed are three feedback
loops which allow for a set of opportunites described in the text.
environments to interact with naive passersby, instead of with only the researchers
within the laboratory environment. Figure 1-3 shows the fully integrated system that
we implemented in this thesis, connecting raw input in real human environments to
the incremental face recognition system.
1.2
Integration and Opportunities
A fully integrated system from raw input in the real world to the incremental recognition system generates a number of opportunities. As shown in figure 1-3, there are
three feedback loops which allow for the following opportunities.
First, there is a small loop between the human and the robot. This corresponds
to human-robot interaction. Through many hours of embodied interaction, the robot
can generate a large amount of natural human-centric perceptual data. In a four
day long experiment at the very early project stage, the robot collected over 90,000
face images from over 600 individuals with a wide range of natural poses and facial
expressions. Moreover, the robot can take advantage of various contextual infor-
21
mation that is freely available through its embodied experience. Mertz utilizes and
learns to associate concurrent multi-modal sensory data to improve various perceptual mechanisms. Mertz’s attention system also heavily relies on both multi-modal
integration and spatio-temporal context. These contextual mechanisms have turned
out to be particularly useful in dealing with the chaotic and noisy human’s natural
environment.
Second, there is a feedback loop between the human, the robot, and the learning
data. This feedback loop allows for an exploration of the experiential learning concept
which has been proposed by many, in various research disciplines [28, 99, 49]. As the
robot autonomously decides what and when to learn, the robot can organize and
influence these self-generated data in the most convenient way for its own learning
task. For example, in order to allow for higher resolution face images, the robot
verbally requests for people to stand closer to the robot when they are too far away.
Third, there is a big feedback loop between the human, the robot, and the incremental recognition system. This feedback loop allows for an opportunity for the
robot to adapt its behavior based on the recognition output, i.e. as the robot gets to
know some familiar individuals. In animal behavior research, this is widely known as
social recognition, i.e. a process whereby animals become familiar with conspecifics
and later treat them based on the nature of those previous interactions [58]. Social
recognition capabilities have been observed in bottlenose dolphins, lambs, hens, and
mantis shrimps [83, 78, 27, 24]. We do not explore this feedback loop in this thesis,
however, this thesis contributes by advancing toward social recognition capabilities
as a prerequisite for long-term human-robot social interaction [36].
1.3
Integration, Challenges, and Robustness
In addition to the above opportunities, a fully integrated system from raw input in the
real world to the incremental recognition system, as shown in figure 1-3 also present
many challenges. Error propagation through the many subsystems is one of the
most unexplored challenges, since most research projects currently focus on building
22
isolated systems for face detection, tracking, clustering, recognition, etc. Moreover,
most current systems interface with data obtained from controlled experiments, by
either taking pictures of subjects or asking them to move around in front of a static
camera. Unless one chains together each of the subsystems in a fully integrated
system and interfaces the system with real human environment, one will not see the
extent of the challenges posed by the error propagation.
Inspired by biological systems which are incredibly robust despite many redundant
and non-optimized components, our approach is not to optimize each subsystem.
Instead, we focus on achieving a robust integrated framework where failures in a
module are somehow compensated by other modules further down the line.
Moreover, our deliberate choice of an uncontrolled and challenging setup was
driven by the assumption that in a dynamic and noisy environment, errors are inevitable. It would be unrealistic to assume that 100% accuracy is possible in every
part of the system. We hypothesize that these errors and imperfections are in fact
useful as they would motivate the rest of the system to compensate robustly.
More generally, we propose that setting a higher benchmark for robustness and
generality of operating condition is likely to motivate more scalable solutions. In the
conventional setup, where task performance has the highest priority, it is common to
employ shortcuts so as to allow for initial progress. It is quite typical for robots to
require a specific set-up, such as a particular location, set of objects, background color,
or lighting condition. Such simplifications, while far from reducing the environment
to a blocks world, naturally raise some scalability concerns. A deeper concern is that
these shortcuts might actually be hampering potential progress in the longer term.
1.4
The Task Breakdown
In order to illustrate the project goal and approach more concretely, we will enumerate
the robot’s set of tasks. While operating in the midst of hundreds of passersby, the
robot has to perform the following tasks automatically:
1. operate continuously for up to 10 hours each day;
23
2. attract passersby to approach the robot and engage them in spontaneous social
interaction, e.g., by visual tracking and simple verbal exchanges;
3. regulate the interaction in order to generate opportunities to collect data from
as many people as possible;
4. detect, segment, store, and process face images and voice samples during interaction;
5. use tracking and spatio-temporal assumptions to obtain a sequence of face images and voice samples of each individual as a starting point toward unsupervised recognition;
6. perform unsupervised clustering on these collected face sequences and construct
a self-generated training database, where each class consists of face images and
voice samples of each individual;
7. train a face recognition system using the self-generated training database to
recognize a set of individuals the robot has previously encountered.
1.5
Thesis Scope and Criteria
Throughout this project, our design and implementation steps have been guided by
the following set of criteria and scope.
Full automation and integration The robot has to be fully automatic and integrated. The robot has to autonomously interact with, segment, collect relevant
data from passersby, and use it to construct a training database without any human
intervention.
Unstructured environment It has to operate in real time and in different public
environments, without the use of any markers to simplify the environment.
24
No manual data filtering or processing We do not allow any manual intervention to process and filter the data which the robot collected during the experiments.
The robot automatically stores these data to be used directly by the recognition
system.
Timing requirement The robot’s sensors and actuators have to perform in real
time. The face clustering system does not have to operate in real time.
Natural human-robot interaction The robot has to be able to interact with
multiple people concurrently in the most natural way possible, with minimal constraints and instructions. In some cases, we have had to compromise performance in
other parts of the system in order to enforce this requirement. For example, we have
to use a desktop microphone, instead of a headset, to allow multiple people to speak
to the robot at the same time. As a result, the robot’s speech recognition performance
is compromised. Without reliable speech understanding, the robot’s verbal behavior
is thus limited to simple word mimicking and some predefined sentences.
Trade-off between robustness and complexity We expect that trade-off between robustness and complexity would be inevitable. Thus, we have bypassed some
optimization steps in some of the robot’s subtasks. For example, the robot engages
people by simply visually tracking their faces, instead of explicitly detecting their
eyes for precise eye contact.
The recognition task The robot’s main task is to learn to recognize familiar
individuals, i.e., those who frequently interact with the robot. The robot does not
have to learn to recognize every single person the robot ever encountered. Moreover,
we will not evaluate each of the robot’s subsystem and look for the most optimized
performance. We expect that the system will make mistakes, but these mistakes
must be somehow compensated by another part of the system. Instead of aiming to
achieve the highest classification accuracy for each test data, we would like to explore
alternative compensation methods which make sense given our setup. For example, in
25
the future, we are interested in exploring an active learning scheme where the robot
makes an inquiry to a person to check if its recognition hypothesis is correct and
somehow integrate the answers into its learning system.
1.6
Thesis Contribution
This thesis makes the following contributions:
• We implemented an integrated end-to-end incremental and fully unsupervised
face recognition framework within a robotic platform embedded in real human
environment. This integrated systems provides automation for all stages, from
data collection, segmentation, labelling, training, and recognition.
• We developed a face sequence clustering algorithm that is robust to a high
level of noise in the robot’s collected data, generated by inaccuracies in face
detection, tracking, and segmentation in the dynamic and unstructured human
environments.
• We implemented a robust robotic platform capable of generating a large amount
of natural human-centric data through spontaneous interactions in public environments.
• We present an adaptive multi-modal attention system coupled with spatiotemporal learning mechanism to allow the robot to cope with the dynamic and
noise in real human environment. These coupled mechanisms allow the robot to
learn multi-modal co-occurence and spatio-temporal patterns based on its past
sensory experience. The design of the attention system also provides a natural
integration between bottom-up and top-down control, allowing simultaneous
interaction and learning.
26
Figure 1-4: Some snapshots of Mertz in action, interacting with many passersby in
the lobby of the MIT Stata Center. Mertz is typically approached by multiple people
at once. It is a quite hectic environment for a robot.
1.7
Demonstration and Evaluation
We demonstrate the robot’s capabilities and limitations in a series of experiments.
We initially summarized a set of multi-day experiments in public spaces, conducted
at the early stage of the project and interleaved with the robot development process.
Lessons from these early experiments have been very valuable for our iterations of
design and development throughout the project.
At the end of the project, we conducted a final experiment to evaluate the robot’s
overall performance from the perspective of its main task, incremental individual
recognition. This experiment was held for 8 days, 2-7 hours each day, at a public
lobby. Figure 1-4 shows some snapshots of the robot interacting with many passersby
in the lobby of the MIT Stata Center. We describe a set of quantitative and qualitative
results from each of the robot’s relevant subtasks.
We first assess the robot’s capabilities in finding and engaging passersby in spontaneous interaction. This involves the robot’s perceptual, attention, and spatiotemporal learning systems. Toward the final goal, the robot first has to collect as
many faces and voice images as possible from each person. The longer and more
natural these interactions are, the more and better these data would be for further
recognition. We then evaluate the face and voice data that the robot automatically
27
collected during the experiment. We analyze the accuracy and other relevant characteristics of the face sequences collected from each person.
We then evaluate the incremental individual recognition system using the automatically generated training data. We analyze the face clustering performance across
different variants of the algorithm and data-dependent parameters. For comparison
purposes, we also apply the clustering algorithm on the Honda-UCSD video face
database [51, 50]. Lastly, we analyze both the incremental recognition performance
and the accuracy of the self-generated clusters in an evaluation of the integrated
incremental and unsupervised face recognition sytem.
1.8
Thesis Outline
We first begin by discussing some background information and related work involving
the issues of face recognition, social robotics, and robustness issues in robotics in
chapter 2.
In chapter 3, we discuss the robot design and building process. We focus on
two issues that received the most amount of consideration during the design process:
robustness and social interface. We also present a series of earlier experiments conducted at different stages of the project and describe a number of valuable lessons
that heavily influenced the final implementation of the robot.
In chapter 4, we provide the implementation of the robot’s perceptual, attention,
and behavior systems. The robot has to organize these subsystems not only to solicit
spontaneous interaction, but also to regulate these interactions to generate learning
opportunities and collect as much face and voice data as possible from various individuals. We then evaluate these subsystems, with respect to the target goal of collecting
face and voice data from each individual through spontaneous interaction.
In chapter 5, we describe how these collected data are automatically processed
by the robot’s individual recognition system to incrementally learn to recognize a
set of familiar individuals. We present the implementation of the unsupervised face
clustering and how this solution is integrated into an incremental and fully unsuper28
vised face recognition system. We evaluate this integrated system using the robot’s
collected face sequences to analyze both the incremental recognition performance and
the accuracy of the self-generated clusters.
In the last chapter, we provide a conclusion and describe some possible future
directions.
29
Chapter 2
Background and Related Work
2.1
Learning from Experience
The main inspiration of this thesis was derived from the concept of active or experiential learning. The emphasized role of experience in the learning process has been
proposed in many different research areas, such as child development psychology,
educational theories, and developmental robotics.
Piaget proposed that cognitive development in children is contingent on four factors: biological maturation, experience with the physical environment, experience
with the social environment, and equilibration [77, 85]. ”Experience is always necessary for intellectual development...the subject must be active....” [49]. Vygotsky
developed a sociocultural theory of cognitive development, emphasizing the role the
socio-cultural in the human’s cognitive development [99].
Learning by ”doing” is a popular theme in modern educational theories since John
Dewey’s argument that children must be engaged in an active experience for learning. [28]. The principle of sensory-motor coordination was inspired by John Dewey,
who, as early as 1896, had pointed out the importance of sensory-motor coordination
for perception. This principle implies that through coordinated interaction with the
environment, an agent can structure its own sensory input. In this way, correlated
sensory stimulation can be generated in different sensory channels – an important prerequisite for perceptual learning and concept development. The importance of direct
30
experience of sensory input and actuation in the world through physical embodiment
is the cornerstone of the embodied Artificial Intelligence paradigm [16, 74]. In this
thesis, by learning while experiencing the world, the robot gains the opportunity to
not only generate sensory input through its behavior, but also actively structure these
sensory inputs to accommodate its learning task. The use of social behavior has been
shown to be effective in regulating interaction and accommodating the robot’s visual
processing [15].
The notion that the human developmental process should play a significant role in
the pursuit of Artificial Intelligence has been around for a long time. The associated
idea of a child machine learning from human teachers dates back at least to Alan
Turing’s seminal paper “Computing Machinery and Intelligence” [95]. The interpretation and implementation of this developmental approach have varied from having
human operators enter common-sense knowledge into a large database system [52] to
robots that learn by imitating human teachers [84].
Developmental robotics is a very active research area that has emerged based on
these developmental concepts. Many developmental approaches have been proposed
and implemented in various robots [56]. Most relevant to our work are SAIL and Dav
at Michigan State University, humanoid robot platforms for exploring autonomous
life-long learning and development through interaction with the physical environment
and human teachers [102].
2.2
Human-Robot Interaction
The long-term objective of this thesis is to advance toward incremental individual
recognition as a prerequisite for long-term human-robot social interaction. Social
robotics is a growing research area based on the notion that human-robot social
interaction is a necessary step toward integrating robots into human’s everyday lifes
[3] and for some, also a crucial element in the development of robot intelligence
[26, 11, 17].
[32] presents a survey of different approaches in developing socially interactive
31
robots. These systems vary in their goals and implementations. The following robots
are mainly focused on one-on-one and shorter-term interaction in controlled environments. Kismet at MIT is an expressive active vision head robot, developed to engage
people in natural and expressive face-to-face interaction [11]. The research motivation is to bootstrap from social competences to allow people to provide scaffolding to
teach the robot and facilitate learning.
WE-4R at Waseda University is an emotionally expressive humanoid robot, developed to explore new mechanisms and functions for natural communication between
humanoid robot and humans [62]. The robot has also been used to explore emotionbased conditional learning from the robot’s experience [61]. Leonardo at MIT is an
embodied humanoid robot designed to utilize social interaction as a natural interface
to participate in human-robot collaboration [13]. Infanoid at National Institute of Information and Communications Technology (NICT) is an expressive humanoid robot,
developed to investigate joint attention as a crucial element in the path of children’s
social development [47].
There have also been a number of approaches in developing social robotic platforms which can operate for longer time scales in uncontrolled environments outside
the laboratory. The Nursebot at Carnegie Mellon University is a mobile platform designed and developed toward achieving a personal robotic assistant for the elderly [63].
In a two day-long experiment, the Nursebot performed various tasks to guide elderly
people in an assisted living facility. Similar to our findings in dealing with uncontrolled environments, the Nursebot’s speech recognition system initially encountered
difficulties and had to be re-adjusted during the course of the experiment. Grace at
CMU is an interactive mobile platform which has participated in the AAAI robot
challenge of attending, registrating, and presenting at a conference [91].
Robovie at ATR, an interactive humanoid robot platform, has been used to study
long-term interaction with children for two weeks in their classrooms [43]. Keepon at
NICT, is a creature-like robot designed to perform emotional and attention exchange
with human interactants, especially children [48]. Keepon was used in a year and
a half long study at a day-care center to observe interaction with autistic children.
32
Robox, an interactive mobile robotic platform, was installed for three months at the
Swiss National Exhibition Expo 2002 [89]. RUBI and QRIO at the University of
California San Diego are two humanoid robots which were embedded at the Early
Childhoold Education Center as part of a human-robot interaction study for at least
one year on a daily basis [67]. Robovie-M, a small interactive humanoid robot, was
tested at in a two-day human-robot interaction experiment at the Osaka Science
Museum [88].
Most relevant to our project focus is Valerie at Carnegie Melon University, a mobile
robotic platform designed to investigate long-term human-robot social interaction
[36]. Valerie was installed for nine months at the entranceway to a university building.
It consists of a commercial mobile platform, an expressive animated face displayed
on an LCD screen mounted on a pan-tilt unit, and a speech synthesizer. It uses a
SICK scanning laser range finder to detect and track people. People can interact with
Valerie by either speech or keyboard input. Similar to our case, the authors report
that a headset micropone is not an option and therefore the robot’s speech recognition
is limited especially given the noisy environment. Valerie recognizes individuals by
using a magnetic card-stripe reader. People can swipe any magnetic ID cards in
order to uniquely identify themselves. One of Valerie’s primary interaction modes is
storytelling through 2-3 minute long monologues about its own life stories.
During these nine months, people have interacted with Valerie over 16,000 times,
counted by keyboard input of at lease one line of text. An average of over 88 people interacted with Valerie each day. Typical interaction sessions are just under 30
seconds. Out of 753 people who have swiped an ID card to identify themselves, only
233 have done it again during subsequent visits. Valerie encounters 7 repeat visitors
on average each day. These repeat visitors tend to interact with the robot for longer
periods, typicaly for a minute or longer. The authors suggest that in order to study
true long-term interactions with Valerie, the robot needs to be able to identify repeat
visitors automatically. Moreover, Valerie should not only identify but also get to
know people who frequent the booth.
We have the common goal of extending human-robot social interaction. Moreover,
33
Valerie’s setup in the midst of passersby and public environments is similar to ours.
However, Valerie has been installed and tested for a much longer period. In terms of
user interface and perceptual capabilities, Mertz differs from Valerie in a number of
ways. Mertz can only interact with people through visual and verbal exchange. Thus,
it can rely on only noisy camera and microphone input in its interaction with people.
Mertz is a mechanical robotic head and is more expressive in terms head postures.
Valerie’s flat-screen face was reported to have difficulties in expressing gaze direction.
However, Mertz only has four degrees of freedom allocated to its facial expression,
allowing a much smaller range than an animated face.
2.3
Extending Robustness and Generality
The thesis goal of extending the duration and spatial range of operation is important
in that it addresses a particular limitation of current humanoid robotics research.
Despite tremendous research progress in humanoid robotics over the past decade, it
still is challenging to develop a robot capable of performing in a robust and reliable
manner.
As accurately described by Bischoff and Graefe [8], robot reliability has not received adequate attention in most humanoid robotics research projects. One possible
cause could be a false belief that when the time comes for robots to be expedited,
someone else will eventually address this limitation. Moreover, robustness to a wide
range of environments is crucial as the home environment is largely unstructured and
each one varies from another. This flexibility is still a major challenge in robotics.
The current trend in the field is to equip the robot to achieve a very specific and difficult task. The end goal is typically to demonstrate the robot performing its task for
a few minutes. Humanoid robots generally have a limited average running period and
are mostly demonstrated through short video clips, which provide a relatively narrow
perspective on the robot’s capabilities and limitations. This particular setup tends
to both require and generate very sophisticated but specialized solutions. Scalability
issues to other environments, other locations, and other users mostly have been put
34
on hold for now.
Bischoff and Graefe [8] present HERMES, an autonomous humanoid service robot,
designed specifically for dependability. It has been tested outside the laboratory
environment for long periods of time, including an over six-month long exhibition in
a museum. Although our project is exploring a different research direction, we fully
concur with the underlying theme of increasing robot robustness and reliability.
Reliability is also a relevant topic in other museum tour-guide robots [22, 94, 71].
Deployment to a public venue and the need to operate on a daily basis naturally place
reliability demands on these robots. Although Mertz is quite different in form and
function, we are exploiting a similar demand to have the robot perform on a daily
basis and interact with many people in a public venue.
2.4
Face Recognition
Research in person identification technology has recently received significant attention, due to the wide range of biometric, information security, law enforcement applications, and Human Computer Interaction (HCI). Face recognition is the most
frequently explored modality and has been implemented using various approaches
[105]. [68] attempted to combine face and body recognition. Speaker recognition has
also been widely investigated [35]. The use of multiple modalities have been observed
by ???? [21, 55, 46, 25].
There are two main branches in face recognition research: image-based and videobased recognition. Image-based recognition typically involves high-resolution face
images, while video-based recognition deals with lower resolution camera input. Both
approaches have been explored using many different methods ranging from Principle
Component Analysis, Hidden Markov Model, to 3-dimensional morphable models
[96, 53, 9].
Our work falls on the latter category. While video-based approach has its set
of challenges, given the more dynamic input, it also has a number of advantages.
Instead of relying on single image frame for training or recognition, we can start one
35
step ahead by tracking and utilizing spatio-temporal context. Video-based supervised
face recognition is increasingly more prevalent and has been explored using various
approaches [53, 106, 51, 38].
Our implementation of the robot’s face recognition system relies on the Scale
Invariant Feature Transform (SIFT) method. SIFT is a feature descriptor algorithm,
developed by David Lowe [54]. SIFT has been shown to provide robust matching
despite scale changes, rotation, noise, and illumination variations. There have a been
a number of recent supervised face recognition work which also rely on the use of SIFT
features due to its powerful invariance capacity [6, 60, 100]. However, the processing
of these SIFT features differ significantly among these approaches, including ours.
Most of face recognition research focus on the supervised classification problem,
i.e. given a set of manually labelled training data, find the correct person label for
a new set of test data. A number of researchers have been working on extending
this technology to allow for unsupervised training, motivated by a range of different
purposes and applications. Most of these systems, including ours, share the common
feature of relying on video-based approaches. Thus, the task is to cluster face sequences obtained by tracking instead of single face images. We will now discuss the
different goals and approaches of these related research.
Eickeler et al proposed an image and video indexing approach that combines face
detection and recognition methods [30]. Using a neural network based face detector, extracted faces are grouped into clusters by a combination of a face recognition
method using pseudo two-dimensional Hidden Markov Models and a k-means clustering algorithm. The number of clusters are specified manually. Experiments on a
TV broadcast news sequence demonstrated that the system is able to discriminate
between three different newscasters and an interviewed person. In contrast to this
work, the number of clusters i.e. the number of individuals, is unknown in our case.
Weng et al presents an incremental learning method for video-based face recognition [103]. The system receives a video camera output as input as well as a simulated
auditory sensor. Training and testing sessions are interleaved, as manually determined by the trainer. Each individual is labeled by manually entering the person’s
36
name and gender during the training session. Both cameras and subjects are static.
A recognition accuracy of 95.1% has been achieved on 143 people. The issue of direct
coupling between the face recognition system and sensory input is very relevant to
our work, due to the requirement of an embodied setting.
Belongie et al presents a video-based face tracking method specifically designed to
allow autonomous acquisition of training data for face recognition [40]. The system
was tested using 500-frame webcam videos of six subjects in indoor environments
with significant background clutter and distracting passersby. Subjects were asked
to move to different locations to induce appearance variations. The system extracted
between zero to 12 face samples for each subject and never extracted a non-face area.
The described setup with background and distractions from other people is similar
to ours. However, our system differs in that it allows tracking of multiple people
simultaneously.
[5] presents an unsupervised clustering of face images obtained from captioned
news images and a set of names which were automatically extracted from the associated caption. The system clusters both the face images together with the associated
names using a modified k-means clustering process. The face part of the clustering
system uses projected RGB pixels of rectified face images, which were further processed for dimensionality reduction and linear discriminant analysis. After various
filtering, clustering results were reported to produce an error rate of 6.6% using 2,417
images and 26% using 19,355 images.
Raytchev and Murase propose an unsupervised multi-view video-based face recognition system, using two novel pairwise clustering algorithms and standard imagebased distance measures [81]. The algorithm uses grey-scale pixels of size-normalized
face images as features. The system was tested using about 600 frontal and multi-view
face image sequences collected from 33 subjects using a static camera over a period of
several months. The length of these video sequences range from 30-300 frames. The
subjects walked in front of the camera with different speeds and occasional stops.
Only one subject is present in each video sequence. Sample images show large variations in scale and orientations, but not in facial expressions. For evaluation purposes,
37
the authors defined the following performance metric, p = (1−(EAB +EO )/N )∗100%,
where EAB is the number of sequences mistakenly grouped into cluster A although
they should be in B and EO is the number of samples gathered in clusters in which
no single category occupies more than 50% of the nodes. Using this metric, the best
performance rate was 91.1 % on the most difficult data set.
[66] Mou presents an unsupervised video-based incremental face recognition system. The system is fully automatic, including a face detector, tracker, and unsupervised recognizer. The recognition system uses feature encoding from FaceVACS, a
commercial face recognition product. The system was first tested with a few hours of
TV news video input and automatically learned 19 people. Only a qualitative description was reported that the system had no problem to recognize all the news reporter
when they showed up again. The system was also tested with 20 subjects who were
recorded in a span of two years. Other than the fact that one person was falsely
recognized as two different people, no detailed quantitative results were provided.
In this thesis, we aim to solve the same unsupervised face recognition problem as
in the last two papers. Our approach differs in that our system is integrated within an
embodied interactive robot that autonomously collected training data through active
interaction with the environment. Moreover, we deal with naive passersby in a more
dynamic public environment, instead of manual recording of subjects or TV news
video input.
38
Chapter 3
Robot Design and Some Early
Lessons
In this chapter, we will discuss the set of criteria and strategies that we employed
during the robot design process. As listed in section 1.4, the first task toward achieving the thesis goal is to build a robot that can operate for many hours and engage
in spontaneous interaction with passersby in different public spaces. This translates
into two major design prerequisites.
Firstly, the robot design must satisfy an adequate level of robustness to allow
for long-term continuous operation and handle the complexity and noise in human’s
environment.
Secondly, given that natural interactive behavior from humans is a prerequisite for
Mertz’s learning process, the robot must be equipped with basic social competencies
to solicit naive passersby to engage in a natural and spontaneous interaction with the
robot. As listed in the thesis performance criteria, Mertz has to be able to interact
with people in the most natural way possible, with minimum constraints. Since the
robot is placed in public spaces, this means that the robot must be able to interact
with multiple people simultaneously.
39
3.1
3.1.1
Design Criteria and Strategy
Increasing Robustness
The robot building process is a struggle of dealing with a high level of complexity
with limited resources and a large set of constraints. In order to allow many hours
of continuous operation, the robot must be immune against various incidents. Failures may occur at any point in the intricate dependency and interaction among the
mechanical, electrical, and software systems. Each degree of freedom of the robot
may fail because of inaccurate position/torque feedback, loose cables, obstruction in
the joint’s path, processor failures, stalled motors, error in initial calibration, power
cycle, and various other sources.
Even if all predictable problems are taken into account during design time, emergent failures often arise due to unexpected features in the environment. Perceptual
sensors particularly suffer from this problem. The environment is a rich source of
information for the robot to learn from, but is also plagued with a vast amount of
noise and uncertainty. Naturally, the more general the robot’s operating condition
needs to be, the more challenging it is for the robot to perform its task.
During the design process, maximum efforts must be put to minimize the risk of
failures and to attain an appropriate balance in complexity and robustness. Moreover,
modularity in subsystems and maximizing autonomy at each control level are crucial
in order to minimize chaining of failures, leading to catastrophic accidents. We spent
a lot of time and efforts in stabilizing the low-level control modules. All software
programs must be developed to run for many hours and thus free of occasional bugs
and memory leaks.
In addition to fault related issues, the robot must be easily transported and set up
at different locations. The start-up procedure must be streamlined such that the robot
can be turned on and off quickly and with minimum effort. In our past experience,
such a trivial issue had generated enough hesitation in researchers to turn on the
robot frequently. Lastly, we conducted a series of long exhaustive testing processes
in different environmental conditions and carried out multiple design iterations to
40
explore the full range of possible failure modes and appropriate solutions.
3.1.2
Designing A Social Interface
As social creatures, humans have the natural propensity to anticipate and generate
social behaviors while interacting with others. In addition, research has indicated that
humans also tend to anthropomorphize non-living objects, such as computers, and
that even minimal cues evoke social responses [69]. Taking advantage of this favorable
characteristic, Mertz must have the ability to produce and comprehend a set of social
cues that are most overt and natural to us. Results in a human-robot interaction
experiment suggest that the ability to convey expression and indicate attention are
the minimal requirements for effective social interaction between humans and robots
[20]. Thus, we have equipped Mertz with the capability to generate and perceive a
set of social behaviors, which we will describe in more details below.
3.2
The Robotic Platform
Mertz is an active-vision head robot with thirteen degrees of freedom (DOF), using
nine brushed DC motors for the head and four RC servo motors for the face elements
(see figure 3-1). As a tradeoff between complexity and robustness, we attempted to
minimize the total number of DOFs while maintaining sufficient expressivity.
The eight primary DOFs are dedicated to emulate each category of human eye
movements, i.e. saccades, smooth pursuit, vergence, vesticular-ocular reflex, and
opto-kinetic response [17]. The head pans and uses a cable-drive differential to tilt
and roll. The eyes pan individually, but tilt together. The neck also tilts and rolls
using two Series Elastic Actuators [80] configured to form a differential joint.
The expressive element of Mertz’s design is essential for the robot’s social interaction interface. The eyelids are coupled as one axis. Each of the two eyebrows and
lips is independently actuated.
Mertz perceives visual input from two color digital cameras (Point Grey Dragonfly) with FireWire interfaces, chosen for their superior image quality. They produce
41
Figure 3-1: Multiple views of Mertz, an active-vision humanoid head robot with 13
degrees of freedom (DOF). The head and neck have 9 DOF. The face has actuated
eyebrows and lips for generating facial expressions. The robot perceives visual input
using two digital cameras and receives audio input using a desk voice-array microphone, placed approximately 7 inches in front of the robot. The robot is mounted
on a portable wheeled platform that is easily moved around and can be turned on
anywhere by simply plugging into a power outlet.
640 x 480, 24 bit color images at the rate of 30 frames per second. The robot receives proprioceptive feedback from both potentiometers and encoders mounted on
each axis. We also equipped the robot with an Acoustic Magic desk microphone, instead of a head-set microphone, in order to allow for unconstrained interaction with
multiple people simultaneously. The robot’s vocalization is produced by the DECtalk
phoneme-based speech synthesizer using regular PC speakers.
Lastly, the robot is mounted on a portable wheeled platform that is easily moved
around and can be booted up anywhere by simply plugging into a power outlet.
3.3
3.3.1
Designing for Robustness
Mechanical Design
Mertz was mechanically designed with the goal of having the robot be able to run
for many hours at a time without supervision. Drawing from lessons from previous
robots, we incorporated various failure prevention and maintenance strategies, as
42
Figure 3-2: One of the mechanical design goals is to minimize the robot’s size and
weight. The smaller and lighter the robot is, the less torque is required from the
motors to achieve the same velocity. The overall head dimension is 10.65 x 6.2 x 7.1
inches and weighs 4.25 lbs. The Series Elastic Actuators extend 11.1 inches below
the neck and each one weighs 1 lb.
described below. The mechanical design of the robot was produced in collaboration
with Jeff Weber.
Compact design to minimize total size, weight, and power A high-priority
constraint was placed during the early phase of the design process to minimize the
robot’s size and weight. A smaller and lighter robot requires less torque from the
motors to reach the same velocity. Also, the robot is less prone to overloading causing
overheating and premature wear of the motors. The overall head size is 10.65 x 6.2
x 7.1 inches (see figure 3-2) and weighs 4.25 lbs. The Series Elastic Actuators, which
bear the weight of the head at the lower neck universal joint, extend below it 11.1
inches. Mertz’s compact design is kept light by incorporating nominal light alloy
parts, which retaining stiffness and durability for their small size. Titanium, as an
alternative to aluminum, was also used for some parts in order to minimize weight
without sacrificing strength.
Force sensing for compliancy Two linear Series Elastic Actuators (SEA) [80]
are used for the differential neck joint, a crucial axis responsible for supporting the
43
Figure 3-3: The Series Elastic Actuator (SEA) is equipped with a linear spring that
is placed in series between the motor and load. A pair of SEAs are used to construct
the differential neck joint, allowing easy implementation of force control. The neck
axis is the biggest joint responsible for supporting the weight of the head. Thus, the
compliancy and ability to maintain position in the absence of power provided by the
SEA are particularly useful.
entire weight of the head. As shown in figure 3-3, each SEA is equipped with a
linear spring that is placed in series with the motor, which act like a low-pass filter
reducing the effects of high impact shocks, while protecting the actuator and the
robot. These springs also, in isolating the drive nut on the ball screw, provide a
simple method for sensing force output, which is proportional to the deflection of
the spring (Hooke’s Law, F = kx where k is the spring constant). This deflection is
measured by a linear potentiometer between the frame of the actuator and the nut
on the ball screw. Consequently, force control can be implemented which allows the
joint to be compliant and to safely interact with external forces in the environment.
We implemented a simple gravity compensation module to adapt the force commands
for different orientations of the neck, using the method described in [70].
Additionally, the ball screw allows the SEA to maintain position of the head when
motor power is turned off. Collapsing joints upon power shutdown is a vulnerable
point for robots, especially large and heavy ones.
44
Safeguarding position-controlled axes The rest of the DOFs rely on position
feedback for motion control and thus are entirely dependent on accurate position
sensors. Incorrect reading or faulty sensors could lead to a serious damage to the
robot, so redundant relative encoder and potentiometer are utilized in each joint.
The potentiometer provides absolute position measurement and eliminates the need
for calibration routines during startup. Both sensors serve as a comparison point
to detect failures in the other. Each joint is also designed to be back drivable and
equipped with a physical stop in order to reduce failure impacts.
Electrical cables and connectors Placement of electrical cables is frequently an
afterthought in robot designs, as a broken or loose cable is one of the most common
failure sources. Routing over thirty cables inside the robot without straining each
other or obstructing the joints is not an easy task. Mertz’s head design includes large
cable passages through the center of head differential and the neck. This allows cable
bundles to be neatly tucked inside the passages from the eyes all the way through to
the base of the robot, thus minimizing cable displacement during joint movement. On
the controller side, friction or locking connectors are used to ensure solid connections.
3.3.2
Low-Level Motor Control
Whereas our previous work has made use of off-the-shelf motion control products and
PC nodes, we have implemented custom-made hardware for Mertz’s sensorimotor and
behavior control. Off-the-shelf products, though powerful and convenient, are limited
to a set of predetermined capabilities and can be unduly complex. Customizing our
hardware to more precise specifications gives us greater control and flexibility. One
caveat is that it took more time to develop custom-made hardware to a reliable state.
The custom-made motor controller is built using the Motorola DSP56F807. The
controller supports PWM generation, encoder support, and A/D conversion for all
existing axes. The amplifier uses the LMD18200 dual H-Bridge which accommodates
up to 3A continuous output as well as current sensing. The ability to sense current
is crucial as it provides a way to detect failures involving stalled motors. Particular
45
attention was given to protect the robot against power cycle or shutdown. The motors
and controller use separate power sources. Thus, we added simple circuitry to prevent
the motors from running out of control if the controller happens to be off or reset,
which could happen depending on the controller’s initial state upon power reset.
Simple PD position and velocity control were implemented on the head and eye
axes. Taking advantage of the Series Elastic Actuators, we use force feedback to
implement force control for the neck joint. A simple PD position control was then
placed on top of the force control. Various bounds are enforced to ensure that both
position/force feedback and motor output stay within reasonable values.
Each axis is equipped with a potentiometer and a digital relative encoder. This
allows for a fast and automatic calibration process. Upon startup each axis is programmed to find its absolute position and then relies on the encoder for more precise
position feedback. This streamlines the startup sequence to two steps which can be
performed in any order: turning on the motor controller and turning on the motors.
While the robot is running, the motor controller can be reset at any time, causing the
robot to re-calibrate to its default initial position and resume operation. The motors
can also be turned off at any time, stopping the robot, and turned back on, letting
the robot pick up where it left off.
3.3.3
Modular Control
Figure 3-4 illustrates the interconnections among the robot’s hardware and software
modules. The rectangular units represent the hardware components and the superimposed grey patches represent the software systems implemented on the corresponding
hardware module. We have arranged these subsystems in the same order as the
control layers. We paid careful attention to ensure that each layer of control is independent, such that the robot is safe-guarded upon removal of higher level control
while the robot is running. Motor control and behavior layers are implemented using
embedded microprocessors, instead of more powerful but complex PCs, such that
they can run autonomously and reliably at all times without having to worry about
the many other processes running on the computers.
46
Figure 3-4: The robot’s hardware and software control architecture. Rectangular
units represent hardware components. The superimposed grey patches represent the
software systems implemented on the corresponding hardware. We carefully designed
each control layer to be modular and independent. Higher level control layers can be
removed at any point without disrupting the robot’s operation.
We will now go through each control layer as shown in figure 3-4 and describe
how they interact and affect the robot’s overall behavior. Suppose we strip all control
layers, including the lowest level motor control layer. In this condition, the amplifier
is guaranteed to produce zero output to the motors, until the motor controller is back
up. This essentially protects the robot in the event of power loss or power cycle to
the motor controller. If the motor controller is put back into the system without
any other control layers, each degree of freedom will automatically calibrate into its
predefined zero position and stay there. At this point, the force control for the neck
joints will be active, causing both joints to be compliant to external forces. If the
behavior system is now added to the configuration, it will generate random motion
commands to all degrees of freedom. If the vision system is also turned back on, it
will communicate to the behavior system, which will in turn send commands to have
the robot respond to salient visual targets. Similarly, if the audio system is added, it
47
will communicate to the behavior system and activate the robot’s speech behavior.
This control modularity comes in very handy during the development and debugging
process, because one can now be very sloppy about leaving the robot’s motors on
while updating and recompiling code.
3.3.4
Behavior-Based Control
We used the behavior-based control approach to implement Mertz’s behavior system
[2]. A behavior-based controller is a decentralized network of behaviors. Each behavior independently receives sensory input and sends commands to the actuators,
while communicating with each other. The overall robot’s behavior is the result of an
emergent and often unpredictable interaction among these behavioral processes. This
decentralized approach allows for a more robust implementation, as the robot’s behavior system may still work partially even if some components of the robot’s system
are non-functional.
The robot’s behavior system was implemented in L/MARS [19]. L is a Common
Lisp-based programming language specifically designed to implement behavior-based
programs in the incremental and concurrent spirit of the subsumption architecture
[18]. The L system has been retargetted to the POWERPC and is running on a
Mac Mini computer. The MARS (Multiple Agency Reactivity System) language is
embedded in L and was designed for programming multiple concurrent agents. MARS
allows users to create many asynchronous parallel threads sharing a local lexical
environment. Groups of these threads are called assemblages. Each assemblage can
communicate to others using a set of input and output ports. As defined in the
subsumption architecture, wires or connections between ports can either suppress or
inhibit each other. Assemblages can be dynamically killed and connections among
ports can be dynamally made and broken.
48
3.3.5
Long-Term Software Testing
As mentioned above, all software systems must be able to run continuously for many
hours. Long term testing and experiments have been very helpful in identifying emergent and occasional bugs, as well as memory leaks. We conducted multiple design and
testing iterations of various software components at different environmental settings
to avoid overspecialized solutions.
3.4
3.4.1
Social Interface Design
Visual Appearance
As humans, like all primates, are primarily a visual creature, the robot’s appearance
is an important factor and should be designed to facilitate its role as a social creature.
There have been some attempts to study how a humanoid robot should look like from
the perspective of human robot interaction [29, 64]. However, other than a number
of resulting guidelines, the search space is still enormous. We intuitively designed the
robot to be somewhat human like, child-like, and friendly, as shown in figure 3-5.
3.4.2
Generating and Perceiving Social Behaviors
Results in a human-robot interaction experiment suggest that the ability to convey
expression and indicate attention are the minimal requirements for effective social
interaction between humans and robots [20].
We have incorporated degrees of freedom into the design of Mertz’s head and face
such that the robot can produce a set of social gestures. Two pan and one tilt DOF are
dedicated to generate various human eye movement categories, e.g. saccades, smooth
pursuit, vergence, vestibular-ocular reflex, and the opto-kinetic response. These DOF
also allow the eyes to gaze in all directions. The head has three degrees of freedom
to pan, tilt, and roll, yielding many possible head movements. Mertz has a pair of
eyebrows, eyelids, and lips for generating a number of facial expressions. The lips also
serve as a visual complement for the robot’s speech synthesizer. The two-DOF neck
49
Figure 3-5: A close-up image of the Mertz robot. We intuitively designed the robot’s
visual appearance to facilitate its sociability. Overall, we opted for a child-like and
friendly look. The robot can generate facial expressions by actuating the lips, eyelids,
and eyebrows. The lips also move corresponding to the robot’s speech.
adds to the robot’s expressiveness by enhancing head movements as well as producing
a number of overall postures. Clearly, these mechanical DOFs only provide a part
of the story, as they must be controlled in conjunction with the perceptual systems.
The robot’s high level behavior control is described in more detail in section 4.6.
In addition to being socially expressive, the robot must also be responsive to human social cues. Toward this goal, the robot’s first task is to detect the presence
of humans. Once the robot locates a person’s face, it then has to make eye contact
and track the person. This task has been well demonstrated in many social robotic
platforms [11, 13, 63, 62]. In our setup, where the robot has to deal with an unstructured environment, we found that the complexity of this particular task has increased
significantly. In addition to drastic lighting and acoustical variations across different
locations and times of day, the robot has to interact with a large number of people,
50
often simultaneously. Without any specific instructions, these individuals display a
wide range of behavior patterns and expectations. These complexities have triggered
multiple design iterations and incremental changes in our implementation throughout
the project.
In the robot’s final implementation, the robot is capable of detecting people,
attending to a few people at a time, and engaging in a simple verbal interaction.
More details on the robot’s perceptual mechanism will be covered in section 4.3.
3.5
Some Early Experiments
We conducted a number of experiments to evaluate different subsystems at various
stages of the robot development. These experiments range from one to seven days
long and were carried out in different locations. In this section, we will describe three
of these experiments. We briefly state the setup of each experiment, illustrate the
failures which occured during the experiment, and summarize a number of lessons
learned.
During these experiments, the robot collected a set of numerical data to evaluate
the performance of various subsystems. In addition, there was also a set of qualitative
data that we observed by watching the robot from a distance. A carefully designed
human subject experiment would probably generate a set of interesting quantitative
data from these spontaneous human-robot interactions. However, this would require
a more stringent protocol, including written permission forms, which would alter
the nature of the experiment in some undesirable ways. Even though we can not
present these observed lessons in numbers, we present a qualitative description of these
lessons in this section, as they are very valuable in understanding the problem scope.
Moreover, as we interleaved these experiments with the robot development process,
many of these lessons were later incorporated into the robot’s final implementation,
which will be described in the next chapter.
51
Table 3.1: Schedule, time, and location of an early experiment to evaluate the robot’s
reliability. We setup the robot to run at four different locations for a total of 46 hours
within 4 days. At this time, the robot had a very simple visual attention system for
orienting to various visual targets. Our goal was to study failure modes while the
robot operated in its full range of motion.
3.5.1
Experiment 1
Setup We conducted a four-day long experiment at the very early stage of the
robot development. At this time, the robot only consisted of the head and neck
frame. The robot’s face was still in the design phase. We equipped the robot with
a simple behavior system where it simply orients to salient visual targets, i.e., faces,
skin color, and saturated colors. The goal was to test the robot’s robustness and
study failure modes while the robot operated in its full range of motion. The robot
ran for 46 hours within 4 days at four different locations. We also collected raw visual
data to observe variations across different times and locations.
The experiment schedule is shown in table 3.1. The shortest and longest duration
are 6 and 19 hours respectively. Initially, the experiment was conducted with supervision. As the robot showed a reasonable level of reliability (with the exception of
the neck joint that had to be rested every couple hours), we started leaving the robot
alone and checked on it every hour. During the experiment, people were allowed to
approach but not touch the robot. While unsupervised, a sign was placed near the
robot to prohibit people from touching it.
52
Table 3.2: List of observed failures during the experiment. All failures originate from
mechanical problems in the neck joint and its Series Elastic Actuators. It is important
to note that this experiment is slightly biased in finding mechanical faults, since most
of the hardware and software errors are fixed during the development process.
Failure Modes. The head and eye axes are so far free of failures. Most of the
failures originated from the mechanical failures on the neck SEA actuators. Table 3.2
lists each failure that occurred during the experiment.
All observed failures involve the neck joint and its Series Elastic Actuators. Loose
screws seem to be particularly problematic. A probable explanation is that the neck
actuators are constantly in motion. The force control loop produces output that is
proportional to the linear pot signal plus some noise. A series of filters have now been
put in place in order to minimize noise. In addition, the load of each SEA motor is
very small causing the control output to be very sensitive to even a trivial amount
of noise. A dead-band was placed in software to reduce this effect, which eliminates
some but not all of the actuator’s jitter. We also found that decreasing friction on the
SEA’s ball screw helps to reduce jitter. In addition, we put additional protection for
the screws, i.e., using loctite on as many screws as possible. Lastly, it is important
to note that the experiment setup is biased in finding mechanical failures, since much
time was spent to make sure that the hardware and software systems are working
properly during the development process.
53
3.5.2
Experiment 2
Setup. In this experiment, the robot ran from 11 am to 6 pm for 5 days at different
spaces at the first floor of the MIT Stata Center. At this time, the robot’s head
and face had been completed. The goal of this experiment was to further evaluate
robustness and the robot’s potential for soliciting human passersby to engage in an
impromptu interaction.
A written sign was placed on the robot to request people to interact with the
robot. The sign explained that this was an experiment to test how well the robot
operates in different environments and warned that the robot would be collecting face
images and audio samples. A set of bright colored toys were placed around the robot.
We monitored the robot from a distance to encourage people to freely interact with
the robot, instead of approaching us for questions. Figure 3-12 shows the robot on
day 5 of the experiment.
Failure Modes The robot ran without any mechanical failures for the first 3 days
of the experiment. On the 4th day, we detached one of the SEAs which seemed to be
exposed to more friction than the other one, tightened the screws, and re-attached
it. We also had to re-calibrate the motor control software to adapt to the resulting
mechanical changes. The robot continued running without any failures for the rest of
the experiment. However, as we still encounter throughout the project, human error
is simply difficult to avoid. Due to human error, we lost some recorded data during
this experiment.
Experimental Results During the experiment, we recorded the robot’s visual
input and the tracker’s output every second. We labelled a sequence of 14186 frames
collected during a close to four hour period on day 2. Figure 3-6 shows the output
of the robot’s tracker during this period. For approximately 16.9% of the sequence,
the robot tracked correctly segmented faces. For a small part of the sequence the
robot tracked faces that either included too much background or partially cropped
and tracked bright colored toys and clothing.
54
Figure 3-6: The breakdown of what the robot tracked during a sequence of 14186
frames collected on day 2.
We also collected every frame of segmented face images that were detected by
the frontal face detector throughout the experiment, which amounted to a total of
114,880 images from at least 600 individuals. The robot uses the frontal face detector
developed by Viola and Jones [98]. Figure 3-7 shows the breakdown of the face
detector’s false positive error rates during each day, excluding one day due to file
loss. These results suggest that the robot is able to acquire a significant set of face
images because people do interact with the robot closely enough and for long enough
durations.
The robot received over 1000 utterances per day. An utterance starts when the
audio input exceeds a minimum energy threshold and stops when a long enough pause
is detected. We transcribed a portion of these utterances (3027 audio samples from
at least 250 people) to analyze what the robot heard. Each data portion was taken
from a continuous sequence of speech input spanning a number of hours in each day.
Due to some file loss, we were only able to transcribe less than 300 utterances on day
5. As shown in Figure 3-8, approximately 37% of the total utterances are intelligible
robot directed speech. The rest are made up of background noise, the robot’s own
speech, human speech that is unintelligible (cropped, foreign language, muddled, etc),
55
Figure 3-7: The face detector detected 114,880 faces with the average false positive of
17.8% over the course of 4 days. On the right is a sample set of the 94,426 correctly
detected faces.
and human speech directed not at the robot.
Figure 3-9 shows the number of words in each utterance from the set of intelligible
human speech. One-word utterances make up 38% of all intelligible human speech
and 38.64% of robot directed speech. Two-word utterances make up 17.69% of all
intelligible human speech and 17.93% of robot directed speech. Approximately 83.21%
of all intelligible human speech and 87.77% of robot directed speech contain less than
5 words.
We are also interested in finding out whether or not the robot may be able to
acquire a lexicon of relevant words. In particular, we would like to assess whether a
set of words tends to be repeated by a large number of people. Figure 3-10 illustrates
the top fifteen most frequently said words during each day and a set of frequently
said words that are shared by 3 days or more during the experiment.
Figure 3-11 illustrates the difference in average pitch and pitch gradient values
of robot directed speech versus non-robot directed speech on each experiment day.
Both female and male speakers tend to speak with higher pitch average to the robot
versus to other people.
These results seem to suggest to people do in fact speak to the robot. Moreover,
they tend to speak to it like they would to a young child. The frequency of oneword utterances seems to be high enough to provide the robot with a starting point
56
Figure 3-8: The characteristic of speech input received by the robot on each day of
the experiment.
for unsupervised lexical acquisition. Lastly, a set of common words tend to be repeated throughout the experiment despite the large number of speakers and minimal
constraints on the human-robot interaction.
3.5.3
Experiment 3
Setup. This experiment was done in two parts. We first conducted a 5-hour experiment inside the laboratory. We requested ten people to come and interact with the
robot. A few days later, we conducted a six-hour experiment where the robot interacted with over 70 people in a public space outside the laboratory. The goal of these
experiments was to evaluate the robot’s multi-modal attention and spatio-temporal
learning systems. The setup of the robot and experiment instruction was identical to
the setup described in Experiment 2.
Failure Modes. One of the robot’s computers has been problematic and finally
failed during the experiment. We also discovered a software bug because of a counter
which became too large and cycled back to zero. We did not encounter this error
during the shorter-term testing periods. A similar error occured in data recording,
where a log file storing a large amount of output from the robot’s spatio-temporal
57
all
to robot
DAY 1
100
80
all
to robot
DAY 2
100
all
to robot
DAY 3
100
60
50
50
40
20
0
0
10
20
30
0
0
5
100
10
all
to robot
DAY 4
80
60
15
20
25
40
0
0
5
10
15
all
to robot
DAY 5
30
20
40
10
20
0
0
5
10
15
0
0
20
5
10
15
20
Figure 3-9: Number of words (x axis) in utterances (y axis) collected during each
day of the experiment. Dashed line: all transcribed data. Solid line: robot directed
speech only.
learning system grew too large and killed the program.
In a later informal experiment in the public lobby, the robot’s head pan and tilt
motors broke. Since the failure occured when the robot was unsupervised, we don’t
know the precise cause of the failures. Our hypothesis is that one of the motor’s
gearheads may have failed and caused a chain reaction to another joint.
3.5.4
Summary and Lessons
Based on the numerical results and visual observation of these experiments, we extracted a number of lessons which have triggered a set of incremental changes in the
robot’s development process.
The environment. As expected, the robot’s environment is very dynamic and
challenging. Mertz was approached by an average of 140 people a day. The robot
was approached by one individual, small groups, and at times large groups of up to
20 people. The robot often perceives multiple faces and speech from multiple people
simultaneously. Some people spoke to the robot, while some spoke to each other.
Additionally, the auditory system’s task is made even more difficult by the high level
58
15 most frequent words
Day
Day
Day
Day
Day
1
2
3
4
5
it/it’s,you,hello,I,to,is,what,hi,are,the,a,Mertz,here,your,this
you,hello,it/it’s,what,I,hi,yeah,are,the,oh,to,is,Mertz,a,your
hello,it/it’s,you,hi,Mertz,bye,to,robot,are,the,I,what,hey,how,is
you,hello,it/it’s,what,hi,Mertz,are,I,bye,this,here,how,is,robot,to
you,hello,it/it’s,hi,I,what,are,oh,how, say,the,a,at,can,is
Shared by
5 days
4 days
3 days
Common most frequent words
hello, you, it/it’s, hi, what, are, I, is
to, Mertz, the, this, how, hey, what’s
a, here, your, oh, can
Figure 3-10: The top 15 words on each experiment day and the common set of most
frequently said words across multiple days.
Figure 3-11: Average pitch values extracted from robot directed and non-robot directed speech on each day of the experiment.
of background noise. It is a very erratic environment for the robot’s perceptual and
attention system.
As we move the robot to different locations, we encounter drastic changes in
the visual and acoustical input. We also continue to discover unexpected features
which were absent inside the laboratory environment, but caused various difficulties
for the robot’s perceptual and attention system. False positive error is particularly
troublesome. Detection of a face in the background or a large bright orange wall tends
to dominate and steer the robot’s attention system away from real salient stimuli.
Variation in lighting and background noise level is also very problematic. Figure 3-13
59
Figure 3-12: A sample interaction session on day 5 of the experiment. The robot
ran continuously for 7 hours in a different public location each day. A written sign
was placed on the robot: “ Hello, my name is Mertz. Please interact with me. I can
see, hear, and will try to mimic what you say to me.” Some bright colored toys are
available around the robot. On the bottom right corner is a full-view of the robot’s
platform in the lobby of the MIT Stata Center.
contains a set of face images collected in different locations and times of day. A fixed
sound detection threshold which works well inside the laboratory is no longer effective
when the robot is moved outside the laboratory. The much higher background noise
causes the robot to perceive sound everywhere and overwhelms the attention system.
Figure 3-14 shows the output of the robot’s attention system during two experiments,
inside and outside the laboratory. Each plot contains different measurements for what
the robot attended to and shows how the number of sound event occurrence dominates
over the visual events when the robot is moved outside the laboratory. [NOTE: ADD
HERE ABOUT FALSE POSITIVE DETECTION ERRORS ]
60
Figure 3-13: A set of face images collected in different locations and times of day.
These difficulties have led us to put a lot more efforts into the robot’s attention
system than we initially expected. We upgraded the robot’s attention system to include an egocentric representation of the world, instead of simply relying on the retinal
coordinates. We also enhanced the robot’s attention system to utilize spatio-temporal
context and correlate multi-modal signals to allow for a more robust integration of
the noisy perceptual input. Research in computer vision and speech recognition has
made a lot of advances in dealing with these environmental variations. However, we
believe that errors and imperfections in the robot’s various subsystems are simply
inevitable. Thus, a robust integration of the robot’s subsystems is a crucial element
in the path toward intelligent robots.
The passersby. There is a large variation in the level of expectations and behavior
patterns in the large set of naive passersby. Many people spoke naturally to the
robot, but some simply stared at the robot. Some people who successfully attracted
the robot’s visual attention then tried to explore the robot’s tracking capabilities by
moving around and tilting their heads. Many people were not aware of the robot’s
limited visual field of view and seemed to expect that the robot should be able to
see them wherever they were. When they realized that the robot was not aware of
61
Figure 3-14: The robot’s attention system output during two experiment days, inside
and outside the laboratory.
their presence, many used speech or hand movements to try to attract the robot’s
attention. This led to a decision to give Mertz a sound localization module, which
has been a tremendous addition to its ability to find people.
At this point, the robot’s auditory system consisted of a phoneme recognizer. The
speech synthesizer then simply mimicked each extracted phoneme sequence. This
led to a lot of confusions for many people. The robot often produces unintelligible
phoneme sequences due to the noisy recognizer. Even when the robot produces the
correct phoneme sequence, people still had trouble understanding the robot. Many
people also expect that the robot understands language and become frustrated when
the robot is not responding to their sentences. For this reason, we have incorporated
a more complex word recognition system into the robot.
Learning while interacting. The most important lesson that we learned during
these early experiments is the difficulty of having to interact with while learning from
the environment. In our setup, there is no boundary between testing and training
stages. The robot’s attention system has to continually decide between two conflicting
choices: to switch attention to a salient input which may lead to learning targets or
to maintain attention to the current learning targets. Moreover, even though the
62
robot successfully tracked over a hundred thousand faces, the accuracy required from
tracking for interaction is much lower than tracking for learning. In order to collect
effective face data for recognition purposes, the robot has to be able to perform
same person tracking accurately. This is a very difficult task when both the robot’s
cameras and people are constantly moving. Additionally, the simultaneous presence
of multiple faces further increases the task complexity. We further discuss this topic
in section 4.1 and present the implications, as reflected in the final implementation
of the robot’s attention system in section 4.4.
63
Chapter 4
Finding, Interacting With, and
Collecting Data From People
In the previous chapter, we have demonstrated a robotic platform that is capable of
operating for many hours continuously and soliciting spontaneous interaction from a
large number of passersby in different public locations. We also described some early
experiments and showed that there is still a large gap between the ability for superficial interaction with many passersby and our final goal of unsupervised incremental
individual recognition.
In this chapter, we present the implementation details of the robot’s perceptual,
attention, and behavior systems. The robot has to organize its perceptual and behavior systems not only to solicit interaction, but also to regulate these interactions to
generate learning opportunities. More specifically, as listed in section 1.4, the robot
has to perform the following tasks automatically:
1. attract passersby to approach the robot, engage them in spontaneous social
interaction, and trigger natural interactive behavior from them;
2. regulate the interaction in order to generate opportunities to collect data from
as many people as possible;
3. detect, segment, store, and process face images and voice samples during interaction with people;
64
4. use tracking and spatio-temporal assumptions to obtain a sequence of face images and voice samples of each individual as a starting point toward unsupervised recognition.
4.1
Challenges and Approach
As we have shown in Chapter 3, Mertz was able to collect a large number of faces
during some early experiments due to the extremely robust face detector, developed
by Viola and Jones [98]. This of course assumes an initial condition where the robot
is facing the person somewhat frontally. Even though this is not always the case,
the robot still managed to track over 100,000 faces from over 600 individuals in an
early 7-day long experiment, described in section 3.5. However, the further task
of interacting while collecting face and voice data of each individual has generated
additional load and complexity, especially on the robot’s attention system.
The attention system serves as a front gate to hold back and select from an
abundance of streaming sensory input. In the absence of such filtering, both the
robot’s controller and learning mechanism will be overwhelmed. The importance of
an attention system for learning has been discovered in many research areas [104, 20].
Studies of the human’s visual system suggest that selective visual attention is guided
by a rapid and stimulus-driven selection process as well as by a volitional controlled
top-down selection process [73]. Incorporating top-down control of attention has
been explored in [34, 41, 42]. However, the top-down attention control was mostly
simulated manually in most of these systems. Our initial implementation of the
attention system originated from [14]. However, the requirement for operation in
unstructured environments has triggered the need for many additional functionalities.
Many properties of the robot’s current attention system were inspired by the Sensory
Ego-sphere [41].
In our setup, where there is no boundary between testing and training stages,
the robot has to perform the parallel task of interacting with while collecting data
and learning from the environment. This task is difficult for a number of reasons.
65
Firstly, the robot’s attention system faces conflicting tasks, as it has to be reactive to
find learning targets in the environment but also persistent to observe targets once
they are found. In the human’s visual attention system, this dichotomy is reflected
in two separate components: the bottom-up (exogenous) and top-down (endogenous)
control [79].
Secondly, attending to learn in an unconstrained social environment is a difficult
task due to noisy perceptual sensors, target disappearing and reappearing, simultaneous presence of multiple people, and the target’s or robot’s own motion. Same person
tracking in subsequent frames is an easy task for the human’s visual system since we
are very good in maintaining spatiotemporal continuity. Even when our heads and
eyes move, we can easily determine what have moved around us. Unfortunately, for a
robot active vision system, this is not the case. The robot essentially has to process
each visual frame from scratch in order to re-discover the learning target from the
previous frame. Tracking a person’s face in order to learn to recognize the person is a
somewhat convoluted problem. The robot has to follow and record the same person’s
face in subsequent frames, which requires some knowledge about what this person
looks like, but this is exactly what the robot is trying to gather in the first place.
An additional complexity is introduced by the trade-off between timing and accuracy requirements of the interaction and learning processes. The interaction process
needs fast processing to allow for timely responses, but less accuracy since the consequence of attending to the wrong thing is minimal. The data collection process is not
as urgent in timing, but it needs higher accuracy. The consequence of incorrect segmentation or placing the wrong data for the wrong person is quite significant for the
robot’s learning process. Interestingly, this dichotomy is also reflected in the separate
dorsal where and ventral what pathways in the human’s visual system, for locating
and identifying objects [37].
We have designed the robot’s attention system to address some of the issues
mentioned above, by incorporating object-based tracking and an egocentric multimodal attentional map based on the world coordinate system [1, 41]. The attention
system receives each instance of object-based sensory events (face, color segment, and
66
sound) and employs space-time-varying saliency functions, designed to provide some
spatiotemporal short-term memory capacity in order to better deal with detection
errors and having multiple targets that come in and out of the field of view.
In addition, inspired by the coupling between human infants’ attention and learning process, we implemented a spatiotemporal perceptual learning mechanism, which
incrementally adapts the attention system’s saliency parameters for different types
and locations of stimuli based on the robot’s past sensory experiences. In the case of
human infants, the attention system directs cognitive resources to significant stimuli
in the environment and largely determines what infants can learn. Conversely, the
infants’ learning experience in the world also incrementally adapts the attention system to incorporate knowledge acquired from the environment. Coupling the robot’s
attention system with spatiotemporal perceptual learning allows the robot to exploit
the large amount of regularities in the human environment. For example, in an indoor environment, we would typically expect tables and chairs to be on the floor,
light fixtures to be on the ceiling, and people’s faces to be at the average human
height. Movellan et al presents an unsupervised system for face detection learning by
exploiting contingency of an attention system associating audio signals and peoples’
tendency to attend to the robot [23].
4.2
System Architecture and Overall Behavior
Figure 4-1 illustrates the robot’s system architecture. The robot’s visual system is
equipped with detectors and trackers for relevant stimuli, i.e., faces, motion, and color
segments. The auditory system detects, localizes, and performs various processing
on sound input. Each instance of a perceptual event (face, motion, color segment,
and sound) is projected onto the world coordinate system using the robot’s forward
kinematic model and is entered into both the multi-modal egocentric attention and
spatio-temporal learning map. The attention system computes the target output
and sends it to the robot’s behavior system to calculate the appropriate next step.
The spatio-temporal learning process incrementally updates the attention’s saliency
67
Figure 4-1: The robot’s overall system architecture, consisting of a perceptual, attention, and behavior system. The robot’s visual and auditory system detects and localizes various stimuli. Each instance of a perceptual event (face, motion, color segment,
and sound) is projected onto the world coordinates system into both the multi-modal
attention and spatio-temporal learning map. The attention system computes and
sends the target output to the robot’s behavior system. The spatio-temporal learning
process incrementally updates the attention’s saliency parameters.
parameters, which is then fed back into the attention map. In parallel, each perceptual
event is also filtered, stored, and processed to automatically generate clusters of
individuals’ faces and voice segments for incremental person recognition.
The robot’s vision system receives a 320 × 240 color frame from each camera, but
processes at half that resolution to allow real-time processing. However, the system
retrieves the higher resolution image when it segments and stores face images. The
vision system and communication among PC nodes were implemented using YARP,
an open source vision software platform developed by a collaboration effort between
the MIT Humanoid Robotics Laboratory and LIRA-Lab at University of Genova. [59]
YARP is a collection of libraries, providing various image processing functionalities
68
and message based interprocess communication across multiple nodes.
4.3
Perceptual System
The robot is capable of detecting a set of percepts, i.e. face, motion, color segment,
and sound. We describe the implementation details of each perceptual subsystem
below and present how they are integrated in the next section.
4.3.1
Face Detection and Tracking
In order to be responsive to people’s interaction attempt, MERTZ must be able to
detect the presence of humans and track them while they are within its field of view.
In order to detect and track faces, we are combining a set of existing face detection
and feature tracking algorithms. We will not go into the implementation details, as
these information are available in the publication of each original work.
We are using the frontal face detector developed by Paul Viola and Michael Jones
[98]. The face detector occasionally finds a false positive face region in certain backgrounds, causing the robot to fixate on the floor, wall, or ceiling. This is especially
problematic during long experiments where there was a lot of down time when no
one is around. We implemented a SIFT-based feature matching module to calculate a
sparse disparity measure between the face region in the images from the left and right
cameras. Using an expected ratio of estimated disparity and face size, the module
rules out some false faces in the background that are too far away.
Since both people and the robot tend to move around frequently, faces do not
remain frontal to the cameras for very long. We are using a KLT-based tracker to
complement the frontal face detector. The KLT-based tracker was obtained from Stan
Birchfield’s implementation [7] and enhanced to track up to five faces simultaneusly.
The robot relies on the same-person face tracking module as a stepping stone
toward unsupervised face recognition. The idea is that the better and longer the
robot can track a person continuously, the more likely that it will collect a good
sequence of face images from him/her. A good sequence is one which contains a set
69
Figure 4-2: The same person face tracking algorithm. We have combined the face
detector, the KLT-based tracker, and some spatio-temporal assumptions to achieve
same-person tracking.
of face images of the same person with high variations in pose, facial expression, and
other environmental aspects.
We have combined the face detector, the KLT-based tracker, and some spatiotemporal assumptions to achieve same-person tracking. As shown in figure 4-2 for
every detected face the robot activates the KLT-based face tracker for subsequent
frames. If the tracker is already active and there is an overlap in the tracked and
detected region, the system will assume that they belong to the same person and
refine the face location using the newly detected region. If there is no overlap, the
system will activate a new tracker for the new person. The overlapping criteria is
also shown in figure 4-2. If the disparity checker catches a false positive detection,
the face tracker cancels the corresponding tracking target.
With this algorithm, we make some spatio-temporal assumptions that each sequence of tracked face belongs to the same individual. Of course, this can sometimes
be wrong, especially in the case of simultaneous tracking of multiple people. We have
observed two failure modes; where a sequence contains face images of two people and
where the face image consists of mostly or completely the background region. The
first failure happened in multiple occasions where a parent is holding a child. In these
cases, their face proximity often confuses the tracker.
70
4.3.2
Color Segment Tracking
In order to enhance the robot’s person tracking capabilities, we augmented the face
detector by tracking the color segment found inside the detected face frame. This
can be handy when the person’s face is rotated too much such that neither the face
detector or tracker can locate it.
The color segment tracker was developed using the CAMSHIFT (Continuously
Adaptive Mean Shift Algorithm) approach [10], obtained from the OpenCV Library
[82]. The tracker is initialized by the face detector and follows a similar tracking
algorithm as described in figure 4-2. This tracker can only track one segment at a
time, however. We also implemented an additional module to check for cases when the
tracker is lost, which tend to cause the CAMSHIFT algorithm to fixate on background
regions. This module performs a simple check that the color histogram intersection
of the initial and tracked region is large enough [93].
4.3.3
Motion Detection
Since the robot’s cameras are moving, background differencing is not sufficient for
detecting motion. Thus, we have implemented an enhanced version of the motion
detector. Using the same approach for detecting motion with active cameras [33], we
use the KLT-based tracking approach to estimate displacement of background pixels
due to robot motion at each frame [87]. Object motion is then detected by looking
for an image region whose pixel displacement exhibit a high variance from the rest of
the image.
In order to complement the motion detector, we implemented a simple and fast
color-histogram based distance estimator to detect objects that are very close to the
robot. We simply divide the image into four vertical regions and compute color
histograms for each region on both cameras. We then calculated the histogram intersection between each region of the two cameras [93]. This method though simple
and sparse is at times effective in detecting objects that are very close to the robot.
This detection is used to allow the robot to back up and protect itself from proximate
71
objects.
4.3.4
Auditory Perception
The robot’s auditory system consists of a sound detection and localization module,
some low-level sound processing, and word recognition. An energy-based sound detection module determines the presence of sound events above a threshold. This
threshold value was initially empirically determined, but we quickly found that this
did not work well outside the laboratory. This threshold is adaptively set using a simple mechanism described in section 4.5. Lastly, the robot inhibits its sound detection
module when it is speaking.
In an early experiment, we observed that the robot’s limited visual field of view
really limit its capability in finding people using only visual cues. The microphone
has a built in sound localizer and displays the horizontal direction of the sound source
using five indicator LEDs. Thus, we now tap into these LEDs on the microphone to
obtain the sound source direction. The presence and location of each sound event
are immediately sent to the robot’s attention system, allowing the robot to attend to
stimuli from a much larger spatial range.
In parallel to this, a separate module also processes each sound input for further
speech processing. The robot’s speech recognition system was implemented using
CMU Sphinx 2 [65]. This module uses a fixed energy threshold to segment and
record sound events, because we would like to record as many sound segments as
possible for evaluation purposes.
Each recorded segment is processed redundantly for both phoneme and word
recognition, as well as for pitch periodicity detection. Firstly, the pitch periodicity detection is used to extract voiced frames and filter phoneme sequences from
noise-related errors. The filtered phoneme sequence output is then further filtered
by using TSYLB, a syllabification tool to rule out subsequences that are unlikely to
occur in the English language [31]. Lastly, the final phoneme sequence is used to filter
the hypothesized word list. The robot’s behavior system then utilizes this final list
to produce speech behaviors, as described in section 4.6.
72
4.4
Multi-modal Attention System
The robot’s attention system consists of an attentional map. This attentional map is
a 2D rectangle, which is an approximated projection of the front half of the geodesic
sphere centered at the robots origin (a simplified version of the Sensory Ego-Sphere
implementation [41]).
The attention system receives multi-modal events from the robot’s perceptual
systems (face, color segment, motion, and sound). The retinal location of each perceptual event is projected onto the attentional map’s world coordinates using the
robot’s forward kinematic model. Based on this coordinate mapping, each perceptual
event is placed on a region inside the attentional map, by altering the saliency level
in the corresponding region. The location with the highest saliency value in the map
becomes the next target for the robot to attend to.
Figure 4-3 shows an example of an input image and its corresponding attentional
map. In this example, the robot is oriented slightly to the left. Thus, the input
image occupies the left region of the robot’s attentional map. The small white patch
represents the detected face. The long ellipse-shaped patch corresponds to the detected sound, since the robot’s sound localization module only provides a horizontal
direction.
Figure 4-3: An example of an input image and the corresponding attentional map.
The attention map is a is an approximated projection of the front half of the geodesic
sphere centered at the robots origin The small white patch represents the detected
face. The long ellipse-shaped patch corresponds to the detected sound, since the
robot’s sound localization module only provides a horizontal direction.
73
4.4.1
Implementation Details
The attentional map consists of 280x210 pixels, indexed by the azimuth and elevation
angles of the robot’s gaze direction which is generated by actuations of the eyes, head,
and neck.
Each pixel at location x, y in the attentional map contains a set of cells Cx,y,n , 0 ≤
n ≤ 20.
Each cell Cx,y,n consists of the following variables.
• Feature Type Px,y,n . The attention system receives four types of perceptual
events. Px,y,n ∈ [face, color segment, motion, sound].
• Input State Stx,y,n
• Saliency Value Vx,y,n
• Saliency Growth Rate Rgx,y,n
• Saliency Decay Rate Rdx,y,n
• Start Time T 0x,y,n
• Last Update Time T lx,y,n
• Current ID CurrIDx,y,n .
• Last ID LastIDx,y,n .
• Inside Field of View F ovx,y,n .
The ID variables indicate when an input belongs to the same object. These
information are determined differently for each perceptual type. Face IDs are provided
by the same-person face tracker. Color Segment IDs are provided by the color segment
tracker. Motion and Sound IDs are determined temporally, i.e. events during a brief
continuous period of sound and motion are tagged with the same ID.
We begin by describing how the presence of a perceptual event activates a set
of cells in the corresponding region. This activation consequently alters the saliency
74
value Stx,y,n in each cell Cx,y,n over time. We then illustrate how to combine the
saliency values from all active cells to produce a saliency map. Lastly, we describe
how the resulting saliency map is used to produce the final attention output of the
robot’s next target.
We first describe how the presence of each perceptual event E activates a set of
new cells and affects the Input State variable Stx,y,n in each cell Cx,y,n .
At each time step, before the attention system processes any incoming perceptual
events, the variable Stx,y,n is reset to 0 for all cells in the attentional map.
1
2
FOR each perceptual event E of type p and ID i
Assign a region T for E in the attentional map by converting its location
from the retinal to the world coordinate
3
FOR each pixel Px,y , x, y ∈ T
4
Activate a new cell Cx,y,n in Px,y
5
Set Stx,y,n = 1
6
Set Px,y,n = p
7
Set Rgx,y,n = AdaptRgx,y,p which is incrementally set by the spatiotemporal learning system for type p
8
Set Rdx,y,n = AdaptRdx,y,p which is incrementally set by the spatiotemporal learning system for type p
9
Set LastIDx,y,n = CurrIDx,y,n
10
Set CurrIDx,y,n = i
11
IF x, y ∈ the robot’s current field of view
12
THEN Set F ovx,y,n = 1
13
ELSE Set F ovx,y,n = 0
75
14
ENDIF
15
ENDFOR
16
ENDFOR
We now define how the changes in the Input State variable Stx,y,n affect the T 0
and T l variables which consequently alter the Saliency Value Vx,y,n in each active cell.
We describe the latter alteration of the Saliency Value in the next part.
1
FOR each cell Cx,y,n at time t
2
IF it is active, Stx,y,n == 1
3
THEN Set T lx,y,n = t
4
IF it is a new object, CurrIDx,y,n 6= LastIDx,y,n
q
THEN Set T 0x,y,n = t +
5
2
−2 ∗ Rgx,y,n
∗ log(Minit /Mmax ), Minit =
100, Mmax = 200
6
ENDIF
7
ENDIF
8
IF has not been updated for some time, t − T lx,y,n > 10 msec
9
10
11
THEN Set Rdx,y,n = 0.2
ENDIF
ENDIF
We now describe how the Saliency Value Vx,y,n is updated at each time step based
on the rest of the variables stored in each cell Cx,y,n . Figure 4-4 illustrates how the
Saliency Value changes over time for varying values of Saliency Growth Rate (Rgx,y,n )
and Decay Rate (Rdx,y,n ). The Saliency Value initially increases using the Growth
Rate until it reaches a peak value and starts decreasing using the Decay Rate.
76
1
FOR each active cell Cx,y,n , Stx,y,n == 1at time t
2
IF Px,y,n 6= face OR F ovx,y,n == 1
q
3
THEN Set tpeak = T 0 +
4
IF t < tpeak
−2Rg2 log(Minit /Mmax )
THEN Vx,y,n = Mmax ∗ exp −((t − T 0)2 )/(2 ∗ Rgx,y,n ), Mm ax = 200
5
6
ELSE Vx,y,n = Mmax ∗ exp −((t − T 0)2 )/(2 ∗ Rdx,y,n )
7
ENDIF
8
ENDIF
9
ENDFOR
Note that if F ovx,y,n 6= 1 for face inputs, i.e. the face is located outside the robot’s
field of view, the saliency value does not change to provide short-term spatial memory.
We now combine the Saliency Value Vx,y,n from all active cells to produce a saliency
map S.
At each location x, y,
Sx,y =
N
X
Vx,y,n
(4.1)
n=0
Lastly, we describe how the resulting saliency map S is used to produce the final
attention output O of the robot’s next target. O is a coordinate x, y = argmaxSx,y .
4.4.2
Saliency Growth and Decay Rates
Initially, both AdaptRgx,y,p and AdaptRdx,y,p are set to 30 for all locations x, y, and
feature types p. As the robot gains experience in the environment, the spatio-temporal
learning system incrementally updates both AdaptRgx,y,p and AdaptRdx,y,p .
Figure 4-4 illustrates the saliency function for varying values of saliency growth
rate (Rg ) and decay rate (Rd ). The idea is that if a face or color segment is detected
77
Figure 4-4: The attention’s system’s saliency function for varying growth and decay
rates. The idea is that if a face or color segment is detected and subsequently tracked,
its saliency value will initially grow and start decaying after a while. The saliency
growth rate determines how good a particular stimuli is in capturing the robot’s
attention and the decay rate specifies how well it can maintain the robot’s attention.
and subsequently tracked, its saliency value will initially grow and start decaying after a while. The saliency growth rate determines how good a particular stimuli is in
capturing the robot’s attention or vice versa. The decay rate specifies how well it can
maintain the robot’s attention. The time-varying saliency functions and interaction
among these functions for multiple sensory events generate a number of advantages.
Firstly, since each object has to be tracked for some time to achieve a higher saliency
value, the system is more robust against short-lived false positive detection errors.
It also deals better with false negative detection gaps. The combination of decay
rates and egocentric map’s short-term memory provides some short-term memory
capabilities to allow the robot to remember objects even if they have moved outside the robot’s field of view. Moreover, the emergent interaction among various
saliency functions allows the attention system to integrate top-down and bottom-up
control and also to naturally alternate among multiple learning targets. Lastly, the
system architecture provides natural opportunities to detect various spatio-temporal
and multi-modal correlation in the sensory data. The incremental adaptation of the
saliency parameters based on these observed patterns allows the attention system
to be more sensitive to a set of previously encountered learning target types and
locations.
78
Figure 4-5: Two sample image sequences and the corresponding attentional map,
illustrating the attention system’s output while interacting with two people simultaneously. On each attention map (left column, the two vertical lines represent the
robot’s current field of view. Two people were interacting with the robot. The blue
box superimposed on the image indicates detected faces. The red cross indicates the
current attention target.
4.4.3
An Example
Figure 4-5 shows two sample sequences of the attentional map output. On each
attention map (left column), the two vertical lines represent the robot’s current field
of view. Two people were interacting with the robot. The blue box superimposed
on the image indicates detected faces. The red cross indicates the current attention
target. Once a person’s face is detected, it is represented by a white blob in the
attentional map, with time-varying intensity level determined by the saliency function
described above. Thus, the blob often remains in the map even if the face is no longer
detected for some time, allowing the robot to still be aware of a person despite failure
in detecting his or her face. In the upper sequence, the female’s face was detected
only in frame 2, but was still present in the map in frames 3 through 6. Similarly, in
the lower sequence, the infant’s face was detected in frames 2 through 4 and remains
in the map for the rest of the frames. Moreover, as shown in both sequences, after
79
attending to the first person, the attention system switches to the second person after
some time due to the temporal interaction among each blob’s saliency function. In
both upper and lower sequences, this attention switch from the first person to the
second person in frame 4 and 5 respectively.
4.5
Spatio-temporal Sensory Learning
The robot maintains a spatio-temporal learning map to correlate spatio-temporal
patterns in the robot’s sensory experience. Like the attention map, the system receives
multi-modal events from the robot’s perceptual systems (face, color segment, and
sound). Note that the color segment corresponds to the color inside the detected face
region. Additionally, it receives an input for each response window event, which is a
fixed-duration period following each speech event produced by the robot. This input
is used to detect when speech input occurs shortly after and thus is possibly a response
to the robot’s own speech. The spatiotemporal map’s task is to detect spatio-temporal
patterns of sensory input occurrence and correlate multi-modal inputs. The idea is
that if a person is indeed present in front of the robot, concurrent presence of face,
color, sound, and response are morely likely to happen. Thus, the spatiotemporal
map can use this information to increase or lower confidence for the output of the
perceptual system. Moreover, this information is also useful for biasing the attention
system to favor certain regions where a person was just recently present or where
people tend to appear.
4.5.1
Implementation Details
Figure 4-6 shows an illustration of the spatio-temporal learning map. This map is
spatially equivalent to the egocentric attention map and also represents an approximated projection of the front half of the geodesic sphere centered at a robots origin
, but at a lower resolution. It consists of a 2D rectangle with 70x52 pixels.
Each pixel spatially represents a 4x4 pixels corresponding region in the robot’s
attentional map. Each map pixel Px,y at location x, y in the spatio-temporal learning
80
Figure 4-6: A diagram of the robot’s spatio-temporal learning map. It consists of a 2D
rectangle with 70x52 pixels. Each pixel spatially represents a 4x4 pixels corresponding
region in the robot’s attentional map. Each map pixel is a storage space, containing
four cells, one for each input type (face, color, sound, response), and a hebbian
network.
map contains four cells, Cx,y,p , one for each input type (face, color, sound, response)
and a Hebbian network Hx,y .
Each cell Cx,y,p has a number of states depending on its activity level Ax,y,p , as
follows.
• Initially, all cells are empty and Ax,y,p = −1.
• When a perceptual event of type P is present at time t and location L, a set of
cells Cx,y,P , x, y ∈ L become active and Ax,y,P is set to an initial magnitude of
M = 200.
• Over time, the magnitude of Ax,y,P decays based on the function A(t) = M exp −D ∗ (t − Tstart )
.3, Tstart = the time of when the last perceptual event was entered into Cx,y,P .
With this decay function, if a cell has not been activated for about 2 seconds, its activity level Ax,y,P will decay to 0 and the cell becomes dormant
until activated again. Figure 4-7 illustrates an example of the spatio-temporal
learning system’s activity function when a perceptual input is entered at time
t=3,5,7,10,18,20,22,30 ms.
Using this simple mechanism, the map can be used to record various spatiotem81
poral patterns in the sensory input. Active cells represent sensory events that are
currently present. Dormant cells provide spatial history of each perceptual type,
which allows the robot to learn where faces typically occur, etc.
In addition, each pixel Px,y contains a Hebbian network Hx,y to perform local
computation and capture temporal pattern of events occurring within the small region
that each pixel represents. This hebbian network contains six nodes and six weights
as illustrated in figure 4-6. For example, W4 is strengthed when face and sound are
both present in this region, while W10 is strengthened when both face and sound are
present in this region shortly after the robot speaks.
The following are the processing steps of the hebbian learning process:
1
2
3
FOR each pixel Px,y
IF Cx,y,P is active, i.e. Ax,y,P > 0
THEN Activate the corresponding Hebbian node m for feature type P ,
by setting the node’s input Im = 1
4
ENDIF
5
FOR each Hebbian weight Wij , which connects node i to node j
6
Update each weight dWij = Y 0 ∗ Ij ∗ (Ii − (Ij ∗ Wij )), Y 0 = 0.01, Ii =
inputof nodei
7
8
ENDFOR
ENDFOR
When combined together, these simple local cells provide spatio-temporal informa-
tion about where and when things tend to co-occur. Figure 4-8 shows some examples
of two-dimensional maps constructed by the spatially combined Hebbian weights from
each cell. The spatio-temporal learning system currently utilizes these Hebbian maps
in two ways. Firstly, it relies on W11 in making decisions on when to correlate pairs
of face and voice samples. In particular, when it detects co-occurring face and voice
82
Figure 4-7: An example of the spatio-temporal learning system’s activity function
when an object is entered a t time t=3,5,7,10,18,20,22,30 ms.
Figure 4-8: Some samples of two-dimensional maps constructed by the eleven hebbian
weights(W1 -W11 ) from left to right.
in a region, it will only correlate and store the pair if the value of W11 within the
corresponding region is larger than a predefined threshold. In other words, it will
only correlate the pair if co-occurrence of all input types has been high within this
region. Secondly, it also uses the W11 weight map to update the AdaptRgx,y,p and
AdaptRdx,y,p parameters in the attention system. Similarly, if co-occurrence of all
input types has been high within a certain region, this will bias the attention system
to favor this region over others.
Each cell in the map also performs a simple histogram calculation for the sound
energy values occurring in two cases: when a face is present and not present. These
histograms are then used to adaptively set the threshold for the sound detection
module. This adaptation step is necessary since we deal with a large variations
of background noise in different environments. Adaptive sound energy threshold
allows the robot to disregard background noise when attending to people, simply by
modelling sound energy values when faces are present.
83
4.6
Behavior-Based Control Architecture
We have so far covered the robot’s physical actuators and sensors, low level motor
controller, and perceptual systems. The behavior system integrates these lower level
components into overall coherent actions and behaviors that are relevant to the robot’s
current perceptual inputs. The overall goal of the control architecture is to control
the robot’s high level behavior such that the robot is able to determine potential
learning targets in the world while engaging in social interaction with human mentors.
This has to be done in a timely manner as a human’s social behavior is a very
complex mechanism and humans are very well tuned to expect a certain degree of
expressiveness and responsiveness. At any given time, the high level controller must
in real time assess the current multi-modal perceptual state for presence of human
social cues and potential learning targets. Simultaneously, it must also determine the
most relevant behavioral response for the robot.
Figure 4-9 illustrates MERTZ’s final behavior-based controller for finding, interacting with, and learning to recognize people. The robot’s behavior system was implemented in L/MARS [19]. The controller has a number of behavior modules which
communicate with each other using inhibiting and suppressing signals, as defined in
the subsumption architecture [18].
Down Time At the lowest level, module random-explore simply generates random
motion commands to module explore which sends the commands to the robot’s eyes.
This allows the robot to randomly explore its environment when noone is around,
which is likely to happen in a long-term experiment.
Attending to Target The attend module consists of the multi-modal attention
system described in section 4.4. It receives input from both visual and audio processing, which provide detection and tracking of faces, color segments, motion, and
sound. Whenever the attention system decides that a potential learning target is
present, it sends the target coordinates to the robot’s eyes by inhibiting the output
of module explore. The eyes are actuated to simply minimize the error between the
84
Figure 4-9: MERTZ’s behavior-based controller for finding, interacting with, and
learning to recognize people. The robot orients to and track salient visual targets,
receives audio input, and tries to mimic what people say. The behavior system was
implemented in MARS, a programming language specifically designed to implement
behavior based programs in the incremental and concurrent spirit of the subsumption
architecture.
target location and the center of the robot’s field of view. This feedback control is
done using only one of the robot’s eyes, i.e., the right one.
Natural Human-Like Motion Module VOR monitors the eyes and head velocity
and at times sends commands to the robot’s eyes to compensate for the robot’s moving
head, inhibiting module explore. In parallel, module posture monitors the position
of the robot’s eyes and generates postural commands to the head and neck such
that the robot is always facing the target. Lastly, module lipsynch receives phoneme
sequences that the robot is currently uttering and commands the robot’s lips to move
accordingly.
Emotional Model Module emotion contains the robot’s emotional model, implemented based on various past approaches for computational modelling of emotion
[97, 12, 72]. For Mertz, the emotional model serves as an important element of the
85
social interface. Research in believable agents suggests that the ability to generate
appropriately timed and clearly expressed emotions is crucial to the agent’s believable quality [57, 4]. Mertz’s emotional model consists of two parameters: arousal and
valence. In the absence of emotionally salient input, these variables simply decreases
or increases over time toward a neutral state. We have predefined faces, motion, and
proximate objects to trigger an increase of arousal. Medium-size faces are defined as
positive stimuli while large motion and proximate objects have negative affects on
the valence variable.
The arousal and valence variables are then mapped to a small set of facial expression, formed by the four degrees of freedom on the robot’s eyebrows and lips.
Inviting and Regulating Interaction Module maintain-distance at times inhibits module posture to move the robot’s neck to maintain a comfortable distance
with a target, estimated using the proximate object detection and the relative changes
in salient target’s size.
Module speak receives input from module vision and audio proc. When the word
recognition system successfully segments two words from the speech input, it sends a
command to speak to either mimick these words. When the word recognition system
is overwhelmed by a long speech input, it commands speak to produce a randomly
selected sentence from a predefined set to request for people to speak in fewer words.
The visual system also sends a predefined sentence to speak when it encounters a
number of situations. Table 4.1 lists these situations and the corresponding predefined
set of sentences.
4.7
Automatic Data Storage and Processing
The robot collects and stores each face image as segmented by the same-person
tracker, except for those whose area is less than 2500 pixels. These face images are
organized as sequences. Each sequence contains the result of a continuous tracking
session and is therefore assumed to contain face images of one individual. The robot
86
Table 4.1: A set of predefined sentence for regulating interaction with people.
Condition
1 Segmented more than five words
2
Segmented face area is less
than 2500 pixels
3
The last 3 tracked sequences
and contain less than 20 images
4
The spatiotemporal system
detects a face region
but no sound input
Predefined Sentences
Please say one or two words.
I don’t understand.
Are you speaking to me?
Too many words
What did you say?
Can you repeat it please?
Please come closer.
I cannot see you.
You are too far away.
Please do not move too much.
Please look at my face.
Please face me directly.
Please speak to me.
Teach me some words, please.
Please teach me some colors.
Please teach me some numbers.
then eliminates all face sequences which contain less than 8 images and performs
a histogram equalization procedure on all remaining sequences. All face sequences
are taken as automatically produced by the robot without any manual processing or
filtering.
The robot assigns a unique index for each sequence and keeps track of the last
index at the end of each day. Loss of data and overwriting because of the common
programming of indices to start at zero upon startup are one of the many mundane
failures we simply overlooked during the project.
As mentioned above, the robot’s spatio-temporal learning system utilizes the simple hebbian network within each local cell to make decision in correlating co-occurring
face and sound samples. The robot automatically stores these pair correlations, retrieves the relevant sound samples, and places them along with the correllated face
sequence.
The robot then automatically processes all of the final set of face sequences and
correlated voice samples to extract various features for further recognition purposes.
87
One computer is assigned solely for this processing so that the robot can run these
computationally extensive programs online without interfering with the real-time online behavior. However, at the end of each experiment day, the operator has to pause
this data processing to move the robot back into the laboratory. The operator then
manually resumes the processing after the move.
4.8
Experimental Results
In this experiment, we evaluate the robot’s perceptual, attention, and spatio-temporal
learning systems, with respect to the target goal of collecting face and voice data from
each individual through spontaneous interaction. The most important task here is to
collect face sequences from each person. The more images there are in each sequence
and the larger each image is , the better it is in capturing visual information of
each person. We also report on these collected training data which is then used by
the robot’s incremental face recognition system, as described in the next chapter. We
analyze the accuracy and other relevant characteristics of the face sequences collected
from each person.
4.8.1
Setup
We conducted the experiment in eight days. The robot was placed in x different
locations (where?) for 2-7 hours each day. The exact schedule is shown in table 4.2.
Like the earlier experiments, the robot was set up and people were free to approach
the robot whenever they want. A written poster and sign was placed on the robot
platform to introduce the robot and explain the experiment (see ??).
Throughout this project, we have seen many interesting aspects and issues associated with carrying out experiments in public spaces. In addition to the valuable
lessons that we presented in section 3.5, it is a great opportunity for the robot to engage in natural and spontaneous interaction with a large number of naive passersby
and also collect a huge amount of data. However, the down side is there is no guarantee of repeatability in the individuals that the robot encountered. This would severely
88
Table 4.2: Experiment Schedule
Experiment
1
2
3
4
5
6
7
8
Date
Nov 20
Nov 21
Nov 22
Nov 27
Nov 29
Nov 30
Dec 1
Dec 4
Time
1-7 pm
4-7 pm
3-6 pm
1-5 pm
1-7 pm
12-7 pm
2-7 pm
2-4 pm
limit our evaluation of the robot’s incremental individual recognition capabilities. We
thus decided on a compromise, where we recruited fourteen voluntary subjects and
requested that they come to interact with the robot on multiple days throughout the
experiment. In order to minimize control, we did not impose any rules or restriction
on these subjects. We simply announced where and when the robot would be running on each day of the experiment. Unfortunately, due to the lack of control and
instructions, most of the voluntary subjects came to interact with the robot once and
only for a very short time. The detailed recruitment protocol and written instructions
provided to the experiment subjects are attached in section ??.
As mentioned above, we were not able to record the experiment externally due to
some limitations involving human subject experiments. This would require a more
stringent protocol, such as written permission forms, which would alter the nature
of the experiment in some undesirable ways. Thus, we can only provide quantitative data based on the robot’s camera and microphone input. The camera’s limited
field of view unfortunately severely constrains the range of events that we can capture. For example, we cannot report on the number of people who approached the
robot but failed to attract the robot’s attention. Thus, we complement these data
whenever approriate with some qualitative results through visual observation of the
experiment. Though subjective, we believe that these qualitative observations yield
some interesting insights uncaptured by the cameras.
89
4.8.2
Finding and Interacting With People
How many people did the robot find? Table 4.3 illustrates the number of face
sequences produced by the robot’s same-person tracking module during each experiment day. Each face sequence contains a set of face images produced by a continuous
tracking session of one person. The robot collected a total of 4250 sequences and
175168 images during the entire experiment.
Figure 4-10 shows some of these sample face sequences. Each sequence contains a
set of face images, which are assumed to belong to the same person. As we can see in
this figure, this is of course not always the case. Some sequences contain background,
badly segmented faces, and in a few cases even faces of another person. Thus, we
need to further analyze these face sequences for detection and segmentation accuracy.
For this purpose, we manually labelled each sequence produced on day 6, the
longest experiment day. The 863 sequences produced on day 6 come from at least 214
people. From these 863 sequences, we could not label 58 sequences due to background
inclusion, segmentation error, and bad lighting. Thus, for day 6, 93% of the collected
sequences contain correctly detected face images. Some statistics on the number
of sequences and images that belong to each individual is shown in table 4.4 and
figure 4-11.
Based on these numbers, we can infer that the robot indeed detected and tracked
a large number of people. Most people generate less than 200 images. A number
of people interacted with the robot much longer and thus generated up to 3177 images. These collected face sequences are later used as training data for the robot’s
incremental individual recognition system, except for some that the robot excluded
automatically. We will further analyze these face training data for detection and
segmentation accuracy in more details in the next section. Manual labelling of these
training data indicates that the robot found and tracked at least 525 people during
the entire experiment.
How long was the interaction and how many people at a time? Figure 412 shows the number of people that the robot interacted during the seven hours on
90
Figure 4-10: Some sample face sequences produced by the same-person tracking module. Each sequence contains a set of face images, which are assumed to belong to the
same person. As we can see in this figure, this is of course not always the case. Some
sequences contain background, occluded or badly segmented faces, and in a few cases
even faces of another person.
91
Figure 4-11: The distribution of the number of sequences and images from 214 people
on day 6. On the left is the histogram of the number of sequences and on the right is
the histogram of the number of images.
Table 4.3: The Face Tracking Sequence Output
Experiment # face seq
1
941
2
268
3
113
4
459
5
832
6
863
7
633
8
141
Total
4250
# images
31667
12164
10023
22289
27467
27824
39622
4112
175168
# image/seq:
min max
1
534
1
421
1
926
1
521
1
458
1
509
1
487
1
305
ave
std
33.7 48.1
45.4 59.2
88.7 138.9
48.6 65.7
33.0 48.1
32.2 53.0
62.6 77.6
29.2 47.2
Table 4.4: The Same-Person Face Tracking Output on Day 6
Experiment # face seq
6
863
# people
214
# seq/person:
# image/person:
92
min
1
min
1
max ave
71
4.01
max ave
3177 128.2
std
6.9
std
283.6
Figure 4-12: The number of people that the robot interacted with during the seven
hours on day 6. This plot was calculated based on the manual labels of each sequence
and the assumption that each person was still present for 15 seconds after the end of
their face sequence.
day 6, according to the recorded face sequences and their timing information. This
plot was calculated based on the manual labels of each sequence and the assumption
that each person was still present for 20 seconds after the end of their face sequence.
When actively interacting, robot interacted with more than one person concurrently
for roughly 16% of the time. Table 4.5 shows the detailed breakdown of the duration
of interaction segments with different numbers of people. A continuous interaction
segment is defined is by the presence of a tracked face at every 10 ms interval. The
robot interacted with at least one person for 32% of the seven hours. The longest
duration of a continuous session with one, two, three, and four people are 510.5,
211.3, 259, 81, and 35 seconds respectively. The longest duration of down time is
510.5 seconds.
We also calculated the duration of continuous interaction sessions for each of the
214 people based on the manual labels and the timing information of each recorded
face sequence. Figure 4-13 left shows the histogram of the sum of continuous session
duration for these 214 people. From 214 people, 97% have a total duration of less than
119 seconds. Three people have a total duration of between 119-237 seconds. The last
93
Figure 4-13: A set of histograms illustrating the distribution of session durations for
the 214 people who were identified through manual labelling of all face sequences
collected on day 6. Some bins of value 1 are annotated because of lack of visibility.
On the left is a histogram of the sum of continuous session duration for all 214 people.
On the right is the histogram of the total time span of appearance, measured from
the time of first and last recorded sequence. The lower histogram is a zoomed in
version of the first bin in the upper histogram.
94
Table 4.5: Duration of interaction sessions for different number of people on day 6
Num of people
0
1
2
3
4
% of 7 hours
67.95%
26.84%
4.84%
0.34%
0.03%
# segments max duration (ms)
504
5105
464
2113
151
259
23
81
3
35
ave duration (ms)
336.7
144.5
80
37.1
24.3
two people have a total duration of 532 and 886 seconds, respectively. Figure 4-13
right shows the histogram of the total time span of appearance, measured from the
time of first and last recorded sequence. This gives us some information for cases of
individuals who interacted with the robot on multiple occasions throughout the day.
From 214 people, 97.7% appeared in the total time span of less than 1244 seconds.
The lower right histogram shows a more detailed breakdown for this bin. The other
3 people have a larger total time span, ranging from 8081-24240 seconds.
From these numbers, we can infer that the robot is in action and interacting with
people for roughly one third of the time. About 16% of these active moments, the
robot interacted with multiple people concurrently. Moreover, most people stopped
by once during the day and interacted briefly with the robot. A few people stayed for
a longer session with the robot. A few people also stopped by in multiple occasions
during the day.
Did people verbally interact to the robot? As described in section 4.3.4, the
word recognition system uses a fixed energy threshold to detect the onset of sound
events. Now the threshold is set low enough such that most sound events, including
background noise, are recorded for evaluation purposes. Table 4.6 shows the number
of recorded samples and the number of sound samples that pass the robot’s word
recognition filtering mechanism described in section 4.3.4. Note that sound data
recorded on day 1 were lost due to human error. The first does not contain much
information as it includes almost every sound event. The latter gives a better estimate
95
Table 4.6: The Voice Samples Output
Experiment
2
3
4
5
6
7
8
Total
# total samples
1641
560
1032
3274
5148
3492
882
16029
# processed
915
323
834
1041
1520
892
282
5807
of the number of actual speech input that the robot received.
We manually annotated 346 sound samples which have passed the word recognition filtering system. From the 346 samples, 26 samples contain only background
noise and 27 samples contain only the robot’s own voice. Thus, roughly 86% of the
samples contains human speech. In an earlier experiment described in section 3.5.2,
we annotated all of the recorded sound files more extensively. We also differentiated
between robot directed speech and non-robot directed speech based on the speech
content. These annotations indicate that the robot indeed received a large amount
of human speech input during the experiment.
How well did the robot correlate face and voice data? In order to complement
the unsupervised face recognition system, the robot utilizes spatio-temporal context to
correlate pairs of face and voice sequences from the same individual to allow for multimodal recognition. Table 4.7 illustrates the number of face and voice pairs produced
by the robot’s spatio-temporal correlation procedure during each experiment day.
Note that a large part of the data from day 1 was unfortunately lost due to human
error.
Each pair of face and voice samples contains one face sequence and one or more
sound samples which were spatio-temporally correlated with the corresponding face
sequence. The robot collected a total of 201 pairs during the entire experiment. We
manually annotated each collected pair and report on the correlation and segmenta96
Table 4.7: The Paired Face and Voice Data
Experiment
1
2
3
4
5
6
7
8
# pairs
7
48
17
20
47
24
32
6
# voice samples
14
128
37
23
52
34
49
7
# face images
381
4944
3786
1717
5430
3457
4767
296
tion accuracy in table 4.8 and figure 4-14.
Figure 4-14 consists of three plots. The top plot shows the proportion of sound
samples containing purely of background noise, robot’s speech, and human speech.
The middle plot shows the percentage of sound samples in each face-voice pair which
correctly belong to an individual portrayed in the correlated face image. We cannot
be absolutely sure that the voice and the face actually match up since we don’t have
a video recording of the entire experiment as ground truth. However, we at least
matched up that the gender of the face and voice match up. Of these correct voice
samples, we analyze the segmentation accuracy of the content. The proportion of
inclusion of robot’s voice and other people’s voice inside these samples are shown in
the lower plot. On the first four experiment days, the robot’s auditory system did not
inhibit its input when the robot is speaking. Thus, we can see that the proportion of
robot’s voice is very high during those four days.
Given the amount of noise and dynamic in the environment, this face and voice
correlation task is very difficult for the robot. People tend to come in groups and
they all speak at the same time. Some speak to the robot and others speak to each
other in the background. These results show that the robot can perform this task
with a reasonable accuracy, but it has to be very selective in its decision making, as
determined by the robot’s spatio-temporal learning system. Thus, out of the 5807
sound samples that the robot collected, only 344 samples were correlated with a face
sequence.
97
Figure 4-14: Segmentation and Correlation Accuracy of the Paired Face and Voice
Data
98
Table 4.8: Segmentation and Correlation Accuracy of the Paired Face and Voice Data
Exp
speech
robot’s
bgnd
correct
clean
+robot
+1p
+mp
+robot+1p
+robot+mp
1
2
3
4
5
6
7
8
14.3
13.3
13.5
0
0
5.9
6.1
0
0
3.1
16.2
13
5.8
2.9
8.2
0
85.7
83.6
70.3
87
94.2
91.2
85.7
100
100
83.3
88.2
80
93.8
95.8
93.8
100
58.3
15
76.9
85
98
87.1
88.1
71.4
4.2
64.5
23.1
10
2
3.2
2.4
0
0
1.9
0
5
0
6.5
4.8
28.6
0
1.9
0
0
0
3.2
2.4
0
0
11.2
0
0
0
0
2.4
0
0
5.6
0
75
0
0
0
0
Figure 4-15: A set of histograms of the pixel area of faces collected during three
experiment days. The first two were taken from an earlier experiment described in
section 3.5. The third histogram was taken from day 1 of the final experiment. During
these three experiment days, the robot did not make any verbal requests.
Was the robot able to influence the input data? Figure 4-15 and 4-16 shows
a set of histograms of the pixel area of the faces collected on different experiment
days. We will compare these histograms to show an increase of face sizes when the
robot actively makes verbal requests for people to come closer when their face area
is less than 2500 pixels.
Two histograms in the first figure, titled experiment A and B were taken from an
earlier experiment described in section 3.5. The third histogram titled experiment 1
was taken from day 1 of the final experiment. During these three experiment days,
the robot did not make any verbal requests. The second figure contains histograms
99
Figure 4-16: A set of histograms of the pixel area of faces collected during seven
experiment days. These histograms were taken from day 2-8 of the final experiment,
where the robot made verbal requests.
100
taken from day 2-8 of the final experiment, where the robot made verbal requests.
These results indicate that the occurence of face sizes within the range of 500010000 pixels indeed increases when the robot makes verbal requests for people to
come closer.
What did the robot attend to? Figure 4-17-4-20 provides a few snapshots of the
output of the attention system during the experiment. Each figure contains a time
series of pairs of the robot’s camera input image and the corresponding attention map
output, ordered from left to right. Note that the attention map is an approximated
projection of the front half of the geodesic sphere centered at a robots origin. As shown
previously in figure 4-3, the camera input image occupies only a subregion of the
attentional map. Due to the reduced resolution, the white lines which represent the
camera input image borders in the attention map are not always visible. As mentioned
above, white patches in the attention map represents various sensory inputs: face,
sound, motion, or color segments. A white cross in the attentional map represents
the current attention target, which is the region with highest saliency value.
The camera input image is sometimes superimposed with blue boxes, which represents the output of the face detector and tracker. The red cross in the camera input
image represents the same current attention target (shown as a white cross in the
attention map) when it happens to fall within the robot’s field of view.
Figure 4-17 shows a snapshot of the attention output while the robot maintains
a short-term memory for a person while interacting with another person. The robot
detected two people in the second frame. It then continued to interact with one
person, but maintained a short-term memory for the other person in the attentional
map while he was outside the robot’s field of view. After some time, the robot
habituated to the first person and switched attenion to the second person.
Figure 4-18 shows a snapshot of the attention output while the robot switched
attention from one person to another using sound cues. The robot interacted with
a person while another person was speaking but not visible to the robot. After the
robot habituated to the first person, it switched attention to the other person.
101
Figure 4-17: A snapshot of the attention output while the robot maintained a shortterm memory for a person while interacting with another person.
102
Figure 4-18: A snapshot of the attention output while the robot switched attention
from one person to another using sound cues.
103
Figure 4-19: A snapshot of the attention output showing the robot interacting with
two people simultaneously.
Figure 4-19 shows a snapshot of the attention output while the robot interacted
with two people simultaneously. The robot first interacted with one of the two people
in front of it. At some point, the robot detected both of them and switched attention
to the other person after some time.
Figure 4-20 shows a snapshot of the attention output while the robot recover its
tracking of a person because of the attention system’s spatio-temporal persistence.
The robot initially tracked a person, but was having difficulty because of false positive
detection and motion in the background. At some point, the robot lost track of the
person because she was too close to the robot. However, the attention system’s spatiotemporal persistence allows the robot to maintain its target on the person until the
face detector was able to find the person’s face again.
104
Figure 4-20: [A snapshot of the attention output showing the robot recover its tracking
of a person because of the attention system’s spatio-temporal persistence.
105
What did it learn? Figure 4-21 shows a sequence of maps consisting of the eleven
hebbian weights W1 − W1 1 recorded on day 7 of the experiment, ordered from left to
right. By the end of the experiment, the different hebbian maps provide patterns for
when and where sensory inputs tend to occur. For example, the W7 map provides
the information that face, color segment, and sound tend to co-occur across the
horizontal region spanning the front half of the robot, but spans a smaller vertical
range. Thus, a false positive face in a ceiling is less likely to be a face. The W8 map
provides information that face, color segment, and response tend to co-occur around
the middle region in front of the robot. This corresponds to the likely areas for people
to verbally respond to the robot’s speech.
As described above, the sound detection module also detects the onset of sound
events using an adaptive energy threshold to minimize detection of and therefore
allocation of attention to background noise. As shown previously in figure 3-14, variation of background noise level is particularly problematic when the robot is moved
away from its familiar laboratory environment to a noisy public space. Figure 4-22
illustrates the adapted sound threshold values throughout day 6 and 7 of the experiment. The blue line corresponds to the sound energy threshold value over time. The
superimposed red line corresponds to the number of faces present over the same time
period. In the lower figure, the face occurrence is shown as a red dot because we do
not have information to determine the exact number of faces. These plots show that
the robot adaptively increases the sound energy threshold when people are present.
Adapting the sound threshold according to the energy level of each interaction session
protects the robot’s attention system from being overwhelmed by the high level of
background noise coming from all directions.
4.8.3
The Face Training Data
We have shown in the last section that the robot collected a large number of face
sequences from many different individuals. The robot automatically constructs the
training data set by compiling all of the collected face sequences and making some automatic exclusions, as described in section 4.7. Some samples of these face sequences
106
Figure 4-21: A sequence of maps consisting of the eleven hebbian weights W1 - W11
recorded on day 7 of the experiment, ordered from left to right
107
Figure 4-22: Adaptive sound energy threshold value over time on day 6 and 7. The
blue line corresponds to the sound energy threshold value over time. The superimposed red line corresponds to the number of faces present over the same time period.
In the lower figure, the face occurrence is shown as a red dot because we do not have
information to determine the exact number of faces.
108
Table 4.9: The distribution of the number of images per sequence and the number of
sequences and appearance days per individual
#images/seq
0-110 111-230
1702
257
231-350
43
351-470
16
471-590
5
591-710
0
711-830
1
831-950
1
# seq/person
1-5
448
6-10
41
11-15
11
16-20
4
21-25
2
26-30
1
31
1
248
# days/person
1
492
2
15
3
1
4
0
5
1
6
0
7
1
8
are previously shown in figure 4-23.
After these automatic filtering steps, the robot’s final set of face training data
contains 2025 sequences. We manually labelled these sequences to obtain identification ground truth. If more than 50% of the sequence contains non-face images, the
sequence is labelled as unidentified. Sequences which contain two people are labelled
according to the individual who owns the majority of the face images. From these
2025 sequences, 289 are labelled as unidentified and the rest are identified as 510 individuals. Table 4.9 shows the distribution of the number of images per sequence, the
number of sequences per individual, and the number of distinct appearance days of
the few repeat visitors. The sequence length ranges from 9 to 926 images. The number of sequences per person is very uneven. The robot operator has the most number
of sequences and repeat visits. A total of 19 people have more than 10 sequences
in the database. Eighteen people interacted with the robot on more than one day
during the experiment. Unfortunately, given the minimal control of our recruitment
procedure, only a few volunteers actually came for multiple visits.
We also visually inspected 25% of the data set in three contiguous segments in
the order of its recording time and analyzed the amount of errors and variations of
the content. Figure 4-23 shows the proportion of segmentation error, occlusion, and
pose variations in this annotated set. From these 567 face sequences, 8.6% consists
entirely of non-face images, 28% contain at least one badly segmented face image and
109
Figure 4-23: Analysis of errors and variations in the face sequence training data that
the robot automatically collected during the experiment
10% contain at least one non-face or background image. These three error categories
are mutually exclusive. A face image is considered to be badly segmented if less than
two third of the face is included or it fills up less than half of the image. Roughly 0.9
% of the sequences contain face images of two people. No sequence in the database
contains more than two people. In order to estimate the amount of noise in these
sequences, we measured that 32.6% of the sequences contain large face pose variations
of more than 30 degree in-plane or out-of-plane rotation. Lastly, about 14.5% of the
sequences contain occluded face images. This occlusion is most frequently caused by
the robot’s eyelids which occasionally come down and obstruct the robot’s field of
view.
These results indicate that the robot collected a large face training data, containing
a large number of images from many individuals. There is a lot of variation within
each face sequence, which is valuable for encoding maximal information about a
person’s facial appearance, but challenging for recognition. As expected, there is a
high level of noise in the data, induced by both tracking error and the dynamic setting
the robot operates in.
110
Chapter 5
Unsupervised Incremental Face
Recognition
In the last chapter, we have presented the implementation and experimental results
of the robot’s perceptual, attention, and behavior systems. The integration of these
subsystems allows the robot to automatically instigate, segment and collect a large
amount of face and voice data from people during natural spontaneous interaction in
uncontrolled public environments. In this chapter, we describe how these data are
processed by the robot’s face recognition system to incrementally learn to recognize
a set of familiar individuals.
An incremental and unsupervised face recognition framework essentially involves
two main tasks. The first task is face clustering. Given that the system starts with
an empty training set, it has to initially collect face data and form clusters of these
unlabelled faces in an unsupervised way. The system then incrementally builds a
labelled training set using these self-generated clusters. At this point, the system is
ready for the second task of supervised face recognition based on the incrementally
constructed labelled training set.
We begin by discussing a set of challenges and failure modes in realizing these
tasks. We then present the implementation of the first task: unsupervised face clustering. We analyze the face clustering performance across a range of algorithmic
and data-dependent parameters. We also applied the clustering algorithm on the
111
Honda-UCSD video face database [51, 50].
Lastly, we present an integrated system which incorporates the face clustering
solution to implement an incremental and fully unsupervised face recognition system. We evaluate this integrated system using the robot’s collected face sequences
to analyze both the incremental recognition performance and the accuracy of the
self-generated clusters.
5.1
Challenges
Automatic face recognition is a very useful but challenging task. While a lot of
advances have been made, the supervised face recognition problem is still unsolved
when posed with high variations in orientation, lighting, and other environmental
conditions. The Face Recognition Vendor Tests (FRVT) 2002 report indicates that
the current state of the art in face recognition using high-resolution images with
reasonably controlled indoor lighting is 90% verification at a 1% false positive rate
[75]. Recognition performance is reported to decrease as the size and time lapse of the
database increase. The American Civil Liberties Union’s review of face recognition
surveillance tests at the Palm Beach airport describes that classification performance
decreases sharply in the presence of motion, pose variations, and eyeglasses [92].
In our particular setting, embodied social interaction translates into a complex,
noisy, and constantly changing environment. The system has to work directly from
video stream input, instead of nicely cropped images. Both the robot’s cameras and
the interaction subjects are moving, leading to variations in viewing angle, distance,
etc. The experiment lasts for many hours of the day in areas with large glass windows,
resulting in lighting changes. Mertz typically encounters a large number of individuals
each day and often has to interact with multiple people concurrently. The passersby
exhibit a wide range of interaction patterns and expectations. While some people
managed to face the robot frontally and allowed the robot to extract frontal face
images, others were busy looking around to check out the robot’s platform and thus
rarely facing the robot frontally. Some people have trouble hearing the robot and thus
112
Figure 5-1: A sample of the face sequences taken from interaction sessions. Both the
camera and target move around, creating a complex and noisy environment. Note
the large variation in face pose, scale, and expression.
tilt their head sideways to listen to the robot, exposing only their side facial views.
Figure 5-1 illustrates a sample set of faces tracked during interaction, containing
large variations in scale, face pose, and facial expression. These dynamic and noisy
settings lead to various errors and imprecision in the robot’s collected training data,
as described in section 4.8.3.
In addition to errors and imprecisions which increase the complexity of the problem, the task of extending the recognition system to allow for incremental acquisition
and unsupervised labelling of training data further complicates the problem. Previous studies have shown that the intraclass distance of a person’s face is sometimes
larger than the interclass distance between two people’s faces [90]. Thus, the task of
clustering face images is challenging and cannot rely on existing distance-based clustering methods. Moreover, many existing clustering algorithms require an a priori
number of clusters, which is not available in our case.
113
5.2
Failure Modes
We describe a set of failure modes relevant to the incremental recognition task. In
our application and setup, some failures are more catastrophic than others. As the
entire procedure is automated, failure in one subsytem can propagate to the rest of
the system. Thus, it is important for the different subsystems to compensate for each
other’s failures.
The Face Data
• Partially cropped faces or inclusion of background due to inaccurate segmentation during tracking
• Inclusion of non-face images due to false positive face detection
• Inclusion of another person’s face images due to to tracking error
The first two failures are certainly not ideal, but may still be compensated by the
face clustering system. The last one, however, would lead to unacceptable results if
they cause two people to be recognized as one person.
Unsupervised Clustering
• Merging – multiple individuals per cluster
• Splitting – multiple clusters per person
• Failure to form a cluster of an individual
• Inclusion of non-face images in some clusters in the database
• Formation of clusters containing of only non-face images
Merging and splitting are the two main failure modes of any clustering method.
We will often refer to these failure modes by using these italicized keywords. Merging
multiple individuals into a cluster can lead to the false recognition of two people as
one person, especially if the proportion of each person in the cluster is roughly even.
114
This failure may not propagate to the recognition system however, if one person holds
majority of the cluster. Splitting or forming multiple clusters per person is somewhat
less catastrophic. In fact, we expect that this failure will frequently occur initially. As
the robot interacts more with the person, we hope that these split clusters will start
getting merged together. However, if the system starts building two large clusters of
a person and never merges them together, this would lead to a negative consequence.
As the main goal is to recognize familiar individuals, the failure to form a cluster of
a person is unacceptable of he/she has interacted with the robot frequently. However,
failure to form a cluster of someone who has only shown up once or twice is perfectly
acceptable.
Inclusion of non-face images in a part of or whole cluster in the database is inevitable, unless the robot has 100% face detection and tracking accuracy. This failure
also may lead to negative consequences, the recognition system fails to compensate
for them.
Person Recognition
• False recognition of a known person as an unknown
• False recognition of a known person as someone else
Both of these failure modes are unacceptable, given that the recognition system is
our last point of contact in this thesis. In the future, we are interested in exploring an
active learning scheme where the robot inquires a person to check if its recognition
hypothesis is correct and somehow integrate the answers into its learning system.
This would provide a way to compensate for these unacceptable recognition failures.
5.3
Unsupervised Face Clustering
Given an unlabelled training set consisting of an arbitrary number of face sequences
from multiple individuals, the task of the clustering system is to cluster these sequences into classes such that each class corresponds to one individual. Figure 5-2
115
illustrates the steps of the face sequence clustering procedure. We first describe the
overall approach that we took in our solutions. We then present each implementation
step in more details below.
Figure 5-2: The Unsupervised Face Sequence Clustering Procedure. The input is a
data set containing an arbitrary number of face sequences. Each face sequence is
processed for feature extraction. Extracted features are then passed to the clustering
system.
5.3.1
The Approach
The following are four properties of the solutions that we employ in our implementation of the face clustering system.
Use of face sequence We are using a video-based approach which deals with face
sequences instead of images. The robot utilizes spatio-temporal context to perform
same-person face tracking and obtain a sequence of face images that the robot assumes
to belong to an individual. These face sequences provide a stepping stone for the
unsupervised face clustering problem. Instead of clustering face images which may
look very different from one another, we are clustering face sequences which contains
images of varying poses and thus captures more information about a person’s face.
Similarly, the face recognition system also utilizes this spatio-temporal context such
that it does not have to produce a hypothesis for every single image frame.
116
Use of local features We are using David Lowe’s SIFT (Scale Invariant Feature
Transform) to describe the face sequences [54]. SIFT is an algorithm for describing
local features in a scale and rotation invariant way. Each SIFT feature or keypoint is
represented by a 128-dimensional vector, produced by computing histograms of gradient directions in a 16x16 pixels image patch. SIFT feature descriptor is particularly
suitable for our application as it has been shown to provide robust nearest-neighbor
matching despite scale changes, rotation, noise, and illumination variations. Its invariance capacity is crucial in allowing our system to deal with the high pose variations
and noise in the robot’s collected face data.
Sparse alignment and Face Regioning We are using very sparse face alignment
in our clustering solution. We simply use the face segmentation provided by the
robot’s face detector [98] and tracker [87]. We then blindly divide each face image into
six regions, as shown in figure 5-4. The choice for a sparse alignment is an important
design decision, as we want to step away from the requirement for a precise alignment,
which is not yet solved for the case of multi-view faces. Many current face recognition
methods are dependent on having good alignment of the face images. Therefore, the
alignment accuracy of face images has been shown to cause a large impact on face
recognition performance [101, 76, 86].
Clustering of local features We are using SIFT because it is a powerful local
descriptor for small patches of image. However, this also means that we have to represent each data sample (face sequence) as a set of multiple SIFT keypoints extracted
from different parts of the face. Thus, the task of clustering these local features in
the 128-dimensional feature space is not straight-forward. We develop a clustering
algorithm for local features based on the simple intuition that if two sequences belong to the same individual, there will be multiple occurrences of similar local image
patches between these two sequences. In addition to providing a sparse alignment,
the face regions are also utilized to enforce geometrical constraints in the clustering
step. We describe the clustering algorithm in more details below.
117
5.3.2
A Toy Example
The clustering system’s task is to cluster all face sequences in the unlabelled training
set into a set of classes such that each class corresponds to each individual. In order to
do this, we have to compute whether or not two sequences should be matched and put
in the same class based on some distance metric. Standard distance measures do not
work well in this setup since each training sample, i.e., a face sequence, is represented
by a set of local features taken from different points of the image. Instead, we develop
a matching algorithm for comparing two face sequences based on their local features.
This matching procedure receives one face sequence as an input to be matched against
a data set of an arbitrary number of sequences. It produces an output match set,
which may contain an empty set if no match is found. If one or more matches are
found, the output match set may contain anywhere from one up to all of the sequences
in the training set.
Before we move on to the algorithm details, we first use a simple toy example to
provide an intuition for the algorithm. In this example, as shown in figure 5-3, we
would like to compare a face sequence S1 to a data set of two sequences, S2 and S3.
Suppose we choose a feature in a region of sequence S1, highlighted by a red dot.
We find ten nearest neighbors to this feature from other sequences in the data set,
but only within the corresponding region, highlighted by the green dots. We then
compute a list of the number of green dots found from S2 and S3, sorted from highest
to lowest. In this simple example, S3 has 7 matches while S2 has 2 matches. Thus, S3
appears first on the list. The sequence matching algorithm is based on the following
intuition. If a sequence is indeed a match to the input sequence, it is more likely to
have more matches and thus is likely to appear higher on the sorted list. Moreover,
it is more likely to appear higher on the sorted list in multiple regions of the face. In
other words, if two sequences belong to the same individual, there will be multiple
occurrences of similar local image patches between these two sequences.
118
Figure 5-3: A simple toy example to provide an intuition for the sequence matching
algorithm. In this example, we want to compare a face sequence S1 to a data set of
two sequences, S2 and S3.
119
5.3.3
Implementation
We start by defining the clustering algorithm input, i.e., a data set containing an
arbitrary number of sequences of face images Q.
Q = {Si |i ∈ [1, . . . , N ]}
(5.1)
where N is the total number of sequences.
A sequence of images Si is defined as follows:
Si = {Imi,j |j ∈ [1, . . . , Ni ]}
(5.2)
where Ni is the number of images in the sequence Si .
Figure 5-4: The division of face image into six regions. We initially divide each face
image into six regions. This divison is done blindly without any precise alignment of
face or facial features. In addition to providing a sparse alignment, the face regions
are also utilized to enforce geometrical constraints in the clustering step.
120
Feature Extraction First, we extract a set of features from each face sequence Si
by performing the following steps.
We initially divide each face image Imi,j into six regions, as follows:

Imi,j = 

R1 R2 R3

R4 R5 R6
(5.3)
This division is done blindly without any precise alignment of face or facial features, as shown on figure 5-4.
For each image Imi,j , we use the Harris corner detector to identify a set of interest
points, which is a subset of all pixels in Imi,j [39]. We are using the Harris corner
detector instead of the SIFT’s keypoint finding method proposed in [54], because the
latter method does not work as well with our lower resolution face images.
For each image Imi,j , we then compute one SIFT 128-dimensional feature vector
for each interest point and group the results in six batches. We define the function
Ï•(.) that computes SIFT feature vectors from a set of interest points in an image
Imi,j and group each resulting feature vector into one of six batches according to
which of the six regions its associated interest point comes from.
Bi,j,m = Ï•m (Imi,j )f orm = 1, . . . , 6
(5.4)
where Bi,j,m is a set of 128-dimensional SIFT feature vectors for each region Rm ,
Bi,j,m ∈ R
As shown in figure 5-5, we then combine all computed keypoints per region from
all images in the sequence. Thus, in the end of this step, each sequence Si is mapped
to six batches of keypoints Ai,m , m ∈ [1, . . . , 6], as follows.
Ai,k =
SNi
j=1
Bi,j,k
(5.5)
Feature Prototype Generation In order to reduce the number of keypoints to be
clustered, we convert them into a set of prototypes. For each batch of keypoints pro-
121
Figure 5-5: The feature extraction procedure of each face sequence. We extract SIFT
features from each detected corner (using the Harris corner detector [39]) from each
image in the sequence. We then combine all computed keypoints per region from all
images in the sequence. Thus, in the end of this step, each sequence has six batches of
keypoints, one batch for each of the six regions. Lastly, these six batches of keypoints
are converted into a set of prototypes.
duced by each face sequence, we perform k-means clustering to compute 50 keypoint
prototypes.
For this computation, we use KMlocal, a collection of software implementations
of a number of k-means clustering algorithms [45, 44]. In particular, we use a hybrid
algorithm, combining a swap-based local search heuristic and Lloyd’s algorithm with
a variant of simulated annealing.
Let’s define the function Γ50 (·) that computes 50 k-means.
Oi,m = Γ50 (Ai,m )
m ∈ [1, . . . , 6]
(5.6)
Note that in practice, each Ai,m contains at least 50 elements.
After this final step, as shown in figure 5-5, each sequence Si is mapped to
Oi,1 , . . . , Oi,6 , where Oi,m is defined as:
©
ª
Oi,m = Ci,m,p |Ci,m,p ∈ R128 , p = 1, . . . , 50
122
(5.7)
, where each Ci,m,p for p = 1, . . . , 50 is one of the 50 prototype vectors output
which are produced using k-means clustering.
Sequence Matching Now we have shown how to convert each sequence Si in the
data set Q to six sets of feature prototypes, Oi,1 , . . . , Oi,6 .
As shown in figure 5-7, we use the extracted feature sets to compare each sequence
Si against the rest of the data set R = {Sx |x 6= i, x, . . . , N }, using a function Ψ(.)
which produces an output set Mi containing matches for Si .
Mi = Ψ(Si , R)
(5.8)
Mi is a subset of {0, . . . , N − 1}.
We now define the function Ψ(.), which takes in two inputs:
• a sequence input Si which has been converted to six sets of 50 feature prototypes
Ci,m,p , m = 1, . . . , 6, p = 1, . . . , 50.
• the rest of the data set R = Sx |x 6= i, x ∈ [1, . . . , N ], which have been converted
to Cx,m,p , i = 1, . . . , N, m = 1, . . . , 6, p = 1, . . . , 50.
We describe each step of the sequence matching function Ψ(.) using the following
pseudo-code. This sequence matching procedure is also illustrated in figure 5-6.
Sequence Matching Algorithm
1
SequenceMatching(Ci,m,p , R)
2
FOR each m
3
4
5
FOR each p
Find K nearest neighbors to Ci,m,p in region m in R
Define Counti (x, m, p) to be the number of nearest neighbors that
come from region m in sequence Sx
6
ENDFOR
123
P50
7
Define Counti (x, m) =
8
Sort Counti (x, m) in descending order
9
Take the top N elements in the sorted Counti (x, m)
p=1
Counti (x, m, p)
10
ENDFOR
11
Sequence Sx matches Si iff Counti (x, m) is in the top N positions for all m =
1, . . . , 6.
There is one missing detail in the above description. If the value of any element
in the sorted Counti (x, m) list is < C% ∗ the first element for some parameter
C, this element and the rest of the elements in this sorted list of length N are
excluded.
Unsupervised Face Clustering We have now shown how to compare each sequence Si against the rest of the data set R = {Sx |x 6= i, x, . . . , N } to produce an
output set Mi .
As shown in figure 5-7, we compare each sequence Si against the rest of the data
set R = {Sx |x 6= i, x, . . . , N }, to produce an output set Mi containing matches for Si .
If the output set Mi is not empty, the system will combine Si and each element
Sx in Mi into a cluster. This process is repeated for each Mi from each sequence
Si . This clustering step is performed greedily such that if any two clusters contain
matching elements, the two clusters will be merged together.
5.3.4
Clustering Performance Metric
In order to evaluate the robot’s unsupervised face clustering, we define a set of performance metric. Given the nature of our setup and failure modes, we feel that one
single number is not sufficient to reflect both the merging and splitting errors. The
clustering task is essentially a struggle between merging and splitting failures. For
our purposes, merging multiple people into a cluster are more detrimental than splitting an individual’s face sequences into multiple clusters. Both failures would lead to
124
Figure 5-6: The Face Sequence Matching Algorithm. This matching procedure receives one sequence as input and produces an output match set containing one or
more sequences. The intuition behind this algorithm is that if two sequences belong to the same individual, there will be multiple occurrences of similar local image
patches between these two sequences.
125
Figure 5-7: The Face Sequence Clustering Procedure. Given a data set containing t
sequences, each sequence is compared to the rest of sequences in the data set. The
sequence matching algorithm produces an output set of matches, which are greedily merged such that any two clusters containing matching elements will be merged
together
false recognition, but in an incremental setup, the latter may get fixed if the robot
acquires more data from the corresponding individual. Moreover, given the robot’s
greedy clustering mechanism, the merging failure has a compounding effect over time.
In order to reflect how the robot’s clustering mechanism performs with respect to
both of these failure modes, we opted for a set of metric. Given that an individual P
has a set of sequences X = Si |i = 0, .., Xsize in the training set, we define the following
categories:
• Perfect cluster: if the system forms one cluster containing all elements in P’s
sequence set X .
• Good cluster: if the system forms one cluster containing Sj |j = 0, .., M, M < Xsize
and leaves the rest as singletons. Note that the perfect cluster category is a subset of the good cluster category.
• Split cluster: if the system splits the elements of X into multiple clusters.
126
• Merged cluster: if the system combines sequences from one or more other individuals with any sequences inside X into a cluster. Note that the set of
sequences which are labeled as non-faces are treated as if it is an individual.
Thus, merging a non-face sequence into any cluster will be penalized in the
same way.
We also define some additional metric for analyzing the split and merged clusters:
• Split degree: the proportion of the largest of the A clusters in a splitting case.
• Merged purity: the proportion of the number of sequences from individual I
who holds the majority of the sequences in a merged cluster.
• Merged maximum: the maximum number of individuals merged together in a
cluster from all of the merged cases.
The split degree and merged purity provide some information about the severity of
a split or merged failure. A high split degree corresponds to a lower severity, as
this means the clustering still successfully forms a significant cluster of an individual
instead of many small ones. A high merged purity also corresponds to a lower severity,
since it reflects cases where an individual still holds a significant majority of a cluster.
If the merged purity value is very high, the few bad sequences may not be well
represented and will not significantly affect recognition performance.
The merged maximum may provide more information that the total number of
merged cases, when a large number of sequences are falsely merged together into a
cluster. This would only yield one merged case, and thus its severity will not be
reflected by the total number of merged cases.
Given an unlabelled training set containing face sequences from M individuals
(according to ground truth), we then measure the clustering performance by the
following measurements:
• Number of people: the number of individuals who has at least 2 sequences in
the data set
127
• None: the number of individuals whose sequences did not get clustered at all
• Perfect: the number of perfect clusters
• Good: the number of good clusters, which also includes the perfect clusters
• Split: the number of split clusters
• Split degree: a distribution of the split degrees of all the split cases
• Merged purity: a distribution of the merged purity of all the merged cases
• Merged maximum: the maximum number of individuals merged together in a
cluster from all the merged cases
5.3.5
Clustering Parameters and Evaluation
The sequence matching algorithm relies on three parameters, K, N, and C. K is
the number of nearest prototypes used to form a sorted list of sequences with the
most number of nearest prototypes. N is the maximum length of this sorted list
which is then compared for overlaps with sorted lists from other face regions. C is a
threshold value used to cut the sorted list in cases where some sequences dominate
as the source of nearest neighbors over other sequences in the list. Thus, N becomes
irrelevant when the threshold C is activated.
Note that the sequence matching algorithm does not rely directly on distancebased measures. Instead, it is very data dependent as it relies on voting among
nearest neighbors and spatial configuration constraints. In order to assess the algorithm’s sensitivity to various factors, we provide an analysis of the robot’s clustering
performance across a range of data-dependent properties and parameter values K, N,
and C below.
We extract a number of data sets of different sizes from the collected training
data. Each data set was formed by taking contiguous segments from the robot’s
final training data in order of its appearance during the experiment. Thus, each set
contains a similar distribution of number of individuals and number of sequences per
128
Table 5.1: The data set sizes and parameter values used in the clustering algorithm
evaluation.
Data set size
30
300
500
700
1000
2025
K
10
30
50
N
C
3 0%
5 30%
10 50%
15 70%
20
30
40
50
individual shown in table 4.9. We then perform the clustering algorithm on each data
set and computed a set of metric described above. We show a subset of these results
below and provide the complete set in appendix C. These results were generated using
data set sizes and parameter values listed in table 5.1.
Medium to large data set sizes Figure 5-8 and 5-9 are two sample results generated from a data set of 300 and 700 sequences using different combinations of N
and K as listed in table 5.1. These parameter values are shown in the lowest subplot
of each figure. The C parameter is kept constant at 30%.
The top most subplot ilustrates the number of good, perfect, and none resulting
clusters. The second subplot shows the number of total merging errors and their
distribution for different merged purity values (0-25, 25-50, 50-75, 75-100%). The
third subplot shows the number of merged maximum. This measure is more indicative then the total merging errors when a large number of sequences are falsely
merged together. The fourth subplot shows the number of splitting errors and their
distribution for different split degree values (0-25, 25-50, 50-75, 75-100%).
Figure 5-10 shows the normalized number of merging and splitting errors. These
plots essentially illustrate the trade-off between merging and splitting, typically encountered in any unsupervised clustering task. Based on all of the clustering results of
data sets of various sizes using different parameter values, we observe that increasing
129
data set size = 300 , C = 0.3
#person
perfect
none
good
50
0
0
10
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
5
0
0
100
5
10
15
20
25
merge max
50
0
0
10
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure 5-8: The clustering results with a data set of 300 sequences with different N
and K parameter values while C is kept fixed at 30%. The top most subplot ilustrates
the number of good, perfect, and none resulting clusters. The second subplot shows
the number of total merging errors and their distribution for different merged purity
values . The third subplot shows the number of merged maximum. The fourth subplot
shows the number of splitting errors and their distribution for different split degree
values.
the parameter K results in higher merging errors, but does not necessarily reduce the
splitting errors. Thus, for the rest of this evaluation, we assume that K should be
kept on the lower side.
Figure 5-11 illustrates how these trade-off curves between merging and splitting
errors change as we vary our data set sizes, N , and C, while keeping K fixed at 10.
Generally, turning the knob on N slides us along the split and merge trade-off. While
the results are satisfying when we are in a good zone for N , we do not want to have
to carefully tune parameter values each time. As N increases, merging errors increase
while splitting errors decrease. However, this effect is diminished as the data set size
increases. Similarly, we also observe the same diminishing effect as the parameter C
130
data set size = 700 , C = 0.3
100
#person
perfect
none
good
50
0
0
20
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
10
0
0
200
5
10
15
20
25
merge max
100
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure 5-9: The clustering results with a data set of 700 sequences with different N
and K parameter values while C is kept fixed at 30%. The top most subplot ilustrates
the number of good, perfect, and none resulting clusters. The second subplot shows
the number of total merging errors and their distribution for different merged purity
values . The third subplot shows the number of merged maximum. The fourth subplot
shows the number of splitting errors and their distribution for different split degree
values.
increases. For larger data sets or higher C values, the merging errors still increase and
the splitting errors still decrease as N increases, but not nearly as much. This means
that the splitting errors will generally be slightly higher. However, it allows for a
larger margin for how to specify N without sacrificing too many merging or splitting
errors. We later utilize these properties in our parameter specification strategies.
Small data set sizes Figure 5-12 shows the trade-off curves between merging and
splitting errors for a data set of 30 sequences, with different K, N , and C values.
These curves exhibit the a similar trend as those of the larger data set. When C is
low, merging errors increase while splitting errors decrease as we turn the knob on N .
When C is high, the curves flatten. However, there is a difference. When C is low, the
131
Figure 5-10: The normalized number of merging and splitting errors from the clustering results shown in figure 5-8 and 5-9. These plots essentially illustrate the trade-off
between split and merge errors, typically encountered in any unsupervised clustering
task.
merging error increases drastically as N increases such that there is only a very small
good trade-off zone which occurs when N is very low. In terms of K, the results are
consistent with our previous finding that increasing K simply magnifies the merging
errors without improving the splitting errors. We incorporate these observations in
our summary and parameter specification strategies later.
Different sequence distributions within the data set We have so far analyzed
the clustering results on different contiguous subsets of the robot’s collected training
data in the order of its appearance during the experiment. Each set contains a large
variation in the number of sequences per individual, ranging from one to over two
hundred sequences per person. In order to assess the clustering algorithm’s sensitivity
to the distribution of sequences per person in the data set, we conducted the same
clustering tests using data sets with different sequence distributions per person.
Figure 5-13 and 5-14 show the trade-off curves between merging and splitting
errors for data sets with two different distributions of sequences per person, at different
C and N values while keeping K fixed at 10.
132
Figure 5-11: The trade-off curves between merging and splitting errors at different
data set sizes and values of N and C. As N increases, merging errors increase while
splitting errors decrease. However, this effect is diminished as the data set size increases. Similarly, we also observe the same diminishing effect as the parameter C
increases. For larger data sets or higher C values, the merging errors still increase
and the splitting errors still decrease as N increases, but not nearly as much.
133
Figure 5-12: The trade-off curves between merging and splitting errors for a data set
of 30 sequences, with different K, N, and C values.
In figure 5-13, both data sets contain 500 sequences. The left column corresponds
to the first distribution of 7-30 sequences per person. The right column corresponds
to the second distribution of 1-248 sequences per person. We again see a similar trend
which we previously observed, where turning the knob on N slides us along a tradeoff between merging and splitting errors. Increasing the parameter C diminishes this
effect and thus flattens the curves. However, comparisons between the two distributions indicate that the first distribution yields less merging errors and more splitting
errors. This is due to the fact that in the first distribution, each person has more sequences to be clustered. Thus, there is a higher chance that the clustering algorithm
ends up splitting their sequences. On the other hand, in the second distribution, a
large percentage of the people have only 2-3 sequences to be clustered.
In figure 5-14, we see the opposite case where the left column corresponds to
the first distribution of 1-4 sequences per person. The right column corresponds to
the second distribution of 1-248 sequences per person. In this case, we observe that
the first distribution yields less splitting errors. In fact, sliding N to higher values
increases both the merging and splitting errors.
134
Figure 5-13: The trade-off curves between merging and splitting errors for data sets
with two different distributions of sequences per person, at different C and N values
while keeping K fixed at 10. Both data sets contain 500 sequences. The left column
corresponds to the first distribution of 7-30 sequences per person. The right column
corresponds to the second distribution of 1-248 sequences per person.
135
Figure 5-14: The trade-off curves between merging and splitting errors for data sets
with two different distributions of sequences per person, at different C and N values
while keeping K fixed at 10. Both data sets contain 700 sequences. The left column
corresponds to the first distribution of 1-4 sequences per person. The right column
corresponds to the second distribution of 1-248 sequences per person.
136
These results indicate that the distributions of sequences per person in the data
set affect the shape of the trade-off curves at different values of N . Essentially, if
the data set contains few face sequences per individual, there is a higher chance for
merging errors to occur, especially for individuals who has only one sequence and
therefore may be falsely matched to someone else’s sequence. However, there is a
lower chance for splitting errors since there are fewer sequences to be clustered. On
the other hand, if the data set contains many face sequences per individual, there is
a lower chance for merging errors to occur because there is a match for most of the
face sequences. However, there is a higher chance for splitting errors since there are
more sequences to be clustered per person.
Lastly, we observe that increasing the parameter C value has the same effects of
flattening the trade-off curves regardless of the data set size or sequence distribution.
Sequence Matching Accuracy Figure 5-15 shows the accuracy of the sequence
matching algorithm when performed on data sets of different sizes and using different
parameter C values. This accuracy measure is defined as the percentage of occasions
when every element in the matching output set declared by the algorithm is indeed
a correct match to the input sequence.
These accuracy measures indeed confirm our previous observations. For larger
data sets, decreasing the parameter C value causes a slight decrease in the accuracy
rate. However, for smaller data sets, decreasing the parameter C value drastically
reduces the accuracy rate, even all the way down to 0% for data sets of 30 sequences.
This corresponds to a case where all sequences are falsely merged together into a
single cluster.
Based on these numbers, it seems straight forward that we should use high values
of C to obtain the best results. However, keep in mind that a high accuracy value
reflects low merging errors, however it does not reveal anything about the splitting
errors. Instead we suggest to correlate C to the data set size. We discuss these
parameter specification strategies in more details below.
137
Figure 5-15: T he accuracy of the sequence matching algorithm when performed
on data sets of different sizes and using different parameter C values. This accuracy
measure is defined as the percentage of occasions when every element in the matching
output set declared by the algorithm is indeed a correct match to the input sequence.
5.3.6
Summary and Parameter specification strategy
Based on these results, we make the following observations and propose a set of
strategies.
• Turning the knob on the N parameter slides us along the trade-off between
merging and splitting errors. As N increases, the merging error also increases
while the splitting error decreases.
• Increasing parameter K causes higher merging error, but does not decrease
splitting error. Thus, the value of K can be fixed at a low value.
• Parameter C and N are related in that C makes N irrelevant when the algorithm
finds one or more sequences which dominate in the nearest-neighbor votes over
the rest of the data set. Parameter C can be thought of as conservative measure.
Increasing parameter C value allows the system to be more conservative because
it essentially flatten the trade-off curves between merging and splitting. In other
words, at higher values of C, the algorithm is more conservative in declaring a
match and thus generate lower merging errors but higher splitting errors. This
property allows us to keep N at a fixed value and vary C depending on how
conservative we want the algorithm to be.
138
• Increasing the size of the data set has a similar effect as increasing the parameter C value. This means that when we have more data, the algoritm is less
susceptible to merging errors. Our intuition for this is that as more data fills up
the feature space, the algorithm’s sorted list of nearest prototypes has a smaller
chance in catching false positive neighbors. Thus, having a lot of data reduces
the chance of a false positive match. Given these properties, when we have a
lot of data, we can reduce the C value and be a lot more conservative in our
clustering process.
• To summarize, we propose to keep K to be fixed at a low value and N at a
middle-range value. We have observed that for a data set of 30 sequences, the
merging error increases drastically as we increase N . Thus, for very small data
sets, we would need to use a low value of N .
• For smaller data sets, we have to be more conservative as it is more susceptible
to false matches. For larger data sets, we can be less conservative as it is more
immune to false matches. Thus, we propose to correlate the parameter C to the
data set size. For all of the experiments using data sets of over 300 sequences,
we use K = 10 and N = 30. We also tested the clustering performance using
a range of correlation function between C and the data set size, as shown in
figure 5-29.
Using A Different Data Set – The Honda/UCSD Video Face Database In
order to analyze the robot’s clustering performance on a different dataset, we conducted a test using the Honda/UCSD Video Database [51, 50]. The database was
created to provide a standard database for evaluting video-based face tracking and
recognition algorithms. The database contains 72 video sequences from 20 individuals. All individuals are supposed to have more than one sequence in the database.
However, in the version that we downloaded, we only have 19 people with more than
one sequence and one person with one single sequence.
Each video sequence contains one person at a time, lasts for at least 15 seconds,
139
and is recorded in an indoor environment at 15 frames per second and 640x480 resolution. In each video, the person rotates and turns his/her head in his/her own
preferred order and speed. Thus, typically in about 15 seconds, the video captures a
wide range of poses with significant in-plane and out-of-plane head rotations.
In our tests, we reduce the video resolution to 160x120 for face tracking and
retrieve the face images to be stored at 320x240 to match the setting used in the
robot’s visual processing. Using the robot’s face detector and tracker, we convert each
video into a face sequence. We then processed and clustered these face sequences in
the same way as we did in our previous test with the robot’s self-generated training
set.
Table 5.2 shows the different measurements for the 19 individuals in the database
with more than one sequence. Since the size of the database is small (72 sequences),
we knew ahead of time that it would need small parameter values. The clustering
performance is in fact highest at the smallest parameter values of N = 3, K = 10.
Out of 19 people, the clustering algorithm formed 18 good clusters. Nine of these are
perfect clusters. One individual was split with a 60% split degree.
We also show how these results degrade as the K and N parameter values increase
in figure 5-16. The plot structure is the same as that of figure ??. The parameter C
is set at a high value of 70%, as we know that we have to be conservative with very
small data sets. The clustering performance did not degrade too badly even as both
N and K are increased. The algorithm still managed to find good clusters for at least
75% of the individuals with only one merged cluster for most the parameter values.
Performance is worst at the highest values of N = 30, K = 30.
5.4
The Integrated Incremental and Unsupervised
Face Recognition System
We incorporate the face clustering solution decribed above to implement an integrated
system for unsupervised and incremental face recognition, as illustrated in figure 5-17.
140
Table 5.2: The batch clustering results using the Honda-UCSD face video database
person
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
total n seq
9
5
5
5
4
3
3
3
3
3
3
3
3
3
3
3
3
2
2
n seq in cluster/total n seq
0.44
1
0.4
1
0.5
1
0.67
0.67
0.67
1
1
0.67
0.67
1
1
1
1
1
1
141
split degree
1
0.6
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
merge purity
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
data set size = 72
20
#person
perfect
none
good
10
0
1
2
2
3
4
5
6
7
8
9
10
1
0
1
4
2
3
4
5
6
7
8
9
10
2
0
1
4
merge max
2
3
4
5
6
7
8
9
10
2
0
1
40
2
3
4
5
6
7
8
9
10
#split
0−25
25−50
50−75
75−100
param N
param K
20
0
1
#merge
0−25
25−50
50−75
75−100
2
3
4
5
6
7
8
9
10
parameter values
Figure 5-16: The face sequence clustering results with the Honda-UCSD video face
database, using different N and K values, while keeping C fixed at 70%. The clustering performance did not degrade badly even as both N and K are increased. The
algorithm still managed to find good clusters for at least 75% of the individuals with
only one merged cluster for most the parameter values.
It consists of two separate training sets, an unlabelled one for the clustering system
and a labelled one for the recognition system. Figure 5-18 illustrates the four phases
of the incremental face recognition process, from left to right.
• In the first phase, both training sets are empty.
• In the second phase, the clustering system starts receiving one face sequence
at a time and simply stores them until the clustering training set contains 300
unlabelled face sequences.
• At this point, we enter the third phase where multiple events occur. First, the
clustering system performs a batch clustering on the stored 300 face sequences.
Second, from the resulting clusters, M largest ones are automatically transferred
to form labelled data in the recognition training set where each cluster corresponds to a class. Third, upon formation of these labelled training data, each
142
Figure 5-17: The unsupervised and incremental face recognition system. It consists of
two separate training sets, an unlabelled one for the clustering system and a labelled
one for the recognition system. Both training sets are initially empty. Over time, the
system incrementally builds a labelled training set using self-generated clusters and
then uses this training set to recognize each sequence input.
143
image from the sequence input is then passed to the recognition system one
at a time. Based on the current labelled training set, the recognition system
produces a running hypothesis based on each face image from the sequence input. The recognition system may decide on a final hypothesis after an arbitrary
number of face images, that the sequence input belongs to either a particular
person in the training set or an unknown person. The system then stops its
processing and ignores the rest of the sequence input.
• In the fourth phase, after the recognition makes a final hypothesis, each sequence input is also passed to the clustering system, which matches it against
the existing clusters. Depending on the output of the sequence matching algorithm, the new sequence input may be integrated into an existing cluster, form
a new cluster, or remain as a singleton. This incremental change is subsequently
reflected in the labelled training set, which will then be used to recognize the
next sequence input. We then loop back to another round of recognition as in
the third phase.
Note that the recognition hypothesis and the output of the sequence matching algorithm are redundant. Both essentially determine which existing cluster the sequence
input belongs to, if any. However, in our implementation, we only use the sequence
matching output to update our existing clusters.
5.4.1
The Supervised Variant
In theory, the supervised face recognition module can be filled by any existing supervised face recognition system. Since we are interested in evaluating our approaches,
we implemented the supervised face recognition module using a variant of the sequence matching algorithm. Instead of face sequences, the face recognition receives
a single face image as input. Moreover, the sequence matching algorithm relies on
a k-means clustering to obtain feature prototypes which does not work in real time.
Thus, the matching algorithm has to be adapted to accommodate single image input
and faster processing. Figure 5-19 illustrates this adapted algorithm. It is similar
144
Figure 5-18: The four phases of the unsupervised and incremental face recognition
process. Curing this incremental process, the system builds a labelled training set
using self-generated clusters and then uses this training set to recognize each sequence
input.
145
to the original algorithm, except that instead of matching feature prototypes of a
sequence, it matches the original features directly from each input image.
Figure 5-19: The adapted sequence matching algorithm for supervised face recognition. This adapted version is very similar to the original algorithm, except that
instead of matching feature prototypes of a sequence, it matches the original features
directly from each input image.
5.4.2
Incremental Recognition Results
We conducted two tests to evaluate the integrated system. Figure 5-20 shows the
incremental recognition results of each sequence input generated from the first test. In
this test, the labelled recognition training set is incrementally constructed using fifteen
largest self-generated clusters, containing at the minimum two sequences. The lower
sub-plot shows the number of sequences in the incrementally constructed labelled
146
training set, which increases as the robot encounters more input data over time.
For each sequence input, the system makes a recognition hypothesis that it either
belongs to a specific known person Px or an unknown person who is not the the
training database. The upper sub-plot shows the hypothesis accuracy of the first
case. The blue and red solid lines correspond to the accummulated number of correct
and incorrect hypotheses respectively over time. When the system hypothesizes that
the sequence input belongs to a known person, it is correct 74.5% of the time. The
slope of the number of incorrect hypotheses decreases over time as the size of the
labelled training set increases. If we calculate the recognition performance after some
delay, as shown by the blue and red dotted lines, the performance improves to being
correct 81.8% of the time.
We calculated the recognition performance for some familiar individuals who encountered the robot on multiple days. The familiar individual whose cluster is shown
in figure 5-26 came to interact with the robot on seven different days. Once the
system integrates her cluster into the labelled training set, the recognition system
correctly classified her 15 times and misclassified her as another person and an unknown person 2 and 6 times respectively. The familiar individual whose cluster is
shown in figure D-4 interacted with the robot on two different days. Once the system
integrates his cluster into the labelled training set, the recognition system correctly
classified him 19 times and misclassified him as another person and an unknown
person 4 and 5 times respectively.
The middle sub-plot illustrates the recognition accuracy of the second case where
the system makes an unknown person hypothesis. The system is correct 74% of the
time when it hypothesizes that the sequence input belongs to an unknown person who
does not exist in the training database. As shown in figure 5-17, after each sequence
input is processed for recognition, it subsequently goes through the clustering system
and incrementally integrated into the existing self-generated clusters. Roughly 14.3%
of the sequence inputs, which were falsely recognized as an unknown person, were
later correctly integrated into an existing cluster, as shown by the green line.
Figure 5-21 shows the incremental recognition results from the second test. In this
147
Incremental recognition results of each sequence input,
training set = 15 largest self−generated clusters with > 1 sequence
known correct
known false
known correct
known false
200
400
600
800
1000
1200
1400
1600
800
1000
1200
1400
1600
800
1000
1200
1400
1600
unknown correct
unknown false
unknown false − correctly clustered
200
400
600
number of training sequences
200
400
600
Sequence Input
Figure 5-20: The incremental recognition results of each sequence input. The labelled
recognition training set is incrementally built using fifteen largest self-generated clusters, containing at the minimum two sequences. The lower sub-plot shows the number
of sequences in the incrementally constructed labelled training set, which increases
as the robot encounters more input data over time. The upper sub-plot shows the
recognition hypothesis accuracy. The blue and red solid lines correspond to the accummulated number of correct and incorrect hypotheses respectively over time. The
middle sub-plot illustrates the recognition accuracy of the second case where the
system makes an unknown person hypothesis.
148
test, the setting is similar except that the labelled recognition training set is incrementally built using fifteen largest self-generated clusters, containing at the minimum
six sequences. At this setting, when the system hypothesizes that the sequence input
belongs to a known person, it is correct 47.4% of the time. We attribute the lower
recognition performance to the smaller size of the labelled training set due to the
more selective process of transferring only clusters containing at least six sequences.
However, if we calculate the recognition performance after some delay, as shown by
the blue and red dotted lines, the performance improves to being correct 77.6% of
the time. Thus, once the labelled training set is large enough, the performance is
comparable to that of the previous test.
When the system hypothesizes that the sequence input belongs to an unknown
person who does not exist in the training database, the performance is also comparable
to that of the previous test. The system is correct 74% of the time in declaring that
a person is unknown. Roughly 10.8% of the sequence inputs, which were falsely
recognized as an unknown person, were later correctly integrated into an existing
cluster.
To summarize, we evaluated the incremental and fully unsupervised face recognition system using the face data automatically generated by the robot during the
final experiment. During this incremental clustering and recognition test, the system
builds a labelled training set in a fully unsupervised way and then uses this training
set to recognize each sequence input. The system hypothesizes correctly 74% of the
time that the sequence input belongs to an unknown person who does not exist in
the training set. After an initial learning period, when the system hypothesizes that
the sequence input belongs to a specific known person, it is correct roughly 80% of
the time.
5.4.3
The Self-Generated Clusters
Figure 5-22 shows the clustering results produced during the incremental recognition
process. These results were generated with N = 30, K = 10, and C is correlated with
the data set size using the function inc3 shown in figure 5-29. The results are plotted
149
known correct
Incremental recognition results of each sequence input,
known false training set = 15 largest self−generated clusters with > 5 sequence
known correct
known false
200
400
600
800
1000
1200
1400
600
800
1000
1200
1400
600
800
1000
1200
1400
unknown correct
unknown false
unknown false − correctly clustered
200
400
number of training sequences
200
400
Sequence Input
Figure 5-21: The incremental recognition results of each sequence input using a different setting. The labelled recognition training set is incrementally built using fifteen
largest self-generated clusters, containing at the minimum six sequences. The lower
sub-plot shows the number of sequences in the incrementally constructed labelled
training set, which increases as the robot encounters more input data over time. The
upper sub-plot shows the recognition hypothesis accuracy. The blue and red solid
lines correspond to the accummulated number of correct and incorrect hypotheses
respectively over time. The middle sub-plot illustrates the recognition accuracy of
the second case where the system makes an unknown person hypothesis.
150
at every addition of 20 sequences, starting from the initial batch of 300 sequences.
We provide some examples of these generated clusters in appendix B.
At the end of the test, the clustering algorithm found 151 good clusters. 75 of
these are perfect clusters. The algorithm made 22 merging failures, with the largest
merged cluster containing 3 individuals. In more than half of these merged cases, the
merge purity value is between 75-100%. There are 26 split clusters. In more than half
of these split cases, the split degree value is between 50-75%.
Figure 5-23 provides a visualization of these self-generated clusters. Each pie
chart corresponds to an individual. The size of the pie chart circle corresponds to
the number of sequences an individual have in the data set. The pie charts are
ordered from individuals with the most to the least number of sequences. We exclude
individuals who have only one sequence in the data set. The different shades of green
regions correspond to sequences that were correctly clustered. Multiple green slices
are present within a pie chart indicate that there is a splitting error. The red regions
correspond to sequences that were falsely clustered, i.e. the merging errors. The gray
regions correspond to unclustered sequences.
Figure 5-24 shows six different snapshots of the largest fifteen clusters taken at
different times during the first incremental recognition test. These largest fifteen
clusters were essentially the content of the labelled recognition training set. In the
beginning, the selected clusters come from individuals who are not most familiar, i.e.
only have a few sequences. At the end, the selected clusters converge to the top, i.e.
the familiar individuals who have had many encounters with the robot. Note that
these fifteen clusters do not actually correspond to fifteen different people, since there
are some splitting cases. In particular, the first individual represented by the largest
pie chart is represented by four clusters and therefore as four different classes in the
labelled training set.
Figure 5-25 through 5-28 shows some examples of these largest clusters of the
familiar individuals. Note that not all of the sequences are shown in these figures,
due to visibility and space constraints. However, we make sure to include all of the
merged errors, when present.
151
N = 30, K = 10
#person
perfect
none
good
400
600
800
1000
1200
1400
1600
1800
2000
#merge
0−25
25−50
50−75
75−100
400
600
800
1000
1200
1400
1600
1800
2000
merge max
400
600
800
1000
1200
1400
1600
1800
2000
#split
0−25
25−50
50−75
75−100
400
600
800
1000
1200
1400
1600
1800
2000
Number of sequences
Figure 5-22: The self-generated clusters constructed during the incremental recognition process. These results were generated with N = 30, K = 10, and C is correlated
with the data set size using the function inc3 shown in figure 5-29. The results
are plotted at every addition of 20 sequences, starting from the initial batch of 300
sequences.
The first one is the largest one of the four split clusters formed from the invididual
represented by the largest pie chart. The second one is of a woman who came to
interact with the robot on seven different days throughout the experiment. The red
rectangle is placed to point out the falsely merged sequences from another individual
in the cluster. The third one is a falsely merged cluster of many individuals’ faces
that were poorly segmented and happen to share a similar background. The fourth
one is of a wall region which was falsely detected as a face by the face detector. We
include more examples of these familiar individuals in appendix D.
5.4.4
Incremental Clustering Results Using Different Parameters
Figure 5-30 shows the incremental clustering results when using different correlation
functions between the parameter C and data set size. We use six different correlation
152
Figure 5-23: Visualization of the self-generated clusters. Each pie chart corresponds to
an individual. The size of the pie chart circle corresponds to the number of sequences
an individual have in the data set. The different shades of green regions correspond to
sequences that were correctly clustered. Multiple green slices are present within a pie
chart indicate that there is a splitting error. The red regions correspond to sequences
that were falsely clustered, i.e. the merging errors. The gray regions correspond to
unclustered sequences.
153
Figure 5-24: Six different snapshots of the largest fifteen self-generated clusters taken
at different times during the incremental recognition process.
154
Figure 5-25: A sample cluster of a familiar individual, who was split among 4 clusters
in the labelled training set. The other three clusters belonging to this individual are
shown in figure ?? through ??.
155
Figure 5-26: A sample cluster of a woman who came to interact with the robot on
seven different days throughout the experiment. The red rectangle is placed to point
out the falsely merged sequences from another individual in the cluster.
156
Figure 5-27: A sample cluster of a familiar individual, formed by a falsely merged
cluster of many individuals’ faces that were poorly segmented and happen to share a
similar background.
157
Figure 5-28: A sample cluster of a familiar individual, which is a wall region which
was falsely detected as a face by the face detector.
158
functions, as shown in figure 5-29. As mentioned previously, the clustering system
for the integrated incremental and unsupervised scheme was implemented using correlation function inc3.
Based on our previous clustering evaluations, we expect that the correlation function inc1 will yield lower merging and higher splitting errors. On the other hand,
we expect that the function inc6 will yield the opposite case of higher merging and
lower splitting errors. The results indeed confirm these expectations. However, the
differences in performance across the six different correlation functions are not drastic. In general, the clustering system is capable of generating clusters with low errors
for roughly half of the individuals in the data set.
Figure 5-29: The set of correlation functions between the data set size and parameter
C values used to evaluate the incremental clustering results.
5.5
Comparison to Related Work
In section 2.4, we describe a number of related research in unsupervised face recognition. For comparison purposes, we formulate our face clustering results to match
the peformance metric used by Raytchev and Murase in [81] and Berg et al in [5], as
shown in figure 5-31 and 5-32 respectively.
The comparison to Raycthev and Murase is more straight-forward as both approaches are video-based and thus deal with face sequences as input. The perfor159
Figure 5-30: The incremental clustering results using different correlation functions
between the data set size and parameter C.
mance metric takes into account two types of errors: the number of mistakenly clustered sequences and the number of sequence in clusters with ¡ 50% purity. Using this
performance metric, we present our results using data sets of 2025 and 500 sequences.
For the latter, we display results using four different values of the C parameter, ranging from 0-70%, since we have observed that the smaller data set is more susceptible
to a less conservative (lower) C parameter value and thus should use higher C values.
For both data sets, our results are slightly better, except for when C is reduced to
30% or less.
The comparison to Berg et al is not as straight-forward, as their approach is
image-based and the clustering is performed using face data along with captioned
text information. The performance metric is the error rate of false cluster assignment.
Our results are comparable when compared to their smaller data set and better when
compared to their larger data set, except in case of when C is reduced to 0% for our
data set of 500 sequences.
160
Figure 5-31: Comparison to Raytchev and Murase’s unsupervised video-based face
recognition system.
161
Figure 5-32: Comparison to Berg et al’s unsupervised clustering of face images and
captioned text.
162
5.6
Discussion
We have evaluated the unsupervised face clustering system by itself. We have also
evaluated an integrated system that uses the face clustering solution to incrementally
build a labelled training set and use it to perform supervised recognition.
The presented solutions and results indicate promising steps for an unsupervised
and incremental face recognition system. The face clustering algorithm yields good
performance despite the extremely noisy data automatically collected and segmented
by the robot through spontaneous interactions with many passersby in a difficult
public environment. The face data contains a large number of poorly segmented faces,
faces of varying poses and facial expressions, and even non-face images. Moreover, the
face clustering algorithm is more robust to merging errors when more data is available.
For larger data sets, the face clustering algorithm generated stable performance across
a wide range of parameter settings.
The current implementation of the face clustering system and its supervised variant has not been optimized for speed. For each face sequence input, it currently takes
5-15 minutes to process, extract features, and determine where it should be placed
among the existing clusters. The supervised recognition process currently takes 1-2
seconds per face image. The next research step would be to optimize the computational speed of these two systems, particularly the supervised recognition process.
The first most obvious candidate approach for optimization is in the feature representation of the face sequences. We currently use 50 SIFT feature prototypes per
region to represent each face sequence. Most likely, some of these prototypes are very
useful and some are probably irrelevant. Thus, pruning these feature prototypes will
not only increase computational speed, but also improve the clustering algorithm.
Moreover, when a set of face sequences are combined into a cluster, we currently
retain all of their features to represent a class or person. In cases where a person’s
cluster has a large number of sequences, we prune them by selecting those which
have been most frequently selected as a match throughout the incremental clustering
process. This pruning process of sequences within a cluster can be performed more
163
efficiently. The combination of pruning feature prototypes within a face sequence and
pruning face sequences within a cluster would increase the computational speed of
both the face clustering system and its supervised variant.
Eventually, we believe that an ultimate unsupervised recognition sytem for a home
robot, that is fully robust and capable of learning to recognize people in any environmental settings, would require additional perceptual cues and contextual information.
Thus, the learning system can essentially combine different sources of information to
make a robust unsupervised decision. These additional information may be in the
form of a multi-modal perceptual inputs, associated information such as names, or a
reward signal. The use of this coupling of information was already explored in by Berg
et al [5]. The robot can also actively acquire these additional information by using
its social interface, e.g. by verbally asking for people’s names or for confirmation of
one’s identity, to assist its learning task.
164
Chapter 6
Conclusion
We present an integrated end-to-end incremental and fully unsupervised face recognition framework within a robotic platform embedded in real human environment.
The robot autonomously detects, tracks, and segments face images during spontaneous interactions with many passersby in public spaces and automatically generates
a training set for its face recognition system.
We present the robot implementation and its unsupervised incremental face recognition framework. We demonstrate the robot’s capabilities and limitations in a series
of experiments at a public lobby. In a final experiment, the robot interacted with a
few hundred individuals in an eight day period and generated a data set of over a
hundred thousand face images.
We describe an algorithm for clustering local features extracted from a large set
of automatically generated face data. This algorithm is based on a face sequence
matching algorithm which shows robust performance despite the noisy data. We
evaluate the clustering algorithm performance across a range of parameters on this
automatically generated training data and also the Honda-UCSD video face database.
Using the face clustering solution, we implemented an integrated system for unsupervised and incremental face recognition. We evaluated this system using the face
data automatically generated by the robot during the final experiment. During this
incremental clustering and recognition test, the system builds a labelled training set
in a fully unsupervised way and then uses this training set to recognize each sequence
165
input. The system hypothesizes correctly 74% of the time that the sequence input
belongs to an unknown person who does not exist in the training set. After an initial
learning period, when the system hypothesizes that the sequence input belongs to a
specific known person, it is correct roughly 80% of the time.
6.1
Lessons in Extending Robustness
In this thesis, we learned a number of the lessons and identified a set of challenges
during the consequential process of extending the space and time in which the robot
operates. Although it is still difficult for humanoid robots to operate robustly in
noisy environment, the issue of robustness has not received adequate attention in
most research projects 3. Since robots will ultimately have to operate beyond the
scope of short video clips and end-of-project demonstrations, we believe that a better
understanding of these challenges is valuable for motivating further work in various
areas contributing to this interdisciplinary endeavor.
Perception has been blamed to be one of the biggest hurdles in robotics and
certainly has posed many difficulties in our case. We generally found that many existing vision and speech technology are not suitable for our setting and constraints.
Vision algorithms for static cameras are unusable because both cameras pan independently. The desktop microphone required for natural interaction with multiple
people generates decreased performance compared to the headset typically used for
speech recognition. Drastic lighting changes inside the building and conducting experiments in different locations have forced us to go through many iterations of the
robot’s perceptual systems. Something that works in the morning at the laboratory
may no longer work in the evening or at another location. Many automatic adaptive mechanisms, such as for the camera’s internal parameters to deal with lighting
changes throughout the day, are now necessary.
For a robotic creature that continuously learns while living in its environment,
there is no separation between the learning and testing periods. The two are blurred
together and often occurring in parallel. Mertz has to continually locate learning
166
targets and carefully observe to learn about them. These two tasks are conflicting
in many ways. The perceptual system is thus divided between fast but less precise
processes for the first task and slower but more accurate algorithms for the latter.
Similarly, the attention system has to balance between being reactive to new salient
stimuli and persistent to observe current learning target. This dichotomy is interestingly reflected in the what and where pathways of our visual cortex, as well as the
endogenous and exogenous control of visual attention.
Humans’ tendency to anthropomorphize generally makes the robot’s task to socially interact simpler. However, requiring the robot to interact with multiple people
for an extended duration has called for a more sophisticated social interface. One can
imagine that a friendly robot that makes eye contact and mimics your speech can be
quite entertaining, but not for too long. While the premise that social interaction can
guide robot learning is promising, it also suffers from the ”chicken and egg” problem
in a long-term setting, i.e. in order to sustain an extended interaction, the robot also
needs to be able to learn and retain memory of past events.
In all engineering disciplines, we tend to focus on maximizing task performance.
Whenever people are present, Mertz’s task is to detect and engage them in interaction.
We learned that when the robot is on all the time, in addition to its tasks, the
robot also has to deal with down time, when no one is around. All of a sudden
the environment’s background and false positive detection errors become a big issue.
During an experiment session, the robot kept falsely detecting a face on the ceiling,
stared at it all day, and ignored everything else. Lastly, as the software complexity
grows, the harder it becomes to keep the entire system running for many hours.
Memory leaks and infrequent bugs emerging from subsystem interactions are very
difficult to track. Moreover, a robot that runs for many hours per day and learns
from its experiences can easily generate hundreds of gigabytes of data. While having
a lot of data is undeniably useful, figuring out how to automatically filter, store, and
retrieve them in real time is an engineering feat.
167
6.2
Future Work
As we discussed in section 5.6, the next research step would be to optimize the
computational speed of these two systems, particularly the supervised recognition
process. This would allow for an online evaluation of the integrated incremental face
recognition system. We have also suggested some possible optimization steps.
A natural extension to this thesis is to integrate voice recognition. The robot’s
spatio-temporal learning mechanism currently allows the robot to generate and segment not only face sequences, but also associated voice samples from the corresponding individual. Integration of voice recognition can assist both the clustering and
recognition process. Berg et al showed that additional information given by a set of
names extracted from captioned text can assist in improving the accuracy of clustering of face images from an unlabelled data set of captioned news images. Similarly,
additional information given by voice recognition can be used to help the face clustering process. Moreover, during the supervised recognition part, an additional source
of recognition hypothesis will be very useful.
Face, especially when segmented without any hair, provides very limited information for individual recognition. Further additional information will be in fact necessary, as we discussed in section 5.6, to achieve an ultimate incremental individual
recognition that is fully robust. These additional information can come from other
visual cues, multi-modal signals, associated features such as people’s names, contextual information, and active learning. What we mean by active learning is to utilize
the robot’s social behavior to actively inquire specific individuals for information to
assist its learning task. For example, the robot may ask someone if the robot has ever
seen them before. Alternatively, the robot may ask someone to double check if he is
in fact who the robot thinks he is.
Lastly, it would be very interesting to take the next step of closing the loop from
the recognition output to the robot’s behavior system. This would allow for a social
recognition mechanism, where the robot can not only learn to recognize people, but
also to adapt its behavior based on the robot’s previous experience with specific
168
individuals.
169
Appendix A
Experiment Protocol
• The content of a large poster placed in the front of the robot platform during
the experiment:
Experiment Notice
Hello, my name is Mertz. Please interact with me. I am trying to learn to
recognize different people.
This is an experiment to study human-robot interaction. The robot is collecting
data to learn from its experience.
Please be aware that the robot is recording both visual and audio input.
• The content of a small poster placed on the robot platform during the experiment
Hello my name is Mertz. Please interact and speak to me. I am trying to learn
to recognize different people.
I can see and hear. I may ask you to come closer because I cannot see very far.
I don’t understand any language, but I am trying to learn and repeat simple
words that you say.
• The content of email sent to the recruited subjects
170
Thank you again for agreeing to participate in this experiment. The robot’s
schedule and location will posted each day at:
http://people.csail.mit.edu/lijin/schedule.html
Tomorrow (monday nov 22), the robot will be at the student street, 1st floor,
from 12 - 7 pm. The first couple days will be ’test runs’, so I apologize if you
may find the robot not working properly or being repaired.
With the exception of the first day, the robot will generally be running from
8.30am - 7pm, at either the 1st or 4th floor of Stata. Please feel free to come on
whichever days and times that work best with your schedule and travel plans.
It would be great if the robot can see and talk to you on multiple occasions
during the second week (from Monday Nov 27 onward).
Some written information about the experiment will be posted near the robot,
in order to ensure that everyone including the passersby receives the same instructions.
If you prefer that I send you an email for each schedule update (instead of
looking up online) or have any questions/comments, please let me know.
171
Appendix B
Sample Face Clusters
The following are a set of sample face clusters, formed by the unsupervised clustering
algorithm. See Chapter 5 for the implementation and evaluation of the clustering
system.
172
Figure B-1: An example of a falsely merged face cluster.
173
Figure B-2: An example of a falsely merged face cluster.
174
Figure B-3: An example of a good face cluster containing sequences from multiple
days.
175
Figure B-4: An example of a good face cluster.
176
Figure B-5: An example of a non-face cluster.
177
Figure B-6: An example of a good face cluster containing sequences from multiple
178
days.
Figure B-7: An example of a good face cluster
179
Figure B-8: An example of a falsely merged face cluster.
180
Figure B-9: An example of a good face cluster containing sequences from multiple
days.
181
Figure B-10: An example of a good face cluster.
182
Figure B-11: An example of a good face cluster.
183
Figure B-12: An example of a good face cluster.
184
Figure B-13: An example of a good face cluster.
185
Figure B-14: An example of a good face cluster.
186
187
Figure B-15: An example of a good face cluster.
188
Figure B-16: An example of a good face cluster.
189
Figure B-17: An example of a good face cluster containing sequences from multiple
190
Figure B-18: An example of a good face cluster.
Figure B-19: An example of a good face cluster.
191
Appendix C
Results of Clustering Evaluation
The following is a set of clustering evaluation results. We extract a number of data
sets of different sizes from the collected training data. Each data set was formed
by taking contiguous segments from the robot’s final training data in order of its
appearance during the experiment. These results were generated using data set sizes
and parameter values listed in table 5.1.
192
data set size = 30 , C = 0
#person
perfect
none
good
10
5
0
0
2
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
1
0
0
20
5
10
15
20
25
10
0
0
1
merge max
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
0.5
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 30 , C = 0
#person
perfect
none
good
5
0
0
2
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
1
0
0
20
5
10
15
20
25
10
0
0
1
merge max
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
0.5
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-1: Results of clustering evaluation of a data set of 30 sequences, C = 0%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences
in the data set. The top most subplot ilustrates the number of good, perfect, and
none resulting clusters. The second subplot shows the number of total merging errors
and their distribution for different merged
193purity values . The third subplot shows
the number of merged maximum. The fourth subplot shows the number of splitting
errors and their distribution for different split degree values. The rest of the plots in
this section follows the same structure.
data set size = 30 A
#person
perfect
none
good
10
5
0
0
2
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
1
0
0
4
5
10
15
20
25
2
0
0
1
merge max
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
0.5
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 30 A
#person
perfect
none
good
5
0
0
2
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
1
0
0
4
5
10
15
20
25
merge max
2
0
0
1
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
0.5
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-2: Results of clustering evaluation of a data set of 30 sequences, C = 70%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
194
data set size = 300 , C = 0
#person
perfect
none
good
50
0
0
10
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
5
0
0
100
5
10
15
20
25
merge max
50
0
0
10
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 300 , C = 0
#person
perfect
none
good
5
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
100
5
10
15
20
25
merge max
50
0
0
4
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
2
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-3: Results of clustering evaluation of a data set of 300 sequences, C = 0%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
195
data set size = 300 , C = 0.3
#person
perfect
none
good
50
0
0
10
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
5
0
0
100
5
10
15
20
25
merge max
50
0
0
10
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 300 , C = 0.5
#person
perfect
none
good
5
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
100
5
10
15
20
25
merge max
50
0
0
4
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
2
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-4: Results of clustering evaluation of a data set of 300 sequences, C = 30%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
196
data set size = 300 , C = 0.5
#person
perfect
none
good
50
0
0
10
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
5
0
0
100
5
10
15
20
25
merge max
50
0
0
10
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 300 , C = 0.5
#person
perfect
none
good
5
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
100
5
10
15
20
25
merge max
50
0
0
4
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
2
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-5: Results of clustering evaluation of a data set of 300 sequences, C = 50%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
197
data set size = 300 , C = 0.7
#person
perfect
none
good
50
0
0
5
5
0
0
5
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
10
15
20
25
merge max
0
0
10
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 300 , C = 0.7
#person
perfect
none
good
5
0
0
2
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
1
0
0
5
5
10
15
20
25
merge max
0
0
4
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
2
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-6: Results of clustering evaluation of a data set of 300 sequences, C = 70%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
198
data set size = 500 , C = 0
#person
perfect
none
good
50
0
0
20
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
10
0
0
200
5
10
15
20
25
merge max
100
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 500 , C = 0
#person
perfect
none
good
5
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
200
5
10
15
20
25
merge max
100
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-7: Results of clustering evaluation of a data set of 500 sequences, C = 0%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
199
data set size = 500 , C = 0.3
#person
perfect
none
good
50
0
0
20
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
10
0
0
200
5
10
15
20
25
merge max
100
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 500 , C = 0.3
#person
perfect
none
good
5
0
0
2
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
1
0
0
200
5
10
15
20
25
merge max
100
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-8: Results of clustering evaluation of a data set of 500 sequences, C = 30%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
200
data set size = 500 , C = 0.5
#person
perfect
none
good
50
0
0
10
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
5
0
0
100
5
10
15
20
25
merge max
50
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 500 , C = 0.5
#person
perfect
none
good
5
0
0
2
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
1
0
0
100
5
10
15
20
25
merge max
50
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-9: Results of clustering evaluation of a data set of 500 sequences, C = 50%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
201
data set size = 500 , C = 0.7
#person
perfect
none
good
50
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
10
5
10
15
20
25
merge max
5
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 500 , C = 0.7
#person
perfect
none
good
5
0
0
1
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
0.5
0
0
2
5
10
15
20
25
merge max
1
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-10: Results of clustering evaluation of a data set of 500 sequences, C = 70%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
202
data set size = 500 , C = 0
#person
perfect
none
good
40
20
0
0
10
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
5
0
0
40
5
10
15
20
25
merge max
20
0
0
40
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
20
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 500 , C = 0
#person
perfect
none
good
5
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
40
5
10
15
20
25
merge max
20
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-11: Results of clustering evaluation of a data set of 500 sequences, C = 0%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
203
data set size = 500 , C = 0.3
#person
perfect
none
good
40
20
0
0
10
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
5
0
0
40
5
10
15
20
25
merge max
20
0
0
40
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
20
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 500 , C = 0.3
#person
perfect
none
good
5
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
40
5
10
15
20
25
merge max
20
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-12: Results of clustering evaluation of a data set of 500 sequences, C = 30%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
204
data set size = 500 , C = 0.5
#person
perfect
none
good
40
20
0
0
10
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
5
0
0
20
5
10
15
20
25
merge max
10
0
0
40
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
20
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 500 , C = 0.5
#person
perfect
none
good
5
0
0
2
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
1
0
0
20
5
10
15
20
25
merge max
10
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-13: Results of clustering evaluation of a data set of 500 sequences, C = 50%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
205
data set size = 500 , C = 0.7
#person
perfect
none
good
40
20
0
0
1
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
0.5
0
0
2
5
10
15
20
25
merge max
1
0
0
40
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
20
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 500 , C = 0.7
#person
perfect
none
good
5
0
0
1
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
0.5
0
0
2
5
10
15
20
25
merge max
1
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-14: Results of clustering evaluation of a data set of 500 sequences, C = 70%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
206
data set size = 500 , C = 0
20
#person
perfect
none
good
10
0
0
5
5
0
0
20
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
10
15
20
25
merge max
10
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 500 , C = 0
#person
perfect
none
good
5
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
20
5
10
15
20
25
merge max
10
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-15: Results of clustering evaluation of a data set of 500 sequences, C = 0%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
207
data set size = 500 , C = 0.3
20
#person
perfect
none
good
10
0
0
5
5
0
0
20
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
10
15
20
25
merge max
10
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 500 , C = 0.3
#person
perfect
none
good
5
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
20
5
10
15
20
25
merge max
10
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-16: Results of clustering evaluation of a data set of 500 sequences, C = 30%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
208
data set size = 500 , C = 0.5
20
#person
perfect
none
good
10
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
10
5
10
15
20
25
merge max
5
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 500 , C = 0.5
#person
perfect
none
good
5
0
0
2
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
1
0
0
10
5
10
15
20
25
merge max
5
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-17: Results of clustering evaluation of a data set of 500 sequences, C = 50%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
209
data set size = 500 , C = 0.7
20
#person
perfect
none
good
10
0
0
1
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
0
−1
0
1
5
10
15
20
25
merge max
0
−1
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 500 , C = 0.7
#person
perfect
none
good
5
0
0
1
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
0
−1
0
1
5
10
15
20
25
merge max
0
−1
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-18: Results of clustering evaluation of a data set of 500 sequences, C = 70%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
210
data set size = 700 , C = 0
200
#person
perfect
none
good
100
0
0
40
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
20
0
0
200
5
10
15
20
25
merge max
100
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 700 , C = 0
#person
perfect
none
good
5
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
200
5
10
15
20
25
merge max
100
0
0
2
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
1
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-19: Results of clustering evaluation of a data set of 700 sequences, C = 0%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
211
data set size = 700 , C = 0.3
200
#person
perfect
none
good
100
0
0
40
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
20
0
0
400
5
10
15
20
25
merge max
200
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 700 , C = 0.3
#person
perfect
none
good
5
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
10
5
10
15
20
25
merge max
5
0
0
2
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
1
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-20: Results of clustering evaluation of a data set of 700 sequences, C = 30%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
212
data set size = 700 , C = 0.5
200
#person
perfect
none
good
100
0
0
40
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
20
0
0
100
5
10
15
20
25
merge max
50
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 700 , C = 0.5
#person
perfect
none
good
5
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
100
5
10
15
20
25
merge max
50
0
0
2
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
1
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-21: Results of clustering evaluation of a data set of 700 sequences, C = 50%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
213
data set size = 700 , C = 0.7
200
#person
perfect
none
good
100
0
0
10
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
5
0
0
3
5
10
15
20
25
merge max
2
1
0
2
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
1
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 700 , C = 0.7
#person
perfect
none
good
5
0
0
1
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
0.5
0
0
2
5
10
15
20
25
merge max
1
0
0
1
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
0
−1
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-22: Results of clustering evaluation of a data set of 700 sequences, C = 70%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
214
data set size = 700 , C = 0
100
#person
perfect
none
good
50
0
0
40
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
20
0
0
200
5
10
15
20
25
merge max
100
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 700 , C = 0
#person
perfect
none
good
5
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
200
5
10
15
20
25
merge max
100
0
0
4
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
2
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-23: Results of clustering evaluation of a data set of 700 sequences, C = 0%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
215
data set size = 700 , C = 0.3
100
#person
perfect
none
good
50
0
0
20
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
10
0
0
200
5
10
15
20
25
merge max
100
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 700 , C = 0.3
#person
perfect
none
good
5
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
200
5
10
15
20
25
merge max
100
0
0
4
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
2
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-24: Results of clustering evaluation of a data set of 700 sequences, C = 30%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
216
data set size = 700 , C = 0.5
100
#person
perfect
none
good
50
0
0
20
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
10
0
0
100
5
10
15
20
25
merge max
50
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 700 , C = 0.5
#person
perfect
none
good
5
0
0
2
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
1
0
0
100
5
10
15
20
25
merge max
50
0
0
4
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
2
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-25: Results of clustering evaluation of a data set of 700 sequences, C = 50%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
217
data set size = 700 , C = 0.7
100
#person
perfect
none
good
50
0
0
2
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
1
0
0
4
5
10
15
20
25
merge max
2
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 700 , C = 0.7
#person
perfect
none
good
5
0
0
1
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
0.5
0
0
2
5
10
15
20
25
merge max
1
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-26: Results of clustering evaluation of a data set of 700 sequences, C = 70%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
218
data set size = 1000 , C = 0
#person
perfect
none
good
100
50
0
0
40
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
20
0
0
200
5
10
15
20
25
merge max
100
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 1000 , C = 0
#person
perfect
none
good
5
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
200
5
10
15
20
25
merge max
100
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-27: Results of clustering evaluation of a data set of 1000 sequences, C = 0%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
219
data set size = 1000 , C = 0.3
#person
perfect
none
good
100
50
0
0
40
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
20
0
0
200
5
10
15
20
25
merge max
100
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 1000 , C = 0.3
#person
perfect
none
good
5
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
200
5
10
15
20
25
merge max
100
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-28: Results of clustering evaluation of a data set of 1000 sequences, C =
30%. The upper figure shows the results from the entire data set and the lower
figure shows a subset of the results from 5 individuals who have the most number of
sequences in the data set.
220
data set size = 1000 , C = 0.5
#person
perfect
none
good
100
50
0
0
20
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
10
0
0
100
5
10
15
20
25
merge max
50
0
0
20
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
10
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 1000 , C = 0.5
#person
perfect
none
good
5
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
100
5
10
15
20
25
merge max
50
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-29: Results of clustering evaluation of a data set of 1000 sequences, C =
50%. The upper figure shows the results from the entire data set and the lower
figure shows a subset of the results from 5 individuals who have the most number of
sequences in the data set.
221
data set size = 1000 , C = 0.7
#person
perfect
none
good
100
50
0
0
4
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
2
0
0
2
5
10
15
20
25
merge max
1
0
0
40
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
20
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 1000 , C = 0.7
#person
perfect
none
good
5
0
0
1
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
0.5
0
0
2
5
10
15
20
25
merge max
1
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-30: Results of clustering evaluation of a data set of 1000 sequences, C =
70%. The upper figure shows the results from the entire data set and the lower
figure shows a subset of the results from 5 individuals who have the most number of
sequences in the data set.
222
data set size = 2025 , C = 0
#person
perfect
none
good
200
100
0
0
50
5
0
0
100
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
10
15
20
25
merge max
50
0
0
40
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
20
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 2025 , C = 0
#person
perfect
none
good
5
0
0
5
5
0
0
100
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
10
15
20
25
merge max
50
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-31: Results of clustering evaluation of a data set of 2025 sequences, C = 0%.
The upper figure shows the results from the entire data set and the lower figure shows
a subset of the results from 5 individuals who have the most number of sequences in
the data set.
223
data set size = 2025 , C = 0.3
#person
perfect
none
good
200
100
0
0
50
5
0
0
40
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
10
15
20
25
merge max
20
0
0
40
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
20
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 2025 , C = 0.3
#person
perfect
none
good
5
0
0
5
5
0
0
40
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
10
15
20
25
merge max
20
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-32: Results of clustering evaluation of a data set of 2025 sequences, C =
30%. The upper figure shows the results from the entire data set and the lower
figure shows a subset of the results from 5 individuals who have the most number of
sequences in the data set.
224
data set size = 2025 , C = 0.5
#person
perfect
none
good
200
100
0
0
40
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
20
0
0
10
5
10
15
20
25
merge max
5
0
0
40
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
20
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 2025 , C = 0.5
#person
perfect
none
good
5
0
0
5
5
0
0
10
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
10
15
20
25
merge max
5
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-33: Results of clustering evaluation of a data set of 2025 sequences, C =
50%. The upper figure shows the results from the entire data set and the lower
figure shows a subset of the results from 5 individuals who have the most number of
sequences in the data set.
225
data set size = 2025 , C = 0.7
#person
perfect
none
good
200
100
0
0
5
5
0
0
3
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
10
15
20
25
merge max
2.5
2
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
0
0
50
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
5 most frequent person, data set size = 2025 , C = 0.7
#person
perfect
none
good
5
0
0
2
5
10
15
20
25
#merge
0−25
25−50
50−75
75−100
1
0
0
4
5
10
15
20
25
merge max
2
0
0
5
0
0
50
5
10
15
20
25
#split
0−25
25−50
50−75
75−100
5
10
15
20
25
param N
param K
0
0
5
10
15
20
25
parameter values
Figure C-34: Results of clustering evaluation of a data set of 2025 sequences, C =
70%. The upper figure shows the results from the entire data set and the lower
figure shows a subset of the results from 5 individuals who have the most number of
sequences in the data set.
226
Appendix D
Sample Clusters of the Familiar
Individuals
The following is a set of sample clusters of familiar individuals which were constructed
during the incremental recognition process, as described in section 5.4. Some sample
clusters have already been shown earlier in section 5.4.3. Note that not all of the
sequences are shown in these figures, due to visibility and space constraints. However,
we make sure to include all of the merged errors, when present.
227
Figure D-1: The generated cluster of familiar individual 2. Individual 2 is the same
person as individual 1, shown in section 5.4.3
228
Figure D-2: The generated cluster of familiar individual 3. Individual 3 is the same
person as individual 1, shown in section 5.4.3. Note that the five sequences in the
last row are falsely merged from another person.
229
Figure D-3: The generated cluster of familiar individual 4. Individual 4 is the same
person as individual 1, shown in section 5.4.3.
230
Figure D-4: The generated cluster of familiar individual 6. Individual 6 came to
interact with the robot on two different days during the experiment. The red rectangle
is placed to point out the falsely merged sequences from another individual in the
cluster.
231
Figure D-5: The generated cluster of familiar individual 9. The red rectangle is placed
to point out the falsely merged sequences from another individual in the cluster.
232
Figure D-6: The generated cluster of familiar individual 10.
233
Bibliography
[1] J.S. Albus. Outline for a theory of intelligence. IEEE Transactions on Sysems,
Man, and Cybernetics, vol. 21, no. 3, 1991.
[2] R. Arkin. Behavior-based Robotics. MIT Press, Cambridge, Massachusetts,
1998.
[3] C. Bartneck and J. Forlizzi. Shaping human-robot interaction: understanding
the social aspects of intelligent robotic products. Conference on Human Factors
in Computing Systems, 2004.
[4] J. Bates. The role of emotion in believable agents. Communications of the
ACM, Special Issue on Agents, 1994.
[5] T. Berg, A. Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned-Miller,
and D. Forsyth. Names and faces in the news. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition., 2004.
[6] M. Bicego, A.Lagorio, E. Grosso, and M. Tistarelli. On the use of sift features
for face authentication. Proc. of IEEE Int Workshop on Biometrics, 2006.
[7] S. Birchfield.
Klt: An implementation of the kanade-lucas-tomasi feature
tracker. http://www.ces.clemson.edu/ stb/klt/.
[8] R. Bischoff and V. Graefe. Design principles for dependable robotic assistants.
International Journal of Humanoid Robotics, vol. 1, no. 1 95–125, 2004.
234
[9] Volker Blanz and Thomas Vetter. Face recognition based on fitting a 3d morphable model. IEEE Trans. Pattern Analysis and Machine Intelligence, 25(9),
2003.
[10] G. R. Bradski. Computer vision face tracking for use in a perceptual user
interface. Intel Technology Journal, Q2:1-15, 1998.
[11] C. Breazeal. Sociable machines: Expressive social exchange between humans
and robots. Sc.D. dissertation, Department of EECS, MIT, 2000.
[12] C. Breazeal. Emotion and sociable humanoid robots. International Journal of
Human-Computer Studies, 59, pp.119-155, 2003.
[13] C. Breazeal, A. Brooks, J. Gray, G. Hoffman, C. Kidd, H. Lee, J. Lieberman,
A. Lockerd, and D. Mulanda. Humanoid robots as cooperative partners for
people. International Journal of Humanoid Robotics, 2004.
[14] C. Breazeal and B. Scassellati. A context-dependent attention system for a
social robot. Proceedints of the Sixteenth International Joint Conference on
Artificial Intelligence, 1999.
[15] Cynthia Breazeal, Aaron Edsinger, Paul Fitzpatrick, Brian Scassellati, and
Paulina Varchavskaia. Social constraints on animate vision. IEEE Intelligent
Systems, 15(4):32–37, 2000.
[16] R. Brooks. Intelligence without representation. Artificial Intelligence Journal
47:139-160, 1991.
[17] R. Brooks, C. Breazeal, M. Marjanovic, B. Scassellati, and M. Williamson. The
cog project: Building a humanoid robot. Computation for Metaphors, Analogy and Agents, Vol. 1562 of Springer Lecture Notes in Artificial Intelligence,
Springer-Verlag, 1998.
[18] R.A. Brooks. A robust layered control system for a mobile robot. IEEE Journal
of Robotics and Automation 2:1:14–23, 1986.
235
[19] R.A. Brooks and C. Rosenberg. L -a common lisp for embedded systems. Association of Lisp Users Meeting and Workshop, 1995.
[20] A. Bruce, I. Nourbakhsh, and R. Simmons. The role of expressiveness and
attention in human-robot interaction. Proc. AAAI Fall Symp. Emotional and
Intel. II: The Tangled Knot of Soc. Cognition, 2001.
[21] R. Brunelli and D. Falavigna. Person identification using multiple cues. IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 17, No. 10,
1995.
[22] W. Burgard, D. Fox, D. Hdhnel, G. Lakemeyer, D. Schulz, W. Steiner, S. Thrun,
and A.B. Cremers. Real robots for the real world – the rhino museum tourguide project. Proc. of the AAAI Spring Symposium on Integrating Robotics
Research, Taking the Next Leap, Stanford, CA, 1998.
[23] N. Butko, I. Fasel, and J. Movellan. Learning about humans during the first 6
minutes of life. Proceedings of the 5th Internation Conference on Development
and Learning, 2006.
[24] R. Caldwell. Recognition, signalling and reduced aggression between former
mates in a stomatopod. Animal Behavior 44,11-19, 1992.
[25] T. Choudhury, B. Clarkson, T. Jebara, and A. Pentland. Multimodal person
recognition using unconstrained audio and video. Proceedings of the Second
Conference on Audio- and Video-based Biometric Person Authentication, 1999.
[26] K. Dautenhahn. Getting to know each other - artificial social intelligence for
autonomous robots. Robotics and Autonomous Systems 16:333-356, 1995.
[27] M. S. Dawkins. Distance and social recognition in hens: Implications of the use
of photographs as social stimuli. Behavior 133:9-10,663-680, 1996.
[28] J. Dewey. Experience and education. New York: Macmillan, 1938.
236
[29] C. DiSalvo, F. Gemperle, J. Forlizzi, and S. Kiesler. All robots are not created equal: The design and perception of humanoid robot heads. Conference
Proceedings of Designing Interactive Systems, London, England, 2002.
[30] S. Eickeler, F. Wallhoff, U. Iurgel, and G. Rigoll. Content-based indexing of
images and videos using face detection and recognition methods. IEEE Int.
Conference on Acoustics, Speech, and Signal Processing, 2001.
[31] W. Fisher. Program tsylb (version 2 revision 1.1). NIST, 7 August 1996.
[32] T. Fong, I. Nourbakhsh, and K. Dautenhahn. A survey of socially interactive
robots. Robotics and Autonomous Systems 42:143-166, 2003.
[33] G.L. Foresti and C. Micheloni. Real-time video-surveillance by an active camera. Ottavo Convegno Associazione Italiana Intelligenza Artificiale (AI*IA)
- Workshop sulla Percezione e Visione nelle Macchine, Universita di Siena,
September 11-13, 2002.
[34] Simone Frintrop, Gerriet Backer, and Erich Rome. Selecting what is important: Training visual attention. Proceedings of the 28th German Conference on
Artificial Intelligence, 2005.
[35] S. Furui. An overview of speaker recognition technology. ESCA Workshop on
Automatic Speaker Recognition Identification Verification, 1994.
[36] R. Gockley, A. Bruce, J. Forlizzi, M. Michalowski, A. Mundell, S. Rosenthal,
B. Selner, R. Simmons, K. Snipes, A. Schultz, and J. Wang. Designing robots for
long-term social interaction. IEEE/RSJ International Conference on Intelligent
Robots and Systems, 2005.
[37] M. Goodale and A. Milner. Separate pathways for perception and action. Trends
in Neuroscience 15: 20-25, 1992.
[38] D. Gorodnichy. Video-based framework for face recognition in video. Proc. of
the Second Canadian Conference on Computer and Robot Vision, 2005.
237
[39] C. Harris and M. Stephens. A combined corner and edge detector. Alvey Vision
Conference, pp 147-152, 1988.
[40] R. Hewitt and S. Belongie. Active learning in face recognition: Using tracking
to build a face model. Proceedings of the Conference on Computer Vision and
Pattern Recognition Workshop, 2006.
[41] R.A. Peters II, K.E. Hambuchen, K. Kawamura, and D.M. Wilkes. The sensory egosphere as a short-term memory for humanoids. Proc. 2nd IEEE-RAS
International Conference on Humanoid Robots, pp 451-459, 2001.
[42] L. Itti. Models of bottom-up and top-down visual attention. Ph.D. Thesis,
California Institute of Technology, 2000.
[43] T. Kanda, T. Hirano, D. Eaton, and H. Ishiguro. Interactive robots as social
partners and peer tutors for children: A field trial. Human Computer Interaction, Vol. 19, No. 1-2, pp. 61-84, 2004.
[44] T.
Kanungo,
verman,
D.
M.
and A. Wu.
Mount,
N.
Netanyahu,
C.
Piatko,
R.
Sil-
Efficient algorithms for k-means clustering.
http://www.cs.umd.edu/ mount/Projects/KMeans/.
[45] T. Kanungo, D. M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu.
An efficient k-means clustering algorithm: Analysis and implementation. IEEE
Trans. Pattern Analysis and Machine Intelligence, 24:881-892, 2002.
[46] J. Kittler, Y. Li, J. Matas, and M. Ramos Sanchez. Combining evidence in
multimodal personal identity recognition systems. International Conferrence
on Audio and Video-based Biometric Person Authentication, 1997.
[47] H. Kozima. Infanoid: A babybot that explores the social environment. Socially Intelligent Agents: Creating Relationships with Computers and Robots,
Amsterdam: Kluwer Academic Publishers, pp.157-164, 2002.
238
[48] H. Kozima, C. Nakagawa, and Y. Yasuda. Interactive robots for communicationcare: A case-study in autism therapy. Proceedings of the IEEE International
Workshop on Robot and Human Interactive Communication, 2005.
[49] E. Labinowicz. The Piaget Primer: Thinking, Learning, Teaching. AddisonWesley, 1980.
[50] K.C. Lee, J. Ho, M.H. Yang, and D. Kriegman. Video-based face recognition
using probabilistic appearance manifolds. IEEE Conf. On Computer Vision
and Pattern Recognition, 1:313–320, 2003.
[51] K.C. Lee, J. Ho, M.H. Yang, and D. Kriegman. Visual tracking and recognition using probabilistic appearance manifolds. Computer Vision and Image
Understanding, 99:303–331, 2005.
[52] D.B. Lenat. Cyc: A large-scale investment in knowledge infrastructure. Communications of the ACM 38, no. 11, November, 1995.
[53] X. Liu and T. Chen. Video-based face recognition using adaptive hidden markov
models. Proc. of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, 2003.
[54] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60, 2, pp. 91-110, 2004.
[55] J. Luettin. Visual Speech and Speaker Recognition. PhD thesis, University of
Sheffield, 1997.
[56] M. Lungarella, G. Metta, R. Pfeifer, and G. Sandini. Developmental robotics:
A survey. Connection Science, vol.00 no.0:1-40, 2004.
[57] S. Macskassy and H. Hirsh. Towards personable agents: A survey. 1998.
[58] J. Mateo. Recognition systems and biological organization: The perception
component of social recognition. Annales Zoologici Fennici, 41, 729-745, 2004.
239
[59] G. Metta, P. Fitzpatrick, and L. Natale. Yarp: Yet another robot platform.
International Journal of Advanced Robotics Systems, special issue on Software
Development and Integration in Robotics, Volume 3(1), 2006.
[60] A. S. Mian, M. Bennamoun, and R. Owens. Face recognition using 2d and
3d multimodal local features. International Symposium on Visual Computing,
2006.
[61] H. Miwa, T. Okuchi, K. Itoh, H. Takanobu, and A. Takanishi. A new mental
model for humanoid robots for human friendly communication introduction of
learning system, mood vector and second order equations of emotion. Robotics
and Automation, 2003.
[62] H. Miwa, T. Okuchi, H. Takanobu, and A. Takanishi. Development of a new
human-like head robot we-4r. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.2443-2448, 2002.
[63] M. Montemerlo, J. Pineau, N. Roy, S. Thrun, and V. Verma. Experiences
with a mobile robotic guide for the elderly. Proceedings of the AAAI National
Conference on Artificial Intelligence, 2002.
[64] M. Mori. The buddha in the robot. Charles E. Tuttle Co., 1982.
[65] R. Mosur. Sphinx-ii user guide.
[66] D. Mou. Autonomous Face Recognition. PhD thesis, University of Ulm, 2005.
[67] J. Movellan, F. Tanaka, B. Fortenberry, and K. Aisaka. The rubi/qrio project,
origins, principles, and first steps. International Journal of Human-Computer
Studies, 59, pp.119-155., 2005.
[68] C. Nakajima, M. Pontil, B. Heisele, and T. Poggio. People recognition in image
sequences by supervised learning. A.I. Memo No. 1688, C.B.C.L. Paper No.
188. MIT, 2000.
240
[69] C. Nass and S. Brave. Emotion in human-computer interaction. in The humancomputer interaction handbook: fundamentals, evolving technologies and emerging applications, Lawrence Erlbaum Associates, 2002.
[70] L. Natale. Linking Action to Perception in a Humanoid Robot: A Developmental
Approach to Grasping. PhD thesis, University of Genova, 2004.
[71] I. Nourbakhsh, C. Kunz, and T. Willeke. The mobot museum robot installations: A five year experiment. Proceedings of the IEEE/RSJ International
Conference on Intelligent Robots and Systems, Las Vegas, 2003.
[72] T. Ogata, Y. Matsuyama, T. Komiya, M. Ida, K. Noda, and S. Sugano. Development of emotional communication robot, wamoeba-2r-experimental evaluation of the emotional communication between robots and humans. Proceedings
of the IEEE/RSJ International Conference on Intelligent Robots and Systems,
2000.
[73] S. Palmer. Vision science: Photons to phenomenology. MIT Press, 1999.
[74] R. Pfeifer and J. Bongard. How the Body Shapes the Way We Think. MIT
Press, 2006.
[75] P. Phillips, P. Grother, R. Micheals, D. Blackburn, E. Tabassi, and J. Bone.
Frvt 2002 evaluation report. 2003.
[76] P. Phillips, H. Moon, S. Rizvi, and P. Rauss. The feret evaluation methodology
for face recognition algorithms. IEEE Transactions on Pattern Analysis and
Machine Intelligence 22 no. 10, pp 1090-1104, 2000.
[77] J. Piaget. The Origins of Intelligence in Children. Routledge and Kegan Paul,
1953.
[78] R. Porter, L. Desire, R. Bon, and P. Orgeur. The role of familiarity in the
development of social recognition by lambs. Behavior 138:2,207-219, 2004.
241
[79] M. I. Posner. Orienting of attention. Quarterly Journal of Experimental Psychology, 32:3-25, 1980.
[80] J.E. Pratt. Virtual model control of a biped walking robot. Tech. Rep. AITR1581, MIT Artificial Intelligence Laboratory, Cambridge, MA, USA, 1995.
[81] B. Raytchev and H. Murase. Unsupervised face recognition by associative chaining. Pattern Recognition, Vol. 36, No. 1, pp. 245-257, 2003.
[82] Intel
Research.
Open
source
computer
vision
library.
http://www.intel.com/technology/computing/opencv/.
[83] L. Sayigh, P. Tyack, R. Wells, A. Solow, M. Scott, and A. Irvine. Individual
recognition in wild bottlenose dolphins: a field test using playback experiments.
Animal Behavior 57(1):41-50, 1999.
[84] S. Schaal. Is imitation learning the route to humanoid robots?
Trends in
Cognitive Sciences 3:233-242, 1999.
[85] M. Schwebel. The role of experience in cognitive development. Presented at the
5th Invitational Interdisciplinary Seminar , University of Southern California,
1975.
[86] S. Shan, Y. Chang, and W. Gao. Curse of misalignment in face recognition:
Problem and a novel misalignment learning solution. IEEE International Conference on Automatic Face and Gesture Recognition, 2004.
[87] J. Shi and C. Tomasi. Good features to track. IEEE Conference on Computer
Vision and Pattern Recognition, pages 593-600, 1994.
[88] M. Shiomi, T. Kanda, H. Ishiguro, and N. Hagita. Interactive humanoid robots
for a science museum. ACM 1st Annual Conference on Human-Robot Interaction, 2006.
[89] R. Siegwart, K. Arras, S. Bouabdallah, D. Burnier, G. Froidevaux, X. Greppin, B. Jensen, A. Lorotte, L. Mayor, M. Meisser, R. Philippsen, R. Piguet,
242
G. Ramel, G. Terrien, and N. Tomatis. Robox at expo.02: A large-scale installation of personal robots. Robotics and Autonomous Systems 42(3-4): 203-222,
2003.
[90] T. Sim and S. Zhang. Exploring face space. Proceedings of the Conference on
Computer Vision and Pattern Recognition Workshop, 2004.
[91] R. Simmons, D. Goldberg, A. Goode, M. Montemerlo, N. Roy, B. Sellner,
C. Urmson, A. Schultz, M. Abramson, W. Adams, A. Atrash, M. Bugajska,
M. Coblenz, M. MacMahon, D. Perzanowski, I. Horswill, R. Zubek, D. Kortenkamp, B. Wolfe, T. Milam, and B. Maxwell. Grace: An autonomous robot
for the aaai robot challenge. AI Magazine, 24(2), 2003.
[92] J. Stanley and B. Steinhardt. Drawing a blank : The failure of face recognition
in tampa. An ACLU Special Report, 2002.
[93] M. J. Swain and D. H. Ballard. Color indexing. International Journal of
Computer Vision, 7:1, 1991.
[94] S. Thrun, M. Bennewitz, W. Burgard, A.B. Cremers, F. Dellaert, D. Fox,
D. Halnel, C. Rosenbert, N. Roy, J. Schultze, and D. Schulz. Minerva: A
second-generation museum tour-guide robot. Proceedings of IEEE International
Conference on Robotics and Automation, vol 3, pages 1999-2005, 1999.
[95] A.M. Turing. Computing machinery and intelligence. Mind, 59, 433-460, 1950.
[96] M. Turk and A. Pentland. Face recognition using eigenfaces. Proceedings IEEE
Conference Computer Vision and Pattern Recognition, 1991.
[97] J. Velasquez. When robots weep: Emotional memories and decision-making.
Proceedings of the Fifteenth National Conference on Artificial Intelligence.
Madison, WI, 1998.
[98] P. Viola and M. Jones. Robust real-time object detection. Technical Report
Series, CRL2001/01. Cambridge Research Laboratory, 2001.
243
[99] L. S. Vygotsky. Mind in society. Cambridge, MA: Harvard Univ Press, 1978.
[100] C. Wallraven, A. Schwaninger, and H. H. Blthoff. Learning from humans:
Computational modeling of face recognition. Network: Computation in Neural
Systems 16(4), 401-418, 2005.
[101] P. Wang, M. Green, Q. Ji, and J. Wayman. Automatic eye detection and its
validation. IEEE Workshop on Face Recognition Grand Challenge Experiments,
2005.
[102] J. Weng. Developmental robotics: Theory and experiments. International
Journal of Humanoid Robotics, vol. 1, no. 2, 2004.
[103] J. Weng, C. Evans, and W. Hwang. An incremental learning method for face
recognition under continuous video stream. Proceedings of the Fourth IEEE
International Conference on Automatic Face and Gesture Recognition, 2000.
[104] C. Yu and D.H. Ballard. Exploring the role of attention in modeling embodied language acquisition. Proceedings of the Fifth International Conference on
Cognitive Modeling, 2003.
[105] W. Zhao, R. Chellappa, A. Rosenfeld, and P. Phillips. Face recognition: A
literature survey. ACM Computing Surveys, Volume 35, Issue 4, pp. 399-458,
December, 2003.
[106] S. Zhou, V. Krueger, and R. Chellappa. Probabilistic recognition of human faces
from video. Computer Vision and Image Understanding, 91:214-245, 2003.
244