Unsupervised Learning of Affordance Relations on a Humanoid Robot

advertisement
Unsupervised Learning of Affordance Relations on
a Humanoid Robot
Barış Akgün, Nilgün Dağ, Tahir Bilal, İlkay Atıl and Erol Şahin
Kovan Research Lab.
Computer Engineering Department
Middle East Technical University
Ankara, Turkey
Email: bakgun@ceng.metu.edu.tr
Abstract—In this paper, we study how a humanoid robot can
learn affordance relations in his environment through its own
interactions in an unsupervised way. Specifically, we developed
a simple tapping behavior on the iCub humanoid robot simulator and allowed the robot to interact with a set of objects
with different types and sizes positioned within its reach. The
interaction schema is as follows: an object is put in the visual
field of the robot, the robot then focuses on the object and
applies its tapping behavior. The robot records its initial and
final percept of its view obtained from a range camera, in the
form a feature vector. The difference between initial features and
final features are considered as effect features. The effect features
are clustered using Kohonen’s self-organizing maps to generate a
set of effect categories in an unsupervised way. Then, we used the
RelieF feature extraction method to determine the most relevant
features and a multi-class support vetor machine (SVM) classifier
is trained to learn the mapping between the relevant features and
the effect categories. We analyzed the unsupervised clustering of
effect features using both the types and sizes of objects that fall
into the effect clusters, as well as the success/fail (corresponding
to rolled and not rolled) labels that were manually attached
to the interactions. Our results show that, 1) despite the lack
of supervision, the effect clusters tend to be homogeneous in
terms of success/fail, 2) the relevant features consist mainly of
shape, but not size, 3) the number of relevant features remains
approximately constant with respect to the number of effect
clusters formed, and 4) the SVM classfier can successfully learn
the effect categories using the relevant features.
I. I NTRODUCTION
Robots are becoming more and more integrated into our
every-day life. However, our environment is highly dynamic
and complex with many different things to interact with. An
inner representation of these objects and the environment is
necessary for interaction. With so many things around, all the
world information can not be stored in memory. Moreover, the
prediction of the outcome of an interaction is of importance.
Affordances, a concept from psychology, offers an explanation for such information pick up in organisms. Affordances
were coined by J.J. Gibson who described them as action
possibilities offered to an agent by its environment[1]. He
claimed that these possibilities are directly apparent without
making a mental calculation on perceived data. He stated that
an action needed only relevant perceptual information for its
execution. This is sometimes referred as perceptual economy.
Another important idea he put out was that an affordances
are relational. This relation is between environment and the
agent, encompassing agent’s physical properties, perception,
behaviors and effects of these behaviors. Later E. Gibson
argued that the learning of affordances is ”discovering features
and invariant properties of things and events”[2]. She stated
that this discovery is a developmental process. These properties of affordances offer solutions to aforementioned robotic
problems.
We are interested in learning affordances for a humanoid
robot. We perform our experiments in a simulated environment
with a humanoid robot armed with perceptual and behavioral
skills. We describe a framework in which affordances can
be learned in a developmental way through interaction. We
particularly follow the formalization described in [3] in which
affordance relations are represented as triples of the form
(effect,(entity, behavior)), where entity refers to the perceptual
representation of an interesting part of the environment such
as an object. This formalization can be used for predicting
the outcome of an action and understanding another agent’s
action.
In the rest of the paper, we first give a literature survey
about current state of the art on affordance usage in robotics.
Then on section 3, we describe the environment, the robot, its
capabilities, the interaction sequence and the learning model.
Details of perceptual features and robot behaviors are given in
sections 4 and 5 respectively. Experimental results are given
in section 6 followed by a discussion in section 7.
II. L ITERATURE S URVEY
Affordance concept is gaining popularity among robotics
research. Similarities between affordance theory and reactive/behavior based robotics have been noted in [4] and [5].
Relation between action-oriented perception and affordances
is also noted in [4]. From developmental robotics point
of view, affordances are higher level concepts [6] that are
learned by interaction with the environment [7]. Affordance
theory has inspired robotic learning [8],[9], tool-use [10] and
decision-making [11]. Studies on learning of affordances are
concerned with two major issues; learning consequences of
a certain action in a given situation [7],[9],[10] and learning
of invariant properties of environments that afford a certain
action[8],[12],[13]. Studies in the latter group also relate these
Fig. 2: Segmented object in the range image
(a)
(b)
Fig. 1: (a) A screen shot from the iCub simulator showing
the robot (b) Different objects iCub interacts with during the
experiments.
properties to the effect of an action, but these are in terms
of internal values of the agent, not in terms of changes
in the environment. An application for learning and use of
affordances on mobile robotics is detailed in[14]. In that study,
traversability of an environment containing simple objects
like boxes, cylinders and spheres is learned through many
interactions in a simulator. Relevant features for traversability
affordance and effect predictors are learned using success/fail
labels of these interactions. The work in [15] and [16] developed a neurocomputational model and tested various parts by
two robotics experiments. In one of these experiments, they
learn object affordances. A recent model for developmental
learning and usage of affordances for a humanoid robot is
detailed in [17]. Affordances are modeled as trilateral relations
between objects, effects and actions using a Bayesian network.
Structure of this network is learned through interactions of the
robot with different shaped and sized objects.
III. E XPERIMENTAL F RAMEWORK
A. Environment
We are working on the iCub Humanoid Robot Simulator
[18] which is a physics-based simulator. The robot in the
simulator is modeled after iCub, a child-sized humanoid robot
designed for cognitive and developmental robotic research
[19]. We have modified the simulator to better fit our needs.
Objects that we use in our experiments are spheres, boxes and
cylinders of different sizes. A screen shot from the simulator
can be seen in figure 1.
B. Perception
We record interaction data through cameras, a range camera,
position sensors and touch sensors. Some features are extracted
from this data. Currently we do not use normal cameras’ data
for feature extraction. Implementing or using a stereo vision
algorithm is excluded from this study for practical purposes.
We have implemented a range camera in the simulator.
This simplifies the vision problem thus allowing more time to
gather and comment on results. Object of interest is segmented
in the range image and 49 features are extracted. 24 of
them are shape related, 18 of them are size related and 7
of them are related with hand-object relation. Segmentation
of the object is currently done without any use of image
processing techniques. The pixels belonging to the object of
interest are marked in the range image owing to the simulated
environment.
Extracted shape features are slightly modified versions of
those described in [14]. To summarize the process in Figure 3,
initially the Cartesian coordinates of the points on the object
are calculated from range readings. A normal vector is computed for each of the points. Latitude and longitude angles of
the normal vectors are computed. Two histograms composed
of 12 bins each, for latitude and for longitude, are generated.
Values of the bins are taken as shape features.
In order to calculate size related features, first the orientation
of the object in the range image is calculated by means of
a simple covariance arithmetic. Using the orientation of the
object, we can detect the principal axes and the four extremum
points of the object as depicted in figure 2. We use average
x,y,z and depth values of the pixels, the number of object
pixels in the image, the orientation of A-B line segment, x,y,z,
cartesian and ray distances between A-B and C-D as features.
Our size features also include image plane cartesian distances
of A-B and C-D. There are totally 18 size features.
We also use 7 features to give information about hand
object relation. These features include x, y, euclidian distances
between hand and object in the image plane, additionally x, y,
z, euclidian distances between hand and object in 3d cartesian
space.
C. Behaviors
We are using tapping as our main behavior. Tapping involves
controlling robotic hand’s pose or velocity which require
forward and inverse kinematics calculations. We perform the
kinematics calculations using Orocos Kinematic and Dynamics
Library (KDL) [20]. The shape, size and position information
of objects are taken directly from the simulator for simplicity,
and then used to adjust behavior parameters.
Fig. 5: Interaction scheme. Robot first finds the object then
focuses on it. A behavior, in this case grasping, is executed
afterwards.
IV. A FFORDANCE L EARNING
Fig. 3: Range normals and histograms
Fig. 4: Robot tapping a non-rollable object(left) and a rollable
object(right).
Our attention mechanism controls the eyes and the head of
the robot in order to focus on a target in the visual field using
the cameras. The target is found using image segmentation.
By moving the neck and the eyes, focusing is accomplished.
We define reaching as bringing the robotic hand in vicinity
of the target. It need not be precise and the orientation of the
hand is not important. Approximate position of the object can
be calculated using neck angles and eye vergence. Attention
and reaching phase preceeds tapping.
We define tapping as moderately hitting an object from
its side. Once the hand is aligned with the object, it moves
towards the object with a specified velocity. Hand stops
when it hits the object. Figure 4 shows two instances of
tapping behavior. Aim of tapping is to have a behavior whose
effect is mostly dependent on object’s shape. We expect that
spheres and horizontal cylinders to roll and boxes and vertical
cylinders not to roll after being tapped.
D. Interaction
An interaction starts with an object in the visual field of
the robot. Next, using an attention mechanism, robot focuses
on the object. An environmental snapshot1 is taken from
multiple sensors. Afterwards a behavior is executed. When the
execution ends, another snapshot is taken. A simple interaction
schema from the eyes of the robot can be seen in figure 5 This
is repeated during the experimentation phase.
1 We
define a snapshot as measurements of all the sensors for a given
moment.
We developed a model where a humanoid robot learns affordances by itself through interaction with its environment. This
model establishes the relations between robot’s perception,
actions and their effects and the environment.
After a number of interactions, we extract features from the
initial and the final snapshots. We define effect features as
the difference between initial features and final features. We
normalize these features before processing them.
There could be many objects to interact with but the
types of effects that an action will create are usually limited.
According to this information, effect features are clustered
using Kohonen’s self-organizing maps (SOM) to generalize
effects. We need to specify the cluster topologies to this map.
SOM with 1-D neighborhood is chosen since it was found
that the topology doesn’t have a significant effect on results.
However the total number of clusters is important. This is one
of the parameters of our model.
We reflect the cluster membership values of effects back
to corresponding effect features and use these as class labels.
Recall that perceptual economy is an important aspect of affordances. ReliefF algorithm, first proposed in [21] and extended
in [22] is used for relevant feature selection. It gives weights
to the features according to their contribution in classification.
This relevant feature selection approach provides perceptual
economy. However, for the ReliefF algorithm we need to
specify the number related features. We accomplish this by
specifying a relevancy threshold. This is the second parameter
of our model which we introduce later.
For affordance learning to be complete, our robot needs to
predict the outcome of an action on an object. This means that
we need to relate initial features with effects. We train an SVM
classifier using the class labels and relevant features. Note
that class labels correspond to different effects. By training
a classifier, we relate the initial features with effects and close
the loop of affordance learning. The robot can now predict the
effects that it can create on an object using a certain action.
Figure 6 depicts the learning model.
The method described above is for a single affordance
relation. This model can easily be extended by training a
separate SOM and SVM for each behavior.
V. E XPERIMENTAL R ESULTS
We conducted 1000 experiments for tapping behavior. We
performed our learning model 20 times. The evaluated values
are averaged over these trials.
Fig. 6: Affordance learning model
A. Evaluation Metrics
We use two metrics totally. In order to assess the homogeneity of the effect clusters in terms of success/fail ratios, we
make use of uniformity metric. Uniformity metric is defined
as
X
kSc − Fc k
U = 100 ×
T
c∈Clusters
where Sc denotes the number of success clusters in cluster
c, Fc denotes the number of fail clusters in cluster c and T
denotes the total number of instances in all clusters. If all
clusters contains equal number of success and fail instances,
uniformity is mapped to 0. If all clusters contain exclusively
success or fail instances then uniformity is mapped to 100.
Secondly, for feature selection we make use of relevancy
metric. ReliefF procedure returns us the most important
features for classification with their respective importance
weights. Our relevancy metric always takes a value between
0 and 1. To determine the features to be used for learning, we
multiply our relevancy metric value with the most important
features’ weight and consider this product as a threshold. We
only make use of the features whose importance weights are
above this threshold.
B. Uniformity
In this section we will present the performance of unsupervised effect clustering with respect to hand labelled data.
Figure 7 depicts our uniformity results for effect clusters of
tapping. The essence of our results is the fact that despite
our feature vector does not include hand labels(success/fail),
unsupervised clustering is able to infer this inherent information quite successfully. Generally speaking, uniformity of
clusters increases parallel with the rise of number of clusters,
as expected. The average uniformity is above 80% for tapping
and 70% for grasping. It can reach up to 95%. There is a
fall in the general trend of increase in the grasp behaviour for
the 3 cluster case. The grasping behaviour is affected mostly
Fig. 7: Uniformity of clustering with respect to number of
clusters averaged over number of trials and number of clusters
by the size of the object to be grasped and grasp is varied
continuously. So there is a critical size value determining the
graspability. One of the clusters fall near this region, stealing
from both and therefor decreasing uniformity.
C. Selected Features
In this section we will investigate the effects of relevancy
threshold and number of clusters on selected features. In figure
8, upper subfigure shows the percentage of the shape features
in selected features and lower subfigure shows the total number
of selected features. As it can be seen, shape features dominate
the selected features. When relevancy threshold is decreased,
the number of selected features increase. Since there are only
24 shape related features, the percentage of shape related
features decreases. The number of selected features tends to
be inversely proportional to relevancy threshold and remain
the same above some number of clusters.
We have given an overall analysis of the data. Next we will
further investigate two specific cases.
D. Case Studies
Fig. 9 shows the results of clustering effects for tapping
behavior using 1x2 topology of self-organizing maps. Fig. 9(a)
corresponds to hand labelled fail/success ratios of instances in
each cluster. The results show that although an unsupervised
technique was used to cluster the effect features, each cluster
has high uniformity measured according to hand labels in
line with our previous analysis. Fig. 9(b) reflects the shape
characteristics of the instances in each cluster. The first cluster,
which mostly consists of success instances, actually contains
rollable shapes, spheres and horizontal cylinders. Second cluster consist of unrollable shapes, boxes and vertical cylinders.
Also both clusters have instances from the whole size range
as Fig. 9(c) suggests.
Fig. 10, shows the same results for 1x4 topology. Looking
at the hand-labelled fail/success ratios at Fig. 10(a) , 2 of
the resulting clusters correspond to success clusters while
(a) Hand labelled success/fail ratios
(b) Shapes
(a) Percentage of shape features in selected features wrt number of clusters
and relevancy threshold for tapping
(c) Sizes
Fig. 9: 1x2 topology clustering results for Tap behavior
(a) Hand labelled success/fail ratios
(b) Number of selected features wrt number of clusters and relevancy
threshold for tapping
(b) Shapes
Fig. 8: Properties of selected features of tapping
the remaining 2 correspond to fail clusters. Different clusters
corresponding to the same hand-label (success or fail) shows
variability within hand-labelled success and fail effects. Fig.
10(b) suggests that the first cluster mostly consists of spheres
while the third one consists of only boxes. So the first cluster
may correspond to the effect of a sphere rolling and the third
may correspond to the effect of tapping an unrollable box. Also
the clusters are not determined by the size of the instances as
can be seen from Fig. 10(c).
E. SVM Learning
In order to learn affordances, as we have stated before,
we first select the relevant features for the behaviour, next
categorize its effects into clusters in an unsupervised fashion,
and finally train a classifier to predict the resulting effect
category when that behaviour is applied to an initial instance.
Training of a classifier corresponds to learning of affordances
for a single behaviour. We use Support Vector Machines for
classification. We trained SVMs for several number of effect
(c) Sizes
Fig. 10: 1x4 topology clustering results for Tap behavior
clusters and relevancy values. The results are depicted in
Figure 11. Different colored lines correspond to different relevancy values. Generally speaking, SVM’s prediction accuracy
decreases as the number of effect categories are increased. This
fact is to a degree contrary to our expectations. As shown in the
previous sections, effect clusters’ homogeneity increases with
the increasing number of effect clusters. Hence, we expected
our classifiers’ accuracy to increase with more uniform target
clusters. There may be several explanations for the inaccuracy
of this phenomenon, however. First, with growing number of
classes to predict, the probability of predicting the correct
class decreases. This is a common machine learning problem.
ACKNOWLEDGMENT
This research was supported by EU FP7 Project ROSSI,
contract no. 216125-STREP. Barış Akgün acknowledges the
full support and Tahir Bilal and İlkay Atıl acknowledges the
partial support of the TÜBİTAK graduate student fellowship.
The authors would like to thank Doruk Tunaoglu for his
helpful comments.
R EFERENCES
Fig. 11: SVM prediction successes for tapping behaviour wrt
number of clusters
Secondly, SVMs are originally dichotomizers, they are not
as successful in multiclass problems as they are in two class
problems. Another point to note is that the accuracy success is
almost the same for the relevancy values of 0%, 20% and 40%.
Regarding tapping behaviour, we can say that 40% is the best
relevancy value,not only it provides the maximum accuracy,
but also it achieves it with the highest perceptional economy
possible. The SVM accuracy figure shows an increase in 4
clusters anomalously to the following decrease trend. In our
opinion, this is because there are 4 diffent types of objects
in the environment: boxes, spheres, horizontal cylinders and
vertical cylinders.
VI. C ONCLUSION
In this work we have presented a model for learning
object affordances through interaction for a humanoid robot.
Robot’s interaction took place in a simulated environment.
One of the major results of this work is the fact that robot
learns the effects it creates in the environment in a complete
unsupervised fashion without resorting to a mentor.
the number of relevant features remains approximately
constant with respect to the number of effect clusters formed
We analyzed the created effect clusters with our uniformity
metric and hand labels. It was interesting to see that the effect
clusters correspond to externally observed effects. Extracted
relevant features were in line with our predictions. The model
found out that tapping behavior is mostly effected by shape.
Using only relevant features gives us a computational performance increase. If the total number of features were much
higher, this increase would be more significant. We have seen
that the number of relevant features is not effected by the
number of effect clusters chosen. We have shown that, through
interaction with its environment, our humanoid robot is able
to learn the rollability affordances quite successfully.
[1] J. J. Gibson, The Ecological Approach to Visual Perception. Lawrence
Erlbaum Associates, 1979.
[2] E. J. Gibson, “Perceptual learning in development: Some basic concepts,” Ecological Psychology, vol. 12, no. 4, pp. 295–302, 2000.
[3] E. Sahin, M. Cakmak, M. R. Dogar, E. Ugur, and G. Ucoluk, “To afford
or not to afford: A new formalization of affordances towards affordancebased robot control,” Adaptive Behavior, 2007.
[4] R. Arkin, Behavior-based Robotics. Cambridge, MA, USA: MIT Press,
1998.
[5] R. R. Murphy, “Case studies of applying gibson’s ecological approach to
mobile robots,” IEEE Transactions on Systems, Man, and Cybernetics,
vol. 29, no. 1, pp. 105–111, 1999.
[6] M. Lungarella, G. Metta, R. Pfeifer, and G. Sandini, “Developmental
robotics: a survey,” Connection Science, vol. 15, no. 4, pp. 151–190,
2003.
[7] P. Fitzpatrick, G. Metta, L. Natale, A. Rao, and G. Sandini, “Learning
about objects through action -initial steps towards artificial cognition,”
in Proceedings of the 2003 IEEE International Conference on Robotics
and Automation, ICRA, 2003, pp. 3140–3145.
[8] K. MacDorman, “Responding to affordances: Learning and projecting a
sensorimotor mapping,” in Proc. of 2000 IEEE Int. Conf. on Robotics
and Automation, San Fransisco, California, USA, 2000, pp. 3253–3259.
[9] A. Stoytchev, “Toward learning the binding affordances of objects: A
behavior-grounded approach,” in Proceedings of AAAI Symposium on
Developmental Robotics. Stanford University, 2005, march.
[10] A. Stoytchev., “Behavior-grounded representation of tool affordances,”
Proceedings of the 2005 IEEE International Conference on Robotics
and Automation Barcelona, Spain, pp. 18–22, 2005.
[11] I. Cos-Aguilera, L. Canamero, and G. M. Hayes, “Motivation-driven
learning of object affordances: First experiments using a simulated
khepera robot,” in In Proceedings of the 9th International Conference
in Cognitive Modelling (ICCM’03), Bamberg, Germany, 4 2003.
[12] G. Fritz, L. Paletta, M. Kumar, G. Dorffner, R. Breithaupt, and E. Rome,
“Visual learning of affordance based cues,” in SAB, 2006, pp. 52–64.
[13] D. Kim, J. Sun, S. M. Oh, J. M. Rehg, and A. Bobick, “Traversability
classification using unsupervised on-line visual learning for outdoor
robot navigation,” in IEEE Intl. Conf. on Robotics and Automation (ICRA
06), Orlando, FL, 5 2006.
[14] E. Ugur, M. R. Dogar, M. Cakmak, and E. Sahin, “The learning and use
of traversability affordance using range images on a mobile robot,” in
in Proceedings of IEEE Intl. Conf. on Robotics and Automation (ICRA
07), April 2007.
[15] G. Metta, G. Sandini, L. Natale, L. Craighero, and L. Fadiga, “Understanding mirror neurons: a bio-robotic approach,” Interaction Studies,
vol. 7, p. 2006, 2006.
[16] L. Craighero, G. Metta, G. Sandini, and L. Fadiga, “The mirror-neurons
system: data and models,” Progress in Brain Research, vol. 164, pp.
39–59, 2007.
[17] L. Montesano, M. Lopes, A. Bernardino, and J. Santos-Victor, “Learning
object affordances: From sensory–motor coordination to imitation,”
Robotics, IEEE Transactions on [see also Robotics and Automation,
IEEE Transactions on], vol. 24, no. 1, pp. 15–26, 2008.
[18] V. Tikhanoff, P. Fitzpatrick, F. Nori, L. Natale, G. Metta, and A. Cangelosi, “The icub humanoid robot simulator,” Nice, France, September
2008.
[19] G. Metta, G. Sandini, D. Vernon, L. Natale, and F. Nori, “The icub
humanoid robot: an open platform for research in embodied cognition,”
Washington DC, USA, Aug 2008.
[20] “Orocos kinematics and dynamics library.” [Online]. Available:
http:/www.orocos.org/kdl
[21] K. Kira and L. A. Rendell, “A practical approach to feature selection,”
in ML92: Proceedings of the ninth international workshop on Machine
learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,
1992, pp. 249–256.
[22] I. Kononenko, “Estimating attributes: Analysis and extensions of relief,”
in European Conference on Machine Learning, 1994, pp. 171–182.
Download