Unsupervised Learning of Affordance Relations on a Humanoid Robot Barış Akgün, Nilgün Dağ, Tahir Bilal, İlkay Atıl and Erol Şahin Kovan Research Lab. Computer Engineering Department Middle East Technical University Ankara, Turkey Email: bakgun@ceng.metu.edu.tr Abstract—In this paper, we study how a humanoid robot can learn affordance relations in his environment through its own interactions in an unsupervised way. Specifically, we developed a simple tapping behavior on the iCub humanoid robot simulator and allowed the robot to interact with a set of objects with different types and sizes positioned within its reach. The interaction schema is as follows: an object is put in the visual field of the robot, the robot then focuses on the object and applies its tapping behavior. The robot records its initial and final percept of its view obtained from a range camera, in the form a feature vector. The difference between initial features and final features are considered as effect features. The effect features are clustered using Kohonen’s self-organizing maps to generate a set of effect categories in an unsupervised way. Then, we used the RelieF feature extraction method to determine the most relevant features and a multi-class support vetor machine (SVM) classifier is trained to learn the mapping between the relevant features and the effect categories. We analyzed the unsupervised clustering of effect features using both the types and sizes of objects that fall into the effect clusters, as well as the success/fail (corresponding to rolled and not rolled) labels that were manually attached to the interactions. Our results show that, 1) despite the lack of supervision, the effect clusters tend to be homogeneous in terms of success/fail, 2) the relevant features consist mainly of shape, but not size, 3) the number of relevant features remains approximately constant with respect to the number of effect clusters formed, and 4) the SVM classfier can successfully learn the effect categories using the relevant features. I. I NTRODUCTION Robots are becoming more and more integrated into our every-day life. However, our environment is highly dynamic and complex with many different things to interact with. An inner representation of these objects and the environment is necessary for interaction. With so many things around, all the world information can not be stored in memory. Moreover, the prediction of the outcome of an interaction is of importance. Affordances, a concept from psychology, offers an explanation for such information pick up in organisms. Affordances were coined by J.J. Gibson who described them as action possibilities offered to an agent by its environment[1]. He claimed that these possibilities are directly apparent without making a mental calculation on perceived data. He stated that an action needed only relevant perceptual information for its execution. This is sometimes referred as perceptual economy. Another important idea he put out was that an affordances are relational. This relation is between environment and the agent, encompassing agent’s physical properties, perception, behaviors and effects of these behaviors. Later E. Gibson argued that the learning of affordances is ”discovering features and invariant properties of things and events”[2]. She stated that this discovery is a developmental process. These properties of affordances offer solutions to aforementioned robotic problems. We are interested in learning affordances for a humanoid robot. We perform our experiments in a simulated environment with a humanoid robot armed with perceptual and behavioral skills. We describe a framework in which affordances can be learned in a developmental way through interaction. We particularly follow the formalization described in [3] in which affordance relations are represented as triples of the form (effect,(entity, behavior)), where entity refers to the perceptual representation of an interesting part of the environment such as an object. This formalization can be used for predicting the outcome of an action and understanding another agent’s action. In the rest of the paper, we first give a literature survey about current state of the art on affordance usage in robotics. Then on section 3, we describe the environment, the robot, its capabilities, the interaction sequence and the learning model. Details of perceptual features and robot behaviors are given in sections 4 and 5 respectively. Experimental results are given in section 6 followed by a discussion in section 7. II. L ITERATURE S URVEY Affordance concept is gaining popularity among robotics research. Similarities between affordance theory and reactive/behavior based robotics have been noted in [4] and [5]. Relation between action-oriented perception and affordances is also noted in [4]. From developmental robotics point of view, affordances are higher level concepts [6] that are learned by interaction with the environment [7]. Affordance theory has inspired robotic learning [8],[9], tool-use [10] and decision-making [11]. Studies on learning of affordances are concerned with two major issues; learning consequences of a certain action in a given situation [7],[9],[10] and learning of invariant properties of environments that afford a certain action[8],[12],[13]. Studies in the latter group also relate these Fig. 2: Segmented object in the range image (a) (b) Fig. 1: (a) A screen shot from the iCub simulator showing the robot (b) Different objects iCub interacts with during the experiments. properties to the effect of an action, but these are in terms of internal values of the agent, not in terms of changes in the environment. An application for learning and use of affordances on mobile robotics is detailed in[14]. In that study, traversability of an environment containing simple objects like boxes, cylinders and spheres is learned through many interactions in a simulator. Relevant features for traversability affordance and effect predictors are learned using success/fail labels of these interactions. The work in [15] and [16] developed a neurocomputational model and tested various parts by two robotics experiments. In one of these experiments, they learn object affordances. A recent model for developmental learning and usage of affordances for a humanoid robot is detailed in [17]. Affordances are modeled as trilateral relations between objects, effects and actions using a Bayesian network. Structure of this network is learned through interactions of the robot with different shaped and sized objects. III. E XPERIMENTAL F RAMEWORK A. Environment We are working on the iCub Humanoid Robot Simulator [18] which is a physics-based simulator. The robot in the simulator is modeled after iCub, a child-sized humanoid robot designed for cognitive and developmental robotic research [19]. We have modified the simulator to better fit our needs. Objects that we use in our experiments are spheres, boxes and cylinders of different sizes. A screen shot from the simulator can be seen in figure 1. B. Perception We record interaction data through cameras, a range camera, position sensors and touch sensors. Some features are extracted from this data. Currently we do not use normal cameras’ data for feature extraction. Implementing or using a stereo vision algorithm is excluded from this study for practical purposes. We have implemented a range camera in the simulator. This simplifies the vision problem thus allowing more time to gather and comment on results. Object of interest is segmented in the range image and 49 features are extracted. 24 of them are shape related, 18 of them are size related and 7 of them are related with hand-object relation. Segmentation of the object is currently done without any use of image processing techniques. The pixels belonging to the object of interest are marked in the range image owing to the simulated environment. Extracted shape features are slightly modified versions of those described in [14]. To summarize the process in Figure 3, initially the Cartesian coordinates of the points on the object are calculated from range readings. A normal vector is computed for each of the points. Latitude and longitude angles of the normal vectors are computed. Two histograms composed of 12 bins each, for latitude and for longitude, are generated. Values of the bins are taken as shape features. In order to calculate size related features, first the orientation of the object in the range image is calculated by means of a simple covariance arithmetic. Using the orientation of the object, we can detect the principal axes and the four extremum points of the object as depicted in figure 2. We use average x,y,z and depth values of the pixels, the number of object pixels in the image, the orientation of A-B line segment, x,y,z, cartesian and ray distances between A-B and C-D as features. Our size features also include image plane cartesian distances of A-B and C-D. There are totally 18 size features. We also use 7 features to give information about hand object relation. These features include x, y, euclidian distances between hand and object in the image plane, additionally x, y, z, euclidian distances between hand and object in 3d cartesian space. C. Behaviors We are using tapping as our main behavior. Tapping involves controlling robotic hand’s pose or velocity which require forward and inverse kinematics calculations. We perform the kinematics calculations using Orocos Kinematic and Dynamics Library (KDL) [20]. The shape, size and position information of objects are taken directly from the simulator for simplicity, and then used to adjust behavior parameters. Fig. 5: Interaction scheme. Robot first finds the object then focuses on it. A behavior, in this case grasping, is executed afterwards. IV. A FFORDANCE L EARNING Fig. 3: Range normals and histograms Fig. 4: Robot tapping a non-rollable object(left) and a rollable object(right). Our attention mechanism controls the eyes and the head of the robot in order to focus on a target in the visual field using the cameras. The target is found using image segmentation. By moving the neck and the eyes, focusing is accomplished. We define reaching as bringing the robotic hand in vicinity of the target. It need not be precise and the orientation of the hand is not important. Approximate position of the object can be calculated using neck angles and eye vergence. Attention and reaching phase preceeds tapping. We define tapping as moderately hitting an object from its side. Once the hand is aligned with the object, it moves towards the object with a specified velocity. Hand stops when it hits the object. Figure 4 shows two instances of tapping behavior. Aim of tapping is to have a behavior whose effect is mostly dependent on object’s shape. We expect that spheres and horizontal cylinders to roll and boxes and vertical cylinders not to roll after being tapped. D. Interaction An interaction starts with an object in the visual field of the robot. Next, using an attention mechanism, robot focuses on the object. An environmental snapshot1 is taken from multiple sensors. Afterwards a behavior is executed. When the execution ends, another snapshot is taken. A simple interaction schema from the eyes of the robot can be seen in figure 5 This is repeated during the experimentation phase. 1 We define a snapshot as measurements of all the sensors for a given moment. We developed a model where a humanoid robot learns affordances by itself through interaction with its environment. This model establishes the relations between robot’s perception, actions and their effects and the environment. After a number of interactions, we extract features from the initial and the final snapshots. We define effect features as the difference between initial features and final features. We normalize these features before processing them. There could be many objects to interact with but the types of effects that an action will create are usually limited. According to this information, effect features are clustered using Kohonen’s self-organizing maps (SOM) to generalize effects. We need to specify the cluster topologies to this map. SOM with 1-D neighborhood is chosen since it was found that the topology doesn’t have a significant effect on results. However the total number of clusters is important. This is one of the parameters of our model. We reflect the cluster membership values of effects back to corresponding effect features and use these as class labels. Recall that perceptual economy is an important aspect of affordances. ReliefF algorithm, first proposed in [21] and extended in [22] is used for relevant feature selection. It gives weights to the features according to their contribution in classification. This relevant feature selection approach provides perceptual economy. However, for the ReliefF algorithm we need to specify the number related features. We accomplish this by specifying a relevancy threshold. This is the second parameter of our model which we introduce later. For affordance learning to be complete, our robot needs to predict the outcome of an action on an object. This means that we need to relate initial features with effects. We train an SVM classifier using the class labels and relevant features. Note that class labels correspond to different effects. By training a classifier, we relate the initial features with effects and close the loop of affordance learning. The robot can now predict the effects that it can create on an object using a certain action. Figure 6 depicts the learning model. The method described above is for a single affordance relation. This model can easily be extended by training a separate SOM and SVM for each behavior. V. E XPERIMENTAL R ESULTS We conducted 1000 experiments for tapping behavior. We performed our learning model 20 times. The evaluated values are averaged over these trials. Fig. 6: Affordance learning model A. Evaluation Metrics We use two metrics totally. In order to assess the homogeneity of the effect clusters in terms of success/fail ratios, we make use of uniformity metric. Uniformity metric is defined as X kSc − Fc k U = 100 × T c∈Clusters where Sc denotes the number of success clusters in cluster c, Fc denotes the number of fail clusters in cluster c and T denotes the total number of instances in all clusters. If all clusters contains equal number of success and fail instances, uniformity is mapped to 0. If all clusters contain exclusively success or fail instances then uniformity is mapped to 100. Secondly, for feature selection we make use of relevancy metric. ReliefF procedure returns us the most important features for classification with their respective importance weights. Our relevancy metric always takes a value between 0 and 1. To determine the features to be used for learning, we multiply our relevancy metric value with the most important features’ weight and consider this product as a threshold. We only make use of the features whose importance weights are above this threshold. B. Uniformity In this section we will present the performance of unsupervised effect clustering with respect to hand labelled data. Figure 7 depicts our uniformity results for effect clusters of tapping. The essence of our results is the fact that despite our feature vector does not include hand labels(success/fail), unsupervised clustering is able to infer this inherent information quite successfully. Generally speaking, uniformity of clusters increases parallel with the rise of number of clusters, as expected. The average uniformity is above 80% for tapping and 70% for grasping. It can reach up to 95%. There is a fall in the general trend of increase in the grasp behaviour for the 3 cluster case. The grasping behaviour is affected mostly Fig. 7: Uniformity of clustering with respect to number of clusters averaged over number of trials and number of clusters by the size of the object to be grasped and grasp is varied continuously. So there is a critical size value determining the graspability. One of the clusters fall near this region, stealing from both and therefor decreasing uniformity. C. Selected Features In this section we will investigate the effects of relevancy threshold and number of clusters on selected features. In figure 8, upper subfigure shows the percentage of the shape features in selected features and lower subfigure shows the total number of selected features. As it can be seen, shape features dominate the selected features. When relevancy threshold is decreased, the number of selected features increase. Since there are only 24 shape related features, the percentage of shape related features decreases. The number of selected features tends to be inversely proportional to relevancy threshold and remain the same above some number of clusters. We have given an overall analysis of the data. Next we will further investigate two specific cases. D. Case Studies Fig. 9 shows the results of clustering effects for tapping behavior using 1x2 topology of self-organizing maps. Fig. 9(a) corresponds to hand labelled fail/success ratios of instances in each cluster. The results show that although an unsupervised technique was used to cluster the effect features, each cluster has high uniformity measured according to hand labels in line with our previous analysis. Fig. 9(b) reflects the shape characteristics of the instances in each cluster. The first cluster, which mostly consists of success instances, actually contains rollable shapes, spheres and horizontal cylinders. Second cluster consist of unrollable shapes, boxes and vertical cylinders. Also both clusters have instances from the whole size range as Fig. 9(c) suggests. Fig. 10, shows the same results for 1x4 topology. Looking at the hand-labelled fail/success ratios at Fig. 10(a) , 2 of the resulting clusters correspond to success clusters while (a) Hand labelled success/fail ratios (b) Shapes (a) Percentage of shape features in selected features wrt number of clusters and relevancy threshold for tapping (c) Sizes Fig. 9: 1x2 topology clustering results for Tap behavior (a) Hand labelled success/fail ratios (b) Number of selected features wrt number of clusters and relevancy threshold for tapping (b) Shapes Fig. 8: Properties of selected features of tapping the remaining 2 correspond to fail clusters. Different clusters corresponding to the same hand-label (success or fail) shows variability within hand-labelled success and fail effects. Fig. 10(b) suggests that the first cluster mostly consists of spheres while the third one consists of only boxes. So the first cluster may correspond to the effect of a sphere rolling and the third may correspond to the effect of tapping an unrollable box. Also the clusters are not determined by the size of the instances as can be seen from Fig. 10(c). E. SVM Learning In order to learn affordances, as we have stated before, we first select the relevant features for the behaviour, next categorize its effects into clusters in an unsupervised fashion, and finally train a classifier to predict the resulting effect category when that behaviour is applied to an initial instance. Training of a classifier corresponds to learning of affordances for a single behaviour. We use Support Vector Machines for classification. We trained SVMs for several number of effect (c) Sizes Fig. 10: 1x4 topology clustering results for Tap behavior clusters and relevancy values. The results are depicted in Figure 11. Different colored lines correspond to different relevancy values. Generally speaking, SVM’s prediction accuracy decreases as the number of effect categories are increased. This fact is to a degree contrary to our expectations. As shown in the previous sections, effect clusters’ homogeneity increases with the increasing number of effect clusters. Hence, we expected our classifiers’ accuracy to increase with more uniform target clusters. There may be several explanations for the inaccuracy of this phenomenon, however. First, with growing number of classes to predict, the probability of predicting the correct class decreases. This is a common machine learning problem. ACKNOWLEDGMENT This research was supported by EU FP7 Project ROSSI, contract no. 216125-STREP. Barış Akgün acknowledges the full support and Tahir Bilal and İlkay Atıl acknowledges the partial support of the TÜBİTAK graduate student fellowship. The authors would like to thank Doruk Tunaoglu for his helpful comments. R EFERENCES Fig. 11: SVM prediction successes for tapping behaviour wrt number of clusters Secondly, SVMs are originally dichotomizers, they are not as successful in multiclass problems as they are in two class problems. Another point to note is that the accuracy success is almost the same for the relevancy values of 0%, 20% and 40%. Regarding tapping behaviour, we can say that 40% is the best relevancy value,not only it provides the maximum accuracy, but also it achieves it with the highest perceptional economy possible. The SVM accuracy figure shows an increase in 4 clusters anomalously to the following decrease trend. In our opinion, this is because there are 4 diffent types of objects in the environment: boxes, spheres, horizontal cylinders and vertical cylinders. VI. C ONCLUSION In this work we have presented a model for learning object affordances through interaction for a humanoid robot. Robot’s interaction took place in a simulated environment. One of the major results of this work is the fact that robot learns the effects it creates in the environment in a complete unsupervised fashion without resorting to a mentor. the number of relevant features remains approximately constant with respect to the number of effect clusters formed We analyzed the created effect clusters with our uniformity metric and hand labels. It was interesting to see that the effect clusters correspond to externally observed effects. Extracted relevant features were in line with our predictions. The model found out that tapping behavior is mostly effected by shape. Using only relevant features gives us a computational performance increase. If the total number of features were much higher, this increase would be more significant. We have seen that the number of relevant features is not effected by the number of effect clusters chosen. We have shown that, through interaction with its environment, our humanoid robot is able to learn the rollability affordances quite successfully. [1] J. J. Gibson, The Ecological Approach to Visual Perception. Lawrence Erlbaum Associates, 1979. [2] E. J. Gibson, “Perceptual learning in development: Some basic concepts,” Ecological Psychology, vol. 12, no. 4, pp. 295–302, 2000. [3] E. Sahin, M. Cakmak, M. R. Dogar, E. Ugur, and G. Ucoluk, “To afford or not to afford: A new formalization of affordances towards affordancebased robot control,” Adaptive Behavior, 2007. [4] R. Arkin, Behavior-based Robotics. Cambridge, MA, USA: MIT Press, 1998. [5] R. R. Murphy, “Case studies of applying gibson’s ecological approach to mobile robots,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 29, no. 1, pp. 105–111, 1999. [6] M. Lungarella, G. Metta, R. Pfeifer, and G. Sandini, “Developmental robotics: a survey,” Connection Science, vol. 15, no. 4, pp. 151–190, 2003. [7] P. Fitzpatrick, G. Metta, L. Natale, A. Rao, and G. Sandini, “Learning about objects through action -initial steps towards artificial cognition,” in Proceedings of the 2003 IEEE International Conference on Robotics and Automation, ICRA, 2003, pp. 3140–3145. [8] K. MacDorman, “Responding to affordances: Learning and projecting a sensorimotor mapping,” in Proc. of 2000 IEEE Int. Conf. on Robotics and Automation, San Fransisco, California, USA, 2000, pp. 3253–3259. [9] A. Stoytchev, “Toward learning the binding affordances of objects: A behavior-grounded approach,” in Proceedings of AAAI Symposium on Developmental Robotics. Stanford University, 2005, march. [10] A. Stoytchev., “Behavior-grounded representation of tool affordances,” Proceedings of the 2005 IEEE International Conference on Robotics and Automation Barcelona, Spain, pp. 18–22, 2005. [11] I. Cos-Aguilera, L. Canamero, and G. M. Hayes, “Motivation-driven learning of object affordances: First experiments using a simulated khepera robot,” in In Proceedings of the 9th International Conference in Cognitive Modelling (ICCM’03), Bamberg, Germany, 4 2003. [12] G. Fritz, L. Paletta, M. Kumar, G. Dorffner, R. Breithaupt, and E. Rome, “Visual learning of affordance based cues,” in SAB, 2006, pp. 52–64. [13] D. Kim, J. Sun, S. M. Oh, J. M. Rehg, and A. Bobick, “Traversability classification using unsupervised on-line visual learning for outdoor robot navigation,” in IEEE Intl. Conf. on Robotics and Automation (ICRA 06), Orlando, FL, 5 2006. [14] E. Ugur, M. R. Dogar, M. Cakmak, and E. Sahin, “The learning and use of traversability affordance using range images on a mobile robot,” in in Proceedings of IEEE Intl. Conf. on Robotics and Automation (ICRA 07), April 2007. [15] G. Metta, G. Sandini, L. Natale, L. Craighero, and L. Fadiga, “Understanding mirror neurons: a bio-robotic approach,” Interaction Studies, vol. 7, p. 2006, 2006. [16] L. Craighero, G. Metta, G. Sandini, and L. Fadiga, “The mirror-neurons system: data and models,” Progress in Brain Research, vol. 164, pp. 39–59, 2007. [17] L. Montesano, M. Lopes, A. Bernardino, and J. Santos-Victor, “Learning object affordances: From sensory–motor coordination to imitation,” Robotics, IEEE Transactions on [see also Robotics and Automation, IEEE Transactions on], vol. 24, no. 1, pp. 15–26, 2008. [18] V. Tikhanoff, P. Fitzpatrick, F. Nori, L. Natale, G. Metta, and A. Cangelosi, “The icub humanoid robot simulator,” Nice, France, September 2008. [19] G. Metta, G. Sandini, D. Vernon, L. Natale, and F. Nori, “The icub humanoid robot: an open platform for research in embodied cognition,” Washington DC, USA, Aug 2008. [20] “Orocos kinematics and dynamics library.” [Online]. Available: http:/www.orocos.org/kdl [21] K. Kira and L. A. Rendell, “A practical approach to feature selection,” in ML92: Proceedings of the ninth international workshop on Machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1992, pp. 249–256. [22] I. Kononenko, “Estimating attributes: Analysis and extensions of relief,” in European Conference on Machine Learning, 1994, pp. 171–182.