From: AAAI Technical Report FS-96-02. Compilation copyright © 1996, AAAI (www.aaai.org). All rights reserved. Emotion, Embodiment,and Consequence Driven Systems Stevo Bozinovski, Georgi Stojanov, Liljana Bozinovska* Electrical Engineering Department, University of Skopje, Macedonia * Institute of Physiolog%Medical Department, University of Skopje, Macedonia {bozinovs, georgiS}@rea.eff.ukim.edu.mk Abstract Emotion has been considered as an important concept of embodied systems. A learning agent, Crossbar Adaptive Array, which implements the concept of emotionin its learning and behavioral scheme is described. Someissues relevant to embodiedintelligence and reinforcement learning are also discussed. Introduction In studying artificial embodied ~,stems the focus is mainly on issues of associated with the rational aspects (motion planning, reasoning, learning, etc). There are distinguished fields of science knownas Artificial Intelligence and Artificial Neural Systems. However,there is not so mucheffort concentrated on the issue of artificial emotion. In the terminology of agent based system, summarizedfor example in the work of Wooldridge and Jennings (1994) and in the glossary of intelligent behavior (Mataric 1994) the term "emotion" does not exists. We do not know how to produce a robot which is ashamed of something; or how to model era% Maybeit is because most of our emotionsare not rationally driven, and their basis is the hormonal system, not the neural EMbody agents Cognitive ~,stem in a body assumes existence of some spatial encapsulation(skin, shell,..) surroundingthe system. It is assumedthat such an obtained body has ability to address (contact, reach,..) objects and beings that are placed at some distance fi’om the body. The way of addressing can be by means of sensors, physical actions, and~or communications. The bo@exist in an environmerit in which it behaves. Figure 1. showsour basic understanding of an embodiedintelligence through the concept of EaVlbody(emotional, embodiedsystenO. behavioral environment ~-~ one. Webelieve that there are important issues to pursue in relation to emotions, and in this paper we will only scratch the area. Firstly we will introduce the EMbody systems, in which importance of the emotion is emphasized from the viewpoint of sensation, action, and cominunication. After that we will describe the CAAagent whichactually implementsthe concept of emotion. It is a rather early reinforcement learning system, which uses the emotionalconcepts of feelings and desirability to develop its learning procedure. Finally we discuss additional issues around the concept embodiedintelligence 12 environment ~ Figure 1. The concept of Embody As Figure 1 shows, EMbodyis an embodied intelligence which has three main features: 1) Existence in both genetic and behavioral emdroumentand 2) Expansion of its presence in the behavioral environmentbeyond its "raw" body, to the extent of its sensory range, action range and communicationrange 3) Emotion, for purpose of communication,internal action and sense of the state values. Fromthe definition of the EMbodyit is natural that we define its basic tasks as 1) to survive in the behavioral em,ironment, 2) to reproduce itself in the genetic environment, 3) to expand its sensory, action, and communication ranges further 4) to optimizethe value of its internal state, i.e. to experiencea "positive" emotion. In order to achieve the above goals, an EMbodyneeds a knowledge, sensing- and learning ability, amongother features of an embodiedintelligence. It needs to build a cognitive model of the environment in a way it understands it. The EMbodyexpansion in the behavioral environmentis in the sense of expanding its influence in the environment, by meansof tools, and at somestage, technology. This paper will restrict itself to the issue of emotion, whichis considered to be a feature in the intersection of the notions of action, sensing, and communication.It is understood as an internal sensation which can sometimes be used as external communicationand/or action. This work will describe a consequence driven agent from the EMbodyclass. Before we give the description of the agent, we will briefly present a part of the theory of consequencedriven systems. Consequence Driven Systems The theory of consequence driven systems (Bozinovski 1995) coves a wide range of systems, including consequence driven teaching, and consequence driven learning (or reinforcement learning ) systems. Consequencedriven learning system (Figure 2) is adaptive behavior system which always assumes exis tence of an environments (game opponent, world, market,...) it interacts with. Its basic behavioral routine (below) emphasizes that a consequencedriven system, addition of being an interaction (communication)system, is also a pattern recognition system. In a given situation, it needs to recognize the consequenceof his previous be receive new situation recognize in that situation a consequenceof a previous segment of your behavior if so, adjust your behavior to optimize the future consequence send appropriate action havioral poli W.It should be noted that here the situation is the consequence, and there is not necessarily an explicit external reinforcementsignal. The consequence driven system can be considered as a planning system. Fromthe AI point of view it can be represented as a rule-based knowledgesystem using consequence driven if-then rules. They can be in planning foIvn IF X(t) THENA(t) IN ORDER TO X(t+l), more generally, IF X(t) THENA(t) IN ORDER TO X(t+k), or, in its expectation form IF X(t) THENA(t) EXPECTING X(t+k), wherek> 1 is in the future. The notion of consequence planning is closely connected to the notion of benefit, or on emotionallevel, desirabilit)p of being in some state, as a consequenceof taking someaction, for whicha cost should be paid. Figure 3 depicts the concept. E.,\q, TR OMt4ENT Onarket, opponent, [ time situation I actionlconsequence benefit association ~ E.~ time ENVIRONMENT (EMbody, world ) learning variable consequence action Figure 3. The lemwingprocess in a benefit-cost system situ]tion ~ Consequence Driven System[__ (EMbody,...) Figure 2. A consequence driven learning system 13 As Figure 3 shows, it is assumed that a consequence driven system performs somekind of benefit-cost analysis, an learns on such a basis. The learning variable associated with the basic situation-action-consequence routine, is adjusted according to a learning rule in the form w ’(s,a) = U(w(s,a),V(s ’), C(a)) where w is somelearning construct (for examplelearning variable), s is the current situation, a is taken action, ~((.) is (emotional) desirability function and C(.) is cost function: x’is the next value ofx. Tile simplest learning rule is the desirability-only rule w ’~j - w~i+ vk (2) wherew~iis the learning variable associated x~4th the ath action in thej-th situation, and vk. is the emotional value of being in the, as consequenceobtained, k-th situation. A simple cost affected desirabili~., (or benefit-cost) rule is w’~= w~ (3) s ~- ov~- c~ whereo is the benefit discount factor, 0 < o < 1 A more complexlearning rule is (Watkins 1989) w~i= w~j+ 13 0"~ + ov~- w~) (4) where r, has interpretation of reinforcement, not merely the cost. Here 13 is the memoD’forgetting parameter, knownalso as convergence parameter. In general, all the aboverules are special cases of the Bellman equation in Dynamic Programming (Bellman 1957, Barto, Bradtke, & Singh 1995) , which in its benefit-cost form can be written as V(s) =max{oE(V(s~)) - E(C(a))} (5) a c A(s) whereV(s) is the desirabilib’ being in state s, C(a) is cost taking action a, E is the expectation operator, and A(s) is the set of permissibleactions fromthe state the theoretical work of Sutton (1988), and the booming work of Watkins (Watkins 1989) which introduced the well knownQ-learning method. There are good reviews on RL. examples being Kaelbling (1993) and Mataric (1994). Recently, it turned out (Bozinovski1995) that Q-methodis a variant of the original, CAAlearning procedure. That pointed the attention again to the almost forgotten CAAsystem. The key notion of the CAAlearning algorithm are the crossbar elements (later knownas Q-values) w~j where is action (row) and j is the situation (column). The elements are used for both action selection and internal extraction of the emotional value (state value, benefit, desirability) computation. The CAAlearning procedure has four steps: 1) state j: performaction a biasing on w,j 2) state k: computestate value vk using w,k 3) state j: increment w~using vk 4) changestate: j-k; goto 1 where w.j means that the computation considers all the possible actions (the columnvector) from thej-th state. I pattern classifier I4 r sensors situations l, 00@0 Emotionas internal reinforcement --I, action 0@00 A reinforcement learning system which introduced emotional concepts in its learning schemeis the Crossbar Adaptive Array (CAA)agent¯ It was introduced in 1981 (Bozinovski 1981, 1982), and successfully solved the problemof assignment of credit, well knownin the Reinforcement Learning (RL) theoD~. It is an early connectionist system solving that problem, stated by Mins~: (1961). It was previously solved by methodsof AI, examples being Samuel (1959) and Michie and Chambers (1968); later it was solved by another connectionist learning system, Actor-Critic architecture (Barto, Sutton, & Anderson 1983)¯ The term delayed reinforcement learning, (to the best of our knowledge, in modern RL introduced in Bozinovski 1982, on the basis on the work on Keller and Shoenfeld 1950), is also in use as a synonymof temporal credit assignment¯ The most influent works in the area of delayed reinforcement learning are 14 00®0 __ n~] tot ] ] emotions I emotional value selector 9ehavioral ;nvlronnent genetic environment [ Figure 4. The CAAarchitecture The fourth step of the algorithm emphasizes that the CAAalgorithm is an on-route acting algorithm. Some versions (for examplethe Watldns’ Q-learning) can have the fourth step in whichnext state is chosenoff-route, for examplerandomly: those are on-point acting algorithms. Let us note that in the step 1, action is performed only biasing and not necessarily using the crossbar components. It even assumesthat the action can be taken randonily. Somesystems, for example the Petitage system (Stojanov &Bozinovski 1996), uses predefined pattern movements,inthe case where a situation-action mapping is not yet learned for somestate. Other variants are discussed elsewhere (Bozinovski 1995). The CAAarchitecture is shown on Figure 4. It receives set of distinct situations, fromsomepattern classifier, outside the CAA.That pattern classifier can be another CAAmodule working as pattern classifier. Once distinguished, situations are accepted by a crossbar array (later knownas Q-table). Using its CAAlearning procedure (given above) and the desirability-only learning rule (2). it is able to learn a path in an emotionalgraph. It is a secondaEvreinforcement s3.,stem i.e. a goal-subgoalsystem. Other early connectionist reinforcement learning systems, for examplethe ACarchitecture (Barto, Sutton, & Anderson, 1983) rely of an external reinforcer. In CAA,the only input to the system are situations. Those situations are used for computationof the internal emotional value which is further used as reinforcement. Emotional value is computedas v~. = lfimc{w, d (6) where lfimc{.} is some real valued function. It can be computed as maximumselector, neural sum with controlled threshold, or someother function. However,the function should be chosen carefully in order to assure convergence of the CAAlearning procedure. The CAAarchitecture was designed to challenge two particular problems in the area of credit assignment: the maze learning problem and the pole balancing problem. From the simple maze learning, the challenge was restated as learning emotional graphs. Here we will briefly describe a maze running and a emotional graph environment. The pole balancing problem is discussed elsewhere (Bozinovski 1995). Figure 5 and 6 report on one our first (not previously published) experiments done 1981 with CAA.Figure 5 shows a graph which represents a mazehaving only one positive reinforcing state. (This particular graph was designed by Rich Sutton, (personal communication)). The state 22 is reinforcing one. Figure 6 shows the transposed crossbar ar ray and evolution of its values due to the learning process. 15 2 3 7(’- 3 11(’-’) 6~ 10~2 19(% 18_~1 1 2 3 9 13 14 16 17 22 Figure 5. A simple task for the CAAs3,stem The matrix has 22 inputs (situations) and 3 outputs (actions). In 11 iterations one can see howa path is being learned: from the goal state, state 22, bacMvardsa decreasing chain of values is produced, until the starting situation (here situation 6) is reached. Thenthe resulting increasing chain of emotional values 1-3-4-5-7-8-A-B, (A and B stands for 10 and 11) correspond to the situation chain 1-4-5-9-13-14-16-17 which is the path toward the goal 22. sit. action 123 123 123 123 123 123 123 123 O1 000 00( 123 00( 00( 000 000 000 I00011000 123 123 I000 I00] 02 000 00( 00( 00{ 000 000 000 I00011000 10001000 03 000 00{ 00( OOC000 000 000 I00011000 I001 fOOl 04 000 00( OOC OOC000 000 000 I0001100]100211003 05 000 OOCOOC OOC000 DO0 300 I00111002i 100311oo4 06 000 OOC30C 90C 300 900 D00 I00011000110001 I0001 07 000 30C 90C ooc 900 900 300 10001100011000110001 08 000 OOC30C OOCDO0300 300 IO00II O00lI0001I0001 09 300 30C 30C 3oc 300 300 301 10021100311004110051 10 300 30C 30C 30C 300 300 300 :000110001 IO001 I0001 11 300 30C 30C 3oo 300 300 300 i000II0001I0001I0001 12 300 300 300 3OO300 100 100 i10011 iO01 i lOOl l looi 13 300 30C 90G 3oo 301 302 3O3 0041100511006110071 14 300 300 300 3oi 302 303 3O4 0051100611007110081 15 300 300 301 3oi 301 301 301 001I10Oll10011 IOOll 16 3OO 301 )02 3o3 )04 )05 )O6 007110081 I0091Ioo~ 17 301 )02 )03 )04 )05 )06 )07 008II 0091 i oo~ 100131 18 300 300 300 300 )00 )00 )00 000110001 OOOl i0001 19 300 300 )00 )00 300 }00 )00 0001100010001;0001 20 )00 }00 }00 )00 )00 300 )00 0001100010001 0001 21 )00 )00 )00 )00 }OOi)OOi)00 000110001 O00l0001 22 111 Ill 111 [II llll 111 tll 111111111 ~I till i iteration ) I 2 3 4 5 6 7 8 9 10 11 steps/iteration 68 140 189 41 44 142 71 53 33 12 9 Figure 6. Crossbar matrix learning in the experiment For environments in which emotion can also have negative values, the representation of the problemcan be done using the concept of subjective emotional graphs Figure 7 shows such a graph, where cartooned human emotions represent the internal emotional state of the CAAsystem. This particular graph shows environment in which the reward is not only found at the terminal state, but there are rewarding states along the way. The agent has multiple goals, to reach the emotionally positive terminal state, but also to collect somepleasure along the way, which in turn can speed up learning. -, © classify the world sensations into manageablesituations. Or, a usage of fuzzy classifier, which in the process of fuzzyfication, assigns linguistic variables as classes. Aninteresting issue, not discussed in RLand Behavioral Intelligence is the issue of continui~., of the action space. Goodexample is the CAAapplication to the pole balancing problem (Bozinovski 1981b, Bozinovski Anderson 1983). While other RL pole balancing problem solvers use two actions, LEFTand RIGHT, CAAuses three actions: LEFT, RIGHT, and DO-NOTHING.It turned out that greatly simplifies the problem, since a large part of the problem space became a ’do nothing’ region. So the importanceof partitioning the action space in the design phase of the systemw-ill reflect itseff heavily on the partitioning the input space. It is a not widely discussed action function approximation, to the contrast of value function approximation problem, discussed more often in connection w-ith the input space partitioning problem. Figure Z Representing emotions using emotional graphs States and Virtual Reality Those issues are also addressed in Mataric (1995). The detailed experimental results, with variants where the teleportation state is a neutral (not emotionallycolored) one are given elsewhere 03ozinovski 1995). The concept of state is one of the cornerstones in AI, Dynamic Programmingand Reinforcement Learning. There are several problemswith states, one of whichis the aliasing problem, which we call the vh’tual reality problem: howto distinguish betweenthe states if they appeared to be the same to the agent sensor5~ system? Figure 8 represents the problem. Discussion Here we discuss sometopics which appeared in our work and are of interest in RLand embodied, behavior-based intelligence (for exampleMataric 1994). In addition emotion, we will briefly mention the issues on action function approximationand virtual reality. Continuity and Action Function Approximation A real robot with real valued sensors and efectors are facing the problemof continuity. A proper partitioning of the input space is neededin order to avoid large spaces, with dimensionality equal to the numberof points that sensor vector can produce. As CAAmeets that problem, it is the well knownpattern classification problem(for example Duda and Hart 1973). So CAAdivides the problem in two parts, situation recognition and the behavior selection. However, the problem of how to perform the learning for situation recognition is not addressed by CAAin the delayed reinforcement learning task. The CAA pattern classification training is discussed in (Bozinovski 1995) but outside the context of delayed reinforcement learning. It is supposedthat somehow,for example by a supervised training, the CAAwill learn to 16 © emotions sensors ac :@ ~ (~ .@ : ~ ’ ~_j o2 Figure8. Different la),ers of states Figure 8 shows an environment consisting of 5 states defined by the action restriction in some environment. Someof them appear the same to the sensor5, system of the agent (notice the second layer), although they are actually different. States have also emotional values (third layer). Since the agent virtually lives in the second layer it can build a wrong model of the environment. In somecases it can build a stochastic model in deterministic envil’onment. One solution to the problem is the Petitage ~stem, (Stojanov & Bozinovski 1996) which solves the problem by extendingthe definition of a state with its action connections to the neighborhood. The new definition of the state it deals with. is the sensor), registered state plus the connectionto the neighborhoodstates. Emotion and Secondary Reinforcement In contemporary RLtheory the notions of emotion and secondary reinforcement are not in wide use. A more general terms from Dynamic Programming and Neural Networks, as state value and state value backpropagation are used instead. However, the concept of emotion emphasizes that the state value is computedinternally in the agent, not given from the outside world. Webelieve that emotions are geneticaly built into the system and they can serve as primau, drives. They produce elementary behaviors. From the primary reinforcers, an agent, for exampleanimal, will develop its learning behavior by the secondar)’ reinforcement mechanism,amongother methods of updating the memory.The nature and necessib: of the concept of emotionis still not appropriately modeled in AI, and our attempt is only one aspect, a small step towardthat provokingfield. Acknowledgment.Most of the ideas expressed in this paper are results of the stay of the first author with the Adaptive NetworksGroup, University of Massachusetts, Amherst. The authors wish to thank Andy Barto, Nico Spinelli, and Rich Sutton for inspiring environment in which CAAwas designed and evaluated. Special thanks to Jonh Moore,for pointing out relevant issues in animal learning and secondau, reinforcement. References Barto, A.; Sutton, R.; and Anderson, C. 1983. Neuronlike elements that can solve difficult learning control problems. 1EEETrans. Systems, Man, and Cybernetics 13: 834-846. Barto, A.: Bradtke, S.; and Singh S. 1995. Learning to act using real-time dynamic programming.Artificial Intelligence 72:81-138. Bellman, R. 1957. Dynamic Programming. Princeton University Press. Bozinovski. S. 1981a. A self-learning system using secondau, reinforcement. First report on CAAnetavork, ANWGroup, November 25, Computer Science Department, University of Massachusetts, Amherst. 17 Bozinovski, S. 1981b. Inverted pendulum learning control. ANWMemo, December 10, Computer Science Department, University of Massachusetts, Amherst. Bozinovski, S. 1982. A self-learning system using secondalj reinforcement. In R. Trappl, ed. Cybernetics and Systems Research, pp. 397-402, North Holland. Bozinovski, S.; and Anderson C. 1983. Associative memoryas a controller of an unstable system: Simulation of a learning control. In Proceedings of the IEEE Mediterranean Electrotechnical Conference, C5.11, Athens, Greece. Bozinovski, S. 1995. Consequence Driven Systems, Gocmar Press Brooks, R. 1986. A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation 2: 14-23. Duda, R, and Hart, P. 1973. Pattern Recognition and Scene Analysis, Willey Interscience. Kaelbling, L. 1993. Leapwing in EmbeddedSystems. MITPress. Keller, F. and Schoenfeld, W. 1950. Principles of Psychology. Appleton-Centul3,-Croffts. Matatic, M. 1994. Interaction and Intelligent behavior. Ph.D. Thesis, MIT. Michie, D., and Chambers, R. 1968. BOXES:An experiment in adaptive control. In E. Dale, and D. Michie, eds. ~4achinehTtelligence 2:137-152. Oliver and Boyd. Minsky,M. 1961. Steps toward artificial intelligence, Proceedingsof the IRE, 8-30. Samuel, A. 1959. Somestudies in machine learning using the game of checkers. IBMJournal of Research and Development3 (3): 210-229. Stojanov, G.; Stefanovski, S.; and Bozinovski, S. 1995. Expectancy based emergent environment models for autonomousagents. In Proc. Int. Symp.on Automatic Control and ComputerScience, 217-221. Iasi, Romania. Stojanov, G., and Bozinovski, S. 1996. Interactionistexpectative view on agency and learning. In Proc. Int. Conf. Information Technology Interfaces, 85-90. Pula, Croatia. Sutton, R. 1988. Learning to predict by the methods of temporal differences. A4achineLearning 3: 9-44. Watldns, C. 1989. Learning from delayed rewards. Ph.D. Thesis, Kings College, Cambridge, England. Wooldridge, M., and Jennings, N. 1994. Agent theoties, architectures, and languages: A survey. In M. Wooldridge, and N. Jennings, eds. hTtelligent Agents, Proc. ECAI-94 Workshop, 1-39, Amsterdam, Holland