Document 13784613

advertisement
From: AAAI Technical Report FS-96-02. Compilation copyright © 1996, AAAI (www.aaai.org). All rights reserved.
Emotion, Embodiment,and Consequence Driven Systems
Stevo Bozinovski,
Georgi Stojanov,
Liljana
Bozinovska*
Electrical Engineering Department, University of Skopje, Macedonia
* Institute of Physiolog%Medical Department, University of Skopje, Macedonia
{bozinovs, georgiS}@rea.eff.ukim.edu.mk
Abstract
Emotion has been considered as an important
concept of embodied systems. A learning agent,
Crossbar Adaptive Array, which implements the concept of emotionin its learning and behavioral scheme
is described. Someissues relevant to embodiedintelligence and reinforcement learning are also discussed.
Introduction
In studying artificial
embodied ~,stems the focus is
mainly on issues of associated with the rational aspects
(motion planning, reasoning, learning, etc). There are
distinguished fields of science knownas Artificial Intelligence and Artificial Neural Systems. However,there is
not so mucheffort concentrated on the issue of artificial
emotion. In the terminology of agent based system,
summarizedfor example in the work of Wooldridge and
Jennings (1994) and in the glossary of intelligent behavior (Mataric 1994) the term "emotion" does not exists.
We do not know how to produce a robot which is
ashamed of something; or how to model era% Maybeit
is because most of our emotionsare not rationally driven,
and their basis is the hormonal system, not the neural
EMbody agents
Cognitive ~,stem in a body assumes existence of some
spatial encapsulation(skin, shell,..) surroundingthe system. It is assumedthat such an obtained body has ability
to address (contact, reach,..) objects and beings that are
placed at some distance fi’om the body. The way of addressing can be by means of sensors, physical actions,
and~or communications. The bo@exist in an environmerit in which it behaves. Figure 1. showsour basic understanding of an embodiedintelligence through the concept of EaVlbody(emotional, embodiedsystenO.
behavioral
environment
~-~
one.
Webelieve that there are important issues to pursue
in relation to emotions, and in this paper we will only
scratch the area. Firstly we will introduce the EMbody
systems, in which importance of the emotion is emphasized from the viewpoint of sensation, action, and cominunication. After that we will describe the CAAagent
whichactually implementsthe concept of emotion. It is a
rather early reinforcement learning system, which uses
the emotionalconcepts of feelings and desirability to develop its learning procedure. Finally we discuss additional issues around the concept embodiedintelligence
12
environment
~
Figure 1. The concept of Embody
As Figure 1 shows, EMbodyis an embodied intelligence which has three main features: 1) Existence in
both genetic and behavioral emdroumentand 2) Expansion of its presence in the behavioral environmentbeyond
its "raw" body, to the extent of its sensory range, action
range and communicationrange 3) Emotion, for purpose
of communication,internal action and sense of the state
values.
Fromthe definition of the EMbodyit is natural that
we define its basic tasks as 1) to survive in the behavioral
em,ironment, 2) to reproduce itself in the genetic environment, 3) to expand its sensory, action, and communication ranges further 4) to optimizethe value of its internal state, i.e. to experiencea "positive" emotion.
In order to achieve the above goals, an EMbodyneeds
a knowledge, sensing- and learning ability, amongother
features of an embodiedintelligence. It needs to build a
cognitive model of the environment in a way it understands it. The EMbodyexpansion in the behavioral environmentis in the sense of expanding its influence in the
environment, by meansof tools, and at somestage, technology.
This paper will restrict itself to the issue of emotion,
whichis considered to be a feature in the intersection of
the notions of action, sensing, and communication.It is
understood as an internal sensation which can sometimes
be used as external communicationand/or action.
This work will describe a consequence driven agent
from the EMbodyclass. Before we give the description of
the agent, we will briefly present a part of the theory of
consequencedriven systems.
Consequence
Driven
Systems
The theory of consequence driven systems
(Bozinovski 1995) coves a wide range of systems, including consequence driven teaching, and consequence
driven learning (or reinforcement learning ) systems.
Consequencedriven learning system (Figure 2) is
adaptive behavior system which always assumes exis
tence of an environments (game opponent, world, market,...) it interacts with. Its basic behavioral routine
(below) emphasizes that a consequencedriven system,
addition of being an interaction (communication)system,
is also a pattern recognition system. In a given situation,
it needs to recognize the consequenceof his previous be
receive
new situation
recognize in that situation a consequenceof a previous
segment of your behavior
if so, adjust your behavior to optimize the
future consequence
send
appropriate action
havioral poli W.It should be noted that here the situation
is the consequence, and there is not necessarily an explicit external reinforcementsignal.
The consequence driven system can be considered as
a planning system. Fromthe AI point of view it can be
represented as a rule-based knowledgesystem using consequence driven if-then rules. They can be in planning
foIvn
IF X(t) THENA(t) IN ORDER
TO X(t+l),
more generally,
IF X(t) THENA(t) IN ORDER
TO X(t+k),
or, in its expectation form
IF X(t) THENA(t) EXPECTING
X(t+k),
wherek> 1 is in the future.
The notion of consequence planning is closely connected to the notion of benefit, or on emotionallevel, desirabilit)p of being in some state, as a consequenceof
taking someaction, for whicha cost should be paid. Figure 3 depicts the concept.
E.,\q, TR OMt4ENT
Onarket, opponent,
[
time
situation
I actionlconsequence
benefit
association
~ E.~
time
ENVIRONMENT
(EMbody, world )
learning variable
consequence
action
Figure 3. The lemwingprocess in a benefit-cost system
situ]tion
~
Consequence Driven
System[__
(EMbody,...)
Figure 2. A consequence driven learning system
13
As Figure 3 shows, it is assumed that a consequence
driven system performs somekind of benefit-cost analysis, an learns on such a basis. The learning variable associated with the basic situation-action-consequence
routine, is adjusted according to a learning rule in the
form
w ’(s,a) = U(w(s,a),V(s ’), C(a))
where w is somelearning construct (for examplelearning
variable), s is the current situation, a is taken action, ~((.)
is (emotional) desirability function and C(.) is cost function: x’is the next value ofx.
Tile simplest learning rule is the desirability-only
rule
w ’~j - w~i+ vk
(2)
wherew~iis the learning variable associated x~4th the ath action in thej-th situation, and vk. is the emotional
value of being in the, as consequenceobtained, k-th
situation.
A simple cost affected desirabili~., (or benefit-cost)
rule is
w’~= w~
(3)
s ~- ov~- c~
whereo is the benefit discount factor, 0 < o < 1
A more complexlearning rule is (Watkins 1989)
w~i= w~j+ 13 0"~ + ov~- w~)
(4)
where r, has interpretation of reinforcement, not merely
the cost. Here 13 is the memoD’forgetting parameter,
knownalso as convergence parameter.
In general, all the aboverules are special cases of the
Bellman equation in Dynamic Programming (Bellman
1957, Barto, Bradtke, & Singh 1995) , which in its
benefit-cost form can be written as
V(s) =max{oE(V(s~)) - E(C(a))}
(5)
a c A(s)
whereV(s) is the desirabilib’ being in state s, C(a) is
cost taking action a, E is the expectation operator, and
A(s) is the set of permissibleactions fromthe state
the theoretical work of Sutton (1988), and the booming
work of Watkins (Watkins 1989) which introduced the
well knownQ-learning method. There are good reviews
on RL. examples being Kaelbling (1993) and Mataric
(1994). Recently, it turned out (Bozinovski1995) that
Q-methodis a variant of the original, CAAlearning procedure. That pointed the attention again to the almost
forgotten CAAsystem.
The key notion of the CAAlearning algorithm are the
crossbar elements (later knownas Q-values) w~j where
is action (row) and j is the situation (column). The
elements are used for both action selection and internal
extraction of the emotional value (state value, benefit,
desirability) computation. The CAAlearning procedure
has four steps:
1) state j: performaction a biasing on w,j
2) state k: computestate value vk using w,k
3) state j: increment w~using vk
4) changestate: j-k; goto 1
where w.j means that the computation considers all the
possible actions (the columnvector) from thej-th state.
I
pattern
classifier
I4
r
sensors
situations
l,
00@0
Emotionas internal reinforcement
--I,
action
0@00
A reinforcement learning system which introduced
emotional concepts in its learning schemeis the Crossbar
Adaptive Array (CAA)agent¯ It was introduced in 1981
(Bozinovski 1981, 1982), and successfully solved the
problemof assignment of credit, well knownin the Reinforcement Learning (RL) theoD~. It is an early connectionist system solving that problem, stated by Mins~:
(1961). It was previously solved by methodsof AI, examples being Samuel (1959) and Michie and Chambers
(1968); later it was solved by another connectionist
learning system, Actor-Critic architecture (Barto, Sutton,
& Anderson 1983)¯ The term delayed reinforcement
learning, (to the best of our knowledge, in modern RL
introduced in Bozinovski 1982, on the basis on the work
on Keller and Shoenfeld 1950), is also in use as a synonymof temporal credit assignment¯ The most influent
works in the area of delayed reinforcement learning are
14
00®0 __
n~]
tot ] ]
emotions
I
emotional value
selector
9ehavioral
;nvlronnent
genetic environment [
Figure 4. The CAAarchitecture
The fourth step of the algorithm emphasizes that the
CAAalgorithm is an on-route acting algorithm. Some
versions (for examplethe Watldns’ Q-learning) can have
the fourth step in whichnext state is chosenoff-route, for
examplerandomly: those are on-point acting algorithms.
Let us note that in the step 1, action is performed only
biasing and not necessarily using the crossbar components. It even assumesthat the action can be taken randonily. Somesystems, for example the Petitage system
(Stojanov &Bozinovski 1996), uses predefined pattern
movements,inthe case where a situation-action mapping
is not yet learned for somestate. Other variants are discussed elsewhere (Bozinovski 1995).
The CAAarchitecture is shown on Figure 4. It receives set of distinct situations, fromsomepattern classifier, outside the CAA.That pattern classifier can be another CAAmodule working as pattern classifier.
Once
distinguished, situations are accepted by a crossbar array
(later knownas Q-table). Using its CAAlearning procedure (given above) and the desirability-only learning rule
(2). it is able to learn a path in an emotionalgraph. It is a
secondaEvreinforcement s3.,stem i.e. a goal-subgoalsystem. Other early connectionist reinforcement learning
systems, for examplethe ACarchitecture (Barto, Sutton,
& Anderson, 1983) rely of an external reinforcer. In
CAA,the only input to the system are situations. Those
situations are used for computationof the internal emotional value which is further used as reinforcement.
Emotional value is computedas
v~. = lfimc{w, d
(6)
where lfimc{.} is some real valued function. It can be
computed as maximumselector, neural sum with controlled threshold, or someother function. However,the
function should be chosen carefully in order to assure
convergence of the CAAlearning procedure.
The CAAarchitecture was designed to challenge two
particular problems in the area of credit assignment: the
maze learning problem and the pole balancing problem.
From the simple maze learning, the challenge was restated as learning emotional graphs. Here we will briefly
describe a maze running and a emotional graph environment. The pole balancing problem is discussed elsewhere (Bozinovski 1995). Figure 5 and 6 report on one
our first (not previously published) experiments done
1981 with CAA.Figure 5 shows a graph which represents a mazehaving only one positive reinforcing state.
(This particular graph was designed by Rich Sutton,
(personal communication)). The state 22 is reinforcing
one. Figure 6 shows the transposed crossbar ar ray and
evolution of its values due to the learning process.
15
2
3
7(’- 3 11(’-’)
6~
10~2
19(%
18_~1
1 2
3
9 13 14 16 17 22
Figure 5. A simple task for the CAAs3,stem
The matrix has 22 inputs (situations) and 3 outputs
(actions). In 11 iterations one can see howa path is being
learned: from the goal state, state 22, bacMvardsa decreasing chain of values is produced, until the starting
situation (here situation 6) is reached. Thenthe resulting
increasing chain of emotional values 1-3-4-5-7-8-A-B,
(A and B stands for 10 and 11) correspond to the situation chain 1-4-5-9-13-14-16-17 which is the path toward
the goal 22.
sit.
action
123 123 123 123 123 123 123 123
O1 000
00( 123
00( 00( 000 000 000 I00011000
123 123
I000 I00]
02 000 00( 00( 00{ 000 000 000 I00011000
10001000
03 000 00{ 00( OOC000 000 000 I00011000
I001 fOOl
04 000 00( OOC OOC000 000 000 I0001100]100211003
05 000 OOCOOC OOC000 DO0 300 I00111002i
100311oo4
06 000 OOC30C 90C 300 900 D00 I00011000110001
I0001
07 000 30C 90C ooc 900 900 300 10001100011000110001
08 000 OOC30C OOCDO0300 300 IO00II O00lI0001I0001
09 300 30C 30C 3oc 300 300 301 10021100311004110051
10 300 30C 30C 30C 300 300 300 :000110001
IO001
I0001
11 300 30C 30C 3oo 300 300 300 i000II0001I0001I0001
12 300 300 300 3OO300 100 100 i10011
iO01
i lOOl
l looi
13 300 30C 90G 3oo 301 302 3O3 0041100511006110071
14 300 300 300 3oi 302 303 3O4 0051100611007110081
15 300 300 301 3oi 301 301 301 001I10Oll10011
IOOll
16 3OO 301 )02 3o3 )04 )05 )O6 007110081
I0091Ioo~
17 301 )02 )03 )04 )05 )06 )07 008II 0091
i oo~
100131
18 300 300 300 300 )00 )00 )00 000110001
OOOl
i0001
19 300 300 )00 )00 300 }00 )00 0001100010001;0001
20 )00 }00 }00 )00 )00 300 )00 0001100010001 0001
21 )00 )00 )00 )00 }OOi)OOi)00 000110001
O00l0001
22 111 Ill 111 [II llll 111
tll
111111111
~I till
i
iteration
)
I 2 3
4
5
6
7
8 9 10 11
steps/iteration
68 140 189 41 44 142 71 53 33 12 9
Figure 6. Crossbar matrix learning in the experiment
For environments in which emotion can also have
negative values, the representation of the problemcan be
done using the concept of subjective emotional graphs
Figure 7 shows such a graph, where cartooned human
emotions represent the internal emotional state of the
CAAsystem. This particular graph shows environment
in which the reward is not only found at the terminal
state, but there are rewarding states along the way. The
agent has multiple goals, to reach the emotionally positive terminal state, but also to collect somepleasure
along the way, which in turn can speed up learning.
-,
©
classify the world sensations into manageablesituations.
Or, a usage of fuzzy classifier, which in the process of
fuzzyfication, assigns linguistic variables as classes.
Aninteresting issue, not discussed in RLand Behavioral Intelligence is the issue of continui~., of the action
space. Goodexample is the CAAapplication to the pole
balancing problem (Bozinovski 1981b, Bozinovski
Anderson 1983). While other RL pole balancing problem
solvers use two actions, LEFTand RIGHT, CAAuses
three actions: LEFT, RIGHT, and DO-NOTHING.It
turned out that greatly simplifies the problem, since a
large part of the problem space became a ’do nothing’
region. So the importanceof partitioning the action space
in the design phase of the systemw-ill reflect itseff heavily on the partitioning the input space. It is a not widely
discussed action function approximation, to the contrast
of value function approximation problem, discussed more
often in connection w-ith the input space partitioning
problem.
Figure Z Representing emotions using emotional graphs
States and Virtual Reality
Those issues are also addressed in Mataric (1995). The
detailed experimental results, with variants where the
teleportation state is a neutral (not emotionallycolored)
one are given elsewhere 03ozinovski 1995).
The concept of state is one of the cornerstones in AI, Dynamic Programmingand Reinforcement Learning. There
are several problemswith states, one of whichis the aliasing problem, which we call the vh’tual reality problem:
howto distinguish betweenthe states if they appeared to
be the same to the agent sensor5~ system? Figure 8 represents the problem.
Discussion
Here we discuss sometopics which appeared in our work
and are of interest in RLand embodied, behavior-based
intelligence (for exampleMataric 1994). In addition
emotion, we will briefly mention the issues on action
function approximationand virtual reality.
Continuity and Action Function Approximation
A real robot with real valued sensors and efectors are
facing the problemof continuity. A proper partitioning of
the input space is neededin order to avoid large spaces,
with dimensionality equal to the numberof points that
sensor vector can produce. As CAAmeets that problem,
it is the well knownpattern classification problem(for
example Duda and Hart 1973). So CAAdivides the
problem in two parts, situation recognition and the behavior selection. However, the problem of how to perform the learning for situation recognition is not addressed by CAAin the delayed reinforcement learning
task. The CAA
pattern classification training is discussed
in (Bozinovski 1995) but outside the context of delayed
reinforcement learning. It is supposedthat somehow,for
example by a supervised training, the CAAwill learn to
16
©
emotions
sensors
ac
:@
~
(~
.@
: ~
’
~_j
o2
Figure8. Different la),ers of states
Figure 8 shows an environment consisting of 5 states
defined by the action restriction in some environment.
Someof them appear the same to the sensor5, system of
the agent (notice the second layer), although they are
actually different. States have also emotional values
(third layer). Since the agent virtually lives in the second
layer it can build a wrong model of the environment. In
somecases it can build a stochastic model in deterministic envil’onment.
One solution to the problem is the Petitage ~stem,
(Stojanov & Bozinovski 1996) which solves the problem
by extendingthe definition of a state with its action connections to the neighborhood. The new definition of the
state it deals with. is the sensor), registered state plus the
connectionto the neighborhoodstates.
Emotion and Secondary Reinforcement
In contemporary RLtheory the notions of emotion and
secondary reinforcement are not in wide use. A more
general terms from Dynamic Programming and Neural
Networks, as state value and state value backpropagation
are used instead. However, the concept of emotion emphasizes that the state value is computedinternally in the
agent, not given from the outside world. Webelieve that
emotions are geneticaly built into the system and they
can serve as primau, drives. They produce elementary
behaviors. From the primary reinforcers, an agent, for
exampleanimal, will develop its learning behavior by the
secondar)’ reinforcement mechanism,amongother methods of updating the memory.The nature and necessib: of
the concept of emotionis still not appropriately modeled
in AI, and our attempt is only one aspect, a small step
towardthat provokingfield.
Acknowledgment.Most of the ideas expressed in this
paper are results of the stay of the first author with the
Adaptive NetworksGroup, University of Massachusetts,
Amherst. The authors wish to thank Andy Barto, Nico
Spinelli, and Rich Sutton for inspiring environment in
which CAAwas designed and evaluated. Special thanks
to Jonh Moore,for pointing out relevant issues in animal
learning and secondau, reinforcement.
References
Barto, A.; Sutton, R.; and Anderson, C. 1983. Neuronlike elements that can solve difficult learning control
problems. 1EEETrans. Systems, Man, and Cybernetics
13: 834-846.
Barto, A.: Bradtke, S.; and Singh S. 1995. Learning
to act using real-time dynamic programming.Artificial
Intelligence 72:81-138.
Bellman, R. 1957. Dynamic Programming. Princeton
University Press.
Bozinovski. S. 1981a. A self-learning system using
secondau, reinforcement. First report on CAAnetavork,
ANWGroup, November 25, Computer Science Department, University of Massachusetts, Amherst.
17
Bozinovski, S. 1981b. Inverted pendulum learning
control. ANWMemo, December 10, Computer Science
Department, University of Massachusetts, Amherst.
Bozinovski, S. 1982. A self-learning system using
secondalj reinforcement. In R. Trappl, ed. Cybernetics
and Systems Research, pp. 397-402, North Holland.
Bozinovski, S.; and Anderson C. 1983. Associative
memoryas a controller of an unstable system: Simulation
of a learning control. In Proceedings of the IEEE Mediterranean Electrotechnical Conference, C5.11, Athens,
Greece.
Bozinovski, S. 1995. Consequence Driven Systems,
Gocmar Press
Brooks, R. 1986. A robust layered control system for
a mobile robot. IEEE Journal of Robotics and Automation 2: 14-23.
Duda, R, and Hart, P. 1973. Pattern Recognition and
Scene Analysis, Willey Interscience.
Kaelbling, L. 1993. Leapwing in EmbeddedSystems.
MITPress.
Keller, F. and Schoenfeld, W. 1950. Principles of
Psychology. Appleton-Centul3,-Croffts.
Matatic, M. 1994. Interaction and Intelligent behavior. Ph.D. Thesis, MIT.
Michie, D., and Chambers, R. 1968. BOXES:An experiment in adaptive control. In E. Dale, and D. Michie,
eds. ~4achinehTtelligence 2:137-152. Oliver and Boyd.
Minsky,M. 1961. Steps toward artificial intelligence,
Proceedingsof the IRE, 8-30.
Samuel, A. 1959. Somestudies in machine learning
using the game of checkers. IBMJournal of Research
and Development3 (3): 210-229.
Stojanov, G.; Stefanovski, S.; and Bozinovski, S.
1995. Expectancy based emergent environment models
for autonomousagents. In Proc. Int. Symp.on Automatic
Control and ComputerScience, 217-221. Iasi, Romania.
Stojanov, G., and Bozinovski, S. 1996. Interactionistexpectative view on agency and learning. In Proc. Int.
Conf. Information Technology Interfaces, 85-90. Pula,
Croatia.
Sutton, R. 1988. Learning to predict by the methods
of temporal differences. A4achineLearning 3: 9-44.
Watldns, C. 1989. Learning from delayed rewards.
Ph.D. Thesis, Kings College, Cambridge, England.
Wooldridge, M., and Jennings, N. 1994. Agent theoties, architectures,
and languages: A survey. In M.
Wooldridge, and N. Jennings, eds. hTtelligent Agents,
Proc. ECAI-94 Workshop, 1-39, Amsterdam, Holland
Download