From: Proceedings of the First International Conference on Multiagent Systems. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. fDeveloping Industrial Multi-Agent Systems N. R. Jennings J. M. Corera I. Laresgoiti Dept. Electronic Engineering, QueenMary&Westfield College, Mile End Road, LondonE1 4NS, UK. N.R.Jennings@ qmw.ac.uk IBERDROLA S. A. c/Gardoqui 8, 48008Bilbao, Spain. jose.corera @iberdrola.es LABEIN, Parque Technologicode Zamudio101 48016 Zamudio,Spain. lares@labein.es (INVITED PAPER) Abstract Thedevelopment and deployment of multi-agentsystemsin real worldsettings raises a number of importantresearchissues and problemswhichmustbe overcome if DistributedAI (DAD is to becomea widespreadsolution technology.Workundertakenin the context of the ARCHON project has provideda numberof importantinsights into these issues. Byprovidingan in depth analysis of ARCHON’s electricity transportation management application,this paperdrawstogethermanyof the experiences obtainedwhenbuildingoneof the world’sfirst operationalDA! systems. Introduction In manyindustrial applications a substantial amountof time, effort and finance has been devoted to developing complexaud sophisticated software systems. These systems are often viewedin a piecemeal manneras isolated islands of automation,when,in reality, they shouldbe seen as componentsof a muchlarger business function (Jew nings, 1994a). The main benefit of taking a holistic perspectiveis that the partial subsystemscan be integrated into a coherent and consistent super-systemin whichthey worktogether to better meetthe needsof the entire application. Bythe veryfact that they are integrated, the finite budgets available for information technologydevelopment can be madeto go further - consistent and up-to-date versions of the data can be sharedby all the problemsolvers, basic functionalities need only be implementedin one place, problemsolving can makeuse of timely information whichmightnot otherwisebe available, and so on. Twocomponentsare required to develop a well-structured DAIsystem: a software frameworkwhich provides assistance for interaction betweenthe constituent subcomponents and a design methodologywhichprovides a means of structuring these interactions. ARCHON addresses both of these facets: providinga decentralisedsoftwareplatform whichoffers the necessarycontrol and level of integration to help the subcomponents to worktogether and devising a concomitant methodologywhich offers guidance on how to decompose the overall application and howto distribute the constituent tasks throughout the communityto make best use of the capabilities of the ARCHON framework. Both of these facets havebeen applied to a numberof real world industrial applications (Jennings, 1994b)- however here the electricity transportation management application developedand run on-line at IberdrolaI is the mainfocus. ARCHON’s individual problem solving entities are called agents; these agents havethe ability to control their f Theworkdescribedin this paperwassupportedby the ESPRIT II project P2256(ARCHON) ownproblemsolving and to interact with other community members.Theinteractions typically involve agents cooperating and communicatingwith one another in order to enhancetheir individual problemsolving and to better solve the overall application problem.Eachagent consists of an ARCHON Layer (AL) and an application program (knownas an Intelligent System(IS)). Propose-built can make use of the ARCHON functionality to enhance their problemsolving and to improvetheir robustness. However pre-existing ISs can also he incorporated, with a little adaptation,and can experiencesimilar benefits (Jennings et aL, 1993).Thislatter point is importantbecausein manycases developingthe entire application afresh would he considered too expensive or too large a change away fromproventec.tmology(Jennings and Wittig, 1992). Tosuccessfully incorporate both purpose-built and preexisting systems, communitydesign must he carried out from two different perspectivessimultaneously. A top downapproachis neededto look at the overall needsof the application and a bottomup approachis neededto look at the capabilities of the existing systems. Oncethe gap betweenwhat is required and what is available has been identified, the systemdesigner can chooseto provide the additional functionality through newsystems, through additions to the existing systems, or through the ARCHON softwareitself. This methodology, whichis describedmore thoroughlyin Vargaet al. (1994), shapes the design process by providingguidelines for problemdecompositionand distribution whichreduceinefficiencies. Thispaperis organisedalongthe followinglines: section two provides an overview of the ARCHON architecture. Section three describes the Iberdrola application - it involves seven heterogeneousagents, a substantial number of which were purpose built, which perform three main typesof activity (data acquisition,fault diagnosis,andservice restoration) and cooperate in a numberof different styles. Finally, section four highlights somekey experiences whichimpact uponthe design and implementationof future DAIsystems. The Archon Framework The ARCHON software has been used to integrate a wide variety of application programtypes under the general assumptionthat the ensuingagentswill be loosely coupled and semi-autonomous.The ISs themselves can he heterogeneous- in terms of their programming language, their I. Iberdrolais a largeSpanish electricutility. Theirtransportnetwork,onwhichthis appficafionoperates,is controlledby the NorthDispatchControlRoom (DCR) located in Bilbao. Jennings 423 From: Proceedings of the First International Conference on Multiagent Systems. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. algorithm, their problemsolving paradigm,and their hardware platform - as their differences are maskedby a standard AL-ISinterface. AnALviews its IS in a purely functional manner,it expects minvoke functions (task) whichreturn results, and there is a fixed language(Cockburn and Jennings, 1995)for managingthis interaction. In an ARCHON communitythere is no centrally located global authority, each agent controls its ownIS and mediates its owninteractions withother agents (acquaintances). Thesystem’soverall objectives are expressedin the separate local goals of each communitymember.Becausethe agents’ goals are often interrelated, social interactions are required to meetglobal constraints and to providethe necessary services and information. Such interactions are controlled by the agent’s AL. In more detail, an agent’s ALneeds to: control tasks within its local IS (monitor), decidewhento interact with other agents (planning and coordination module) (for whichit needsto modelthe capabilities of its ownIS and the ISs of the other agents - agent informationmanagement (AIM)module), and communicatewith its acquaintances (high level communication module)(fig. Intelligent System "1 r ARCHONLayer MU Name Updat:eTopology Wit:hAlarms Input DISTURBANCE-ID UpdateTopolog] WithAlarms e topology_’~ update AL-IS Interface Requests [topology_ Control ~onfmnatious I update r I J Result UPDATRD-TOPOLOGY Figure2: UpdateTopologyWithAlarms MonitoringUnit understandthe ALdirectives and for the ALto understand the IS messages. MUsrepresent the finest level of control in the AL,at the next level of granularitythere are phms.Plans are pre-speciHigh Level CommunicationManager (HLCM) fled acyclic OR-graphsin whichthe nodes are MUsand the arcs are conditions. Theseconditions can: be dependanton data already available from previously executed MUsin the Acquaintance Ir plan, be dependanton data input to the plan whenit started, Models makeuse of the lockingmechanism for critical sections of the Planning & plan, or be used to return intermediate results before a plan Coordination has completed. 4------D, [ Self Model ] Module (PCM) Asampleplan whichstarts the fault diagnosisactivity is shownin figure 3. Firstly, MO,RM=M~.-q.qAG~.S are used as an input to the Set:NewFault:MUwhichnotes that there is a newfault in the networkand generates a newidentifier (Dr STURBANCE-ID) for it - this identifier is retarned as one of the plan’s intermediate results. The alarm messagesand Monitor the disturbanceidentifier are then usedas inputs to the previAI,-IS Imerfaee ously described Updat:eTopologyWit:hAlarms MU. Whenthis MUis complete, the model of the network on whichthe diagnosiswill be basedis up to date and hencethe processof identifyinga potential list of faults cancommence. Thereare twowaysthis activity can proceed: firstly, it is checkedwhethera list of generatedhypotheseshas already been provided (and stored in the domaindata componentof AIM)by an agent called BRSin which case these should Figare 1: ARCHON AgentArchitecture form the start point (the Hypot:hesisGenerat:ionFromForeignSource MUshould be executed). If no Monitor pertinent informationis availablethen the list shouldbe genThe Monitoris responsible for controlling the local IS. erated from scratch (theHypochesisGeneraccion MU EachIS task is representedin the Monitorby a monitoring shouldhe executed).In the latter case, the planreturnsthe list unit (MU).MUspresent a standard interface to the Moniof generatedhypothesesas an intermediateresult so that they tor whatever the host programming language and hardware can be used elsewherewithin the agent or even disseminated platform of the underlyingIS. Figure 2 showsa graphical to relevant acquaintances. representation ofa MUcalled UpdaUeTopologyWiChThe plan mechanismhas an inbuilt backtracking facifity Alarms which takes ALARM-MESSAGES and whichcan be used to express preferencesand deal with comDISTURBANCE-ID as inputsand producesUPDATEDplex alternatives. Consider the plan ReceiveAlarms TOPOLOGY asan output. The IS task associated with this (figure 4) whichdetermineswhat course of action an agent MUis called UpdateTopologyWit:hAlarms in the should take whenit receives alarmmessages.Here there are ALand t:opolo0y_updat:e in the IS. three cases: (i) see whether the alarms correspond to MUscan send and receive messages(directives, confirongoingfault whichis known about; (ii) see whetherthe mesmations and requests) to and from the IS. All messages sages have been generated by planned maintenance have to pass through the AL-ISinterface whichperforms (manoeuvres)on the network;or (iii) see whetherthe alarms the translation and interpretation required for the IS to correspondto a newfault. TheMonitorfirst tries the leflmost I Messagesto/from acquaintances ’I 424 ICMAS.95 * t From: Proceedings of the First International Conference on Multiagent Systems. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. Plan Name: | ! ALARM-MESSAGES StartNewDia~ inte~~T~Rl~CE - ID fa ( rms G(NE~EUD:PHy~ ~ -HYPOTHESES Figure3: StartNewDiasnos/s Plan branch and executes the OngoingFault MU- if this is unable to give a disturbanceidentifier for the alarms then they cannot correspondto a knownfault and so this branch fails; if, on the other hand,an identifier is foundthen the rest of the branchis traversed. In the case of failure, the plan mechanism backtracksup to the last successful execution (MUCollectAlarms) and tries the next branch (theManoeuvres MU)- thisbranch failsif thedisturbance identifier associatedwith the alarmsis not given the tag "MANOEUVRES". Finally, if the alarms are not generated by manceuvresthen they correspond to a newfault and so the StartNewDiagnosis branch, as described in figure 3, is invoked(this branchnever fails and so the ReceiveAlarms plan will successfully terminate when the plan has been completed). Plan Name: ReceiveAlarms ~ ALARM-MESSAGES ( NOT ( NOT (DXSTURBANCE-ID ?X))) (DISTURBANCE-ID ~s))} Figure 4: ReceiveAlarmsPlan (with backtracking) Thehighest level at whichthe IS’s activities are represented is the behaviourlevel. Belmvioarscontain a plan, a trigger conditionfor activating the behaviour,descriptionsof the inputs neededby the activity and the results whichwill be produced, and any children of the behaviour. There are two types of behaviour: those that are visible to the FCM (and the other ALcomponents)and those that are purely internal to the Monitor.Theformertype are called .Idlk, and they maybe triggered by newdata (either arriving fromother agents or whichthe agent has generateditself) or by direct requests fromother agents. plannin~ and Coordimfion Module (PC/M) ThePCM is the reflective part of the AL,reasoningaboutthe agent’s role in terms of the wider cooperating community (Jennings and Pople, 1993). This modulehas to assess the agent’s current status and decide whichactions should be taken in order to exploit interactions with others whilst ensuring that the agent contributes to the community’s overall well being. Specific examplesof the PCM’s functionality include: deciding whichskills should be executedlocally and whichshould be delegated to others, directing requests for cooperationto appropriate agents, detert~nining howto respendto requests fromother agents, and identifying whento disseminate timely information to acquaintanceswhowould benefit fromreceivingit. The PCMis composedof generic rules about cooperation and situation assessmentwhichare applicablein all industrial applications - all the domainspecific informationneeded to define individual behaviouris stored in the self and acquaintance models. The former contains information about the local IS andthe latter containinformationaboutthe other agents in the systemwith whichthe modellingagent will interact. For example,in order to determinehow to obtain information which is neededmexecute a behaviourbut which is not currently available, the PCM will makereferenceto its self modelto see if the informationcan be providedlocally by executingan appropriate skill. If the informationcannot be provided locally then the acquaintance models are checkedto see if another community membercan provide it. Thefinal majorrole of the PCM is to deal with requests arriving fromother agents. Byreference to its self model,it will decidewhetherto honourthe request and will then activate the necessaryskill to provide the requestedd,,~; when the informationis available it will ensurethat a reply is directed to the sourceof the request. Agent Information Mmmgmument Medele The AIMmoduleis a distributed object management system which was designed to provide information management services to cooperating agents CFuijnman and Afsarmanesh, 1993). WithinARCHON, it is used to store both the agent modelsand the domainlevel data. As an illuswafion of the agent models, consider an agent which is capable of producing information about ALARMMESSAGES. Theinterest slots of its acqnainmucemodels contain those agents whoare interested in .,~ceiving this information and the conditions underwhichthey are interested. The following portion of the acquaintance model specifies that an agent called BRSis interested in ALARMMESSAGES which contain chronological information, that an agent called AAA is interested only in non-chronological alarm messages,and that an agent called BAIis only interested in non-chronological alarm messageswhich have the string INTwithin their ALARMS field: INTERF.~-DESCRIPTOR INFORMATION-NAME: ALARM-ME88AGF.S Jennings 425 From: Proceedings of the First International Conference on Multiagent Systems. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. INFORMATION-CONDITION: [ (’BRS’,(CONTAIN (ALARM-MESSAGES =CHRONO "YES’))); (=AA.A’,(CONTAIN (ALARM-MESSAGES "CHRONO "NO’))); (~BAI’, (AND(CONTAIN (ALARM-MESSAGES "CHRONO "NO’)) (CONTAIN (ALARM-MESSAGES.ALARMS "IN~)));] In manyindustrial applications the domainlevel data whichthe agents need to exchangehas a complexinternal structure. In ARCHON, this structure is specified and maintained by AIM.For example, the information type AL,MLM-MESSAGES is defined in the following manner: ALARM-MESSAGES AGENT-ID: AGENT DISTURBANCE-ID: SOURCE CHRONO: Y-N-FlAG BLOCK-TYPE: BLOCK-TYPE BLOCK-ID: ID-TYPE ALARMS: LIST-OF-ALARMS whereeach of the followingtypes has the followingset or permissable values: agent (CSI I AAA), source (yymmddhhmmss I MANOEUVRES I UNKNOWN), y-nflag (YES I NO), block-type (UNIQUEI UNKNOWN MIXED),id-type (yymmddhhmmss), and list-of-alarms (alarml..... alarmn). High Level Comm~tieation Module The High Level CommunicationModule (HLCM)allows agents to communicatewith one another using services based on the TCP/IPprotocol. The HLCM incorporates the functionality of the ISO/OSISession Layer whichcontinuously checks communication links and provides automatic recovery of connectionbreaks whenpossible. Electricity Transport Management Energymanagement is the process of monitoringand controlling the cycle of generating, transporting and distributing electrical energy to industrial and domestic customers. Generation transforms raw energy into a more accessible form. Energyis then transportedfromits generation site to the consumer. To minimise losses during transportation, the electrical voltage is madehigh (132 kV or above) before it is placed on a transport networkand sent over manyhundredsof kilometres. Finally, the voltage is loweredand electricity is delivered to the consumers using a distribution network. Toensure the transportation networkremains within the desired safety and economicalconstraints, it is equipped with a sophisticated data acquisition system(SCADA) and several conventional application programswhichhelp the operator (control engineer) to analyse it (these programs are primarily designed for normaloperating conditions). The network’s operation is monitored from a DCRand wheneveran unexpected event occurs hundreds of alarms arc automatically sent to it by the SCADA system. Under these circumstancesthe operator has to rely on experiential knowledgeto analyse the information, diagnose the situation, and take appropriate remedialactions to return the networkto a safe state. Toreduce the operators’ cognitive load in such circumstances, and to help themmakebetter decisions faster, Iberdrola decidedthat a numberof decision support systems should be developed. Thesesystems were then interconnected and subsequentlyextendedusing ARCHON technology - for a more detailed analysis refer to Jenningset aL (1995). Whyuse DAItechniques? WhenIberdrola decided, in 1988, to implementdecision support tools to ease the workloadof their control engineers during disturbances, several technical factors affected their design choices. Firstly, the control system 426 ICMAS-9$ itself wasa proprietary productfroma control systemssupply company- thus it was considered too risky and too difficult to embedthe additionalfunctionalitydirectly within it. Secondly,the state of the art for commercial systemsin this domainmeantthat the diverse support functions could only be realised through a numberof standalone systems. Consequently,Iberdrola built separate decision support systemsto assist with different aspectsof the control engineer’s job - the one whichis mostrelevant to the subsequentdiscussion is the alarms analysis expert systemwhichdiagnosed faults producedin the networkbasedon the alarmmessages which arrived at the DCR.Their decision support systems were unconnectedapart from the fact that they retrieved information about the networkfrom the samesource (the control system’sreal time database). Tomakethis information available to the non-proprietary software products a numberof interfaces to the control systemhad to be written - as well as providingaccess,these interfacescouldfilter and pre-process networkinformation. By 1991, however,someimportant changes had occurred: (i) the evolution of IT hardwareand software had significantly increased the quantity and quality of the data which couldbe acquiredfromthe transport network;(ii) distributed computing had becomecommercially viable because of improvements in local area networktechnology;and (iii) the prices of computershad decreasedsignificantly so that powerful machineswere no longer prohibitively expensive. Takentogether, these changesmeant that better and more powerfultools couldbe built to assist the controlengineer.In particular it wasconsideredimportantto be able to actually perform and dynamicallymonitor the service restoration process and also to exploit the newdata sources, such as chronological information or faster rate snapshots, which becameavailable. However the tried and tested decision support tools werestill needed.Thusit wasdecidedto adopta system upgrading strategy which enabled the previously operational componentsto be used in conjunction with the newfunctionality. Twomeansof realising this strategy were considered:extendthe existing systemsto coverthe newfeatures or follow a distributed approach and allow the new functionality to be expressedas distinct computational entities which could interact with the pre-existing systems through a common distribution platform. The secondoption was chosenbecause it was consideredto be the mosteffective meansof: ( i) Permittingreasoningbased on informationof different granularity. Twotypes of alarm, chronological and nonchronological, nowneededto be dealt with. In non-chronological alarms, the time stampedis coincidentwith the time of acquisition by the control system,whereasin chronological alarms the time stampedis coincident with the actual occurrenceof the event. Aschronologicalalarmsrepresent a moreaccurate picture of events in the networkthey generally lead to a swifter diagnosis, howeverthey havethe disadvantage that chronological information has a low priority in Iberdrola’s communication channels. Thus whenthe channels are saturated (as can happenduringa disturbance)their availability time is unpredictable.For these reasonsit was decided to build a newalarm analysis expert systemwhich utilised chronological information and could subsequently integrate its results with those of the pre-existing system, rather than construct a monolithic system whichreceived both types of data and had to embody both types of diagnostic knowledge.A similar situation occurs whenconsidering service restoration. Twotypes of informationare relevant to this activity: snapshots(whichprovidea comprehensive pic- From: Proceedings of the First International Conference on Multiagent Systems. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. tore of the current state of all the componentsin the network) and alarm messages(which showhowthe state of the components has changedover a period of time). The former can be producedrelatively quickly and give a complete picture of the system’sstate, whereasthe latter may take several minutesfor a large disturbancebut are needed to indicate the type of fault fromwhichthe systemmustbe restored. Rather than trying to place bothtypes of information and reasoning in a single system it seemedmore natural to develop a service restoration subsystemwhich dealt mainly with snapshots and received the necessary high-level informationabout the equipmentat fault froma diagnosissubsystem(rather than trying to deal with the raw alarmmessagesitself). (ii) Allowing different network models to be included within the same syster~ Someof the problem solvers needed to work on the SCADA model of the network, while others needed the applications network model(a modelwhichpermits networkequations to be solved and takes the physicalcharacteristics of all its components into account). Rather than trying to combineand hannonise these complexand disparate modelsat design time, it was decided that each subsystem should work on whichever modelwas most appropriate for its task. Thenthe various components should be able to interact at mntimeto resolve any inconsistencieswhicharise fromtheir use of different network models. (iii) Enablinga numberof different problemsolving paradigmsto be utilized. Thediverse range of activities which neededto be performedin this application meantthat there was no universally best problemsolving paradigm:procedural techniqueswererequired for algorithmiccalculations like connectivity (to knowwhich componentis connected to whichother) and load-flowanalysis (solution of the network equations), whereas symbolic reasoning based on heuristic search wasthe best approachto diagnosis. A distributed approach enabled each componentto be encoded in the most appropriate method. (iv) Meetingthe application’sperformance criterio. Transportation management is a time-critical application and as manydifferent types of information can be processed in parallel, with only a small synchronisationoverhead,the response time of the overall system can be improved throughthe use of a numberof interconnectedmachines. Having decided upon a distributed approach, a choice had to be madebetweenusing moreconventional distributed processing techniques or DAItechniques - here the latter wasadoptedfor the followingreasons: (Barandiaran etaL, 1991; Abel etal., 1993) (i) Economy: The alarms analysis expert system was already operational, howevernewfunctionality neededto be addedand newinformation neededto be treated. It was estimated that the cost of modifyingthe extant systemwas significantly larger than that of implementinga newone. However it wasalso judgedthat as the newfunctionalities and data were so diverse that it wouldbe an extremely expensive activity to put them within a single system. Therefore it turned out to be more economicalto build smaller systems, and re-uso the existing alarm analysis expert system, and allow them to be integrated through a DAI framework. A DAI framework was needed because the interactions between these subsystems were both sophisticated and context dependent, therefore run time reasoning based on dynamicdata needed to be performed. (6) Robusmess: As the subsystems have overlapping domainsof expertise, the failure of one of themto produce an answerdoes not necessarily meanthat no solution will be f~comin~ of the other systems may be ablethis to nee atbecause least aone partial solution. However to achieve back-upfunctionality in a flexible manner,the different problemsolving components needto be intelligently coordinated - a task beyond present generation distributed processing systems. (iii) Reliability: Thesolutionsof the systemsthat overlapcan be cross-referenced to enable the operator to be presented with morereliable information.Again,however,this crossreferencing functionality needs to be properly managed according to the prevailing circumstances and so requires dynamicand flexible reasoningto take place. (iv) Natural representation of the domain:A DAIapproach accurately represents the way the control engineers work whena large disturbanceoccurs. Theyspecialize their roles - one looks after restoration, another tries to diagnosethe problembasedon different sources of infmmation,and so on - and they then communicaterelevant information to one another to ensure they are following a coherent course of action towardsthe overall objective of restoring the service (Jennings and Wittig, 1992). Specification of the Agents During normal workingconditions, management of the networkby the operator in the DCRconsists mainlyof topology changes (operation on breakers and switches), generation scheduling,and control of the energyinterchangewith other utilities (Corera et aL, 1993). However,during emergency situations management becomesconsiderably moredifficult because of the large numberof constraints whichhaveto be taken into considerationand the insufficient quality of the information which is available to makethese decisions. Emergency situations typically originate froma short circuit in a line, bus-baror transformer.Theycan be exacerbatedby equipmentmalfunctioning(eg a breaker failing to open) subsequentoverloads(a dominoeffect cancauseone line to fail becauseof an overload,this in turn .increasesthe loadon neighbouring lines so they becomeoverloaded and subsequently fall, and so on). The situation can becomeeven worse if powerstations becomedisconnected as this will cause an imbalancein the network’s power. Consequently, actions to restore service must be taken rapidly and accurately, so that whatstarts as a relatively minorproblemdoes not escalate into a majordisaster. In these circumstances,the actions whichthe operator can perform consist mainly of breaker operations, topologychanges,and activation/deactivation of automatismsand protective relays. For larger disturbances, however,actions on powerplants mayalso be required. Fromthis description of the control engineer’sjob, a topdownanalysis identified that a comprehensive decision support systemshouldcover the followingactivities: (i) Detect the existence of disturbances; sometimesthe operation of protective relays and breakers can be caused by routine maintenanceand this should not be confusedwith a genuine disturbancesituation; (ii) Determinethe cause, location and type of the disturbance;including identifying if any equipmentis permanentlydamaged;(ili) Analysethe situation the networkonceit arrives at a steadystate; and(iv) Prepare a restoration plan to return the networkto its original operational state Allying this top-downanalysis with the bottom-upperspective of examiningthe extant systems, it wasdecidedto encapsulate the followingpre-existing systemsas agents the alarms analysis expert systemand the interface to the control system. As discussed earlier, the availability of Jenuings 427 From: Proceedings of the First International Conference on Multiagent Systems. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. chronological alarm messagesnecessitated a newdiagnosis systemwhichit was decided to makeavailable as an agent. Finally, it wasalwaysknownthat informationabout the initial area out of service(the blackout area)couldhelp constrain the search of the faulty equipment,howeverit was never deemedcost effective to develop a dedicated stand alone systemfor this purposesince the original alarm analysis expert system’sperformancewas considered satisfactory (if somewhatslow). Howeverthrough the use DAItechnologymuchof the basic infrastructure to implement this functionality was nowavailable from other agents and so it was considered economicallyviable to developa systemcapable of producingthis information(in terms of the ARCHON methodology,this decision correspondsto providing additional functionality through the developmentof new systems). In more detail, the operational DAIsystem consists of sevenagents runningon five different machines. ¯ BAI(Black-outAreaIdentifier) Whena fault occurs, the network’sprotective relays and breakersautomaticallytry to isolate the minimum amountof equipmentpossible; in an ideal case only the elementat fault wouldbe isolated. The BAI’sobjective is to identify whichelements of the networkare initially out of service as the actual elementat fault mustbe within this region. It uses non chronological alarm messagesas its information source and cooperates with the BRSand the AAA mincrease the efficiency of the overall diagnosisprocess. ¯ CSI (pre-existing ControlSystemInterface) TheCSI acts as the application’s front end to the control systemcomputers. Its objectivesare to acquireanddistribute networkdata to the other agents, to interface to the conventionalmanagementsystemapplication programs, and to monitor the restoration processto detect any unexpecteddeviations. It is split into twophysical agents: CSI-Dwhichdetects the occurrence of disturbances and preprocesses the chronological and non chronological alarm messageswhich are used by the AAA,the BAIand the BRS;and CSI-Rwhich detects and corrects inconsistencies in the snapshot data file of the network,calculates the powerflowingthroughit and makesthis information available to the SRAand the UIA.CSI-Dis primarily concernedwith the system’sdiagnosis activities andCSI-Rwithits restoration activities. ¯ BRS(Breakers and Relays Supervisor) The new alarms analysis expert systemwhichdetects the occurrenceof a disturbance, determinesthe type of fault and its extent, generates an ordered list of fault hypotheses, validates hypotheses, and identifies malfunctioningequipment.In order to performits analysis, it takes twotypes of inputs: chronological alarm messagesand snapshots of the networkwhichgive the status of every breaker and switch. ¯ AAA (pre-existing, non chronological AlarmsAnalysis Agentexpert system) This agent pursues similar goals to the BITS,howeverthe quality of informationit receives is inferior to that of the BRS.Althoughthe alarm messages received by both systemsrelate to the samephysical operations, those received by the AAArepresent :t5 seconds accuracy, while those received by the BRSare precise. This meansthat if the data is error free, then the BRSperforms a better diagnosis than the AAA.Howeverif someof the chronologicalinformationis lost (a distinct possibility when the SCADA system is busy) then the BRSmayperform worse than the AAA.Therefore wheneverincomplete or erroneousinformationexists, whichis in mostinteresting cases, there is a need for cooperationbetweenthe two systems to make the overall system more robust and reliable. 428 ICMAS-9$ ¯ SRA(Service RestorationAgent)This agent devises a service restoration plan to return the networkto a steady state after a blackouthas occurred.Todo this it takes into account the constraints imposedby the damaged equipment,as identiffed bythe diagnosisagents. ¯ UIA(User Interface Agent) This agent implements the interface betweenthe users and the community of agents. It gives the user the facility to inspect the results producedby the diagnosisagents, display the alarmsreceived, and browse throughthe log of analyseddisturbances. Fromthe point of viewof restoration, the user can see the plan produced,modify it, run it in a simulatedenvironment to see its predicted effect, and request the development of a newrestoration plan whichtakes into account someactions whichhe deemspertinent. Throughthe use of a distributed windowingsystem, the UIApresents the appropriate informationon the consoles of the various control engineers whoare workingon the system. This systemdesignensuresthat all the tasks identified by the top-downanalysis are performedby at least one agent. Robusmessis achieved by having multiple agents that are able to provide the same (or at least some) overlapping results. Efficiencyis obtainedby the parallel activation of tasks. Reliability is increased because even if one of the agents breaks downthe rest of the agents can often produce a result which,althoughnot as goodas the one providedby the completesystem,is still of use to the operator. CooperativeScenarios An important example of cooperation in this system involves the information interchange between the AAA, BItS and BAI agents. The AAAand the BRSproduce the sameresult from different information sources, while the BAIapplies different knowledgeto produce a result that should he coherent with that of the AAA and the BRS. Assumea block of non-chronologicalalarm messageshas been provided by the SCADA system and these alarm messages havebeenidentified as related to a disturbanceby the CSI. Usingthe interest descriptors of its acquaintancemodels - see the AIMsection - the CSI will realise that this information is relevant to the AAA and the BAI. Application of the appropriate PCM generic rule will result in this informationvoluntarily being sent to the specified agents as unsolicited data. Sometime later, the sameprocess will be repeated and the BRSwill receive the correspondingchronological alarm messages. At this point, the AAA,BAIand BRSare all operatingin parallel. Whenthe AAAreceives the alarm messages, and the corresponding disturbance identifier marks themas being a consequenceof a newfault (see figure 4), the St~artNewDiagnosis plan is executed and a preliminary set of hypotheses (GENERATED-HYPOTHESES) are produced. During this time, the BAIwould also have received the alarmmessagesand wouldhave started its skill for producing the Iniuial-BZack-Oun-Area. Whenthis plan is complete, the BAI’s PCMchecks whether any other agents are interested in this information- it finds out that the AAA is andso it sendsout the information. Simultaneously,aitough after a certain delay, the BRS agent starts workingon the analysis of the chronological alarm messages.This will also result in a list of GENERATED-HYPOTHESES being produced. The BRS checks whetherany agents are interested in this information- again the AAA is noted and the generated hypothesesare sent to it. TheBRSthen continueswithits diagnosisto try andvalidate the causeof the fault. From: Proceedings of the First International Conference on Multiagent Systems. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. After producingits tentative list of hypotheses,the AAA proceedswith a detailed analysis to try and ascertain the precise cause of the fault (i.e. to produce VALIDATEDHYPOTHESES). The following situations maythen occur: (i) the Initial-Black-Out-Area is available to the AAA,this triggers a refinement behaviour which may reduce the numberof hypothesesto be validated because the BAIhas givena focusedviewof the situation: (ii) the generated hypotheses provided by the BRSare available to the AAA.this triggers another refinement behaviour and obtains a better reorderingof the hypothesesto be validated anda benefit in findingthe elementat fault; (iii) the validated hypothesesprovidedby the BRSare available to the AAA,this triggers yet another refinement hehaviour, whichhas the samefunctionality as the previous one, but the reordering is based on validated hypotheseswhichare moreaccurate; (iv) if no informationis available fromthe BAIor BRS,the AAA proceeds with its hypotheses validation as a standaioneagent. Therefore,if the other agents are downor they are too slowto provide the information, the AAA will continue and find a faulty element although its diagnosiswill be less reliable andwill take longer. The restoration process is activated whenevera disturbanceis detected. Oncethe disturbanceis identified, the disturbance identifier is sent to the CSI-Rwhichacquires the snapshot of the network, corrects any inconsistencies whichhavearisen in its representation, and calculates the powerflow solution of the current state. This information is then passedontothe SRA so that it can preparefor its restoration planning. TheSRAwaits until the diagnosisagents haveinformedit of the elementsuspectedof being at fault (VALIDATED-HYPOTHESES) and then proceeds to prepare a restorationplan. If, duringthis plan preparation,the SRAis informedthat the equipmentat fault is different from that originally indicated by either the AAA or the BRS,then it replans the restoration taking this information into account. TheUIAis the interface throughwhichthe user accesses the results producedby the agent community.Duringthe diagnosisphase, the user is presentedwith beth the tentative (early) list of suspected hypotheses and the final (validated) list. Duringthe restoration phase, the UIAsupports a moreparticipatory interaction betweenthe user and the agent commumty. The user is presented with the restoration plan and can then decideto modifyit, run a detailed simulationto see the effects of the plan on the state of the networkor ask for a new plan to be devised taking into account newconstraints whichhe specifies. The UIAalso supportsa reporting functionality in that the control engineer can ask for the logs of the disturbancesto he presented and analysed. Observations and Reflections This application has beenin operationin Iberdrola’s North DCRsince the beginning of 1994 and has afforded a numberof benefits. Firstly, the agent systemgives better results than its stand alone counterparts becauseit takes multiple types of knowledgeand data into accountand then integrates themin a consistent manner.Secondly,the agent systemis morerobust because there are overlappingfunctionalities whichmeanspartial results can be producedin the case of component (agent) failure. Thirdly, someresults can be provided more quick!y because cooperation prorides a short cut (see previous section). Fourthly, the functionalities of the different domainsystems can be increased independently which makesthem easier to mainrain (see, for example,the argumentfor developingthe BAD. Fifthly, the control engineer is providedwith an integrated viewof the results he is interestedin. Finally, the systemhas been designedto be open so that newagents can he addedin an incremental manner. Oneof the key features of this multi-agent systemis the wayit handlesfault diagnosisby using twodifferent types of data (the non-chronologicalalarms used by the AAA and the chronological alarms used by the BRS)and two different points of view(the typical diagnosis approachof hypothesis generation and validation used by the AAA and BRS,and the BAI’s monitoringapproachwhichprovides a high level view of the status of the network).Withthis set-up, it is possible to dynamically select the solution methodwhich is best suited to the current situation. For example,if the BRSis operational, but the AAA is not, the solution providedto the control engineeris the one createdby the BRS;but if boththe BItS and the AAA are running, the solution provided is the one which is mutually agreed betweenthem. Also the fact that multiple agents are trying to generatethe sameresults can he exploitedto avoidrepetition of certain tasks if it is deemeddesirable in a particular context. For exampleboth the AAAand the BRScan provide GENERATED-HYPOTHESES,consequently if the generated hypotheses that are providedby the BRSare available to the AAA before it starts its owngenerationtask, then this task neednot be executed and the hypothesesprovidedby the BRScan he used instead (see figure 3). This ability to flexibly manage,at runtime, multiple sources of data and multiple problemsolving perspectives provides enormousrobustness to the overall systembecauseif one of the agents crashes the others will still be able to providesomeformof solution. As a consequenceof the experience obtained during the developmentand installation of this multi-agent system, someimportant application design improvementsare foreseen for the future. Thefirst drawbackof the current system is causedby the fact that the energytransport networkcovers a vast geographic area (meaningthere is a huge amountof topological information) and that it encompassesa number of different voltage levels. As the network’s hehaviour dependsboth on the voltage level and the geographiclocation, the main problemsolving agents (the AAA,BRS,BAI and SRA)have to contain and manageinformation about Iberdrola’s entire transport network.This meansthe agents require a substantial amount of memoryand computing resource because their searches are through such a large problemspace. To combatthis problem,the next version of the system will be designed so that the agents workwith smaller portions of the network;this will m&ke themeasier to debugand maintain, faster in execution, and morecost effective in that they could run on PCs instead of workstations. The seconddrawbackof the current systemis that all the agents haveuniformknowledgeof the network.For instance, the AAA applies virtually the sameknowledgeabout protective relays to its 400 kV,220kVand 132kVvoltage levels. However,if there were one AAA (or BItS or SRA)per voltage level, it wouldbe possible to customisetheir domain knowledge.For example,the followingpieces of potentially useful knowledge couldbe reflected: the fact that protective relays on the 400kVvoltage level are morereliable than the onesat 132kVvoltagelevel, and the fact that the 400kVnetworkis more interconnected than the 132 kV networkand has morecomplexbreakerstructures (like central breakersor rings of breakers). A further sourceof heterogeneitywhichis Jennings 429 From: Proceedings of the First International Conference on Multiagent Systems. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. currently maskedis that the networkitself is the result of the fusion of a numberof smaller transport networksthat were developed by different companiesbefore coming under the overarching umbrella of Iberdrola. Thus, for example, the protective relays of the Northwestern Iberdrola networkare different from the protective relays in the rest of the network.Againthis informationcould be exploited if smaller and more specialised agents were developed. On a more general note, although the ARCHON approachwas influenced to a large extent by the need to incorporate pre-existing industrial control softwareinto a multi-agent community, it is felt that cooperative interworking, heterogeneity, semi-autonomy, and loose coopfing are am’ibutes that are likely to be encountered whenbuilding mostcomplexsystems(even if they are built entirely fromscratch and are not deliberately conceivedas DAIsystems). This belief is based on three mainobservations. Firstly, mostlarge organisatious,wherethe majority of complexsystems reside, have departmental structures whichneedto be observed,but the individuals within these structures often needto worktogether in a coherent manner. Secondly,the components(including both the humans and the software)within an organisation, and evenwithin department,are likely to be heterogeneoussimply because each one has to be based upona different modellingparadigmin order to be effective. Finally, complexityis best handled by devolvingresponsibility for decisions to the level at whichthe actions are performed(hence the problemsolvers will be loosely coupledand semi-autonomous). In most cases the domainsystems that already existed were each fairly complexand had been designed to encompass those aspects of the domainthat could be expressed within a coherent modelling paradigm. They were designed in this mannerbecause conventional wisdomdictates that when building a system using a particular modellingapproach,one should include everythingthat is knownabout the system’s world that can be expressed within that model.However, after due reflection and observation it is noted that once several such conventional systems are brought together into a cooperative framework, then completeness and comprehensivenessare no longer the key criteria for allocating agents. In a multiagent system,questionsof the efficiencyof the overall system are likely to be paramount.This, in turn, maydictate that an agent is identified withthe smallestpossiblecoherent and autonomousentity. As a consequencethe system mayhave a large number of such agents which may, in turn, havean implication on the performanceof the overall system. The ARCHON experience as yet does not extend to systems with many(hundreds) of agents, howeverit likely that in such situations the designermayneedto consider if someof the smaller agents need to be coalesced again. Otherimportantexperiencesfromthis workare that: (i) speed of operation is a factor even wherethe real-time requirementsof the underlying processes are longer than milliseconds (because a great manyconcurrent processes are active and their interactions are cumulative);(ii) passing of intermediate results (or progress reporting) is effective meansof increasingsystemparallelism; (iii) data whichis exchangedbetweenagents ought to have a degree of persistence so that troubleshootingcan take place and audit records can be maintained;(iv) significant improvementsin the overall application can be obtained through relatively straightforwardcooperativeinteractions. Finally, 430 ICMAS-9$ it is importantthat a clear and detailed DAImethodology is workedout, especially in viewof the abovediscussionabout agent granularity - ARCHON’s informal hybrid approachis an importantfirst step in this direction but it needsto be made morerigourousand havean evenclearer link to the software developmentprocess. References Abel,E., Laresgoiti, I., Perez, J., and Corera,J., and Echavarri, J., 1993. A multi-agent approach to analyse disturbancesin electrical networks.Prec. FourthInt. Conf. on Expert Systems Applications to Power Systems, Melbourne, Australia. Barandiaran, J., Laresgoiti,I., Perez,J., Corera,J., andEchavarri, J., 1991.Diagnosing faults in electrical networks.Prec. EXPERSYS 91, Paris, France. Corera,J., Echavarri,J., Laresgoiti, I., Lazaro,J. M., and Perez, J., 1993.On-lineexportsystemfor service restoration. Prec. Fourth Int. Conf. on Expert SystemsApplications to PowerSystems,Melbom-ne,Australia. Cockburn,D., and Jennings, N. R., 1995. ARCHON: A DAI Systemfor Industrial Applications. In Foundationsof DAI (eds. G. M.P. O’ Hare&N. R. Jennings), WileyInterscience Jennings, N. R., Corera,J., Laresgoiti, I., Mamdani, E. H., Perriolat, F., Skarek, P. and Varga. L. Z. 1995. Using ARCHON to develop real-word DAI applications for electricity transportation managementand particle accelerator control. IEEEExpert. Jennings, N. R. 1994a. Cooperationin Industrial MultiAgentSystems. Series in ComputerScience - Vo143,World Scientific Press. Jennings, N. R. 1994b. The ARCHON system and its Applications. Proc Second Int. WorkingConf. on Cooperating KnowledgeBased Systems, Keele, UK,13-29 Jennings, N. R., and Pople, J. A., 1993. Designand Implementation of ARCHON’s Coordination Module. Prec. Workshop on Cooperating Knowledge Based Systems, Keele, UK,61-82. Jennings, N. R., Varga, L. Z., Am’nts,R., Fuchs, J., and Skarek, P. 1993. TransformingStandalone Expert Systems into a Community of Cooperating Agents. Int. Journal of EngineeringApplications of AI 6 (4) 31%331. Jennings, N. R., and Wittig, T., 1992. ARCHON: Theory and Practice. In DistributedArtificial Intelligence: Theory and Praxis (eds. N. M. Avourisand L.Gasser), KluwerAcademicPress, 179-195. Tuijnman, F., and Afsarmanesh, A., 1993. Distributed Objects in a Federation of Autonomous CooperatingAgents. Prec. Int. Conf. on Intelligent and CooperativeInformation Systems, Rotterdam,Netherlands, 256-265. Varga,L. Z., Jennings, N. R., and Cuckburn,D. 1994. Integrating Intelligent Systemsinto a CooperatingCommunity for Electricity Diswibution Management.Expert Systems with Applications(1994) 7 (4) 563-579.