Ambrosia Autonomous Agent Group (AAAG) by Jed Pickel, John Huebner, Robert Dean, Joshua Baer System Design Document Sunday, February 15, 1998 1. Introduction ________________________________________________________ 1 1.1 Purpose ______________________________________________________________ 1 1.2 Overview _____________________________________________________________ 1 1.3 Terms _______________________________________________________________ 1 2. Agent Execution Environment _________________________________________ 2 2.1 Overall Design ________________________________________________________ 2 2.2 Agent Transportation __________________________________________________ 2 2.2.1 2.2.2 2.2.3 2.3 The Agent as an Object _____________________________________________________ 2 Agent Transport and Replication _____________________________________________ 3 Caching _________________________________________________________________ 4 Security Model ________________________________________________________ 4 2.4 2.5 2.6 2.7 2.7.1 2.7.2 Sandbox___________________________________________________________________ 4 Authentication _____________________________________________________________ 5 Encryption _________________________________________________________________ 5 Location and Authentication Agent (LAA) _______________________________________ 5 Primary Name/Key Server __________________________________________________ 5 Backup Name/Key Server ___________________________________________________ 6 3. Agents_____________________________________________________________ 6 3.1 Find File _____________________________________________________________ 6 Security Issues ____________________________________________________________________ 6 3.2 Administrative ________________________________________________________ 7 3.3 Distributed Processing __________________________________________________ 7 4. Unresolved Design issues _____________________________________________ 7 4.1 Object Serialization vs. Applets __________________________________________ 7 4.2 Is static data part of the code object or the data object? ______________________ 7 4.3 Will nodes use network broadcasts to find administrative servers? _____________ 8 1. Introduction 1.1 Purpose Ambrosia will develop an agent execution environment and prove that it has practical use by implementing several agents. An Agent Execution Environment (AEE) is more flexible than a client/server design model because it allows arbitrary code to be run on the remote machine. An agent system can localize computations near data. For example, a traditional client/server implementation of a mechanism to find files would have to have a server on every machine from which files were to be found. This server would do nothing but wait for find file requests. They would serve no other purpose. In a more flexible design, the server might provide remote directory operations as a service allowing the client to implement find file by requesting all the available directory structures and then requesting the file, once the client found the file. The client would be more efficient if it made a set of concurrent requests for directories, but this would also make the client more complex. In contrast, an agent system could allow the browsing of directories to occur on the same machine that held the disk, while copies of it searched the disks of other machines. This would improve latency, and reduce network bandwidth. Furthermore, the algorithm that the client used could be tailored to the type search being performed. In fact, the Execution Node server could be used for arbitrary agents. 1.2 Overview An agent system offers high availability and fault tolerance using a fail-stop model. Availability is increased because a user can obtain an agent from multiple sources and execute it on multiple hosts. If the node an agent is heading for fails, the node that it was departing from can re-invoke it or re-direct the replication to another node. If the semantics are designed to allow multiple copies of an agent, and/or agent cooperation, then the state machine method of active replication is easily supported. If an agent sends a copy of itself to another node and does not hear a timely reply, it can re-direct that task to another node. Performance can be increased by using long-term caching. An execution node saves the code and static data of an agent so that they need not be sent to the node the next time the agent is invoked. 1.3 Terms Agent Security Object Agent Execution Environment (AEE) Execution Node Data Object Administrative Server (“server”) Code Object Sandbox Resources Short Term Caching Long Term Caching A self contained execution including code, data, security, The portion of an agent containing the Audit Trail, and Public Key (Private?) The distributed collection of applications that accept, executes, and transmits agents. It mediates between the agent and the operating system to acquire resources for the agent. A single application in the execution environment. It will only send agents that it will execute itself. The portion of an agent containing the state of an agent. The variables needed by an agent to run. Note: Constants are stored in the code object at compilation time. The application that keeps track of participating nodes and authenticates agents. The portion of an agent that contains the execution instructions. It also includes static data since it shares its properties of read only and static after compile. The Execution Node system that controls an agent’s access to resources. CPU cycles, disk space, memory, network, display The execution node storing the entire agent to disk, temporarily, for use in transmitting or re-transmitting to another node. The execution node storing the code, and static data of an agent, between invocations, to increase performance the next time that agent is invoked. 1 2. Agent Execution Environment 2.1 Overall Design An Agent Execution Environment (AEE) is best defined as a distributed collection of execution nodes. An execution node is a machine that provides ability to accept and execute objects (agents) from the network as a service. An agent is a combination of code, data, and log information, that has the ability to travel through the Agent Execution Environment, under the constraints imposed by individual execution nodes. Details of agent transport will be covered in detail later in this paper. An agent can only be introduced into the system from a node which itself is participating in the AEE. A node will only have permission to introduce agents which would have permission to execute on the local node. This mechanism will provide incentive to permit local resources to be devoted to the AEE. Activity between nodes within the Agent Execution Environment will be coordinated by an administrative server. The purpose of this server is to maintain state of the AEE. This server will: keep a list of all functioning nodes, act as key server to store public keys for agents and nodes, function as a trusted source for obtaining agents, and have the ability for nodes to query for the trusted checksums or hashes for known agents. The server will be implemented as an agent, which removes the requirement of being bound to an individual node. A separate server is planned for each network segment. This will allow nodes to locate their local server by broadcasting. Local servers will have the ability to communicate with servers in other networks, such that the system will scale. Details of this server are described later in this document. The administrator at each individual node has the ability to configure that environment according to their local policies and procedures. Some of the configuration options are a default security policy (for unknown agents), and the option of configuring individual security policies for known agents. Details of the security model are included below. The distributed nature of this project takes place at two levels. The AEE itself is a distributed system that must maintain state, availability, and security. On top of that, the AEE provides a framework for individual agents to build their own distributed systems. Each node will be multithreaded, have the ability to process multiple agents simultaneously, implement a GUI for configuration, and maintain its own audit trail. Any change of state will be recorded in the audit trail. 2.2 Agent Transportation Agent transport is one of the major elements of the Agent Execution Environment. The transportation system is what allows an agent to be sent around the network from node to node. There are three main parts to the transport system design: the agent object, the agent replication process, and agent caching. 2.2.1 The Agent as an Object The agent object has three main objects within it. These objects are the code object, data object, and the security/authentication object. The agent is designed in this way to limit the executables access to corruptible data. This design also allows the agent to be easily transported across the network as a single object, and also allows the execution environment access to vital security information about the agent before the agent is executed. 2 Agent Object Code Object Constants Data Object Dynamic Data Security/ Authentication Object Static Data Code Object: contains the actual Java byte code for the agent. The code object also contains constants required for execution, but not large static data structures. This includes such items as final variables and predefined strings. As far as the agent is concerned, this object is execute only. Data Object: contains the current state of the agent. If an agent is to be transferred and restarted on another execution environment at the current point of execution, all necessary data for this restart is saved here. If no state is needed at the new location, this object will be “empty.” This object is only available to the agent by Execution Environment system calls such as “Save_State” and “Restore_State.” No other access is allowed to the agent. However, between these calls, a local copy may be manipulated and later saved. There may also be a flag to the system call an agent uses to transport it which will cause the state to travel with it. This also includes separate static data such as graphics files, agent specific help and other static data which the agent may wish to bring with it. Security/Authentication Object: contains all security data needed by the AEE for authentication and tracking of the agent. For example, this object will contain a public key to allow for agent authentication, and also an audit trail, which can be used to track the agent’s path through the network. The agent, through the use of AEE system calls, can read this object without restriction, but has no write access. The authentication portion of this object contains a checksum to ensure that the object is intact along with version information, and the author’s name. This object is more fully explained by the Security Monitor element of this document. 2.2.2 Agent Transport and Replication The Agent Execution environment will support two methods of agent transport and replication. The first of these methods is manual control by the user. For example, the user will be able to contact the primary server and request a specific agent from its long-term cache. The primary server will then update the agent’s audit trail and send a copy of the agent to the user’s client. Each client will also have the ability to send an agent directly to another execution node of the AEE. The second method of agent transport is Agent Replication. This is the process whereby and agent will send a copy of itself to one or more execution nodes. The agent achieves this using execution node system calls. Currently, there are two options for agent replication. The first is a straight transfer of the agent currently residing in the client’s cache. This means that when the agent is executed on the new node, the 3 execution will be independent of the parent agent’s current state at time of transfer. The second transfer method is where the agent requests that its current state be sent along with the cached agent to the new node. In this method, when the agent is executed on the new node its state will be the same as the parent’s state, and the two instances of the agent will be indistinguishable. The first method could be used, for example, to upgrade a common utility agent such as a global find file. Such an agent does not require knowledge of any execution node for it’s own operation and therefore can be transferred without updating it’s stored Execution State. The second method could be useful for a system monitor agent that is gathering data on the system as a whole. When the monitor leaves a node it would require the ability to take whatever data it collects with it. Under the current system design, it is the responsibility of the agent to update and store it’s own state before it is replicated. Each node of the execution environment will contain whatever system calls are necessary for the agent to complete this task if it so desires. 2.2.3 Caching The level and complexity of the caching system used by each node is completely at the control of the node administrator. At the minimum level, all agents that are executed on a node are placed into the node’s agent cache. This allows the node to start and stop an agent as needed without having to download the agent from the network each time. At this level, the administrator can decide to only allow handpicked agents to run on the node, and for the node to refuse replication requests from other nodes. At the most complex level, the node accepts all replication requests from other nodes, and caches any agent that is sent to it independent of the agents executing on the node. Agents are cached regardless of whether or not they are ever executed. 2.3 Security Model Without security in mind, an Agent Execution Environment is a server. It has a listening socket that will accept connections from arbitrary hosts, download, and execute arbitrary code. A significant portion of this project deals with the tradeoff between security and usability. Security is a very important factor in the design of this project. The first security issue to address is which agents will be permitted by a node. Administrators can decide whether to accept anonymous agents and can choose particular agents to accept, while rejecting others. 2.4 Sandbox This environment is designed such that administrators at individual nodes have the ability to configure a default security policy for access to selected resources by the anonymous agents. Anonymous agents are agents that are not known by the local server. Known agents, on the other hand, each have a custom security policy based on the administrator’s level of trust for that agent. The security policy will provide access control to file system network - initiate outgoing connections - accept incoming connections interaction with other locally executing agents memory processing time other local resources (to be determined) 4 2.5 Authentication Upon receipt of an Agent, the Agent Execution Environment (AEE) must perform a number of functions to authenticate that agent. Fundamentally, the two primary authentication requirements are: knowledge of where the agent came from, and assurance that the agent code is not modified from the known version. The server will include a public key infrastructure such that each node has a unique public/private key pair and each instance of an agent has the option of having a public/private key pair. Outgoing agents will be signed, and incoming agents will be verified by checking the signature. This functionality will be implemented at the node and can not be altered by an agent. This form of authentication proves the true source of an agent, and that the agent was not modified in transit. In order to assure that agent code is not modified from a known version by a malicious node, some sort of one way hash mechanism will be used. Hashes of known versions of agent code will be stored on the server. Upon receipt of an agent, a hash of the agent code will be computed and compared with the hash stored on the server. This one way hash function will be implemented with either md5 or blowfish. To reduce the chance of man in the middle attacks, this comparison will have to be encrypted using the public key infrastructure in place. 2.6 Encryption With a public key infrastructure already implemented, we may implement the option of encrypting all data when in transit between servers. 2.7 Location and Authentication Agent (LAA) The Location and Authentication Agent (LAA) is the only agent required by the AEE. The LAA running on each execution node can have one of three possible states Primary Name/Key Server Backup Name/Key Server Name/Key Client In a healthy AEE, there would be one primary name/key server, a few backup name/key servers, and many name/key clients. The primary name/key server would typically serve a local network segment, although there are no practical limitations to the AEE topology. Primary name/key servers group other primary name/key servers and execution nodes into logical, geographic, or other groupings. 2.7.1 Primary Name/Key Server Primary name/key servers act like folders or directories in a tree-structured file system. Each primary name/key server stores a list of names, IP addresses, and public keys for the other primary name/key servers and execution nodes below it in the tree. Optionally, it can define a ‘parent’ server, allowing reverse traversal of the tree. Primary name/key servers will typically service local network segments, for optimal performance. Also, since machines which are physically close to one another often work together and know each other, this will likely be the most useful scenario. Primary name/key servers will provide a number of services and computations: List all machine names in AEE Return parent server name Return most idle node in AEE (largest number of free cycles) 5 Return public key for given machine name Return public key or hash for given agent 2.7.2 Backup Name/Key Server When an LAA is set to backup name/key server mode, it is not always actually used as an active backup server. It is made available as a backup server from the local execution node, but before it will be used as one it must appear in the ‘backup group’ list of a primary name server. The state information which must be kept synchronized between the primary and backup servers consists of the public keys and agent hash results. Cached data does not need to be synchronized; the caching mechanism will keep its data current independently. 2.7.3 Name/Key Client Most execution nodes on a given network segment will be name/key clients. A backup server which is not in the ‘backup group’ list of a primary server will also function as a name/key client until it is added to a primary server’s backup group. Name/key clients talk to primary servers to obtain the list of all machine names in that AEE, the name of the primary servers ‘parent’ server, and for help in choosing execution nodes to work with. They also get public keys and agent hash results from the primary server, for authenticating agents and machines. 3. Agents 3.1 Find File The purpose of the Find File agent is to demonstrate the ability of the Agent Execution Environment to share global resources. The resource in this case is long term storage media. The agent will give the client user the ability to search for data in parallel on multiple execution nodes, and then retrieve that data. The agent can be viewed as being similar to the Find File utility found with Windows 95 and NT, but on a global scale as apposed to local one. The Agent will consist of two parts: an interactive user interface dialog, and a multi-threaded request listener. The user process is as follows: 1) The user fills in the interface dialog. The information entered can be an exact filename (foobar.doc), or a substring of possible file names (foo*). This data is then sent to all execution nodes that are running the Find File agent. 2) The request listeners on the execution nodes receive the search request. The listeners then retrieve the shared directory tree from its host execution node, and searches it based on the request. Any matches found are returned to the proper interface dialog along with any information necessary for retrieving the data over the network. 3) The interface dialog collates all return data and displays it in a graphical list to the user. The user then selects the file(s) that they wish to download, and the agent downloads the data to the user’s machine. Security Issues The main security issue is that the agent would require both read and write file system access. If no sandboxing is performed by the execution node, then the a rogue agent could corrupt the filesystem. Our current sandboxing design resolves this issue by only allowing agents to access a file tree of user defined shared files. Thus, the user has full control over the segments of his file system that the agent can access. 6 3.2 Administrative The administrative agent will allow one machine to monitor other machines on the network. We will attempt to track as much information as possible, however, we expect Java to be a major limitation in this area. In achieving its cross-platform execution, detailed system information was often compromised. We will attempt to track Idle CPU cycles Free disk space Free RAM Network traffic (kb/s) Currently running processes Currently running agents Percentage user/agent processing time To use the administrative agent, one execution node will launch the agent, which will send copies of itself to all execution nodes it is authorized to access (unless a subset is specified). Once at the ‘slave’ nodes, the administrative agents begin sending status reports back to the ‘master’ node at regular intervals. The master will watch the slaves for extreme values or known patterns. Upon detecting a possible problem, a human will be notified via email or possibly numeric pager. Humans could check AEE status at any time by viewing a web page which summarized the current statistics. 3.3 Distributed Processing Our planned demonstration for distributed processing is the generation of fractals. Fractals are convenient because they are complex, iterative mathematical formula with a high degree of locality. Because of their locality, it is easy to separate the task into smaller tasks. The agent for fractal generation will send itself to many nodes, each copy of the agent will calculate a portion of the fractal using local CPU and memory, and then will return the result to the parent who will re-assemble it for the user who launched it. 4. Unresolved Design issues 4.1 Object Serialization vs. Applets Java has a well-developed mechanism for running untrusted code, called the Applet class. Existing Java Virtual Machines (JVMs) already implement a sandbox for this class. The advantage of using applets for our agents is that we could exploit the existing sandbox. The disadvantage of using applets for agents is that we have limited control over the existing sandbox. There are other Java classes that support transporting code. “java.io.ObjectOutputStream” marshalls objects for sending over a socket. “java.io.ObjectInputStream” unmarshalls the stream into an object again. Java’s Remote Method Invocation also has facilities to load a class locally. These classes provide the foundation for building a very rich execution environment, although they are at a lower level than applets. 4.2 Is static data part of the code object or the data object? When constants or strings are part of an agent are they stored in the data or the code object? It is important to ensure that static data is preserved in a Long Term Cache, while dynamic data is not. 7 4.3 Will nodes use network broadcasts to find administrative servers? This seems convenient but implementation has not been explored and there may be hazards to this approach. 4.4 Can two execution nodes communicate without a server? Ideally, any AEE will be able to act as a name/directory server unto itself. This would allow two execution nodes to communicate without a server to mediate the transaction. One execution node would point to the other as it’s name/directory server, and the other execution node would act as a server with only those two execution nodes on the network. If the server execution node was already part of another AEE group, it would form a ‘virtual AEE’ with just those two execution nodes in it. No data or agent processes would be able to transmit between the two AEE’s. One possible solution is to have the server be selected and configured automatically. A new execution node would broadcast to the network looking for servers. If none respond, it would declare itself a primary server. 8