A Tool for Parallel Distributed Reasoning about Computing Networks S. Musman1 MITRE Corp 7515 Colshire Drv McLean, Va, 22153 smusman@mitre.org F. Duncan Augemented Systems, Inc 7400 Seabrook Lane Springfield, Va, 22153 fduncan@aug-sys.com commercial monitoring tools remain centralized. Centralized approaches introduce a single point of failure for the system, leaving the monitored components unprotected if they can be separated from the central monitor. The centralized approach is ultimately not scalable. Abstract We describe a distributed reasoning system called OttoMate. Otto-Mate provides a variety of built-in agents that make it easy to ingest events from logs and add-on tools, response agents to implement responses, and reasoner agents that can reason about situations. All of these capabilities are connected using a popular agent framework, enhanced with a few additional security features. Reasoning agents in Otto-Mate implement a form of situational reasoning, related to case-based reasoning, and the facts in a reasoner’s working memory can be synchronized over multiple reasoners. This allows the implementation of parallel distributed reasoning algorithms that can detect event patterns irrespective of whether the events occur locally or remotely. Distributing the reasoning avoids a single point of failure for the monitoring system, and makes the system much more robust and survivable. Systems such as AAFID [Spafford 00] were early attempts, to apply distributed reasoning to this problem domain. Implemented in Perl, it had performance issues, and AAFID contained none of the necessary management, security or performance features to function as an operational system. Many other research systems are similar and are not useful for practical deployment. Expert systems such as P-BEST [Lindquist, 99], have been applied to reasoning about security events. Such systems are “event” driven, allowing rules to be defined to detect events, or simple event patterns. These systems usually have no explicit knowledge of time, and as such cannot easily reason about events that are expected to occur at certain times, or detect missing events. Historically, most of the knowledge they encode represents shallow knowledge of the domain. Overview Reasoning about computer and network events remains the cornerstone of automating management, security and health monitoring problems, but reasoning in this domain is complicated. Events and reasoning occur at different levels of granularity and is further complicated by the lack of standards for events that are reported. Operating, network components (i.e. switches), and add-on tools (i.e. IDS, logs, virus scanner) all report differently, yet each monitored resource often provides evidence that yields part of an overall picture once the events are correlated. To better understand the reasoning requirements, we describe a sample problem. An earlier version of the OttoMate reasoner [Musman, 00] was used in a DARPA sponsored experiment in Autonomic Information Assurance [Theriault 02]. A reasoner agent was connected into Survivable Autonomic Response Architecture (SARA) [Lewendowski 01], where the task of the reasoner agent in the experiment was to defend a computing network against Despite the maturing of agent-based systems a majority of Response working 3 1 Implement Normal Normal email email Correlated alert NonNonpolymorphic polymorphic Activity Activity msg digests Virus Virus detected detected Response(s) 4 Monitor Monitor nonnonPolymorphic Polymorphic Response(s) response response not working 2 New virus alert Detected Detected Virus alert Virus alert 7 Need Need Human Human Help Help Unknown Unknown Virus Virus Uncorrelated alert msg digests Monitor Monitor Polymorphic Polymorphic Polymorphic Polymorphic Virus Virus detected detected 5 Implement polymorphic Response(s) Response(s) not working response response 6 Human response implemented Figure 1: Reasoning States for email Virus Defense 104 Linux Adding Account to Admin Group Log Entry Linux Situation to Implement Action BSD Adding Account to Admin Group Log Entry Otto-Mate privilege escalation event Solaris Adding Account to Admin Group Log Entry Reportable User access Incident Out of hours Policy Violation Otto-Mate “Removefrom-group” action request event Solaris Situation to Implement Action Windows Situation to Implement Action Windows Adding Account to Admin Group Log Entry Raw Reports BSD Situation to Implement Action Generic Events Situations and Patterns Incidents Generic Action Requests Location Specific Action Implementation Figure 2: Normalizing Events and Requests DoS attacks caused by 0-day email viruses. A variety of sensors and some novel algorithms were developed for the task. Email client sensors, could detect suspicious email activity such as generating email messages faster than a human can possibly type, or sending email messages immediately after opening an email or its’ attachments. Email activity could also be linked together into email flow trees that can act as a detector because the propagation pattern of email viruses is often quite uniquely different from legitimate traffic. An email queue size detector [Gupta, 03], monitored the overall state of the email server and detected if the system became bogged down. Response mechanisms in this system could purge email messages labeled as containing a virus, and also dynamically insert message-blocking signatures at the email server that match certain criteria. A state diagram for the system is shown in figure 1. The SARA architecture supported only one reasoning agent at a time, acting as a centralized reasoning solution to the problem. As a result, on some occasions the network saturated, and severe SARA message latencies (i.e. in excess of 5 minutes wall clock time) occurred and contributed to test scenarios in which the virus succeeded. Often event patterns and relevant responses occur at the level of individual components (i.e. most activity on the email server relates only to its operation). Patterns of events that reflect larger system level patterns are less common, so a distributed reasoning solution that did not centralize the processing would clearly be more efficient and was a motivation for the design of our system. Our analysis, illustrates that a distributed approach could reduce events distributed over the network by over 65%. System Description This example illustrates several aspects of reasoning about computing events. It demonstrates a need to: Otto-Mate consists of a variety of different agents classes: resource monitoring agents, reasoning agents, and response agents. Resource monitoring agents monitor assigned resources and pass events into the Otto-Mate system. Reasoning agents include those described below, but also include anomaly detection agents [Varaandi, 03]. Response agents implement the resource specific actions to resolve problems. Though the base functions of these agents are simple, most of the agents also have additional capabilities to support drill-down. x Assess system state: By analyzing and correlating events to determine if the system is under attack, or noting a response is ineffective and that something else should be done about it; x Select the right response given the system state: By recognizing a certain class of attack and the type of response that could resolve the problem. Responses can also involve starting sensors looking for certain virus symptoms, dynamically adapting system state (sensing in this instance); x Assess alert relevance given system state: For a nonpolymorphic virus response, if a blocking signature is in place all future alerts should be associated with messages that were delivered before the blocking operation. Alerts from messages created after the blocking is applied indicate that there a problem remains; x Ensure responses are timely: Ensure that responses are carried out, request them again if necessary, select an alternative or alert a human. Our reasoning model allows for situations that include a combination of known bad situations, policy violations, as well as anomaly models of unusual events. Because computing networks are usually heterogeneous, event normalization is required to share information. Despite efforts such as IDMEF [IETF 04] this remains an area in which more work needs to be done. Efforts such as IDMEF provide more guidance on how to say things, rather than on what should be said. As such, they do not provide a standard ontology of the domain. 105 Our knowledge-base structure is illustrated in figure 2 where most of the reasoning occurs in the middle layers, and relates to generic abstracted events. This significantly reduces the need duplicate effort developing situations for each separate resource type (i.e. Unix vs. windows). Other differences (i.e. what might be an unusually high CPU load on a workstation would be different on a server) are addressed by being able to assign the agents to groups, representing both is-a and has-a relationships. Situations and reasoner variable values can be assigned to the groups and the intent is to be able to reason about resource types, based on location or function. Configuring every resource individually would be unscalable. (defsituation remote-access-failure :documentation "keep track of any remote access failures" :fact remote-access-event :group * :assignments ((outcome failure) (remote-host ?remote) (account ?account)) :match-on (;; nothing else ) :response ( ;; keep track of failed logins from the user (timer-action increment (5 min) (failed-access-from-user-counter (account ?account) (group dmz))) ;; keep track of failed logins from the host (timer-action increment (5 min) (failed-access-from-host-counter (remote-host ?remote) (group dmz))) ) Figure 3: Example ) timer fact “failed-access-from-user-counter” belongs to the “dmz” group if the remote-login-failures occur across any one or more of the computers in that group the value is incremented on all of them, turning this into a parallel distributed algorithm. Although not illustrated, other situations were programmed to take care of responses: blocking the host for a short period of time, re-enabling it later. By issuing only a temporary blocking of the host, most brute force attacks are repelled without creating any long term denial of service situation, and there is no need to keep track of large blocking lists. Additional situations could also monitor (defsituation too-many-failed-accesses-from-user :documentation "if the same user fails 5 times in 5 min" :fact failed-access-from-user-counter :group * :assignments ((account ?user) (count ?cnt)) :match-on ( (not-fact (disabled-account (account ?user))) (> ?cnt 5) ) :response ( (assert (brute-force-attack-for-user (user ?user))) ) ) Detecting a Brute-force Attack additional activity from hosts that had incidents, and can escalate the severity of response for repeat offenses. Reasoner agents use facts to represent information (though facts can contain confidence values). Events are represented as inheritable knowledge structures with named fields. Facts can be ordinary facts, timer-facts, or event facts. Event facts are transitory, and are automatically removed from the system if no situations match them. Timer facts are special facts associated with a timer mechanism built into the reasoners. Timer facts can be defined to exist for specific periods of time, or be used as timed (sliding window) counters so that they can detect event frequencies given a count and timeframe. Timer facts can also be scheduled for assertion at a future time, delaying when the reasoner operates on them. In Otto-Mate situations can exactly match (like a rule), or partially match. Situations have a default priority, but this priority can also be set to define the order in which matching situations are used when more than one matches. Situations also have function switches call-next, and always-fire that allow them to layer functionality. Call-next implements the response actions of the best-matching situation, and then executes a matching situation with the next highest priority (if any). The always-fire switch causes a matching situation to fire irrespective of whether or not there are any other matching situations. The general consensus when programming rule-based systems is that forcing rules to execute in a specific order is poor form. In Otto-Mate we violate that principle to reuse existing situations, or to supersede behavior of certain situations. Facts and timers also have fields called; “persistent” and “group”. Setting the persistent value causes the reasoners to cache the facts on disk until they are retracted, so they will continue to exist even if the reasoners are stopped and restarted. The group field can be used to specify that a fact is synchronized across multiple reasoners. When assigned a specific group, the fact becomes visible to all reasoners running on resources that are part of that group. Changes to this fact, occurring on any of the reasoners, will be seen in all of the reasoners in the group. Sometimes reasoner agents request additional information from other agents. The decision of where to query for information is either determined statically, based on system layout, or dynamically from location information in events (i.e. an ip address or host name). The information returned can be current values, or historic information and the reasoner is able to reason about both simultaneously. An example of a useful information query might be a reported access failure from an internal host to a sensitive database. Here there would be a desire to determine who was logged An example of detecting more than 5 failed login attempts in 5 minutes is shown in figure 3. This type of brute-force password guessing that is still surprisingly common. As the 106 into the unauthorized host accessing the database, and determine who made the access attempt. References IETF, 2004, "The Intrusion Detection Message Exchange Format." IETF Intrusion Detection Exchange Format Working Group. Internet Draft. Reference: 'draft-ietf-idwgidmef-xml-12'. Queries for information, can be returned to, and processed by the requesting agent. Alternatively the requester can delegate the decision making to another reasoning agent using the "reply-to" feature for the request, to tell the owner of the requested information where to send it. This ability to delegate decision making is particularly valuable if there are multiple hypotheses to explore. Lewandowski, S., Van Hook, D.., O'Leary, G.., Haines, J., & Rossey, L., 2001, SARA: Survivable Autonomic Response Architecture. DARPA Information Survivability Conference and Exposition II Proceedings, Vol. 1, pp. 7788, June. Practical Considerations Lewis, L., 1995, “Managing Computer Networks: A CaseBased Reasoning Approach”, Artech House Books Otto-Mate compares to a variety of other systems for which there are a variety of common methods: rule-based; model-based; fault-trees; CBR; Bayesian Networks; and others [Steinder 01]. Otto-Mate supports CBR, anomaly, and rule-based approaches, and some limited work has been done allowing Bayesian networks to reason about facts in the working memory of the reasoners. Lindqvist U & Porras P. 1999, Detecting computer and network misuse through the Production-Based Expert System Toolset (PBEST) . In Proceedings of the 1999 Symposium on Security and Privacy, Oakland, California, May 1999. IEEE Computer Society Otto-Mate functions in addition to the other the tasks that the components it monitors perform. Under a typical load, of 30 agents monitoring an average host (processes, system logs, file-access, etc.), the load created by Otto-Mate occasionally spikes at 10% (events are often bursty), but usually is much less than 1%. The reasoner itself can process over 5000 events/sec and the agent framework used is JADE, which utilizes several layers of messaging to improve efficiency [Vitaglione 02] can send 1000’s of messages/sec. Musman, S. Flesher, P, 2001, System or Security Managers Adaptive Response Tool, Proceedings DARPA Information Survivability Conference & Exposition - Volume 2, Hilton Head, NC, Gupta, A & Sekar R, An Approach for Detecting SelfPropagating Email Using Anomaly Detection, Recent Advances in Intrusion Detection (RAID), September 2003 Spafford E., & Zamboni D., 2000, “Intrusion detection using autonomous agents”, Computer Networks, 34(4):547-570, October 2000. Summary Steinder M and Sethi A. 2001, The present and future of event correlation: A need for end-to-end service fault localization. In Proc. IIIS SCI: World Multi-Conf. Systemics Cybernetics Informatics, Orlando, FL, 2001 Otto-Mate was developed to automate monitoring, analyzing, correlating, and response to events found in computing systems. In contrast to most commercial systems that do central monitoring, Otto-Mate, reasoning agents are distributed throughout the computing network so there is no single point of failure for the system. Theriault K, Farrel W, Henri H, Kong,D and Nelson W. 2002, The SARA experiment: Coordinated autonomic defense against an e-mailborne virus. In Proceedings of the 2002 IEEE Workshop on Information Assurance, June 2002. Reasoner agents can also work collaboratively and can query event monitoring agents for current, or historical events that may confirm or deny a current hypothesis. Additionally, reasoning agents can be tightly coupled together. Facts within their working memories can be synchronized across multiple reasoner agents allowing multiple reasoners to reason about related events occuring locally or remotely. In essence multiple reasoners work together to implement distributed parallel reasoning algorithms. A limited version of Otto-Mate is available for download and experimentation from http://otto-mate.augsys.com. Vaarandi.R, 2003, A Data Clustering Algorithm for Mining Patterns From Event Logs. Proceedings of the 2003 IEEE Workshop on IP Operations and Management, pp. 119126, 2003 Vitaglione G, Quarta F, Cortese E, 2002, Scalability and Performance of JADE Message Transport System, presented at AAMAS Workshop on AgentCities, Bologna, 16th July, 2002. 107