ITIL® v3 Event Management — A Look at the Theory (from the Real World) Brenda L. Peery, 14th September 2009 BCS Specialist Group Session, All copyrights acknowledged. ITIL ® is a Registered Trade Mark of the Office of Government Commerce, and is Registered in the U.S. Patent and Trademark Office An ‘event’ Not here for the tents and soundstage? What is worth taking from that as we go forward to look at our idea of Event Management? • It looks like it could be a bit muddy – Very broad definition – Obscure language • But there is an idea of purpose … – [from ‘Event Management’, Wikipedia] “to market themselves, build business relationships, raise money or celebrate” Speaker’s Background • 15+ years experience with IT Service Related projects and roles – both vendor and user sides with Event & Systems Management related work • ITIL v2 Manager, v3 Expert, MSP and Prince2 Practitioner, ITIL instructor, APMG committee member developing new ITIL credentials • As an independent consultant for the last 5 years, “IT Service Management Architect” is my favourite title thus far … Main Topics / Goals • Event Management –you may already know it and have it – Monitoring and Event Management (key relationship) • Event Management – the Basics according to ITIL® • Where EM fits & What to consider in doing it – First ask why – strategy – Planning and managing • Evaluation of the need • What are you trying to solve / what need are you trying to serve • Define a model and develop a strategy Initial Context – Familiarity? • Event Management (EM) as a core process is new with v3 ITIL with some roots in v2 • What elements are familiar? © Crown copyright. Reproduced from the OGC's ITIL® version 2 volume: ICT Infrastructure Management and version 3 Core volume Service Operation. All rights acknowledged. Initial Context – Monitoring? • Almost everyone has some familiarity with “Monitoring” • Consider monitoring and management over the last decade: – Systems Management software tools: IBM Tivoli (particularly TEC), CA NSM, BMC Patrol – the reporting capability of underlying Operating Systems: log files and system utilities, Task Manager in Windows, the “top” command in Unix – And never underestimate the diagnostic scripts that your SysAdmins have written or inherited (Illus.) Ops Bridge Monitoring Monitoring Other kinds of monitoring? • Other IT? • Other sector? • Inventory? • Business monitoring? • Projects to bring in & Manage that monitoring •Why do we do it? Initial Context – History So even though Event Management is ‘new’ there are some challenges – in creating a process model – from the back history that comes along with your infrastructure: • There may already be strategies in place and benefits being realised from monitoring programmes • There are likely to ‘competing understandings’: – what events are – what you are or are not doing about them and – at what levels you are engaging to monitor and utilise them • Stakeholders may range from in-depth technical all the way up to non technical consumers of the information EM can produce Your back history, embedded in your kit, will shape or constrain your EM possibilities Best Practice Benefits Develop a shared understanding and common language based on best practice recommendations, at least as your starting point … EM Basics 1 – EM Process “Event Management is the process that monitors all events that occur through the IT infrastructure to allow for normal operation and also to detect and escalate exception conditions” (SO p.35). So it is about: – Detecting events – Making sense of them – Determining appropriate control actions in response to them But also: – Acting as a basis for automating routine Operations Management, and – Because it provides data for comparison, supporting • Service Assurance and Reporting • Service Improvement Event Management - Value “Generally indirect” (SO p.39) • EM provides mechanisms for early detection of incidents (possibly action before any impact felt) • EM provides a basis for automated operations • EM provides a basis for monitoring automated activity by exception – Reducing the need for “expensive and resource intensive real-time monitoring while reducing downtime” • Improves performance of other major Processes (early responses, more business benefit from more effective and efficient ITSM) EM Basics 2 – Event Definition What is an Event? Any detectable or discernible occurrence that has significance for the management of the IT infrastructure or the delivery of IT service and the evaluation of the impact a deviation may cause to a service. Events are typically notifications created by an IT service, Configuration Item (CI) or monitoring tool. (SO, p.35-36) EM Basics 3 – Event Definition (Breadth) Checking the official scope doesn’t narrow it down much: “Event Management can be applied to any aspect of Service Management that needs to be controlled and which can be automated” (SO p.36). EM Basics 4 – Event Type But there is more detail – the guidance suggests that you sub divide Events and “that at least these three broad categories be represented” in your Event Types: 1. Informational • There is no action required • Signifies regular operation (not an exception). 2. Warning • Approaching a threshold. • Signifies unusual, but not exceptional, operation 3. Exception • Abnormal operation. Breach of parameters. Note also: Alert (to trigger human attention or intervention) [SO, p.40] Event EM Basics 5 Process Flowchart Event Notification Generated Event Detected Event Filtered Informational Significance? Exception Warning Event Correlation Trigger Event Logged Auto Response I Alert Type? C P Human Intervention Incident Management Review Actions Effective? Yes Close Event © Crown copyright 2008. Reproduced from the OGC's ITIL® core volume: Service Operation. All rights acknowledged. End No Problem Management Change Management EM Basics 6 – Process Activities Summary Event Occurs – Notification / Detection Filtering (Categorisation) I C Correlation (Logic/rules) Note: Load P Trigger / Response Selection Note: Human Perception Review / Close EM Basics 7 – Events and Infrastructure Consider the extent to which your process design is and must be connected to your installed architecture – Notification/Detection: How are you detecting and how are notifications sent or collected (and what impact does this have)? – Filtering/Categorising: events into I , W , E streams, ignore event (or log/record locally) – Triggering an Alert, Auto Response, or related Process (does your architecture allow this?) EM – Lifecycle & Summary I CS Service Operatio n Service Design Service Strategy Service Transitio n In the Lifecycle concept that is at the heart of v3 ITIL, the Event Management process is seated in Service Operation with the full set of SO processes including: – – – – – – Event Management Incident Management Request Fulfilment Problem Management Access Management Operational aspects of other Processes The EM & Monitoring Relationship If we revisit the basic defintion: “Event Management is the process that monitors all events that occur through the IT infrastructure to allow for normal operation and also to detect and escalate exception conditions” (SO p.35). While the Service Operation book provides a high level model of a ‘sample’ EM process, have we really looked at its key activity sufficiently ... Designing EM – Alternate Lifecycle I CS Service Operatio n Service Design Service Strategy Service Transitio n “In an ideal world, the Service Design process should define which events need to be generated and then specify how this can be done for each type of CI. During Service Transition, the event generation options would be set and tested”. (SO p.39) Monitoring and Infrastructure The base monitoring architecture: • Agent based • Agent less • A sample of an evolved monitoring architecture Agent based Advantages: * Technically more efficient * Possible offline operation * Often Richer in Functionality GUI Console Disadvantage: * More complicated to install * Agent disk footprint Monitoring Server Hub / gateway / Monitoring Server Server (Windows/Unix) system error log config (once) Alerts Metrics disks Agent app log Config History * Alerts * Metrics stored config UP? Mem? CPU? Process 1 Process 2 script CMD Process 3 Agent-less Advantages: * No agent to install -> easy to install * No Agent Footprint Web Console Disadvantages: * More load on monitored machine * Less resilient to network problems Monitoring Server New connection every cycle Al er ts Sc he du l es Web Server relist proce sses & filte r Process 2 Ex u ec ric s Cross-Machine Scheduling Loop app log Process 1 e ot m M et Rescan file Re te Config History * Alerts * Metrics Server (Windows/Unix) system rescan file error log Check disks disks script D CM Process 3 Design Considerations – Starting Systems Unix Database Server Unix Database Server CRON Windows Database Server Unix Application Server Windows Application Server CRON App 1 Proc 3 App 1 Proc 1 Oracle 1 Sybase 1 Oracle 2 MSS 1 Sybase 2 MSS 2 App 1 Proc 2 App 2 Proc 1 CPU Disk Mem Logs CPU Disk Mem Logs CPU Disk Mem Logs CPU Disk Mem Logs CPU Disk App 1 Proc 2 Mem Logs Design Considerations – System Capacity Unix Database Server Unix Database Server CRON Windows Database Server Unix Application Server Windows Application Server CRON App 1 Proc 3 App 1 Proc 1 Oracle 1 Oracle 2 Cap Sybase 1 Sybase 2 Cap MSS 1 MSS 2 Cap Cap App 1 Proc 2 Cap App 2 Proc 1 CPU Disk Mem Logs CPU Disk Mem Open Source Capacity Tool In House GUI Logs CPU Disk Mem Logs CPU Disk Mem Logs CPU Disk App 1 Proc 2 Mem Logs Design Considerations – DB Mon. Capacity Database Monitoring Unix Database Server CRON Oracle 1 Unix Database Server DbMon CRON Sybase 1 Oracle 2 Windows Database Server DBMon DBMon MSS 1 Sybase 2 Unix Application Server MSS 2 Windows Application Server App 1 Proc 3 App 1 Proc 1 App 1 Proc 2 App 2 Proc 1 CPU Disk Mem Logs Database Cap Plan Web Reports CPU Disk Mem Logs CPU Disk Mem Logs CPU Disk Mem Logs CPU Disk App 1 Proc 2 Mem Logs Design Considerations – App. Log Check Unix Database Server Unix Database Server CRON Windows Database Server Unix Application Server Windows Application Server CRON App 1 Proc 3 App 1 Proc 1 Oracle 1 Sybase 1 Oracle 2 MSS 1 Sybase 2 MSS 2 App 1 Proc 2 App 2 Proc 1 CPU Disk Agent Mem Logs CPU Agent Disk Mem Logs CPU Agent Disk Mem Logs CPU Agent Monitoring Server with thresholds & app-specific monitoring configuration Disk Mem Logs CPU Agent Disk App 1 Proc 2 Mem Logs ESM Arch (Generic) Network Monitoring ev e n ts Live Outage Report Central Event Server Cap Additional Departmental Monitoring (Application Specific) eve nts Rules Incident Management System Ticket w/ Events ts even Database Monitoring Unix Database Server Unix Database Server DbMon CRON Windows Database Server DBMon Unix Application Server Windows Application Server DBMon events CRON event details Agent Agent Agent Agent Monitoring Server with thresholds & app-specific monitoring configuration Agent The Two Perspectives Operations led and Design led – Operations led delivers the everyday working process – Operations led vision is really pre-Incident Incident management – Design led establishes a conduit between IT Service Management and the underlying technology – Design led has the potential to be a very effecttive front end and interface for traditionally less visible processes: • Performance & Management Information (dashboards) • Capacity • Availability Strategy – What it Takes to Do EM I CS Service Operatio n Service Design Service Strategy Service Transitio n Start in the center ... First ask “Why?”