Advanced Management Technologies For Exchange 5.5 Greg Todd Program Manager NT Solutions Group BMC Software, Inc. Agenda Current issues with problem diagnosis Theory of root cause analysis (RCA) Primer on RCA How RCA can help you today Application availability timeline Demos of RCA on Exchange 5.5 Systems management vision Management maturity curve The future of Exchange management The Business Problem Event automation #1 priority of IT executives Problem diagnosis is a critical aspect that requires attention Wasted Time 80% of down time spent diagnosing 20% of time spent fixing Wasted Resources Diagnosis often a finger-pointing exercise Frustrated Users Users have no idea what to expect Gartner, 1998 Application Availability Timeline Point of Failure Point of Notification PoF Monitoring Point of Diagnosis PoN Point of Recovery PoD Analysis PoR Recovery Point of Postmortem PoP Evolution Application Availability Timeline Application Violating Service Level time PoF PoN Monitoring PoD Root Cause Analysis PoR Recovery PoP Evolution Application Availability Timeline Application Violating Service Level Significant Decrease Faster Service Restoration PoF PoN Monitoring PoD Root Cause Analysis PoR Recovery Diagnosis Time Reduced PoP Evolution time Benefits Of RCA Based on well-established theories Quicker problem resolution Problem isolation saves resources to address the real problem Symptom filtering allows administrator to ignore sympathetic events Performs tests to find the root cause Far superior to rules-based approach Key enabler to make systems self-sufficient Provides impact analysis capability RCA Key concepts Symptoms are problems to be investigated Faults are the root causes of these symptoms Tests are active tasks which gather information RCA is a problem analysis methodology geared towards finding the real cause of a problem and preventing it from happening again. Rules-Based Approach Vs. RCA Rules-Based Root Cause Analysis Symptom received Symptom received Possible causes looked up in a fixed table of rules Possible causes determined from a generic fault model Set of possible causes presented to user Each cause is tested against suspects Only suggested actions can be provided to user Actual root cause is presented to user after suspects are eliminated Specific actions can be provided to user Root Cause Analysis For Exchange Server Three components that work synergistically Exchange Server Windows NT IP Network High Level RCA Architecture Enterprise Console Mid-Level Manager Managed Node Managed Node Managed Node RCA Architecture Exchange Server and OS KMs BMC PATROL Managed Node Mid-Level Manager KM ARB Bridge RTEP KM Managed Node Javalink Bridge Protocol Layer Agent Request Broker RCA Engine Realtime Event Proxy Mid-level agent Enterprise Console ARB KM RTEP KM Managed Node Custom ARB Other Monitor Diagnostic KM Root Cause Analysis Sample problem Inbound Server Exchange Server D Remote Office Exchange Server T1 Link to Remote Office Inbound Messages To Internet Firewall Bridgehead Server Exchange Server A Outbound Messages Legend Internal Mail Internal & Internet Mail Internet Mail Outbound Server Exchange Server C Exchange Server B PATROL RCA Sample problem Symptom received by model Queue Growth Alarms from multiple Exchange Servers Queue Growth on Server A Queue Growth on Server B Queue Growth on Server C Queue Growth on Server D Suspected root causes found in model CPU Usage High Memory Bottlenecks MTA down on target machine Network Problem PATROL RCA Sample problem Suspected root causes tested ? CPU Usage High ? Memory Bottlenecks ? MTA down on target machine ? Network Problem Root cause isolated CPU usage high on bridgehead CPU Usage High Memory Bottlenecks MTA down on target machine Network Problem Demo Simple RCA Scenario Sample Generic Fault Model Sample Specific Fault Model Sample Specific Fault Model Close-up Demo RCA Engine Causal Directed Graphs Demo Root Cause Analysis Exchange, NT, IP Network Demo Impact Analysis Exchange, NT, IP Network Benefits Of RCA Based on well-researched theories Quicker problem resolution Problem isolation saves resources to address the real problem Symptom filtering allows administrator to ignore sympathetic events Performs tests to find the root cause Far superior to rules-based approach Key enabler to make systems self-sufficient Provides impact analysis capability Systems Management Vision Where’s all this stuff going? Phases Of Management Maturity Based on commonly known process control theory VIRTUALIZE STABILIZE CONTROL MANAGE MONITOR Applies directly to management of complex software systems Maturity Phases Monitoring is plumbing Included with Windows 2000 and Exchange 2000 Server-centric data and event collection MONITOR Monitors component and system data No awareness of other systems or apps Basic alerting, scripting, and actions WMI, PerfMon, HealthMon, Exchange 2000 monitoring Maturity Phases Application-specific and server-centric View and take action on components Availability and performance monitoring Rich reporting Application SLA definition MANAGE ASAP resolution when out of compliance Most correlation done in your head Some tools have reached this level Key enabler to Control phase Maturity Phases Places system automation in control Provides holistic view of systems Enables high level of SLA compliance CONTROL Quick problem diagnosis Action <--> Reaction Proactive correction before users feel impact Management automation maturing Maturity Phases Provides utility-level service Reliable as electric, telephone, water Assures continuous application service Clusters STABILIZE Built-in fault tolerance, re-routing, workload management Failure does not impact service Prediction / impact analysis Awareness of impact on SLAs caused by planned changes Maturity Phases The system learns how to intelligently deal with various issues Automatic everything VIRTUALIZE Actions and responses for the IT group Alerts and communications Acquires and stores knowledge for future reference Uses policy engines to control actions Systems become truly self-sufficient User becomes self-serviced Virtualization Example Problem Research Assistant Correlates problem root cause diagnoses with: Previous resolutions - presents the user with previous remedies based on exact matches or best guess On-line technical documentation - integrates with vendor-supplied support documentation (e.g. Microsoft Knowledge Base articles) Technical Support Request Generator - formats required user information and diagnosed fault into a support request, according to vendorspecific templates Virtualization Example Problem Research Assistant Correlation Backend Diagnosed Faults Problem Research Assistant Bridge RCA Server Help Previous Resolutions Domain Model Domain Model Domain Model IP Reachability Analyzer Online Technical Articles Problem Response History Repository Support Requests RCA Takes Management To The Next Level VIRTUALIZE STABILIZE Many Players Many Choices CONTROL MANAGE MONITOR Root Cause Analysis Summary GOAL: No interruptions in service RCA is key to Exchange availability RCA paves the way to virtualization Accelerates the diagnosis process Can assess impact of failures before-hand Not unreasonable to achieve “five 9’s” Managed systems that learn and adapt You never have to intervene Free to invest more time in pro-activity RCA is in beta now!! Call To Action Demand sophistication and simplicity in Exchange management solutions Solutions that learn Solutions that are easy to use Start thinking of Exchange availability in terms of utility-level service Consider where to implement RCA in your current environment Bring along those whom you service Take care of your users Communicate with them as you progress