BT – Managing Complex Systems Ian Johnston & John Palmer BCS Kingston & Croydon Branch presentation 26/02/08 Presentation Objectives • Approach to managing e2e systems • A standard for application events • Business process and component transaction monitoring • Order tracking and jeopardy • Leveraging the value of monitoring, eg. ASGs, Service and Capacity etc. • Managing COTS products eg BEA, Siebel © British Telecommunications plc The BT experience • BT architecture – SOA – linked reusable capabilities • Our position has been driven from experience in monitoring of complex distributed architecture. • The concept of configuring toolsets to monitor e2e is unachievable for large enterprises – maintenance expensive/ impossible. • This has led us along the Design route which now parallels ITIL‘s Service Design concepts. © British Telecommunications plc BT’s Matrix Architecture © British Telecommunications plc BT Matrix Architecture Challenges - Service Design • Service Level Management – SLM – – • targets, eg volumes, Availabilityresponse times • Aligning with UCs – Understanding CESLAs requirements – • times –Response Capacity Management • Procedures to ensure Capacity Management customer targets are met – Accurate measurement of transaction volumes – • SLAs aligned to business requirements Defining strategy measurements & BT’s •outsourcing –Response Business Continuity times broken down by capability management IT Service Continuity management • Deployment designs to – Dynamic deployment in virtualised environments ensure resilience – and geographic resilience –Physical Availability management • Measure e2e availability broken down to capabilities © British Telecommunications plc BT Matrix Architecture Challenges - Service Operation • Operational management – How to assess the impact and prioritise application events by business process and IT Service ? • Application management – Routing of PRs to the appropriate support groups? – Analysing high volumes of events in log files? • Technical management – Pinpointing root-cause across multiple shared capability • Metrics – Stepped changes in volumes, errors and response times? – Impact of changes eg trend in error rates – Measuring operational efficiency eg txns vs. failures © British Telecommunications plc BT Matrix Architecture Challenges – E2E Design End Customer End Customer NB: incorporates Flow Stream / Manage /Monitor / Director From “SF – Provide – Progress-pt1” (Place Order) Create ServiceID Build Port Network Capacity Shortfall Into Error queue for manual processing Get Tie Cable Mapping Place Order ` ` ` Assigned (SMPF ID) ` ` ` Committed Pending Pending Acknowledged Acknowledged Committed Committed Build VC RADIUS, B-RAS, VCI, etc ` Installed Completed Update (SMPF ID, Installation DN etc) SMPF ID Status = “Completed” ` Activation email Status = “Completed” To “Close Order” sub-process © British Telecommunications plc Complete To “Close Order” sub-process Completed BT Approach – Application event standard Business transaction Business Process Event type Time Application Standard Business keys e2e correlation key © British Telecommunications plc Host server Component capability BT Matrix Architecture Solution - Service Design SLM • agile design workshop to build in measures to support SLAs Availability • • Agile capability workshops to build in measures for monitoring of capacity implemented by apis Standardised events for common error conditions such as interface failures IT Service Continuity • Dynamic reports of services and deployment profile (host/server distribution) © British Telecommunications plc BT Matrix Architecture Solution - Service Operation Operational management • • • • Event correlation (by service and transaction identifiers) Impact (problem scenario and guided action) Performance bottlenecks Support group checklists (quick wins) Application management • • Improved routing of PRs to the appropriate support groups provided by e2e view We can we analyse high volumes of events by restricting the types of events and provision of summarisation Technical management • Diagnosis – root cause ( e2e location and standard error) Metrics • Summarisation and granularity inherent in standard © British Telecommunications plc BT Application Monitoring Standard © British Telecommunications plc Outsourcing Supplier Contracts 1. Monthly views to identify any stepped changes in – Volumes, Response times, Error rates 2. Weekly views of top 5-10 transactions showing – – Distribution of volumes, variance in response times, peaks and spikes Any worsening trends in errors and thresholds 3. Monthly analysis of error messages showing – – – Volumes errors, eg aborts, application, business, etc. Breakdown by business process, IT service and component transaction Corresponding traps and CR/DRs using AlarmMis 4. Ad-hoc Investigations to review – – – – Loadings and relative performance across servers Real-time transaction analysis Drill down diagnostics COTS, platform and network root cause analysis 5. Service management process to review – – – © British Telecommunications plc Capacity Supplier’s (eg Siebel, WLS) and applications development group’s CRs and DRs PRs against remedial activities What is the BT experience? Key messages • Define Standard for Application Events • Instrumentation by design built into matrix capabilities • Implementation by using agile design workshops • Exploitation of toolset supported by supplier contracts • Application monitoring standard promotes the effective problem management by integration with the enterprises diagnostic toolsets © British Telecommunications plc Hunter Integration Console Management Frameworks System & Application Trap Definitions Events COTS Monitoring definitions, e.g., Seibel, BEA, Oracle Performance Remote Operation • Flexible & agile • Uses COTS out-of-the-box • Rapid development & deployment • Any management frameworks • Low maintenance © British Telecommunications plc Business Process & Application txn Monitoring