– Managing Complex Systems BT Ian Johnston & John Palmer

advertisement
BT – Managing Complex Systems
Ian Johnston & John Palmer
BCS Kingston & Croydon Branch presentation 26/02/08
Presentation Objectives
• Approach to managing e2e systems
• A standard for application events
• Business process and component transaction
monitoring
• Order tracking and jeopardy
• Leveraging the value of monitoring, eg. ASGs,
Service and Capacity etc.
• Managing COTS products eg BEA, Siebel
© British Telecommunications plc
The BT experience
• BT architecture – SOA – linked reusable capabilities
• Our position has been driven from experience in
monitoring of complex distributed architecture.
• The concept of configuring toolsets to monitor e2e is
unachievable for large enterprises – maintenance
expensive/ impossible.
• This has led us along the Design route which now
parallels ITIL‘s Service Design concepts.
© British Telecommunications plc
BT’s Matrix Architecture
© British Telecommunications plc
BT Matrix Architecture Challenges - Service Design
•
Service
Level Management
– SLM
–
–
•
targets, eg volumes,
Availabilityresponse times
• Aligning
with UCs
– Understanding
CESLAs
requirements
–
•
times
–Response
Capacity
Management
• Procedures to ensure
Capacity Management
customer targets are met
– Accurate measurement of transaction volumes
–
•
SLAs aligned to business requirements
Defining strategy
measurements &
BT’s •outsourcing
–Response
Business
Continuity
times
broken down by capability
management
IT Service
Continuity management
• Deployment
designs to
– Dynamic
deployment
in virtualised environments
ensure
resilience
–
and geographic
resilience
–Physical
Availability
management
• Measure e2e availability
broken down to capabilities
© British Telecommunications plc
BT Matrix Architecture Challenges - Service Operation
• Operational management
– How to assess the impact and prioritise application
events by business process and IT Service ?
• Application management
– Routing of PRs to the appropriate support groups?
– Analysing high volumes of events in log files?
• Technical management
– Pinpointing root-cause across multiple shared
capability
• Metrics
– Stepped changes in volumes, errors and response
times?
– Impact of changes eg trend in error rates
– Measuring operational efficiency eg txns vs. failures
© British Telecommunications plc
BT Matrix Architecture Challenges – E2E Design
End Customer
End Customer
NB: incorporates Flow
Stream / Manage /Monitor / Director
From
“SF – Provide – Progress-pt1”
(Place Order)
Create ServiceID
Build Port
Network Capacity
Shortfall
Into Error
queue for
manual
processing
Get Tie Cable Mapping
Place Order
`
`
`
Assigned
(SMPF ID)
`
`
`
Committed
Pending
Pending
Acknowledged
Acknowledged
Committed
Committed
Build VC
RADIUS, B-RAS, VCI, etc
`
Installed
Completed
Update (SMPF ID, Installation DN etc)
SMPF ID
Status = “Completed”
`
Activation email
Status = “Completed”
To
“Close Order”
sub-process
© British Telecommunications plc
Complete
To
“Close Order”
sub-process
Completed
BT Approach – Application event standard
Business transaction
Business Process
Event type
Time
Application
Standard
Business keys
e2e correlation key
© British Telecommunications plc
Host
server
Component capability
BT Matrix Architecture Solution - Service Design
SLM
•
agile design workshop to build in measures to support
SLAs
Availability
•
•
Agile capability workshops to build in measures for
monitoring of capacity implemented by apis
Standardised events for common error conditions such
as interface failures
IT Service Continuity
•
Dynamic reports of services and deployment profile
(host/server distribution)
© British Telecommunications plc
BT Matrix Architecture Solution - Service Operation
Operational management
•
•
•
•
Event correlation (by service and transaction identifiers)
Impact (problem scenario and guided action)
Performance bottlenecks
Support group checklists (quick wins)
Application management
•
•
Improved routing of PRs to the appropriate support groups
provided by e2e view
We can we analyse high volumes of events by restricting the
types of events and provision of summarisation
Technical management
•
Diagnosis – root cause ( e2e location and standard error)
Metrics
•
Summarisation and granularity inherent in standard
© British Telecommunications plc
BT Application Monitoring Standard
© British Telecommunications plc
Outsourcing Supplier Contracts
1. Monthly views to identify any stepped changes in
–
Volumes, Response times, Error rates
2. Weekly views of top 5-10 transactions showing
–
–
Distribution of volumes, variance in response times, peaks and spikes
Any worsening trends in errors and thresholds
3. Monthly analysis of error messages showing
–
–
–
Volumes errors, eg aborts, application, business, etc.
Breakdown by business process, IT service and component
transaction
Corresponding traps and CR/DRs using AlarmMis
4. Ad-hoc Investigations to review
–
–
–
–
Loadings and relative performance across servers
Real-time transaction analysis
Drill down diagnostics
COTS, platform and network root cause analysis
5. Service management process to review
–
–
–
© British Telecommunications plc
Capacity
Supplier’s (eg Siebel, WLS) and applications development group’s
CRs and DRs
PRs against remedial activities
What is the BT experience?
Key messages
•
Define Standard for Application Events
•
Instrumentation by design built into matrix capabilities
•
Implementation by using agile design workshops
•
Exploitation of toolset supported by supplier contracts
•
Application monitoring standard promotes the effective problem
management by integration with the enterprises diagnostic toolsets
© British Telecommunications plc
Hunter Integration Console
Management Frameworks
System &
Application
Trap
Definitions
Events
COTS Monitoring
definitions, e.g.,
Seibel, BEA,
Oracle
Performance
Remote Operation
• Flexible & agile
• Uses COTS out-of-the-box
• Rapid development & deployment
• Any management frameworks
• Low maintenance
© British Telecommunications plc
Business Process
& Application txn
Monitoring
Download