Problem Management Overview

advertisement
Problem Management
Version. 3.2
Service
Desk
Problem
Management
Incident
Management
Problems Versus Incidents
• Incident: Any event not part of standard operation of a service that causes an interruption to or a reduction
in the quality of that service.
• Problem: The unknown underlying cause of one or more Incidents. A condition identified from multiple
Incidents exhibiting common symptoms, or from a single major Incident, indicative of a single error, for which
the cause is unknown.
• Problem Management: Process to minimize the adverse impact of incidents, and to proactively prevent the
reoccurrence of incidents, problems, and errors.
Problem
Incident

Symptom [of a problem]


Each occurrence of the
embodiment of a problem
Root/underlying cause(s) of
an Incident or Incidents

“Complete” when the
underlying cause(s) are
permanently removed

Long term resolution

“Complete” when normal
operation (from a
customer perspective) is
restored
“Put the fire out now…”
(Firefighting)
“How did this happen?…”
(Arson Investigation)
Incident Analysis Approach
While the incident record is open or after it has been resolved it is reviewed for problem
management inclusion. If the incident (or group of related incidents) meet the criteria, a problem
record is created to identify cause, solution, and control to prevent re-occurrence.
Incidents
Criteria
Problems
Critical / High Priority
Incidents with
Extensive Impact and potential
To re-occur
Critical Problem
Independent Incidents
With significant
Impact and possible
same root cause
High Priority
Problem
Multiple moderate
incidents possible
Same cause or multiple
Causes.
Medium Priority
Separated Problems
Data analysis of incident
Records to determine
Process, system, or
Network issues
?
Determine
Problem Management
Inclusion.
Incident, Problem and Known Error definitions in ITIL
Common definitions and language are the key factors in communication. That's why ITIL has its own
language and set of specific words widely used by all people working in ITIL environment. Below are
definitions for the incident, problem and known error. These three phrases are very common for the Incident
Management process and it is crucial to understand differences between them.
Incident - an incident is any event that is not part of the standard operation of a service and that causes - or
may cause - an interruption to, or a reduction in, the quality of that service. A good example of an incident can
be lack of free space on somebody's hard disk.
Problem - a problem is the unknown underlying cause of one or more incidents. One can ask when the
incident becomes a problem? The answer is never, the problem ticket is created after the incident has been
resolved and when the root cause is not understood.
Known Error - is created from an incident or problem for which the root cause is known and for which a
temporary workaround or permanent alternative has been identified. Known error is in place as long as it is
permanently fixed by a change. Appropriate Request For Change (RFC) is raised in order to fix the known
error. Known Errors are in the scope of responsibility of Problem Management.
Problem Management Process
Service Desk
Incident
Management
Future State
Supplier or
Contractor
Event
Management
Proactive
Problem
Management
Communications
Problem Detection
•Outage Notifications
•Incident Reports
•Problem Reports
Problem Logging
Categorization
Problem
Management
Prioritization
Investigation and Diagnosis
Operations Team
or
Service Provider
Change
Management
Problem
Module
Workaround?
Create Known Error Record
•Change Notifications
ITSM
Yes
Change Needed?
No
Resolution
Closure
Known
Error
Database
Solution
Database
Incident, Problem and Known Error definitions in ITIL
Common definitions and language are the key factors in communication. That's why ITIL has its own
language and set of specific words widely used by all people working in ITIL environment. Below are
definitions for the incident, problem and known error. These three phrases are very common for the Incident
Management process and it is crucial to understand differences between them.
Incident - an incident is any event that is not part of the standard operation of a service and that causes - or
may cause - an interruption to, or a reduction in, the quality of that service. A good example of an incident can
be lack of free space on somebody's hard disk.
Problem - a problem is the unknown underlying cause of one or more incidents. One can ask when the
incident becomes a problem? The answer is never, the problem ticket is created after the incident has been
resolved and when the root cause is not understood.
Known Error - is created from an incident or problem for which the root cause is known and for which a
temporary workaround or permanent alternative has been identified. Known error is in place as long as it is
permanently fixed by a change. Appropriate Request For Change (RFC) is raised in order to fix the known
error. Known Errors are in the scope of responsibility of Problem Management.
Requester
ITSM Problem Management Process Overview
Problem Analyst
Investigation
Investigation
Initiated
Initiated
3
Problem
Investigation
and Diagnosis
Problem Manager
1
1
Problem
Problem
Identification,
Identification,
Recording,
Recording
and
and
Classification
Classification
8
Knowledge
Identification
and Recording
5
Known Error
Identification
and Recording
4
Problem
Resolution and
Closure
6
Known Error
Classification
and
Assessment
7
Known Error
Resolution and
Closure
9
Knowledge
Validation and
Publication
Other Service Support /
Delivery Processes
2
Review
Incident
Management
Incident
Management
Change
Management
Change
Implemented
Configuration
Management
(Mgmt Reports)
START
Reviewed within
1 Business Day
Root Cause Analysis
within 3 Business Days (85%)
Problem Report
within 5 Business Days
Permanent Solution within 30 Days
Incident
Management
Problem Management Workflow
Problem Management minimizes the adverse impact of incidents on the
business and enables root cause analysis to identify a permanent
solution.
Problem Detection
Trend analysis is the key to spot the Problems. A trend analysis helps in
giving a proactive approach to the Problem Management by which you
can avoid the occurrence of the problem earlier rather than resolving
the problem at a later stage.
Real time trending as incidents come in can detect a problem early and
initiate a resolution before additional customer are impacted.
Problem Investigations should be initiated for all Critical (priority1) and
High (priority2) impact incidents where the cause is unknown
Problem Management Workflow - Identification and Classification
Problem Logging
The Problem logging is critical as all the necessary information and conclusion from the
incident has to be captured while creating the Problem. Avoid duplicates by searching for
similar existing problems before the creation of a new Problem.
Categorization
Problem categorization is very essential to avoid ambiguities. ITSM helps in applying the
incident categorization automatically to a Problem when it is created and this helps in keeping
the problem information at the same level of understanding for the analyst.
Prioritization
Focus on the business critical problems based on the problem prioritization. Problem
prioritization helps technicians to identify critical problems that need to be addressed. Impact
and Urgency associated with a problem decides which problems need to be addressed first.
Problem Management Workflow – Review / Investigation and
Diagnosis
Problem Review
Reviewing the problem both in a peer review and with the service area problem
manager can help further verify validity of information and initiate root analysis.
Investigation and Diagnosis
• Problem investigation results in getting at the root cause of the problem and
initiating actions to resume the failed service. Analyze the impact, root cause
and symptoms of the problem to provide a problem resolution.
• When the cause is understood to the point where solution development can
begin the problem ticket needs to be updated, the known error or solution
record initiated, and ticket moved to “Completed” to reflect that the problem
investigation portion is done.
When is Root Cause Analysis Required?
• “Why” is performed to then determine the “How”.
• Who does RCA – The analyst who resolved the incident initiates the problem
investigation including initial Root Cause analysis, and performs a warm
handoff (with supporting data) to the next area analyst when required.
Problem Management Workflow - Resolution and Recovery
Workaround and Solution
The successful diagnosis of a root cause results in changing the problem to a
Known-Error and suggesting a workaround.
ITSM helps in categorizing the solutions into three- Known-Error Record,
Workaround and Resolution.
Change the problem into a Known-Error record automatically when you add a
work-around. Browsing through these known-error records helps in resolving the
incidents by themselves and reducing the inflow of incidents.
These also help in lowering the incident resolution time by the incident
technicians. The work-around and problem resolutions automatically get added
with the solution list.
Closure
Problem closure is very critical as closure confirms that all aspects of the user
problems are addressed to the user satisfaction.
Process – ITSM 5 Stages of an Investigation
1
Identification and Classification
This stage initiates the problem management process. The purpose
of this stage is to accurately identify and classify the problem.
•Next stage
2
Review
In this stage, the problem manager validates the impact of the
problem and provides guidance to the analyst for the investigation of
the problem while it continues or assesses re-assignment.
• Next stage
•Cost analysis
•Relate incidents
•Assign investigation
•Generate tasks
•Relate to CI
•Enter pending (or resume)
3
Investigation and Diagnosis
In this stage the analyst attempts to identify the root cause of the
problem. You also attempt to find either a permanent solution or a
temporary work-around.
• Next stage
•Generate known error
• Generate work-around / root cause
• Generate tasks
•Enter pending (or resume)
4
Resolution and Recovery
In this stage, you resolve the problem. The end result of the
investigation might be a Known Error record or Solution Database
entry.
•Next stage
•Generate known error
•Generate work-around
• Generate solution
•Close
5
Closed
In this stage, the investigation is closed. No more activities are
performed on the problem investigation.
•None
Problem Reports
•For Critical/High (P1/P2) incidents
•Requested by ministry (P1/P2/P3/P4)
•Distribution through Problem
Management Process Manager internally
and SDM’s to affected Ministries
Known Errors
Incidents, problems and known errors
Incidents may match with existing 'Known Problems' (without a known root
cause) or 'Known Errors' (with a root cause) under the control of Problem
Management and registered in the Known Error Database ( KeDB ) or
knowledge base.
Where existing work-arounds have been developed, it is suggested that
accessing these will allow the Service Desk to provide a quick first-line fix.
Problem Management Structure
Service
Provider
Process
Management
Problem
Management
Bundle 2
Mainframe
Compugen
Incident &
Problem
Management
Acrodex
Incident &
Problem
Management
Service
Provisioning
Non Bundled
Operations
CISO
ITSM
????
IBM Incident
&
Problem
Management
Engagement Points
Ashish Naphray
Ashish.Naphray@gov.ab.ca
Service Alberta
Bundle 1
Service
Desk
Bundle 3
Desktop
Management
And
Worksite
Support
Contact
Krishna Kaipenchery
krishna@ca.ibm.com
*Weekly Problem
Management
Review
*Daily Incident
Management
Touch point
*Weekly Incident
Management Review
Paula Pacitto
ppacitto@compugen.com
Vincent y Lim
Vincent.Y.Lim@gov.ab.ca
*Critical Incident
investigation
*Critical Incident
post mortems
*Incident and Problem
Reporting (WOR, MOR)
Tuan Nguyen
Sandy Stout
Christine Gunarson
Mary Boyle
Alan Wokoeck
Raymond Viens
*Problem
Assignment
Review
Some background on Problem Management Process A problem is the cause of one or more
incidents. The cause is not always known at the time a problem record is created, and the problem
management process is responsible for further investigation.
The key objectives of Problem Management are to prevent problems and resulting incidents
from happening, to eliminate recurring incidents, and to minimize the impact of incidents that cannot
be prevented.
Problem Management includes diagnosing causes of incidents, determining the resolution, and ensuring that the resolution is
implemented. Problem Management also maintains information about problems and the appropriate workarounds and
resolutions.
Problems are different than incidents in that for an incident the goal is to resolve the specific issue as quickly as possible for
the customer. Incident management is engaged for escalations and overall effectiveness of incident resolution. When multiple
incidents seem to be stemming from a common cause, incident management may engage problem management to help
document and solicit appropriate owners and resources to investigate and resolve.
Potential Problems are identified and tracked within the ITSM system problem module and reviewed. The problem
investigation can be initiated by or assigned to the service providers (for critical and high priority issues they are aware and
engaged in cause investigation and resolution plans). The problem ticket is documented through-out the investigation.
Problems are categorized in a similar way to incidents, but the goal is to understand causes, document workarounds and
request changes to permanently resolve the problems. Workarounds are documented in a Known Error Database, which
improves the efficiency and effectiveness of Incident Management.
In summary, when significant or repeated incidents occur the function is to work with the service providers incident management
and problem management. Together perform reviews, root cause investigations, and tracking the solutions and control (does the
problem stay resolved). Information on known errors is fed back into the system for knowledgebase article inclusion.
Download