Problem Management Version. 3.2 Service Desk Problem Management Incident Management Problems Versus Incidents • Incident: Any event not part of standard operation of a service that causes an interruption to or a reduction in the quality of that service. • Problem: The unknown underlying cause of one or more Incidents. A condition identified from multiple Incidents exhibiting common symptoms, or from a single major Incident, indicative of a single error, for which the cause is unknown. • Problem Management: Process to minimize the adverse impact of incidents, and to proactively prevent the reoccurrence of incidents, problems, and errors. Problem Incident Symptom [of a problem] Each occurrence of the embodiment of a problem Root/underlying cause(s) of an Incident or Incidents “Complete” when the underlying cause(s) are permanently removed Long term resolution “Complete” when normal operation (from a customer perspective) is restored “Put the fire out now…” (Firefighting) “How did this happen?…” (Arson Investigation) Incident Analysis Approach While the incident record is open or after it has been resolved it is reviewed for problem management inclusion. If the incident (or group of related incidents) meet the criteria, a problem record is created to identify cause, solution, and control to prevent re-occurrence. Incidents Criteria Problems Critical / High Priority Incidents with Extensive Impact and potential To re-occur Critical Problem Independent Incidents With significant Impact and possible same root cause High Priority Problem Multiple moderate incidents possible Same cause or multiple Causes. Medium Priority Separated Problems Data analysis of incident Records to determine Process, system, or Network issues ? Determine Problem Management Inclusion. Incident, Problem and Known Error definitions in ITIL Common definitions and language are the key factors in communication. That's why ITIL has its own language and set of specific words widely used by all people working in ITIL environment. Below are definitions for the incident, problem and known error. These three phrases are very common for the Incident Management process and it is crucial to understand differences between them. Incident - an incident is any event that is not part of the standard operation of a service and that causes - or may cause - an interruption to, or a reduction in, the quality of that service. A good example of an incident can be lack of free space on somebody's hard disk. Problem - a problem is the unknown underlying cause of one or more incidents. One can ask when the incident becomes a problem? The answer is never, the problem ticket is created after the incident has been resolved and when the root cause is not understood. Known Error - is created from an incident or problem for which the root cause is known and for which a temporary workaround or permanent alternative has been identified. Known error is in place as long as it is permanently fixed by a change. Appropriate Request For Change (RFC) is raised in order to fix the known error. Known Errors are in the scope of responsibility of Problem Management. Problem Management Process Service Desk Incident Management Future State Supplier or Contractor Event Management Proactive Problem Management Communications Problem Detection •Outage Notifications •Incident Reports •Problem Reports Problem Logging Categorization Problem Management Prioritization Investigation and Diagnosis Operations Team or Service Provider Change Management Problem Module Workaround? Create Known Error Record •Change Notifications ITSM Yes Change Needed? No Resolution Closure Known Error Database Solution Database Incident, Problem and Known Error definitions in ITIL Common definitions and language are the key factors in communication. That's why ITIL has its own language and set of specific words widely used by all people working in ITIL environment. Below are definitions for the incident, problem and known error. These three phrases are very common for the Incident Management process and it is crucial to understand differences between them. Incident - an incident is any event that is not part of the standard operation of a service and that causes - or may cause - an interruption to, or a reduction in, the quality of that service. A good example of an incident can be lack of free space on somebody's hard disk. Problem - a problem is the unknown underlying cause of one or more incidents. One can ask when the incident becomes a problem? The answer is never, the problem ticket is created after the incident has been resolved and when the root cause is not understood. Known Error - is created from an incident or problem for which the root cause is known and for which a temporary workaround or permanent alternative has been identified. Known error is in place as long as it is permanently fixed by a change. Appropriate Request For Change (RFC) is raised in order to fix the known error. Known Errors are in the scope of responsibility of Problem Management. Requester ITSM Problem Management Process Overview Problem Analyst Investigation Investigation Initiated Initiated 3 Problem Investigation and Diagnosis Problem Manager 1 1 Problem Problem Identification, Identification, Recording, Recording and and Classification Classification 8 Knowledge Identification and Recording 5 Known Error Identification and Recording 4 Problem Resolution and Closure 6 Known Error Classification and Assessment 7 Known Error Resolution and Closure 9 Knowledge Validation and Publication Other Service Support / Delivery Processes 2 Review Incident Management Incident Management Change Management Change Implemented Configuration Management (Mgmt Reports) START Reviewed within 1 Business Day Root Cause Analysis within 3 Business Days (85%) Problem Report within 5 Business Days Permanent Solution within 30 Days Incident Management Problem Management Workflow Problem Management minimizes the adverse impact of incidents on the business and enables root cause analysis to identify a permanent solution. Problem Detection Trend analysis is the key to spot the Problems. A trend analysis helps in giving a proactive approach to the Problem Management by which you can avoid the occurrence of the problem earlier rather than resolving the problem at a later stage. Real time trending as incidents come in can detect a problem early and initiate a resolution before additional customer are impacted. Problem Investigations should be initiated for all Critical (priority1) and High (priority2) impact incidents where the cause is unknown Problem Management Workflow - Identification and Classification Problem Logging The Problem logging is critical as all the necessary information and conclusion from the incident has to be captured while creating the Problem. Avoid duplicates by searching for similar existing problems before the creation of a new Problem. Categorization Problem categorization is very essential to avoid ambiguities. ITSM helps in applying the incident categorization automatically to a Problem when it is created and this helps in keeping the problem information at the same level of understanding for the analyst. Prioritization Focus on the business critical problems based on the problem prioritization. Problem prioritization helps technicians to identify critical problems that need to be addressed. Impact and Urgency associated with a problem decides which problems need to be addressed first. Problem Management Workflow – Review / Investigation and Diagnosis Problem Review Reviewing the problem both in a peer review and with the service area problem manager can help further verify validity of information and initiate root analysis. Investigation and Diagnosis • Problem investigation results in getting at the root cause of the problem and initiating actions to resume the failed service. Analyze the impact, root cause and symptoms of the problem to provide a problem resolution. • When the cause is understood to the point where solution development can begin the problem ticket needs to be updated, the known error or solution record initiated, and ticket moved to “Completed” to reflect that the problem investigation portion is done. When is Root Cause Analysis Required? • “Why” is performed to then determine the “How”. • Who does RCA – The analyst who resolved the incident initiates the problem investigation including initial Root Cause analysis, and performs a warm handoff (with supporting data) to the next area analyst when required. Problem Management Workflow - Resolution and Recovery Workaround and Solution The successful diagnosis of a root cause results in changing the problem to a Known-Error and suggesting a workaround. ITSM helps in categorizing the solutions into three- Known-Error Record, Workaround and Resolution. Change the problem into a Known-Error record automatically when you add a work-around. Browsing through these known-error records helps in resolving the incidents by themselves and reducing the inflow of incidents. These also help in lowering the incident resolution time by the incident technicians. The work-around and problem resolutions automatically get added with the solution list. Closure Problem closure is very critical as closure confirms that all aspects of the user problems are addressed to the user satisfaction. Process – ITSM 5 Stages of an Investigation 1 Identification and Classification This stage initiates the problem management process. The purpose of this stage is to accurately identify and classify the problem. •Next stage 2 Review In this stage, the problem manager validates the impact of the problem and provides guidance to the analyst for the investigation of the problem while it continues or assesses re-assignment. • Next stage •Cost analysis •Relate incidents •Assign investigation •Generate tasks •Relate to CI •Enter pending (or resume) 3 Investigation and Diagnosis In this stage the analyst attempts to identify the root cause of the problem. You also attempt to find either a permanent solution or a temporary work-around. • Next stage •Generate known error • Generate work-around / root cause • Generate tasks •Enter pending (or resume) 4 Resolution and Recovery In this stage, you resolve the problem. The end result of the investigation might be a Known Error record or Solution Database entry. •Next stage •Generate known error •Generate work-around • Generate solution •Close 5 Closed In this stage, the investigation is closed. No more activities are performed on the problem investigation. •None Problem Reports •For Critical/High (P1/P2) incidents •Requested by ministry (P1/P2/P3/P4) •Distribution through Problem Management Process Manager internally and SDM’s to affected Ministries Known Errors Incidents, problems and known errors Incidents may match with existing 'Known Problems' (without a known root cause) or 'Known Errors' (with a root cause) under the control of Problem Management and registered in the Known Error Database ( KeDB ) or knowledge base. Where existing work-arounds have been developed, it is suggested that accessing these will allow the Service Desk to provide a quick first-line fix. Problem Management Structure Service Provider Process Management Problem Management Bundle 2 Mainframe Compugen Incident & Problem Management Acrodex Incident & Problem Management Service Provisioning Non Bundled Operations CISO ITSM ???? IBM Incident & Problem Management Engagement Points Ashish Naphray Ashish.Naphray@gov.ab.ca Service Alberta Bundle 1 Service Desk Bundle 3 Desktop Management And Worksite Support Contact Krishna Kaipenchery krishna@ca.ibm.com *Weekly Problem Management Review *Daily Incident Management Touch point *Weekly Incident Management Review Paula Pacitto ppacitto@compugen.com Vincent y Lim Vincent.Y.Lim@gov.ab.ca *Critical Incident investigation *Critical Incident post mortems *Incident and Problem Reporting (WOR, MOR) Tuan Nguyen Sandy Stout Christine Gunarson Mary Boyle Alan Wokoeck Raymond Viens *Problem Assignment Review Some background on Problem Management Process A problem is the cause of one or more incidents. The cause is not always known at the time a problem record is created, and the problem management process is responsible for further investigation. The key objectives of Problem Management are to prevent problems and resulting incidents from happening, to eliminate recurring incidents, and to minimize the impact of incidents that cannot be prevented. Problem Management includes diagnosing causes of incidents, determining the resolution, and ensuring that the resolution is implemented. Problem Management also maintains information about problems and the appropriate workarounds and resolutions. Problems are different than incidents in that for an incident the goal is to resolve the specific issue as quickly as possible for the customer. Incident management is engaged for escalations and overall effectiveness of incident resolution. When multiple incidents seem to be stemming from a common cause, incident management may engage problem management to help document and solicit appropriate owners and resources to investigate and resolve. Potential Problems are identified and tracked within the ITSM system problem module and reviewed. The problem investigation can be initiated by or assigned to the service providers (for critical and high priority issues they are aware and engaged in cause investigation and resolution plans). The problem ticket is documented through-out the investigation. Problems are categorized in a similar way to incidents, but the goal is to understand causes, document workarounds and request changes to permanently resolve the problems. Workarounds are documented in a Known Error Database, which improves the efficiency and effectiveness of Incident Management. In summary, when significant or repeated incidents occur the function is to work with the service providers incident management and problem management. Together perform reviews, root cause investigations, and tracking the solutions and control (does the problem stay resolved). Information on known errors is fed back into the system for knowledgebase article inclusion.