Severity 3 - University of Rochester

advertisement
Escalation Procedure (rev October 27, 2003)
Jump to Severity Three crisis management
Definitions
Assignment and Aging
Sample problems and their severity levels
Back to Main On Call Page
Escalation Definitions
Following are the definitions for the four severity levels referenced in this escalation procedure.
There is also a flow picture of how severity levels might change for a given problem.
Severity 0: lowest impact. Default for unassigned problems. Requires no active
assignment in a database. Definition:"Minor Failure [Single user, non-critical facility,
not clustered patient care]"
Severity 1: May be assigned by agent upon report. Definition: "Multiple Users, or
Single in Critical Area".
Severity 2: Equivalent of severity 1 with limited direct assignment by agent. This
level is typically a severity 1 problem with expected > 12 business hour repair
time escalated. Definition: Severity 1 definition with repair expected to be
greater than 12 business hours from the time of the report.
Severity 3: Major failure; CBX or Voicemail node down. Strategic business unit
network down. Backbone down. Internet connection down.
There are two e-mail lists used to communicate and escalate problems based on their severity. These
messages are sent from the NTOC. One of the email lists sends the 33999 page and the other list is used to
send a text message to a VIP list.
Escalation Flows
External Report’s Initial
Assignment and Ageing
0
1
3
2
Jump to Severity Three crisis management
Definitions
Assignment and Aging
Sample problems and their severity levels
Back to Main OnCall Page
Severity 0 : Minor Failure [Single user, non-critical facility, not
clustered patient care]
Voice Examples:
NEC phone, Analog line, Metered Business Line (no dial tone, features not working, etc.);
VoiceMail box (can't access, lost messages);
A single individual's pager is not working.
A single student authorization code is not functioning.
Network Examples:
Soft failure degradation of departmental network performance (network functioning but
not efficiently), or intermittent component failure. Network problem clearly identified to be
within the department's internal network.
Patient Care Examples:
Single patient unit phone, or multiple patient unit phones where there is remaining service,
such that patient care has minimal impact.
Response time:
Average 4 elapsed hours, up to 12 business hours. Discretionary negotiation between
Manager On Call and patient unit leader. Staff phones serviced same evening. Patient
phones escalated through the patient phones telephone # x50143.
Notification / Business Hours:
Network and Telecommunications Operations Center (NTOC) > technician.
Problem owner is responsible for communication with customer and Network and Telecommunications
Operations Center (NTOC).
Notification / After hours;
Network and Telecommunications Operations Center (NTOC) via Answering Service >
Manager On Call >
customer >
Network and Telecommunications Operations Center (NTOC) evening reporting number @ 3-1159.
>
tech >
customer >
Manager On Call >
Closure report to Network and Telecommunications Operations Center (NTOC) @ 3-1159.
Escalation:
At any point, if trouble is determined to have propagated to more than one user, trouble
will be immediately escalated to a "severity 2 or 1" as appropriate by the technician
dispatched. Technician and Manager On Call discuss escalation to telecommunications
engineer on call.
Response > 12 business hours? = escalation to severity 1
Comments:
Customer and Network and Telecommunications Operations Center (NTOC) will be advised of status if
problem is not projected to be
resolved within 12 business hours. Long standing problem status will be provided to
Network and Telecommunications Operations Center (NTOC) Triage position and customer by EOB day
*1
by the problem owner*
Jump to Severity Three crisis management
Definitions
Assignment and Aging
Sample problems and their severity levels
Back to Main OnCall Page
Severity 1: Multiple Users, or Single in Critical Area
Voice Examples:
Can't access area code (software config); several phones down in an area (CBX group);
T line down to medical off site location; any problem, single critical user or area. Multiple
critical phones out of service in given area;
A single pager is not operational for a person who is on-call.
Network Examples:
Hard failure of important network components (department LAN, network router interfaces,
SLA devices, single switch, internet connectivity or single modem pool server).
Security Examples:
Limited display of individualized security concerns on individual, isolated machines. See Incident
Response Team or other actions following the generalized response time section.
See response section below.
Patient Care Examples:
Three or more patient unit phones, or multiple staff phones in patient unit where business
is impacted and patient care may degrade as a result.
Response time:
Immediate dispatch for escalated trouble ticket (>12 business hours). On site within 1.5 to 2 hours (for nonescalated trouble tickets). Average response in 4 elapsed hours, up to 6 hours before escalation to severity
two or three, as appropriate. Discretionary negotiation between Manager On Call and patient unit leader.
Staff phones serviced same evening. Patient phones escalated through the patient phones telephone #.
Critical phones on the Emergency Preparedness list require response with two way
radios (or cellular, if the customer prefers). Immediate response by technician who troubleshoots the
system suspected. Contacts the customer if required for further information to assess the problem.
Determine problem, advise Network and Telecommunications Operations Center (NTOC), resolve.
Technician/engineer assigned will be dedicated to problem until it is resolved.
Network and Telecommunications Operations Center (NTOC) immediately advises customers. If after one
hour from start of problem determination, problem cause not identified, consult with switch tech and/or
engineering. Switch tech assumes ownership of the problem (even if engineering consulted). Switch Tech
and Engineer assess the problem, advise the Network and Telecommunications Operations Center
(NTOC), fix the problem. If appropriate, trouble escalated to vendor.
Security Incident Response Team Actions:
Upon receiving notification of a serious security vulnerability, the Engineer On Call e-mails notification to
mailto:its.telecomhelp@rochester.edu and works with the Incident Response Team as well as any desktop
support staff to address the localized problem and keep it localized.
Notification / Business Hours:
Network and Telecommunications Operations Center (NTOC) > technician
Problem owner is responsible for communication with customer, Manager On Call,
and Triage at Network and Telecommunications Operations Center (NTOC).
Notification / After hours;
Network and Telecommunications Operations Center (NTOC) via Answering Service >
Manager On Call >
customer >
------ assessment of severity two, potential escalation
------ and notification per severity one guidelines
----- if affecting patient care, potential consult with
----- AOC to validate impact and whether a need to
----- escalate
Network and Telecommunications Operations Center (NTOC) evening reporting number @ 3-1159.
>
tech and engineer >
customer >
Manager On Call >
Closure report to Network and Telecommunications Operations Center (NTOC) @ 3-1159.
If severity two, the Network and Telecommunications Operations Center (NTOC) or Manager On Call
may opt to initiate a severity page following the NTOC response process (at the end of this section). Afterhours the Manager On Call may choose to initiate a call-tree.
Escalation:
At any point, if trouble is determined to have propagated to strategic business
units (e.g. Emergency or event locations) multiple users, trouble will be immediately
escalated to a severity two or three, as appropriate assessed by the technician and/or
engineer on site. Technician and Manager On Call discuss escalation to
telecommunications engineer on call. Consultation with AOC may occur to assess.
Comments:
Customer and Network and Telecommunications Operations Center (NTOC) will be advised of status if
problem is not projected to be
resolved within 12 business hours. Long standing problem status will be provided to
Network and Telecommunications Operations Center (NTOC) Triage position and customer by EOB day
by the problem owner
*1
Jump to Severity Three crisis management
Definitions
Assignment and Aging
Sample problems and their severity levels
Back to Main On Call Page
Severity 2: Multiple Users, or Single in Critical Area with response time
likely to exceed the initial 12 hours from time of report.
Voice Examples:
Can't access area code (software config); several phones down in an area (CBX
group); T line down to medical off site location; any problem, single critical user
or area. Multiple critical phones out of service in given area.
Network Examples:
Hard failure of important network components (department LAN, network router
interfaces, SLA devices, single switch, internet connectivity or single modem
pool server).
Security Examples:
Display of security concerns on machines with somewhat limited impact to the business environment.
Limited danger to University assets. Presents a level of inconvenience. See Incident Response Team or
other actions following the generalized response time section. Specific examples include port scans, large
numbers of virus infected e-mail messages, reports of attempted exploit originating from the UofR. See
security response section below.
Patient Care Examples:
Three or more patient unit phones, or multiple staff phones in patient unit where
business is impacted and patient care may degrade as a result.
Response time:
Immediate dispatch. On site within 1.5 to 2 hours with sniffers or other complex diagnostic devices.
Average response in 4 elapsed hours, up to 6 hours before escalation to severity three. Discretionary
negotiation between Manager On Call and patient unit leader. Staff phones serviced same evening. Patient
phones escalated through the patient phones telephone #. Critical phones on the Emergency Preparedness
list require response with two way radios (or cellular, if the customer prefers). Immediate response by
technician who troubleshoots the
system suspected. Contacts the customer if required for further information to assess the problem.
Determine problem, advise Network and Telecommunications Operations Center (NTOC), resolve.
Technician/engineer assigned will be dedicated to problem until it is resolved.
Network and Telecommunications Operations Center (NTOC) immediately advises customers. ITS web
site is updated. If after one hour from start of problem determination, problem cause not identified, consult
with switch tech and/or engineering. Switch tech assumes ownership of the problem (even if engineering
consulted). Switch Tech and Engineer assess the problem, advise the Network and Telecommunications
Operations Center (NTOC),
fix the problem. If appropriate, trouble escalated to vendor.
Security Incident Response Team Actions:
Upon receiving notification of a serious security vulnerability as previously described, the Engineer On
Call e-mails notification to mailto:its.telecomhelp@rochester.edu and works with the Incident Response
Team as well as any desktop support staff to address the problem reduce further spread. Consider whether
escalation to severity three is called for. If so, determine whether we communicate via
abuse@rochester.edu and urcert@utd.rochester.edu mail lists.
Notification for other than a Security Issue / Business Hours:
Network and Telecommunications Operations Center (NTOC) > technician
Problem owner is responsible for communication with customer, Manager On
Call, and Triage at Network and Telecommunications Operations Center (NTOC).
Notification for other than a Security Issue / After hours;
Network and Telecommunications Operations Center (NTOC) via Answering Service >
Manager On Call >
customer >
------ assessment of severity two, potential escalation
------ and notification per severity one guidelines
----- if affecting patient care, potential consult with
----- AOC to validate impact and whether a need to
----- escalate
Network and Telecommunications Operations Center (NTOC) evening reporting number @ 3-1159.
>
tech and engineer >
customer >
Manager On Call >
Closure report to Network and Telecommunications Operations Center (NTOC) @ 3-1159.
If severity two, the Network and Telecommunications Operations Center (NTOC) or Manager On Call
may opt to initiate a severity page following the NTOC response process (at the end of this section). Afterhours the Manager On Call may choose to initiate a call-tree.
Escalation:
At any point, if trouble is determined to have propagated to strategic business units (e.g. Emergency or
event locations) multiple users, trouble will be immediately escalated to a "severity 3" as appropriated by
the technician and/or engineer on site. Technician and Manager On Call discuss escalation to
telecommunications engineer on call. Consultation with AOC may occur to assess.
Comments:
Customer and Network and Telecommunications Operations Center (NTOC) will be advised of status if
problem is not projected to be resolved within 12 business hours. Long standing problem status will be
provided to Network and Telecommunications Operations Center (NTOC) Triage position and customer
by EOB day by the problem owner
*1.
Jump to Severity Three crisis management
Definitions
Assignment and Aging
Sample problems and their severity levels
Back to Main On Call Page
Severity 3 : Major Failure; CBX or VoiceMail Node down. Strategic
(critical) business unit network down. Backbone down. Internet
connection down. Serious information security problem affects multiple
clients and has non-trivial impact creating outages versus
inconvenience. Information security problems that include serious
disruption to business activities, exemplified by problems such as worms
and security vulnerability exploits, especially those launched from the
UofR community.
IMMEDIATE Organizational Actions, if voice services impacted.
1. Upon receipt of a severity three condition, during daytime business hours, the Network
and Telecommunications Operations Center (NTOC) notifies a Senior Manager of the
severity one and assigns a scribe to record events in a chronology. During non-business
hours, the Manager On Call notifies a Senior Manager. Notifier clearly states "severity
three emergency condition".
2. Chain of command defined.
3. The Senior Manager determines a designated location for the response team.
4. Manager presence in the affected area, as appropriate.
5. Lead Technical Engineer assigned.
6. Two way radios are distributed as follows:
a) Network and Telecommunications Operations Center (NTOC);
b) Manager;
c) Emergency Operations Center (EOC) rep;
d) Lead technical engineer;
e) Comm Center (as appropriate);
and / or
f) Area runner and walkabout. Two-way radios will be available locked in the Network and
Telecommunications Operations Center (NTOC) for this purpose.
g) Each holder of a radio must be familiar with two-way radio guidelines. Note that the
key to the cabinet holding the radios is in the Network and Telecommunications
Operations Center (NTOC) Team Leader's desk drawer. It is labeled radios.
7. Emergency Operations Center phone number list distributed to individuals (either
x50500, or two-way radio). Non-University phone service at this location.
8. 15 minute updates initiated.
9. Debrief document available within 72 hours of re-institution of service.
Response Time:
Immediate and continuous effort. Immediate technician dispatch and engineering involvement.
Immediate response by technician to Network and Telecommunications Operations Center (NTOC) who
advises status every 15 minutes. Work begins immediately. Identify, report and resolve problem.
Technician/engineer assigned to this problem will be dedicated to its resolution until fixed. Complex
diagnostic gear are immediate brought to both ends of communications points.
Security Incident Response Team Actions:
Upon receiving notification of a serious security vulnerability as previously described, the Engineer On
Call e-mails notification to mailto:its.telecomhelp@rochester.edu and works with the Incident Response
Team as well as any desktop support staff to address the problem and reduce further impact. Communicate
via abuse@rochester.edu and urcert@utd.rochester.edu mail lists. Engineer On Call notifies the Manager
On Call. Manager On Call notifies Senior Managers.
Notification for other than a Security Issue:
Network and Telecommunications Operations Center (NTOC) > engineer > Senior Manager,
Director, and University Administration, as appropriate.
Immediate notification of critical areas affected by Senior Telecommunications Engineer. Engineer
responsible for 15 minute updates* to Network and Telecommunications Operations Center (NTOC).
Network and Telecommunications Operations Center (NTOC) or engineer communicates with critical
customers every 15 minutes. Immediate notification to Security Dispatch as a "condition utility /
telecommunication"
Comments:
*If delay is caused by equipment availability, tech will track shipment and notify Network and
Telecommunications Operations Center (NTOC) as soon as repair can be scheduled
Expectations:
Whenever a "hand-off" occurs, tech rep and engineer will ensure that problem ownership is clear and
communicated to the Network and Telecommunications Operations Center (NTOC).
Problem Owner* Duties:
*1(technician or engineer until TT returned to Network and Telecommunications Operations Center
(NTOC))
1 - Informing the Network and Telecommunications Operations Center (NTOC) at regular intervals =
every 15 minutes at most to one hour at least for Severity 3 status; end of day, or as possible for Severity 1
or 2; ongoing problem, by end of business day to Triage.
2 - Network and Telecommunications Operations Center (NTOC) sends a severity page following the
NTOC response process (at the end of this section).
After hours communications protocol;
Network and Telecommunications Operations Center (NTOC) >
On Call Rep >
customer >
----- Notification to Security Dispatch for condition utility occurs ----here
Network and Telecommunications Operations Center (NTOC) after hours @ 3-1159 >
tech/engineer >
customer >
On Call Rep.
1 - Updating "Notification" parties identified above. Tracking accumulated tech and engineer labor and
materials for repair.
2 - Verifying billing with appropriate tech supervisor or backup.
3 - Providing all data to the Network and Telecommunications Operations Center (NTOC) at close out.
4 - Filing, or categorizing in AimWorX, any trouble tickets for SLA customers.
5 - Entering information into tracking dbase (as defined above).
Jump to Severity Three crisis management
Definitions
Assignment and Aging
Sample problems and their severity levels
Back to Main On Call Page
Sample Problems and Their Severity level (under construction)
Sample Problem Description
dead phone, not on critical list
single report of inability to connect to VPN and therefore to the University network
single voicemail box problem, not on critical list
inability to connect to VPN determined to be a multiple user problem. VPN never exceeds this
severity level as a service that is not mission critical.
notification of serious security vulnerability that has not implemented itself.
pager broken for person who is on-call
a pager is broken for a person on-call, the problem is 6 hours old, and estimated time for repair
is > 6 more hours. Thus total response exceeds 12 business hours and problem became a sev 2
when that estimate was first known.
Vulnerability exploit which has begun implementing itself and has created inconveniences
rather than serious impacts such as an outage.
Auth Code Manager is down (though sev 2 is typically an ageing from sev 1, this problem was
deemed more critical than the usual sev 1, while keeping it from the glut of sev 3 conditions that
may require a SWAT team response).
An FPC (processor) in the PBX is down
Severity Level
0
0
0
1
1
1
2
2
2
3
Voicemail is down
URNet backbone is down
Worm or security vulnerability exploit in progress, with outage impact or other serious
disruption to mission critical services. Includes those both affecting and launched from the
UofR community.
Ability to call out, whether long distance, or local, is impaired by congestion in the public
telephone network (therefore affects multiple users in critical areas)
3
3
3
3
http://www.utd.rochester.edu/
Published by the University of Rochester Telecommunications Division. Copyright 2000.
Daytime
Severity 3 - Call Center Responsibilities
Definition - Severity 3 : Major Failure; PBX or VoiceMail
Node down. Strategic (critical) business unit network
down. Backbone down. Internet connection down.
IMMEDIATE Organizational Actions, if voice services impacted.
Upon receipt of a severity 3 condition, during daytime business hours, the Call Center notifies a
Senior Manager and the Director of the severity 3 and assigns a scribe to record events in a
chronology.
1. A SEV3 page will be issued.
2. _______________________________
Identify the engineering lead responsible for responding to / communicating
regarding the severity. The engineering lead will be based upon the type of outage
and the on-call schedule.
3.
_______________________________
Assign an NTOC partner role of lead communicator to someone in the Call Center –
either a Senior Call Agent, Call Center Manager, or Triage. That person will identify
themselves to Kate – if she is unavailable – this role will be identified to Norm. If
neither are on site – the escalation is to David Lewis.
4.
_______________________________
Establish communication expectation with Dave, Kate & Norm - status updates every
20 minutes (or other acceptable interval).
5.
Open a trouble ticket - a new trouble ticket will be opened for each report if it is
necessary to track specific information to assess and correct the outage.
6.
_______________________________
Assign chronology duties to track the "history" of the event.
In the chronology, include:
non-technical description of what happened (customer experience)
technical description
clear understanding of: is the problem solved
if not, who is following up
7.
Notify the Directory Service Agents – provide them a script to use for callers.
8.
Notify the Front Office – provide them a script to use for callers.
9.
The NTOC lead assigned in Step 2 will be responsible for email notification to the
proper list(s) (pager, phonedown, netdown, etc).
Prepare an email notification.
Ask Engineering Lead to approve.
Provide notification to either Kate or Norm to review.
Suggest the lists that need to be accessed.
_______________________________ David Lewis
_______________________________ Med Ctr Director’s office
(Julie Choate (x54601), Roberta Parker)
_______________________________ ‘Phonedown’
_______________________________ CIOs office
(Maureen Baisch (x55240))
_______________________________ President’s Office
_______________________________ Provost’s Office
10.
Copies of each communiqué to chron keeper. The phonedown & netdown
communiqué's will be sent by the NTOC communication lead. If they are unable to
send the communiqué alert Kate or Norm – they will assign this function.
11.
Call Center staff will ready 2-way radios for possible issuance.
12.
Call Center will ready Cell phones for issuance.
13.
Determine need for other communication devices.
14.
Determine additional needs for follow up email to lists.
Post sev 3 activity:
15.
Verify with Engineering & the NTOC that the interruption has been resolved.
16.
Notify each of the lists & communication points that the interruption has been
resolved. Work with Kate & Mike to
17.
Follow up with each customer that has a trouble ticket that the issue has been
resolved.
18.
Prepare & email chron to mgrs & appropriate engineers. Engineering prepares any
additional documentation required. Post document to On-call/Reporting folder on
server.
Download