BUYER`S Guide

advertisement
BUYER’S Guide
BUYER’S Guide
Incident Management Buyer’s Guide
Ten years ago, on-call teams were considered a mere support function. Only a chosen few had insight into what kept the machines
running. And IT essentially operated under the doctrine of the “check engine” light: we need to be alerted to generic problems that can
be handled out-of-band.
Zoom forward to today. Rapid evolution in the way in which the world delivers software has put an end to the legend of the lone IT
hero. Businesses adopted 24-hour delivery cycles, and customers made a monumental leap in their expectations that applications are
always on, always available, and always lightning fast.
The realm of “IT” or “on-call” is now a complex landscape that requires constant attention to monitor performance, correct problems
and deliver service 24/7. For today’s teams, on-call is a team sport in which multiple disciplines work together around-the-clock to
solve the company’s biggest challenges. It’s these teams that are quickly becoming the backbone of every business.
Welcome to the Era of Real-time Incident Management
Fortunately, there is a wide range of products available to help you monitor systems, collaborate effectively, and reduce time-toresolution. But what should you look for in an incident management platform? What capabilities do you really need? And is there really
anything wrong with a homegrown solution? It helps to have a construct for product evaluation in such an emerging space.
In this Buyer’s Guide, you will discover:
• Why incident management platforms are essential for today’s teams
• How to move beyond basic alerting to manage the entire incident lifecycle
• Key questions to ask your vendor before making a purchase
• 2 new on-call models for modern DevOps
• Top 5 challenges with homegrown solutions
Let’s first look at how the concept of on-call has evolved beyond simple alerting to a more holistic view across
the entire incident lifecycle.
www.victorops.com
2
BUYER’S Guide
Why a Real-time Incident Management Platform is
Essential for Today’s Teams
The average cost of downtime varies widely - from $90K/hour in the media sector to $6.48M/hour for online brokerages.1
Regardless of where a company falls on that scale, the economic impact of downtime is undeniable, and the softer business
impacts of customer satisfaction and brand sentiment are equally so.
Many Teams Still Rely on Email for Incident Notifications
How much does downtime cost your company? And how much do you invest to
reduce it? The rapid transformation of software delivery has caused a radical
imbalance: reducing downtime is an ever-increasing business priority, yet many
companies have not invested enough in a legitimate solution.
Eighty-two percent of on-call professionals use email as their primary incident
notification, and 57% use SMS.2 This is a shocking statistic. Why are so many
businesses relying on noisy, asynchronous and non-urgent channels for their
most critical interactions?
The answer is relatively simple: even just two years ago, few commercial products existed. To support applications 24/7, teams
had two choices: patch together an internal process or use a basic “paging” tool to centralize alerts. Those solutions were
sufficient... for a time.
Early on-call solutions are no longer enough for the complexity of our systems and processes, the diversity of our teams, and the
business-critical focus of our uptime. Today’s teams must manage the incident across the entire lifecycle - before, during and after
an incident.
How to Move Beyond Basic Alerting to Manage the Incident Lifecycle
Let’s investigate the four stages of the incident lifecycle and what specific product capabilities are needed for each. In each
subsequent section, this Buyer’s Guide provides a starting point for questions to ask during a vendor evaluation, or to ask about
your homegrown solution before making further investments.
www.victorops.com
3
BUYER’S Guide
1) Before the Incident: Alerting and Notification
Incident notification, or alerting, is probably the most traditional view of what
it means to be on-call: “There is a problem, so tell someone about it.” There
is one primary goal for your incident management product in this phase programmatically alert a team member or team members to an issue that
needs attention, and don’t stop alerting until someone responds. This includes
not only the initial alert, but also a way to programmatically escalate and
reroute to other individuals or teams.
Time Spent on Each Stage
of the Incident Lifecycle
For most companies, whether it’s a startup with a handful of customers or
a major ecommerce site, incident notification must be as close to flawless
as possible. If your company supports free or paying customers on a general
availability product, an off-the-shelf incident management solution is
highly recommended.
Key questions to ask:
Does the solution send timely alert notifications from existing monitoring systems (most easily tested during the trial period)?
Can a user customize their own notification policies (push, SMS, email, phone, or a combination) at specified intervals?
Does the solution support escalation policies to teams, individuals, webhooks, emails, etc.?
Does the solution surface up-to-date team contact information and individual contact preferences?
Does the solution support automatic on-call rotations, including follow-the-sun scheduling?
Does the solution support one-touch on-call handoffs that are immediately visible to all team members?
Can the solution be set to page unlimited team members simultaneously?
Can the incident be created, routed and acked within the same view?
Is the product a fully functioning mobile app? (Caution here as some products offer a mobile app that simply provides
notifications with acknowledgements and team collaboration occurring elsewhere.)
4
www.victorops.com
BUYER’S Guide
2) During the Incident: Triage and Investigation
Alerting, or “paging,” is just the tip of the iceberg. For most on-call teams, triage and investigation is the bulk of the
work - and the bulk of the time. During this stage, teams need to gain situational awareness as quickly as possible in
order to begin resolution.
Today’s systems are so complex and ever-changing that they require a whole new level of visibility. Add to that the fact that many
teams now have a varied group of individuals in the on-call rotation - ranging from a system admin to a database developer.
Instead of allowing triage to splinter and break down, the platform you choose must pull together systems, people and knowledge
to provide a single source of truth.
Key questions to ask:
Does the solution provide a single view of all activities (sometimes called a timeline)
including all alerts, paging, team chat, etc.?
Is the timeline fully functional in the web application, and on Android and iOS platforms?
Can a user re-route incidents to one or more teams and stakeholders?
Can a user “soft-escalate” to another team member within the application?
Can a user filter views, such as by items that are paging them, paging their team, or all paging events?
Does the solution automatically surface links to contextual documentation based on alert content?
3) During the Incident: Collaboration and Resolution
On-call team collaboration is completely unique, and should be treated as such. Once that alert sounds, the firefight begins with
a rapid, and sometimes chaotic, succession of messages and escalations. There are three key considerations during this stage:
1) make it as easy as possible for teams to connect and collaborate at any hour of the day or night, 2) ensure visibility within the
team and to stakeholders, and 3) collect all communication for later post-mortem analysis.
www.victorops.com
5
Several off-the-shelf products address these issues by harnessing the stream of activities occurring around an
incident and integrating with existing collaboration platforms. Relatively minor changes in how your team
communicates before, during and after an event can make a measurable impact on your average time-to-resolution.
BUYER’S Guide
Key questions to ask:
Can a user chat directly into the primary activity stream, or timeline?
Can a user view a list of team members who are currently online and available to assist?
Does the solution support contextual chat (either native or via integration), such as chats that attach to a specific incident?
Does the solution support bi-directional integrations with horizontal chat platforms, such as HipChat and Slack?
Does the solution support private chat between team members within the application?
Can a user facilitate documentation clean-up by making notes about annotated assets in real-time?
4) After the Incident: Post-mortem and Remediation
There’s certainly nothing new or groundbreaking about conducting a post-mortem after an incident. What is new and
groundbreaking about today’s on-call management solutions is the ability to pull together activities that occurred throughout the
entire incident lifecycle into a single activity stream, with a single authoritative clock, and including all of the relevant communication.
An accurate and comprehensive post-mortem is an essential tool for communicating with internal and external stakeholders.
But more importantly, it helps prepare, and ideally prevent, similar incidents from occurring again. If you don’t have a process
or tool for conducting post-mortems, you should invest time here.
Key questions to ask:
Can a user specify a timeframe and report on all the activities for that period, broken into alerts, system actions and chat?
Can the solution serve as the “authoritative clock,” accurately timestamping throughout the lifecycle?
Are post-mortem reports editable, for example to edit out lines that aren’t relevant?
Are post-mortem reports customizable, such as annotating points of resolution or adding a summary to the event?
Does the solution provide bi-directional integrations for other tools your team may use during the incident,
such as Slack and HipChat, or after the incident for remediation, such as GitHub?
Does the solution support continuous documentation, for example incident frequency reports
to identify whether alerts are actionable?
www.victorops.com
6
BUYER’S Guide
Two New On-call Models for Modern DevOps
Traditional on-call rotation is by far the most common model used today, but new
advances have opened the door to more sophisticated alternatives. Two new on-call
models - annotated alerting and rules-based routing - go beyond solving for initial
response time and instead tackle the bigger picture of total time-to-resolution.
Traditional On-call Rotation
Omnipresent
Enterprise
Alert Stream
In a traditional on-call rotation, a designated individual receives all alerts that occur
during his or her shift. It is often necessary to escalate the issue to other individuals
or teams, especially as more diverse teams participate in on-call duty. Every
escalation cycle adds time to resolution, and often problem solving only begins
in earnest once the correct person is identified, contacted and engaged. While
a traditional concept of on-call duty is how our industry evolved, isn’t it akin
to throwing bodies at the problem?
On-Call Rotation
What if instead, early warning alerts routed to the right people with the right
context could prevent incidents before they impact the business?
New methods are being adopted in progressive teams in which alert routing is
more flexible, time-to-resolution shrinks, and yet the team still reaps the benefits
of spreading on-call tasks across a larger group of people.
APM
Problems
SecOps
Problems
Slow Site
Problems
Network
Connectivity
Partner
Integration
Problems
Let’s look at two new capabilities that allow us to rethink traditional rotations.
1. Annotated alerts
Annotated alerts solve one of the biggest challenges for multidisciplinary on-call teams: how to programmatically
associate runbooks and other contextual documentation based on alert type. Using a system of annotations, your
on-call management platform, such as VictorOps, recognizes a problem that has occurred in the past and automatically
attaches documentation with remediation information. When the on-call team is alerted to the problem, documentation
previously prepared by the team aids in triage. While escalation may still be necessary, contextual documentation
increases the self-sufficiency of the individual on-call and helps them more easily identify the proper escalation path.
Annotated alerts are a natural next step for teams currently relying on traditional on-call routing.
www.victorops.com
7
2. Rules-based routing
BUYER’S Guide
Rules-based routing has the potential to flip traditional routing on its head. Using rules-based routing,
the traditional model of notifying any available body evolves into notifying the right body. Your incident
management platform recognizes problems that have occurred in the past and immediately routes
them to the correct individual or team. Users can establish broad-based matching to send categorical alerts to teams,
or apply very specific rules for very specific problems. As more alerts are created, a lesser number of alerts go through
the traditional on-call rotation, reducing load on those individuals and avoiding costly delays in escalation cycles.
Annotated Alerts + Rules-based Routing = The Ultimate Solution?
Today’s most sophisticated teams use a blended model of annotations and
rules-based routing. Time-to-resolution shrinks as rules route incidents to specific
individuals, and annotations automatically provide contextual documentation.
Annotated Alerts
+ Rules-based Routing
Omnipresent
Enterprise
Alert Stream
This combined model has many benefits: incidents reach the right person faster;
the on-call load is spread across the widest possible group of people; and team
members gain self-sufficiency as they are armed with contextual documentation.
Transitioning from a traditional on-call rotation does require a certain level of
commitment, but the bulk of that work can occur incrementally as incidents
occur, especially if you have an on-call management platform. For example,
several VictorOps customers have effectively tackled the problem of outdated
documentation by incrementally building documentation in context of alerts
as they occur. For more best practices on using annotations and rules,
contact VictorOps support at support@victorops.com.
On-Call Rotation
Key questions to ask:
Network
Slow Site
APM
SecOps
Does the solution allow a user to annotate relevant assets directly into
Connectivity
Problems
Problems
Problems
an alert, such as notes, images and links?
Does the solution support customizable alert fields to make future alerts
behave differently, such as changing a recurring alert from critical to warning status?
Does the solution support advanced routing, such as routing keys to automatically send specified alerts to
the right person or team?
Does the solution integrate ChatOps, such as leveraging Hubot commands for automated actions?
Does the solution support your team in adopting best practices, such as continuous documentation and ChatOps?
Partner
Integration
Problems
8
Top 5 Challenges with Homegrown On-call Solutions
BUYER’S Guide
“Do anything other than build it yourself.”
- State of On-Call Report survey respondent
From the State of On-Call Report, surveying more than 500 on-call professionals, and from dozens of VictorOps customer
interviews, there are 5 inherent problems with homegrown on-call solutions.
1.
2.
3.
4.
5.
Lack of visibility - There are two key problems with relying on email and/or SMS for initial alert notification.
1) They are inadequate as a means of instilling urgency at the crucial first moments of an incident.
2) They often lack visibility as individuals can’t cut through the noise to know what steps to take. That initial lack
of immediacy can cause a ripple effect of quantifiable damage during troubleshooting.
Lack of accountability and ownership - Homegrown incident notification processes are often sent to a group of
5 or more individuals simultaneously, leaving time-to-resolution at the mercy of the right team member checking in
and responding to the communication. Email responses often become overlapped and communication splinters across
other tools like HipChat, Skype or Slack. The result is slower time-to-resolution and a massive clean-up exercise
for post-mortem analysis.
Inflexibility to schedule changes - On-call schedules are always created with the best intentions of a single individual
being on-call for a specified period of time. Homegrown solutions often can’t handle the inevitable - a child’s first violin
concert or an illness. Homegrown solutions often leave last-minute schedule changes in the hands of individual team members. That solution works in theory until someone forgets that they took a shift and quite innocently silences their phone to step into a movie theater.
Lack of contextual documentation - Fifty-six percent of on-call professionals say that their biggest challenge is
sharing contextual information. The top reported issue to problem remediation was effective surfacing of correct
internal wikis.
Alert fatigue - Homegrown solutions, or alert-only commercial products, often have one or more factors that contribute
to alert fatigue: sending multiple messages without appropriate escalation paths, sending alerts to multiple individuals,
and lacking contextual data to solve the problem. A number of these factors combined quickly results in alert fatigue
and deteriorating response times.
Sources:
1
Evolven.com “Downtime, Outages, and Failures: Understanding their True Costs” Sept 18, 2012
2
VictorOps “The State of On-Call” 2014
9
www.victorops.com
BUYER’S Guide
Key Questions to Ask Your Vendor
Before the Incident: Alerting and Notification
Does the solution send timely alert notifications from existing monitoring systems
(most easily tested during the trial period)?
Can a user customize their own notification policies (push, SMS, email, phone, or a combination) at specified intervals?
Does the solution support escalation policies to teams, individuals, webhooks, emails, etc.?
Does the solution surface up-to-date team contact information and individual contact preferences?
Does the solution support automatic on-call rotations, including follow-the-sun scheduling?
Does the solution support one-touch on-call handoffs that are immediately visible to all team members?
Can the solution be set to page unlimited team members simultaneously?
Can the incident be created, routed and acked within the same view?
Is the product a fully functioning mobile app? (Caution here as some products offer a mobile app that simply provides
notifications with acknowledgements and team collaboration occurring elsewhere.)
During the Incident: Triage and Investigation
Does the solution provide a single view of all activities (sometimes called a timeline)
including all alerts, paging, team chat, etc.?
Is the timeline fully functional in the web application, and on Android and iOS platforms?
Can a user re-route incidents to one or more teams and stakeholders?
Can a user “soft-escalate” to another team member within the application?
Can a user filter views, such as by items that are paging them, paging their team, or all paging events?
Does the solution automatically surface links to contextual documentation based on alert content?
10
www.victorops.com
During the Incident: Collaboration and Resolution
BUYER’S Guide
Can a user chat directly into the primary activity stream, or timeline?
Can a user view a list of team members who are currently online and available to assist?
Does the solution support contextual chat (either native or via integration), such as chats that attach
to a specific incident?
Does the solution support bi-directional integrations with horizontal chat platforms, such as HipChat and Slack?
Does the solution support private chat between team members within the application?
Can a user facilitate documentation clean-up by making notes about annotated assets in real-time?
After the Incident: Post-mortem and Remediation
Can a user specify a timeframe and report on all the activities for that period, broken into alerts,
system actions and chat?
Can the solution serve as the “authoritative clock,” accurately timestamping throughout the lifecycle?
Are post-mortem reports editable, for example to edit out lines that aren’t relevant?
Are post-mortem reports customizable, such as annotating points of resolution or adding a summary to the event?
Does the solution provide bi-directional integrations for other tools your team may use during the incident,
such as Slack and HipChat, or after the incident for remediation, such as GitHub?
Does the solution support continuous documentation, for example incident frequency reports to identify whether alerts
are actionable?
Annotated Alerts and Rules-based Routing
Does the solution allow a user to annotate relevant assets directly into an alert, such as notes, images and links?
Does the solution support customizable alert fields to make future alerts behave differently, such as changing a
recurring alert from critical to warning status?
Does the solution support advanced routing, such as routing keys to automatically send specified alerts to
the right person or team?
Does the solution integrate ChatOps, such as leveraging Hubot commands for automated actions?
Does the solution support your team in adopting best practices, such as continuous documentation
and ChatOps?
11
Download