BUYER’S Guide BUYER’S Guide Incident Management Buyer’s Guide Ten years ago, on-call teams were considered a mere support function. Only a chosen few had insight into what kept the machines running. And IT essentially operated under the doctrine of the “check engine” light: we need to be alerted to generic problems that can be handled out-of-band. Zoom forward to today. Rapid evolution in the way in which the world delivers software has put an end to the legend of the lone IT hero. Businesses adopted 24-hour delivery cycles, and customers made a monumental leap in their expectations that applications are always on, always available, and always lightning fast. The realm of “IT” or “on-call” is now a complex landscape that requires constant attention to monitor performance, correct problems and deliver service 24/7. For today’s teams, on-call is a team sport in which multiple disciplines work together around-the-clock to solve the company’s biggest challenges. It’s these teams that are quickly becoming the backbone of every business. Welcome to the Era of Real-time Incident Management Fortunately, there is a wide range of products available to help you monitor systems, collaborate effectively, and reduce time-toresolution. But what should you look for in an incident management platform? What capabilities do you really need? And is there really anything wrong with a homegrown solution? It helps to have a construct for product evaluation in such an emerging space. In this Buyer’s Guide, you will discover: • Why incident management platforms are essential for today’s teams • How to move beyond basic alerting to manage the entire incident lifecycle • Key questions to ask your vendor before making a purchase • 2 new on-call models for modern DevOps • Top 5 challenges with homegrown solutions Let’s first look at how the concept of on-call has evolved beyond simple alerting to a more holistic view across the entire incident lifecycle. www.victorops.com 2 BUYER’S Guide Why a Real-time Incident Management Platform is Essential for Today’s Teams The average cost of downtime varies widely - from $90K/hour in the media sector to $6.48M/hour for online brokerages.1 Regardless of where a company falls on that scale, the economic impact of downtime is undeniable, and the softer business impacts of customer satisfaction and brand sentiment are equally so. Many Teams Still Rely on Email for Incident Notifications How much does downtime cost your company? And how much do you invest to reduce it? The rapid transformation of software delivery has caused a radical imbalance: reducing downtime is an ever-increasing business priority, yet many companies have not invested enough in a legitimate solution. Eighty-two percent of on-call professionals use email as their primary incident notification, and 57% use SMS.2 This is a shocking statistic. Why are so many businesses relying on noisy, asynchronous and non-urgent channels for their most critical interactions? The answer is relatively simple: even just two years ago, few commercial products existed. To support applications 24/7, teams had two choices: patch together an internal process or use a basic “paging” tool to centralize alerts. Those solutions were sufficient... for a time. Early on-call solutions are no longer enough for the complexity of our systems and processes, the diversity of our teams, and the business-critical focus of our uptime. Today’s teams must manage the incident across the entire lifecycle - before, during and after an incident. How to Move Beyond Basic Alerting to Manage the Incident Lifecycle Let’s investigate the four stages of the incident lifecycle and what specific product capabilities are needed for each. In each subsequent section, this Buyer’s Guide provides a starting point for questions to ask during a vendor evaluation, or to ask about your homegrown solution before making further investments. www.victorops.com 3 BUYER’S Guide 1) Before the Incident: Alerting and Notification Incident notification, or alerting, is probably the most traditional view of what it means to be on-call: “There is a problem, so tell someone about it.” There is one primary goal for your incident management product in this phase programmatically alert a team member or team members to an issue that needs attention, and don’t stop alerting until someone responds. This includes not only the initial alert, but also a way to programmatically escalate and reroute to other individuals or teams. Time Spent on Each Stage of the Incident Lifecycle For most companies, whether it’s a startup with a handful of customers or a major ecommerce site, incident notification must be as close to flawless as possible. If your company supports free or paying customers on a general availability product, an off-the-shelf incident management solution is highly recommended. Key questions to ask: Does the solution send timely alert notifications from existing monitoring systems (most easily tested during the trial period)? Can a user customize their own notification policies (push, SMS, email, phone, or a combination) at specified intervals? Does the solution support escalation policies to teams, individuals, webhooks, emails, etc.? Does the solution surface up-to-date team contact information and individual contact preferences? Does the solution support automatic on-call rotations, including follow-the-sun scheduling? Does the solution support one-touch on-call handoffs that are immediately visible to all team members? Can the solution be set to page unlimited team members simultaneously? Can the incident be created, routed and acked within the same view? Is the product a fully functioning mobile app? (Caution here as some products offer a mobile app that simply provides notifications with acknowledgements and team collaboration occurring elsewhere.) 4 www.victorops.com BUYER’S Guide 2) During the Incident: Triage and Investigation Alerting, or “paging,” is just the tip of the iceberg. For most on-call teams, triage and investigation is the bulk of the work - and the bulk of the time. During this stage, teams need to gain situational awareness as quickly as possible in order to begin resolution. Today’s systems are so complex and ever-changing that they require a whole new level of visibility. Add to that the fact that many teams now have a varied group of individuals in the on-call rotation - ranging from a system admin to a database developer. Instead of allowing triage to splinter and break down, the platform you choose must pull together systems, people and knowledge to provide a single source of truth. Key questions to ask: Does the solution provide a single view of all activities (sometimes called a timeline) including all alerts, paging, team chat, etc.? Is the timeline fully functional in the web application, and on Android and iOS platforms? Can a user re-route incidents to one or more teams and stakeholders? Can a user “soft-escalate” to another team member within the application? Can a user filter views, such as by items that are paging them, paging their team, or all paging events? Does the solution automatically surface links to contextual documentation based on alert content? 3) During the Incident: Collaboration and Resolution On-call team collaboration is completely unique, and should be treated as such. Once that alert sounds, the firefight begins with a rapid, and sometimes chaotic, succession of messages and escalations. There are three key considerations during this stage: 1) make it as easy as possible for teams to connect and collaborate at any hour of the day or night, 2) ensure visibility within the team and to stakeholders, and 3) collect all communication for later post-mortem analysis. www.victorops.com 5 Several off-the-shelf products address these issues by harnessing the stream of activities occurring around an incident and integrating with existing collaboration platforms. Relatively minor changes in how your team communicates before, during and after an event can make a measurable impact on your average time-to-resolution. BUYER’S Guide Key questions to ask: Can a user chat directly into the primary activity stream, or timeline? Can a user view a list of team members who are currently online and available to assist? Does the solution support contextual chat (either native or via integration), such as chats that attach to a specific incident? Does the solution support bi-directional integrations with horizontal chat platforms, such as HipChat and Slack? Does the solution support private chat between team members within the application? Can a user facilitate documentation clean-up by making notes about annotated assets in real-time? 4) After the Incident: Post-mortem and Remediation There’s certainly nothing new or groundbreaking about conducting a post-mortem after an incident. What is new and groundbreaking about today’s on-call management solutions is the ability to pull together activities that occurred throughout the entire incident lifecycle into a single activity stream, with a single authoritative clock, and including all of the relevant communication. An accurate and comprehensive post-mortem is an essential tool for communicating with internal and external stakeholders. But more importantly, it helps prepare, and ideally prevent, similar incidents from occurring again. If you don’t have a process or tool for conducting post-mortems, you should invest time here. Key questions to ask: Can a user specify a timeframe and report on all the activities for that period, broken into alerts, system actions and chat? Can the solution serve as the “authoritative clock,” accurately timestamping throughout the lifecycle? Are post-mortem reports editable, for example to edit out lines that aren’t relevant? Are post-mortem reports customizable, such as annotating points of resolution or adding a summary to the event? Does the solution provide bi-directional integrations for other tools your team may use during the incident, such as Slack and HipChat, or after the incident for remediation, such as GitHub? Does the solution support continuous documentation, for example incident frequency reports to identify whether alerts are actionable? www.victorops.com 6 BUYER’S Guide Two New On-call Models for Modern DevOps Traditional on-call rotation is by far the most common model used today, but new advances have opened the door to more sophisticated alternatives. Two new on-call models - annotated alerting and rules-based routing - go beyond solving for initial response time and instead tackle the bigger picture of total time-to-resolution. Traditional On-call Rotation Omnipresent Enterprise Alert Stream In a traditional on-call rotation, a designated individual receives all alerts that occur during his or her shift. It is often necessary to escalate the issue to other individuals or teams, especially as more diverse teams participate in on-call duty. Every escalation cycle adds time to resolution, and often problem solving only begins in earnest once the correct person is identified, contacted and engaged. While a traditional concept of on-call duty is how our industry evolved, isn’t it akin to throwing bodies at the problem? On-Call Rotation What if instead, early warning alerts routed to the right people with the right context could prevent incidents before they impact the business? New methods are being adopted in progressive teams in which alert routing is more flexible, time-to-resolution shrinks, and yet the team still reaps the benefits of spreading on-call tasks across a larger group of people. APM Problems SecOps Problems Slow Site Problems Network Connectivity Partner Integration Problems Let’s look at two new capabilities that allow us to rethink traditional rotations. 1. Annotated alerts Annotated alerts solve one of the biggest challenges for multidisciplinary on-call teams: how to programmatically associate runbooks and other contextual documentation based on alert type. Using a system of annotations, your on-call management platform, such as VictorOps, recognizes a problem that has occurred in the past and automatically attaches documentation with remediation information. When the on-call team is alerted to the problem, documentation previously prepared by the team aids in triage. While escalation may still be necessary, contextual documentation increases the self-sufficiency of the individual on-call and helps them more easily identify the proper escalation path. Annotated alerts are a natural next step for teams currently relying on traditional on-call routing. www.victorops.com 7 2. Rules-based routing BUYER’S Guide Rules-based routing has the potential to flip traditional routing on its head. Using rules-based routing, the traditional model of notifying any available body evolves into notifying the right body. Your incident management platform recognizes problems that have occurred in the past and immediately routes them to the correct individual or team. Users can establish broad-based matching to send categorical alerts to teams, or apply very specific rules for very specific problems. As more alerts are created, a lesser number of alerts go through the traditional on-call rotation, reducing load on those individuals and avoiding costly delays in escalation cycles. Annotated Alerts + Rules-based Routing = The Ultimate Solution? Today’s most sophisticated teams use a blended model of annotations and rules-based routing. Time-to-resolution shrinks as rules route incidents to specific individuals, and annotations automatically provide contextual documentation. Annotated Alerts + Rules-based Routing Omnipresent Enterprise Alert Stream This combined model has many benefits: incidents reach the right person faster; the on-call load is spread across the widest possible group of people; and team members gain self-sufficiency as they are armed with contextual documentation. Transitioning from a traditional on-call rotation does require a certain level of commitment, but the bulk of that work can occur incrementally as incidents occur, especially if you have an on-call management platform. For example, several VictorOps customers have effectively tackled the problem of outdated documentation by incrementally building documentation in context of alerts as they occur. For more best practices on using annotations and rules, contact VictorOps support at support@victorops.com. On-Call Rotation Key questions to ask: Network Slow Site APM SecOps Does the solution allow a user to annotate relevant assets directly into Connectivity Problems Problems Problems an alert, such as notes, images and links? Does the solution support customizable alert fields to make future alerts behave differently, such as changing a recurring alert from critical to warning status? Does the solution support advanced routing, such as routing keys to automatically send specified alerts to the right person or team? Does the solution integrate ChatOps, such as leveraging Hubot commands for automated actions? Does the solution support your team in adopting best practices, such as continuous documentation and ChatOps? Partner Integration Problems 8 Top 5 Challenges with Homegrown On-call Solutions BUYER’S Guide “Do anything other than build it yourself.” - State of On-Call Report survey respondent From the State of On-Call Report, surveying more than 500 on-call professionals, and from dozens of VictorOps customer interviews, there are 5 inherent problems with homegrown on-call solutions. 1. 2. 3. 4. 5. Lack of visibility - There are two key problems with relying on email and/or SMS for initial alert notification. 1) They are inadequate as a means of instilling urgency at the crucial first moments of an incident. 2) They often lack visibility as individuals can’t cut through the noise to know what steps to take. That initial lack of immediacy can cause a ripple effect of quantifiable damage during troubleshooting. Lack of accountability and ownership - Homegrown incident notification processes are often sent to a group of 5 or more individuals simultaneously, leaving time-to-resolution at the mercy of the right team member checking in and responding to the communication. Email responses often become overlapped and communication splinters across other tools like HipChat, Skype or Slack. The result is slower time-to-resolution and a massive clean-up exercise for post-mortem analysis. Inflexibility to schedule changes - On-call schedules are always created with the best intentions of a single individual being on-call for a specified period of time. Homegrown solutions often can’t handle the inevitable - a child’s first violin concert or an illness. Homegrown solutions often leave last-minute schedule changes in the hands of individual team members. That solution works in theory until someone forgets that they took a shift and quite innocently silences their phone to step into a movie theater. Lack of contextual documentation - Fifty-six percent of on-call professionals say that their biggest challenge is sharing contextual information. The top reported issue to problem remediation was effective surfacing of correct internal wikis. Alert fatigue - Homegrown solutions, or alert-only commercial products, often have one or more factors that contribute to alert fatigue: sending multiple messages without appropriate escalation paths, sending alerts to multiple individuals, and lacking contextual data to solve the problem. A number of these factors combined quickly results in alert fatigue and deteriorating response times. Sources: 1 Evolven.com “Downtime, Outages, and Failures: Understanding their True Costs” Sept 18, 2012 2 VictorOps “The State of On-Call” 2014 9 www.victorops.com BUYER’S Guide Key Questions to Ask Your Vendor Before the Incident: Alerting and Notification Does the solution send timely alert notifications from existing monitoring systems (most easily tested during the trial period)? Can a user customize their own notification policies (push, SMS, email, phone, or a combination) at specified intervals? Does the solution support escalation policies to teams, individuals, webhooks, emails, etc.? Does the solution surface up-to-date team contact information and individual contact preferences? Does the solution support automatic on-call rotations, including follow-the-sun scheduling? Does the solution support one-touch on-call handoffs that are immediately visible to all team members? Can the solution be set to page unlimited team members simultaneously? Can the incident be created, routed and acked within the same view? Is the product a fully functioning mobile app? (Caution here as some products offer a mobile app that simply provides notifications with acknowledgements and team collaboration occurring elsewhere.) During the Incident: Triage and Investigation Does the solution provide a single view of all activities (sometimes called a timeline) including all alerts, paging, team chat, etc.? Is the timeline fully functional in the web application, and on Android and iOS platforms? Can a user re-route incidents to one or more teams and stakeholders? Can a user “soft-escalate” to another team member within the application? Can a user filter views, such as by items that are paging them, paging their team, or all paging events? Does the solution automatically surface links to contextual documentation based on alert content? 10 www.victorops.com During the Incident: Collaboration and Resolution BUYER’S Guide Can a user chat directly into the primary activity stream, or timeline? Can a user view a list of team members who are currently online and available to assist? Does the solution support contextual chat (either native or via integration), such as chats that attach to a specific incident? Does the solution support bi-directional integrations with horizontal chat platforms, such as HipChat and Slack? Does the solution support private chat between team members within the application? Can a user facilitate documentation clean-up by making notes about annotated assets in real-time? After the Incident: Post-mortem and Remediation Can a user specify a timeframe and report on all the activities for that period, broken into alerts, system actions and chat? Can the solution serve as the “authoritative clock,” accurately timestamping throughout the lifecycle? Are post-mortem reports editable, for example to edit out lines that aren’t relevant? Are post-mortem reports customizable, such as annotating points of resolution or adding a summary to the event? Does the solution provide bi-directional integrations for other tools your team may use during the incident, such as Slack and HipChat, or after the incident for remediation, such as GitHub? Does the solution support continuous documentation, for example incident frequency reports to identify whether alerts are actionable? Annotated Alerts and Rules-based Routing Does the solution allow a user to annotate relevant assets directly into an alert, such as notes, images and links? Does the solution support customizable alert fields to make future alerts behave differently, such as changing a recurring alert from critical to warning status? Does the solution support advanced routing, such as routing keys to automatically send specified alerts to the right person or team? Does the solution integrate ChatOps, such as leveraging Hubot commands for automated actions? Does the solution support your team in adopting best practices, such as continuous documentation and ChatOps? 11