Apigee Monitoring and Operations Strategy <Company Name> <Author> <Date> ©2014 Apigee. All Rights Reserved. Introduction The purpose of this document is to: Describe key elements of Apigee’s monitoring and operations. Define the specific strategy for monitoring and operations for your API program. This document will provide high-level information on what needs to be monitored, what needs to be done to define your monitoring solution and position the Apigee support model and support processes. The following topics are covered: Topic Description Monitoring Roles Description of split of responsibilities for monitoring between Apigee and the Customer. Monitoring Strategy High-level description of the approach the Customer will be taking for their specific monitoring implementation. Operations Strategy High-level descriptions of the Customer’s operational teams and processes. Operational Support High level overview of Apigee’s support processes and a framework for defining your processes. Further Reading Additional reference information. APPENDICES Hierarchy of Monitoring General Monitoring Strategy – Best Practices 1 1 Core concepts of monitoring and operations along with key definitions. Collection of high-level core monitoring and alerting approaches and principals. s ©2014 Apigee. All Rights Reserved. Monitoring Roles The detail required for Monitoring and Operations is dependent on whether the Customer is configuring Apigee onpremise or utilizing Apigee hosted solution. Some of the sections in this template are divided into: On-Premise specific content Hosted specific content Generic content applicable to both hosted and on-prem. When completing this document, delete the headers / content that are not applicable. On-Premise Specific: Monitoring and Operations for Apigee on-premise requires implementation appropriate controls covering the infrastructure and run-time components of Apigee, for example: servers, Cassandra, Zookeeper, Postgres, loadbalancer, router, etc in addition to the API-level monitoring. A highly detailed operations guide exists for this which will be made available to help you implement appropriate monitoring. You will be responsible for monitoring, alerting and responses to alerts on all aspects of the Apigee stack, from the base infrastructure that Apigee software runs on, through the API proxies, Developer Portal and any other Apigee components all the way to your API program service. Apigee will play a limited role in your day-to-day operations, but will be available to provide additional remote support in the event that you experience an issue which may require our assistance. Hosted Specific: When utilizing the hosted Apigee solution, you do not need to monitor the infrastructure and run-time components of Apigee. The Apigee Global Support Center (GSC) will monitor and manage the operations of the Apigee infrastructure and run-time components and play an active role in day-to-day operations. If you experience a service issue, then Apigee Support Team will become involved in investigating and addressing the service issue. You will need to integrate Apigee Support Processes into your existing Support Processes. Common: API Monitoring A key part of ensuring your service is working and, in the event of service issues, is being able to rapidly identify where the issue is, so that you can focus your efforts on that area, is a well-structured API monitoring capability. API monitoring can be implemented in a variety of ways but a key objective should be to define a monitoring strategy that enables rapid isolation of the source of any issues. The remainder of this document provides a framework for doing that by describing what needs to be monitored and describing Apigee’s role in operations. 2 2 s ©2014 Apigee. All Rights Reserved. Monitoring Strategy Identify the specific types of monitoring that will be utilized for the Customer’s specific APIs. Passive End to End Service Monitoring: What, if any, passive monitoring will be utilized / which APIs will be monitored and what are the key specific metrics being monitored? Active End to End Service Monitoring: What, if any, active monitoring will be utilized / which APIs will be monitored, where and how will traffic be generated and what are the key specific metrics being monitored? Back-End Service Monitoring: What specific independent back-end service monitoring will be in place? Threshold Definition and Management: What are the key thresholds? Alerting Management: What mechanisms are going to be used to generate alerts, e.g. email, SMS, visual displays? What, if any, alert aggregation will be utilized to apply prioritization rules to alerts? Operations Strategy Identify the Customer’s operational teams and processes. Customer’s Operations Team Overview: Description of operations team, including geographic region locations? Customer’s Operational Processes: High-level summary of core operational processes? Integration with Apigee: Description of how the customer’s operational teams will utilize Apigee Support, process changes and identification of any training needs on how Apigee Support works? Apigee Support Brief overview of Apigee Support and how to integrate into your existing support processes. There are two views expressed here, one that is relevent to On-Prem Customers and one for Hosted Customers. Delete whichever is not applicable. On-Premise As an on-premise Customer, you will be responsible for the day-to-day operations and monitoring of all aspects of your Apigee solution. The on-premise monitoring guide document provides you with all the details you need to get set up. In the event that you experience an issue with your API service, your ideal first step is to review your monitoring and isolate the source of the issue (e.g. back-end server offline, rogue app creating traffic spike, Apigee gateway failure, etc.). If you suspect that the issue is related to your Apigee software and you are unable to diagnose further, then please engage Apigee Support. The nature of on-premise deployments typically means that it is difficult to impossible for Apigee Support Engineers to be able to directly log into your Apigee servers. You will need to be aware of access issues and utilize remote sharing technologies as required in order for Apigee Support Engineers to be able to review your installation. 3 3 s ©2014 Apigee. All Rights Reserved. Hosted As a hosted Customer, Apigee will be constantly monitoring all cloud instances and will detect most service-impacting issues automatically and initiate response protocols to resolve the service incident rapidly. In the event that you experience an issue with your API service, the first step you should take is to review any existing monitoring along with Apigee Analytics and attempt to isolate the source of the service issue. If you are unable to determine the source and / or you determine that an Apigee component is not functioning, please engage Apigee Support. Please refer to the following Further Reading material for details on the Apigee Support Processes including Service Levels and Escalations. Further Reading Additional details are available on our website: apigee.com. Location http://community.apigee.com/ http://community.apigee.com/enterpriseonboarding http://apigee.com/about/documents/apigeespecification-sheet-current http://apigee.com/docs/ 4 4 Description Landing page for a wealth of best practices, strategies, training materials and more. Collection of videos covering multiple topics related to getting up and running with Apigee. Of particular relevance is a video “Introduction to Support”. Detailed support level agreement. Product documentation. s ©2014 Apigee. All Rights Reserved. APPENDIX: Hierarchy of Monitoring Simplified hierarchy of monitoring. There is no universally applicable monitoring tool or process. End-to-end service delivery is served utilizing multiple layers of software and hardware that all work together but have very different monitoring and alerting solutions. Your current IT systems will be operating over and existing infrastructure and software potentially comprising a mix of 3 rd party hosted elements, open source elements, custom in-house elements and an uncontrolled wider developer community. The picture below presents a simplified way of abstracting that complexity, by decomposing the end to end journey into three parts: The Consumption layer, illustrated by the Client(s) and / or App(s) that consume your services The Digital Enablement layer, Apigee, providing API management, analytics, monetization, app services and insights The Exposure layer, illustrated by the Back-End Services, that expose your internal data In turn, each of those layers themselves are comprised of a series of layers: Infrastructure, the physical and virtual compute layers Run Time, the utility software, comprising operating systems, databases, web servers, etc Services, the consumable software, comprising back end services, Apigee services, clients & end-user apps Each layer serves a unique role in your end to end service, but each layer has different key operating metrics that are used to determine whether that layer is functioning as expected. Metrics from monitoring at each layer is combined together in order to provide an overall view of the health of your end to end service. Great monitoring enables you to prioritize issues, trigger alerts when preset thresholds are breached and in doing so, predict and take pro-active action to resolve issues before they threaten service. In the event that an issue becomes service affecting, ©2013 Apigee. All Rights Reserved. great monitoring can make the difference in enabling rapid identification of the source of the service issue and in turn shorten the duration of the service issue. APPENDIX: General Monitoring Strategy – Best Practices Decomposing your monitoring to enable the combination of overall service monitoring and isolation of components. The most important aspect of monitoring is measuring service in the same way that Client(s) and / or App(s) are consuming it or observing how those Client(s) and / or App(s) are currently consuming it, i.e. real-time usage analytics. This is your front line in knowing whether your API is working. Passive vs. Active Monitoring With Passive monitoring, you are watching the current usage patterns. You are looking out for discrepancies, for example a sudden reduction or increase in traffic volumes, a sudden increase in error rates or latency. Apigee Edge comes with a sophisticated Analytics capability that enables creation of highly customized, near real-time passive monitoring. With Active monitoring, you are injecting specific requests that mimic the requests that Client(s) and / or App(s) are making and then you are measuring the responses. The advantage of Passive monitoring is that you are not adding additional transactional load / overhead to your system but the disadvantages are that it is dependent on establishing usage patterns which, in turn, is dependent on having sufficient volumes of traffic to provide large enough sample sizes. Conversely, if your traffic volumes are very high, you may have aggregation lags as large volumes of data are analyzed and you may not detect an issue immediately. The advantage of Active monitoring is that you know that your ‘uncertainty’ window is – it is the duration between two monitoring requests and you can tune those requests so that you get rapid detection of service issues. The disadvantages are that you need to invest in custom load generating solutions, you are unlikely to be able to generate traffic from every potential source and the load that you are generating can mask real traffic patterns that could be useful to derive business insights from. A third alternative is a hybrid of Active and Passive, where your utilize both and tune them to achieve a balance between load and detection rate. As such there is no single, ideal monitoring solution for end to end service and you will need to work through the pros and cons of different options to arrive at the one that best meets your needs. Utilizing Monitoring to Drive Alerts by Setting Thresholds The next important aspect of your monitoring strategy is how you utilize the metrics you are monitoring. Even simple end to end services have multiple components, each of which could be generating large amounts of monitoring metrics. You can very rapidly become overwhelmed with data. Defining thresholds is an important step in taming the data. Through thresholds, you can identify what sort of alerts get raised when measures start to breach thresholds. Smart Alerting – Building Intelligence into Alert Prioritization The final aspect is building an intelligent rule-based alerts processing system. Here, you identify what combination of alerts need to trigger immediate action or delayed action or simply informational. For example, if a single disc fails in a multi-disc storage array, you may want to trigger a procurement request and a low priority support request for an engineer to replace the faulty disk but there is no service issue. However, if multiple disks inside the same array fail, the priority of the fix and the urgency of the procurement increases. Smart, rule-based alert post-processing can help ensure focused and appropriate responses. ©2013 Apigee. All Rights Reserved. About Apigee Apigee empowers enterprises to thrive in today’s digital world by transforming digital assets into innovation engines. Results are delivered as apps, optimized by data and catalyzed by APIs. Hundreds of companies including Walgreens, eBay, Shell, Bechtel, Marks & Spencer & Vodafone partner with Apigee to accelerate their digital transformations. To learn more, visit www.apigee.com ©2013 Apigee. All Rights Reserved.