How Microsoft Does IT: Monitoring Best Practices Published January 2014 The following content may no longer reflect Microsoft’s current position or infrastructure. This content should be viewed as reference documentation only, to inform IT business decisions within your own company or organization. In this paper, Microsoft IT discusses how it has centralized the business of systems monitoring into a set of core and dedicated services that enable business units and other internal teams to focus on adding value to their customers. Situation With over 35,000 servers to manage, over 1,200 line-of-business applications and services to monitor, and six business process units to engage, Microsoft IT needed to create a functioning monitoring system to meet internal service levels and to provide internal customers with a valuable monitoring service. Solution Microsoft IT designed a new monitoring platform based on Microsoft System Center 2012 that centralizes the organization’s monitoring efforts and offers two monitoring service offerings that cater to different types of internal customers with unique monitoring requirements. Benefits • Mitigated operational risk • Global perspective on system availability • Faster time to respond and resolve issues • Streamlined development Products and Technology System Center 2012 SP1 and R2 Operations Manager Windows Azure Windows Server 2012 2 | Technical Case Study Situation As part of Microsoft’s transformation to a services and devices company, Microsoft Information Technology (Microsoft IT) is shifting its focus from technology to maximizing the organization’s value to the business. One key area where Microsoft IT is working to enhance its services is in systems monitoring. With currently over 35,000 servers to manage and monitor, the Service Delivery and Operations team within Microsoft IT uses a variety of tools and processes to generate and collect alerts. The challenge with monitoring large numbers of servers is in the proper tuning of alerts. A few years ago, Microsoft IT was receiving more alerts than was needed or that could be responded to properly, and the alert stream was filled with superfluous data that was impacting the reliability of the core monitoring function. Microsoft IT recognized that developing a better tuned alert stream was critical to supporting the company’s growth. Moreover, by ensuring that only actionable alerts were sent to console operators, Microsoft IT could add value to business units (BUs) by delivering the data they needed to make good data-driven decisions. Solution Microsoft IT decided to build a new monitoring platform to centralize the organization’s monitoring efforts. Based on System Center 2012, the new platform had to support two service offerings, with each service catering to a different type of internal customer that had a unique set of monitoring requirements (see Figure 1). 3 | Technical Case Study Figure 1. Microsoft IT's new monitoring platform caters to two different customer types. Design Principles The following list summarizes the key design principles Microsoft IT followed to build the monitoring platform. Build the right kind of monitoring service for each customer type. For Microsoft IT, this meant providing service offerings that cater to its monitoring consumers: infrastructure incident and application incident management teams. Keep the system actionable through governance and tuning. Ensuring a clean alert stream was, in turn, derived from applying the following principles: o Only generate alerts which console operators can respond to and which can be resolved at the console within a 15-minute window; otherwise, escalate to a line two team. o Only send alerts that contain a published troubleshooting guide. o The troubleshooting guide should never take an operator longer than Microsoft IT’s response service level agreement (SLA) of 15 minutes before requiring a ticket. o Narrow the number of performance and event collections down to the few key items that are used to guide performance and problem management discussions o Override all Management Pack performance collections which aren’t being used for alerting, reporting, or analysis by setting them to not collected. Similarly, govern state data and event collection to reduce system overhead. Keep the platform up to date. Benefit from new product features and improvements by evaluating and installing the lastest quarterly System Center update release, Management Pack update, and Windows version or Service Pack. The Monitoring Platform Services As mentioned previously, Microsoft IT’s new monitoring platform needed to offer different levels of monitoring to support different customers’ needs: core OS and dedicated. Each service is described in the following sections. Core Global Monitoring Service The Core Global Monitoring Service is focused on detecting availability issues with the Operating Systems and health of the supporting server hardware through reactive monitoring. This service does 4 | Technical Case Study so by using System Center 2012 Operations Manager to send alerts to a centralized infrastructure incident response team whenever the core OS and hardware stops running. This environment is also used to collect a standard set of performance data from each server. The performance data is then provided to the server and application owners so that they can identify where their systems are under- or over-utilized and perform capacity management. Alerts generated by these agents are used by the following teams: Incident Management: Uses a set of alerts that allows them to know when an OS is offline or at risk of being offline. Problem: Collect event and state data concerning trends in the data center to help the team identify which 50 servers are rebooting the most frequently or are experiencing the biggest performance issues. Figure 2 below highlights some of the key characteristics of the Core Global Monitoring Service. 5 | Technical Case Study Figure 2. Key characteristics of the Core Global Monitoring Service. Dedicated Monitoring Platforms The Dedicated Monitoring Platform service was primarily designed for the business units (BUs) (such as Corporate Functions IT) that perform specialized business functions for various business units (such as HR). Each BU is allocated its own dedicated Operations Manager platform that Microsoft IT designs, builds, and operates. In this case, the customer BU is responsible for defining what is in the alert suite and how notifications are sent to the incident response teams; Microsoft IT’s role it to ensure that these business-critical platforms are continuously available for monitoring. Dedicated Monitoring customers are supported by a centralized helpdesk, which uses an alert noise reduction program to identify the top “talkers” in the environment that either create the most alerts or the most support tickets. Ticket data is joined with data from Operations Manager to help Microsoft IT identify the rules and monitors that produce both the best and worst outcomes in terms of actionability (such as, Was the trouble ticket generated from an alert closed with a productive 6 | Technical Case Study outcome?), and to find if there are any gaps in the alerting which may, if added, improve the customer experience by reducing operational risk. Figure 3 below highlights some of the key characteristics of the Dedicated Monitoring Service. Figure 3. Key characteristics of the Dedicated Monitoring Service. Implementation Topology The monitoring platform uses System Center 2012 to manage both on-premises assets and cloudbased assets running in Windows Azure. The bulk of the content and building blocks for monitoring come from the retail Management Packs, which are provided by Microsoft and various third parties (such as the hardware vendors for servers, among others). Microsoft IT’s deployment and upgrade of System Center 2007 servers aligned very closely to the deployment and upgrade guidance provided by the System Center 2012 product group. 7 | Technical Case Study Some of the key features of the topology include: Databases are either clustered or configured with SQL AlwaysOn for business continuity across multiple data centers Management servers manage between 1,500 to 2,000 agents Rely on gateway servers to reach into highly secure network environments and across complex Active Directory forest relationships When scale requires breaking agents up into multiple management groups, the agents should be grouped into broad functional groups (that is, production vs. non-production, etc.) Platform Upgrading When a newer version of System Center 2012 became available, Microsoft IT followed the out-of-thebox instructions to upgrade the platform’s System Center 2012 management services and databases. In addition to abiding by the key principle to keep the platform up to date, incorporating System Center 2012 SP1 and beyond into the monitoring platform enabled Microsoft IT to implement the following new features into its monitoring services: Application performance management (APM): Incorporated into the application component level of monitoring, APM gives Microsoft IT deep insight into application performance and exceptions. Global Service Monitoring (GSM): Provides operators with an at-a-glance global view of the health status of all monitored servers from a customer point of view. Visual Studio Team Foundation (VSTF) integration: Enables customers to forward an alert with debug information to VSTF for a developer to respond to. Tip: For more information about the key features available in System Center 2012 R2, see http://technet.microsoft.com/en-us/library/dn249519.aspx; for System Center 2012 SP1, see http://technet.microsoft.com/en-us/library/jj649385.aspx. Agent Upgrading In this new platform, Microsoft IT uses the System Center 2012 Operations Manager SDK to push new installations and upgrades to agents. The administrative burden associated with the SDK is small because Microsoft IT uses it for remote agent administration—including a simple path to upgrade agents via built-in product features. This new upgrade process is a significant change from Microsoft IT’s previous use of System Center Configuration Manager packages to deploy and configure agents, which did not allow for remote administration. Having agent upgrades pushed out with the SDK via the Operations Manager console provides a “free” upgrade from old to new versions: as soon as Microsoft IT upgrades the infrastructure, the agents are automatically upgraded. Alert Tuning and Validation As described previously, Microsoft IT uses out-of-the-box management packs as a starting point for developing alerts and then uses a pre-production environment to validate their tuning. A key aspect of the process is to work with subject matter experts (SMEs) on an ongoing basis to ensure that new alerts introduced into the system are meaningful and actionable. 8 | Technical Case Study Figure 4. Alert tuning and validation life cycle. The following steps summarize Microsoft IT’s alert development life cycle: 1. Identify the need for a new alert. Hallway conversations, major incident reviews, problem managers, and monthly operations reviews will keep a monitoring team busy. 2. Evaluate potential solutions. Determine whether an existing Management Pack will suffice, or if instead an ad hoc alert must be developed. o If an appropriate alert is available in the Retail Management Pack, import it, making sure to turn off everything except for the few availability rules and monitors that have been determined to be vital. Note: As opposed to alerts that indicate a potential problem, Microsoft IT only focuses on true up/down indicators that flag when a system or service is offline. o If a custom ad hoc rule is necessary, author it using the Operations Console. 3. Validate the alert. Talk to the consuming team, in this case the incident management teams, to ensure readiness. Then deploy the vital few agents to a pre-production environment and tune them based on alert volumes and their actionability from the resulting tickets they generate. 4. Deploy into production systems. When future reactive incidents occur (such as when something fails and monitoring didn’t detect it), determine if the likelihood of future occurrence merits broad monitoring. If so, then determine the most basic indicator of the problem and deploy custom monitoring for that particular case. Adding Automation As the platform matured and the alert stream became more and more actionable, Microsoft IT continued to evolve the environment in the following stages: 1. Winnow alerts down to the vital few indicators of up/down availability. 2. Collect the vital performance data to enable capacity and utilization analysis. 3. Use System Center 2012 SP1’s Company Knowledge feature to supplement existing product knowledge that comes with the Management Pack with custom company knowledge, such as specialized procedures that are unique to the IT organization. 4. Review alert volumes regularly to find problem servers, problem clusters, or problem behaviors which need fixing. This can include better monitoring, better knowledge, more training, or better systems. 5. Use product and company knowledge as a guide for implementing automated diagnostics and recoveries within Operations Manager that will reduce the number of incidents that would otherwise require human intervention. Microsoft IT’s troubleshooting guide is the starting point to discover how much effort, complexity, and risk is involved in automating a given scenario. After tuning the monitoring alerts and cutting down the set of alerts to the vital few, Microsoft IT saw an opportunity to add automation and deeper diagnostics to the environment to further reduce alert 9 | Technical Case Study generation and streamline the overall response process. Some example areas where Microsoft IT incorporated automation into the monitoring platform include: Automating operator response: In cases where an existing monitor had a high rate of false alerts, Microsoft IT added Windows PowerShell scripts to the management pack to check on additional server states before generating a Computer Down alert. Drive space management: Another service Microsoft IT designed on top of the monitoring platform was to address the situation where systems’ C: drives were running out of capacity. Microsoft IT developed scripts to automate the cleaning out of temporary files, user profiles, and other cached information to free up space—all without requiring any operator input. Benefits Microsoft IT has derived the following benefits from implementing the monitoring platform: Mitigated operational risk: Microsoft IT sees this monitoring platform as a universal risk mitigation platform, because it is instrumental in helping the organization reduce the number of unexpected events that occur in its business-critical systems and to detect and react to issues before a significant customer impact occurs. Global perspective on system availabilty: Through the use of System Center 2012 SP1’s GSM features and the various synthetic transactions that are provided in the Management Pack templates, Microsoft IT has gained an unprecendented global view of application performance and availability. This insight covers from the assets all the way through to the customer experience, and it is especially helpful in providing application teams with dedicated monitoring environments that cater to the needs of their business. Faster time to respond and resolve issues: Implementing a tuned alert stream enables Microsoft IT to identify and respond to issues (and potential issues) more quickly, which helps keep businesscritical systems running and improves customer satisfaction. Streamlined development: Microsoft uses retail Management Pack templates to speed development and deployment of new alerts and monitoring interfaces. As newer versions of the Management Packs are released, Microsoft IT benefits by leveraging new features or support that the updated Management Packs provide. When customization is required, Microsoft IT first relies on the Management Pack templates, which ensure forward compatability. In the few cases where extending the product beyond Management Packs is required, Microsoft IT uses the Operations Manager SDK or other tools (such as the Visual Studio Authoring Extensions) to integrate development and maintenace of those solutions into source control. Best Practices Microsoft IT followed these best practices when designing and implementing its monitoring platform. System Design Promote internal collaboration among all teams involved. Monitoring is relied on by many different teams in an IT organization for proactive, reactive alerting and data collection. All these stakeholders must provide input at an early stage and work together to design a system that fulfills all key criteria. Identify what your customers need. Identify the types of service offerings your organization requires from your monitoring platform to ensure different types of customers can use the platform at the appropriate level (OS, dedicated platform, or otherwise). Microsoft IT found that its customers are best served by a centrally managed and customer-run alerting system. By providing 10 | Technical Case Study teams with dedicated environments, each can engage with Operations Manager more completely and build out a solution that meets their needs without interfering with, or being constrained by, the needs of other teams. Determine the optimal agent upgrading mechanism for your environment. Microsoft IT uses the documented processes for upgrading management groups and the push install method (via the UI, PowerShell, or the SDK) to push and maintain agents. Ensure that the monitoring platform is a key aspect of operational risk strategy. Microsoft IT’s priorities for the platform were operational risk mitigation and cost avoidance. Prioritize your operational efforts. Microsoft IT established three key areas of concentration and precedence: 1. Focus on availability monitoring first. This could purely be up/down indicators (events or synthetic transactions) and for some customers won’t be more than three to five alerts or performance collections per technology. Use the volume of alerts that are generated by workflows to rank the workflows that need to be reviewed and tuned. 2. Focus on performance data collection second. Get the right data into the warehouse and establish reports and routines for reviewing it. Microsoft IT has found this type of information to be much more actionable than alerting on performance data. 3. Focus on synthetic transactions last. After the vital alerts are in place and key performance data is being collected, determine the endpoints that must be tested on a regular basis and use the APM features and Management Pack templates built into Operations Manager to run regular transactions against these endpoints. The results of these transactions can be used both for availability monitoring (such as, Did the transaction fail?) and performance monitoring (such as, How long did the transaction take?). Invest in automation to help you scale. Automation can help streamline operations at any level, but it is especially effective at scale. After your monitoring platform grows beyond 10 management groups or 20,000 agents, consider creating automated agent deployment processes to keep your agents deployed and healthy. For example, if your organization already has a robust troubleshooting guide that people execute consistently for a given alert, consider converting it into an automated recovery process. Alert Design Always start with the Retail Management Packs or MP templates. Only build custom alerts where it absolutely cannot be avoided. Keep the system available by maintaining a clean alert stream. Define your alerts on actionable events. MSIT required their alerts to be actionable by a console operator within a half-hour window. Narrow down the number of performance counters. Stick to the few key items that are used to guide performance and problem management discussions. By focusing on the top 20 most important performance counters, Microsoft IT was able to winnow out the bulk of the less valued data that was reducing the system’s efficacy and performance. Work with the teams who are your data customers to ensure you are providing good data that enables them to make good decisions. MSIT worked closely with the Operations and Platforms & Services teams to ensure the platform was delivering the right type of data. Set cost/benefit to each requested alert. Benefit must outweigh cost to system and operators. 11 | Technical Case Study Operations Set clear ownership boundaries. The team that responds to the alerts must own the definition of the monitoring and ideally have full control over its configuration. Specialize roles in multi-Operations Manager environments. If your organization needs multiple Operations Manager deployments, then consider having a set of individuals who specialize in running the platform. You can then delegate the definition, tuning and reacting to alerting as needed for each environment. Know when to go horizontal or vertical in your monitoring. Be aware of when monitoring needs to be horizontal for one-size-fits-all situations (such as with availability monitoring) versus vertical monitoring that is customized for a particular environment, application, or other (such as with performance monitoring). Resources Microsoft IT Improves Operational Efficiency and Reduces Costs with Microsoft System Center Microsoft IT Showcase For More Information For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Order Centre at (800) 933-4750. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the World Wide Web, go to: http://www.microsoft.com http://www.microsoft.com/microsoft-IT © 2014 Microsoft Corporation. All rights reserved. Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners. This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.