How Microsoft Does IT: Monitoring Best Practices

How Microsoft Does IT: Monitoring Best
Practices
Published January 2014
The following content may no longer reflect Microsoft’s current position or infrastructure. This
content should be viewed as reference documentation only, to inform IT business decisions
within your own company or organization.
In this paper, Microsoft IT discusses how it has centralized the business of
systems monitoring into a set of core and dedicated services that enable
business units and other internal teams to focus on adding value to their
customers.
Situation
With over 35,000 servers to manage, over 1,200 line-of-business applications and services to
monitor, and six business process units to engage, Microsoft IT needed to create a functioning
monitoring system to meet internal service levels and to provide internal customers with a
valuable monitoring service.
Solution
Microsoft IT designed a new monitoring platform based on Microsoft System Center 2012 that
centralizes the organization’s monitoring efforts and offers two monitoring service offerings that
cater to different types of internal customers with unique monitoring requirements.
Benefits
• Mitigated operational risk
• Global perspective on system availability
• Faster time to respond and resolve issues
• Streamlined development
Products and Technology
System Center 2012 SP1 and R2 Operations Manager
Windows Azure
Windows Server 2012
2 | Technical Case Study
Situation
As part of Microsoft’s transformation to a services and devices company, Microsoft Information
Technology (Microsoft IT) is shifting its focus from technology to maximizing the organization’s value
to the business. One key area where Microsoft IT is working to enhance its services is in systems
monitoring. With currently over 35,000 servers to manage and monitor, the Service Delivery and
Operations team within Microsoft IT uses a variety of tools and processes to generate and collect
alerts.
The challenge with monitoring large numbers of servers is in the proper tuning of alerts. A few years
ago, Microsoft IT was receiving more alerts than was needed or that could be responded to properly,
and the alert stream was filled with superfluous data that was impacting the reliability of the core
monitoring function.
Microsoft IT recognized that developing a better tuned alert stream was critical to supporting the
company’s growth. Moreover, by ensuring that only actionable alerts were sent to console operators,
Microsoft IT could add value to business units (BUs) by delivering the data they needed to make good
data-driven decisions.
Solution
Microsoft IT decided to build a new monitoring platform to centralize the organization’s monitoring
efforts. Based on System Center 2012, the new platform had to support two service offerings, with
each service catering to a different type of internal customer that had a unique set of monitoring
requirements (see
Figure 1).
3 | Technical Case Study
Figure 1. Microsoft IT's new monitoring platform caters to two different customer types.
Design Principles
The following list summarizes the key design principles Microsoft IT followed to build the monitoring
platform.
 Build the right kind of monitoring service for each customer type. For Microsoft IT, this meant
providing service offerings that cater to its monitoring consumers: infrastructure incident and
application incident management teams.
 Keep the system actionable through governance and tuning. Ensuring a clean alert stream was,
in turn, derived from applying the following principles:
o Only generate alerts which console operators can respond to and which can be resolved
at the console within a 15-minute window; otherwise, escalate to a line two team.
o Only send alerts that contain a published troubleshooting guide.
o The troubleshooting guide should never take an operator longer than Microsoft IT’s
response service level agreement (SLA) of 15 minutes before requiring a ticket.
o Narrow the number of performance and event collections down to the few key items that
are used to guide performance and problem management discussions
o Override all Management Pack performance collections which aren’t being used for
alerting, reporting, or analysis by setting them to not collected. Similarly, govern state
data and event collection to reduce system overhead.
 Keep the platform up to date. Benefit from new product features and improvements by
evaluating and installing the lastest quarterly System Center update release, Management Pack
update, and Windows version or Service Pack.
The Monitoring Platform Services
As mentioned previously, Microsoft IT’s new monitoring platform needed to offer different levels of
monitoring to support different customers’ needs: core OS and dedicated. Each service is described in
the following sections.
Core Global Monitoring Service
The Core Global Monitoring Service is focused on detecting availability issues with the Operating
Systems and health of the supporting server hardware through reactive monitoring. This service does
4 | Technical Case Study
so by using System Center 2012 Operations Manager to send alerts to a centralized infrastructure
incident response team whenever the core OS and hardware stops running.
This environment is also used to collect a standard set of performance data from each server. The
performance data is then provided to the server and application owners so that they can identify
where their systems are under- or over-utilized and perform capacity management.
Alerts generated by these agents are used by the following teams:
 Incident Management: Uses a set of alerts that allows them to know when an OS is offline or at
risk of being offline.
 Problem: Collect event and state data concerning trends in the data center to help the team
identify which 50 servers are rebooting the most frequently or are experiencing the biggest
performance issues.
Figure 2 below highlights some of the key characteristics of the Core Global Monitoring Service.
5 | Technical Case Study
Figure 2. Key characteristics of the Core Global Monitoring Service.
Dedicated Monitoring Platforms
The Dedicated Monitoring Platform service was primarily designed for the business units (BUs) (such
as Corporate Functions IT) that perform specialized business functions for various business units (such
as HR). Each BU is allocated its own dedicated Operations Manager platform that Microsoft IT
designs, builds, and operates.
In this case, the customer BU is responsible for defining what is in the alert suite and how notifications
are sent to the incident response teams; Microsoft IT’s role it to ensure that these business-critical
platforms are continuously available for monitoring.
Dedicated Monitoring customers are supported by a centralized helpdesk, which uses an alert noise
reduction program to identify the top “talkers” in the environment that either create the most alerts
or the most support tickets. Ticket data is joined with data from Operations Manager to help
Microsoft IT identify the rules and monitors that produce both the best and worst outcomes in terms
of actionability (such as, Was the trouble ticket generated from an alert closed with a productive
6 | Technical Case Study
outcome?), and to find if there are any gaps in the alerting which may, if added, improve the customer
experience by reducing operational risk.
Figure 3 below highlights some of the key characteristics of the Dedicated Monitoring Service.
Figure 3. Key characteristics of the Dedicated Monitoring Service.
Implementation
Topology
The monitoring platform uses System Center 2012 to manage both on-premises assets and cloudbased assets running in Windows Azure. The bulk of the content and building blocks for monitoring
come from the retail Management Packs, which are provided by Microsoft and various third parties
(such as the hardware vendors for servers, among others). Microsoft IT’s deployment and upgrade of
System Center 2007 servers aligned very closely to the deployment and upgrade guidance provided
by the System Center 2012 product group.
7 | Technical Case Study
Some of the key features of the topology include:
 Databases are either clustered or configured with SQL AlwaysOn for business continuity across
multiple data centers
 Management servers manage between 1,500 to 2,000 agents
 Rely on gateway servers to reach into highly secure network environments and across complex
Active Directory forest relationships
 When scale requires breaking agents up into multiple management groups, the agents should be
grouped into broad functional groups (that is, production vs. non-production, etc.)
Platform Upgrading
When a newer version of System Center 2012 became available, Microsoft IT followed the out-of-thebox instructions to upgrade the platform’s System Center 2012 management services and databases.
In addition to abiding by the key principle to keep the platform up to date, incorporating System
Center 2012 SP1 and beyond into the monitoring platform enabled Microsoft IT to implement the
following new features into its monitoring services:
 Application performance management (APM): Incorporated into the application component
level of monitoring, APM gives Microsoft IT deep insight into application performance and
exceptions.
 Global Service Monitoring (GSM): Provides operators with an at-a-glance global view of the
health status of all monitored servers from a customer point of view.
 Visual Studio Team Foundation (VSTF) integration: Enables customers to forward an alert with
debug information to VSTF for a developer to respond to.
Tip: For more information about the key features available in System Center 2012 R2, see
http://technet.microsoft.com/en-us/library/dn249519.aspx; for System Center 2012 SP1, see
http://technet.microsoft.com/en-us/library/jj649385.aspx.
Agent Upgrading
In this new platform, Microsoft IT uses the System Center 2012 Operations Manager SDK to push new
installations and upgrades to agents. The administrative burden associated with the SDK is small
because Microsoft IT uses it for remote agent administration—including a simple path to upgrade
agents via built-in product features.
This new upgrade process is a significant change from Microsoft IT’s previous use of System Center
Configuration Manager packages to deploy and configure agents, which did not allow for remote
administration. Having agent upgrades pushed out with the SDK via the Operations Manager console
provides a “free” upgrade from old to new versions: as soon as Microsoft IT upgrades the
infrastructure, the agents are automatically upgraded.
Alert Tuning and Validation
As described previously, Microsoft IT uses out-of-the-box management packs as a starting point for
developing alerts and then uses a pre-production environment to validate their tuning. A key aspect
of the process is to work with subject matter experts (SMEs) on an ongoing basis to ensure that new
alerts introduced into the system are meaningful and actionable.
8 | Technical Case Study
Figure 4. Alert tuning and validation life cycle.
The following steps summarize Microsoft IT’s alert development life cycle:
1.
Identify the need for a new alert. Hallway conversations, major incident reviews, problem
managers, and monthly operations reviews will keep a monitoring team busy.
2.
Evaluate potential solutions. Determine whether an existing Management Pack will suffice,
or if instead an ad hoc alert must be developed.
o If an appropriate alert is available in the Retail Management Pack, import it, making sure
to turn off everything except for the few availability rules and monitors that have been
determined to be vital.
Note: As opposed to alerts that indicate a potential problem, Microsoft IT only focuses on
true up/down indicators that flag when a system or service is offline.
o If a custom ad hoc rule is necessary, author it using the Operations Console.
3.
Validate the alert. Talk to the consuming team, in this case the incident management
teams, to ensure readiness. Then deploy the vital few agents to a pre-production
environment and tune them based on alert volumes and their actionability from the resulting
tickets they generate.
4.
Deploy into production systems. When future reactive incidents occur (such as when
something fails and monitoring didn’t detect it), determine if the likelihood of future
occurrence merits broad monitoring. If so, then determine the most basic indicator of the
problem and deploy custom monitoring for that particular case.
Adding Automation
As the platform matured and the alert stream became more and more actionable, Microsoft IT
continued to evolve the environment in the following stages:
1.
Winnow alerts down to the vital few indicators of up/down availability.
2.
Collect the vital performance data to enable capacity and utilization analysis.
3.
Use System Center 2012 SP1’s Company Knowledge feature to supplement existing product
knowledge that comes with the Management Pack with custom company knowledge, such
as specialized procedures that are unique to the IT organization.
4.
Review alert volumes regularly to find problem servers, problem clusters, or problem
behaviors which need fixing. This can include better monitoring, better knowledge, more
training, or better systems.
5.
Use product and company knowledge as a guide for implementing automated diagnostics
and recoveries within Operations Manager that will reduce the number of incidents that
would otherwise require human intervention. Microsoft IT’s troubleshooting guide is the
starting point to discover how much effort, complexity, and risk is involved in automating a
given scenario.
After tuning the monitoring alerts and cutting down the set of alerts to the vital few, Microsoft IT saw
an opportunity to add automation and deeper diagnostics to the environment to further reduce alert
9 | Technical Case Study
generation and streamline the overall response process. Some example areas where Microsoft IT
incorporated automation into the monitoring platform include:
 Automating operator response: In cases where an existing monitor had a high rate of false alerts,
Microsoft IT added Windows PowerShell scripts to the management pack to check on additional
server states before generating a Computer Down alert.
 Drive space management: Another service Microsoft IT designed on top of the monitoring
platform was to address the situation where systems’ C: drives were running out of capacity.
Microsoft IT developed scripts to automate the cleaning out of temporary files, user profiles, and
other cached information to free up space—all without requiring any operator input.
Benefits
Microsoft IT has derived the following benefits from implementing the monitoring platform:
 Mitigated operational risk: Microsoft IT sees this monitoring platform as a universal risk
mitigation platform, because it is instrumental in helping the organization reduce the number of
unexpected events that occur in its business-critical systems and to detect and react to issues
before a significant customer impact occurs.
 Global perspective on system availabilty: Through the use of System Center 2012 SP1’s GSM
features and the various synthetic transactions that are provided in the Management Pack
templates, Microsoft IT has gained an unprecendented global view of application performance and
availability. This insight covers from the assets all the way through to the customer experience, and
it is especially helpful in providing application teams with dedicated monitoring environments that
cater to the needs of their business.
 Faster time to respond and resolve issues: Implementing a tuned alert stream enables Microsoft
IT to identify and respond to issues (and potential issues) more quickly, which helps keep businesscritical systems running and improves customer satisfaction.
 Streamlined development: Microsoft uses retail Management Pack templates to speed
development and deployment of new alerts and monitoring interfaces. As newer versions of the
Management Packs are released, Microsoft IT benefits by leveraging new features or support that
the updated Management Packs provide.
When customization is required, Microsoft IT first relies on the Management Pack templates, which
ensure forward compatability. In the few cases where extending the product beyond Management
Packs is required, Microsoft IT uses the Operations Manager SDK or other tools (such as the Visual
Studio Authoring Extensions) to integrate development and maintenace of those solutions into
source control.
Best Practices
Microsoft IT followed these best practices when designing and implementing its monitoring platform.
System Design
 Promote internal collaboration among all teams involved. Monitoring is relied on by many
different teams in an IT organization for proactive, reactive alerting and data collection. All these
stakeholders must provide input at an early stage and work together to design a system that fulfills
all key criteria.
 Identify what your customers need. Identify the types of service offerings your organization
requires from your monitoring platform to ensure different types of customers can use the
platform at the appropriate level (OS, dedicated platform, or otherwise). Microsoft IT found that its
customers are best served by a centrally managed and customer-run alerting system. By providing
10 | Technical Case Study
teams with dedicated environments, each can engage with Operations Manager more completely
and build out a solution that meets their needs without interfering with, or being constrained by,
the needs of other teams.
 Determine the optimal agent upgrading mechanism for your environment. Microsoft IT uses
the documented processes for upgrading management groups and the push install method (via
the UI, PowerShell, or the SDK) to push and maintain agents.
 Ensure that the monitoring platform is a key aspect of operational risk strategy. Microsoft IT’s
priorities for the platform were operational risk mitigation and cost avoidance.
 Prioritize your operational efforts. Microsoft IT established three key areas of concentration and
precedence:
1. Focus on availability monitoring first. This could purely be up/down indicators (events
or synthetic transactions) and for some customers won’t be more than three to five alerts
or performance collections per technology. Use the volume of alerts that are generated
by workflows to rank the workflows that need to be reviewed and tuned.
2. Focus on performance data collection second. Get the right data into the warehouse
and establish reports and routines for reviewing it. Microsoft IT has found this type of
information to be much more actionable than alerting on performance data.
3. Focus on synthetic transactions last. After the vital alerts are in place and key
performance data is being collected, determine the endpoints that must be tested on a
regular basis and use the APM features and Management Pack templates built into
Operations Manager to run regular transactions against these endpoints. The results of
these transactions can be used both for availability monitoring (such as, Did the
transaction fail?) and performance monitoring (such as, How long did the transaction
take?).
 Invest in automation to help you scale. Automation can help streamline operations at any level,
but it is especially effective at scale. After your monitoring platform grows beyond 10 management
groups or 20,000 agents, consider creating automated agent deployment processes to keep your
agents deployed and healthy. For example, if your organization already has a robust
troubleshooting guide that people execute consistently for a given alert, consider converting it into
an automated recovery process.
Alert Design
 Always start with the Retail Management Packs or MP templates. Only build custom alerts
where it absolutely cannot be avoided.
 Keep the system available by maintaining a clean alert stream. Define your alerts on actionable
events. MSIT required their alerts to be actionable by a console operator within a half-hour
window.
 Narrow down the number of performance counters. Stick to the few key items that are used to
guide performance and problem management discussions. By focusing on the top 20 most
important performance counters, Microsoft IT was able to winnow out the bulk of the less valued
data that was reducing the system’s efficacy and performance.
 Work with the teams who are your data customers to ensure you are providing good data
that enables them to make good decisions. MSIT worked closely with the Operations and
Platforms & Services teams to ensure the platform was delivering the right type of data.
 Set cost/benefit to each requested alert. Benefit must outweigh cost to system and operators.
11 | Technical Case Study
Operations
 Set clear ownership boundaries. The team that responds to the alerts must own the definition of
the monitoring and ideally have full control over its configuration.
 Specialize roles in multi-Operations Manager environments. If your organization needs
multiple Operations Manager deployments, then consider having a set of individuals who
specialize in running the platform. You can then delegate the definition, tuning and reacting to
alerting as needed for each environment.
 Know when to go horizontal or vertical in your monitoring. Be aware of when monitoring
needs to be horizontal for one-size-fits-all situations (such as with availability monitoring) versus
vertical monitoring that is customized for a particular environment, application, or other (such as
with performance monitoring).
Resources
Microsoft IT Improves Operational Efficiency and Reduces Costs with Microsoft System Center
Microsoft IT Showcase
For More Information
For more information about Microsoft products or services, call the Microsoft Sales Information
Center at (800) 426-9400. In Canada, call the Microsoft Canada Order Centre at (800) 933-4750.
Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access
information via the World Wide Web, go to:
http://www.microsoft.com
http://www.microsoft.com/microsoft-IT
© 2014 Microsoft Corporation. All rights reserved. Microsoft and Windows are either registered
trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The
names of actual companies and products mentioned herein may be the trademarks of their respective
owners. This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES,
EXPRESS OR IMPLIED, IN THIS SUMMARY.