Fixing Intermittent Application Performance Problems

Fixing Intermittent Application Performance
Problems: Turning Hours into Minutes
Intermittent performance problems are among the most frustrating issues IT
administrators face. The apparently random nature of occurrences, the nearimpossibility of reproducing all conditions, and the complexity of applications
and their dependencies make it difficult to diagnose root causes to resolve an
issue quickly.
Administrators commonly resort to time-consuming manual event log
searches and trial-and-error attempts to reproduce and diagnose problems.
But what they really need is a more efficient approach that puts an end to
such a resource-intensive and frustrating process, one that provides the tools
and the insight to identify and resolve intermittent performance problems in
a hurry.
Combating Inefficiencies
System administrators are trained problem solvers, but it really helps to have
some idea where to start. Unfortunately users usually can’t give you much
information to go on. They are likely only to report general symptoms such
as a slow application or an inexplicable disconnect, and they are unlikely to
know if it’s being caused by an intermittent performance problem.
Intermittent problems are especially difficult to diagnose because they
don’t last long enough to examine and solve. They appear, wreak havoc,
and vanish. The only trace evidence is captured in event logs, often buried
in large volumes of hard-to-connect data. Finding such evidence can be as
difficult as finding a needle in a haystack. The root cause of the problem can
be hidden almost anywhere because most business applications are complex
entities that interact with multiple resources, such as databases, web
servers, directory services, and the network itself. That complexity forces
the administrator through a slow, labor-intensive investigative process that
can delay other daily tasks and projects.
Without a clear lead to the cause of the problem, the system administrator is
forced to review event logs from every part of the application environment.
That process requires analyzing long lists of events in multiple logs item
by item to find an outstanding event, error condition, or combination of
conditions that correlate to the timeframe in which users began to complain.
To isolate the problem, the system administrator may attempt to write test
scripts, but complex applications can have such a wide variety of conditions
that this trial-and-error process can take many hours that can turn into
weeks.
In organizations large enough to employ specialized administrators for
various functions, a database administrator (DBA) or network administrator
might be called in to help isolate a problem, following the same onerous
analytical process as the system administrator. Scheduling another
administrator with a full workload to participate in this process without
proof that the problem lies in his or her area can consume a lot of time
unnecessarily. And if the issue requires the collaboration of multiple groups,
additional time will be wasted before the issue is resolved.
Once solved, if the problem recurs, business operations will be disrupted
again, user complaints will continue to roll in, and the pressure will build to
find a more efficient solution, one that helps administrators overcome the
challenges and expense of intermittent performance problems.
3 Common Underlying Causes That
Complicate Intermittent Problem
Resolution
The sporadic nature of a performance
problem often makes diagnosis
difficult. Keep your eyes open for these
three common underlying causes of
intermittent problems:
• Memory Leak: Microsoft Internet
Information Services (IIS) for Windows®
Server is a flexible, secure and
manageable web server. Applications
commonly pull information from this
server into local memory to process.
After processing, the memory should
flush this data, but that doesn’t always
happen. The result is overloaded random
access memory (RAM) that cannot accept
more data. This drastically reduces
central processing unit (CPU) utilization,
which in turn reduces application
performance.
• Long-Running Query: Seemingly
innocent acts such as changing a schema
or updating a stored procedure can
introduce a blockage that results in
a very long-running SQL query. Such
queries sap CPU and memory and slow
application performance.
• Wireless Latency: Many applications,
and especially older applications, are
not tolerant of the inherent latency
in wireless networks. When these
applications interact with databases and
web servers over a wireless network,
they might not achieve the response
time that the application requires. The
result is a timed-out application and an
annoyed user.
The most useful Application Performance
Monitoring tools can help diagnose
these causes – and other sources – of
intermittent performance problems
quickly and accurately.
Application Performance Monitoring Speeds Resolution
The greatest challenge in diagnosing an intermittent performance problem is to know where to begin. Is the root
cause in the application itself, or in a database or web server? Or is it a network issue? Without a valid starting point,
there is no way to select the right diagnostic path and conduct an efficient analysis.
An Application Performance Monitoring (APM) solution that lets administrators link all application dependencies can
overcome this impasse. In this environment, targeted, real-time monitoring immediately puts administrators on the
right diagnostic path, and clear graphic displays make it easy to follow that path to the root cause of the performance
problem.
Create Profiles
Figure 1 is an example of an IT environment for a customer relationship management (CRM) service. It includes
multiple applications running on multiple servers. Any issue that affects the performance of any application in this
environment, such as a database problem or a Microsoft® Internet Information Services (IIS) server problem, can
affect the performance of the CRM service. Consequently, diagnosing an intermittent performance problem in the CRM
service requires monitoring all the elements that make up that service and presenting that information in a high-level
view that provides a starting point for analysis.
Figure 1. The CRM Service IT infrastructure consists of multiple components.
An Application Performance Monitor uses application profiles to diagnose intermittent application performance
problems. Application profiles define how an application is monitored and what actions should be taken when an
application or one of its components fails. The most useful Application Performance Monitors not only include
application profiles, they also define complex relationships and dependencies – from simple n-tier applications to large
server farms and on to complete IT services.
For example, a SQL server farm is one of the components that requires monitoring in the CRM service shown in Figure
1. A typical SQL server configuration consists of an online transaction processor (OLTP), a staging server, and a data
warehousing SQL Server. An application profile is created by an administrator and it’s assigned to monitor each SQL
server instance. Individual profiles are then embedded into a higher-level profile to monitor the entire SQL server farm
(Figure 2).
Once the server farm profile is created, it can be embedded into an even higher-level profile that encompasses the
entire CRM service (Figure 3). Replicating this process for each IT service component, such as the IIS server farm,
creates a comprehensive service profile that’s the key to a quick diagnosis of a problem’s root cause. The profile
ensures that the administrator can view the status of the entire service or drill down to any component within that
service, down to a specific instance or component of an application.
The resulting comprehensive service monitoring profile is the foundation for fast, accurate problem diagnoses.
Figure 2. The server application profile is embedded into the server farm profile.
Figure 3. The server farm profile is embedded into the CRM service profile.
Diagnose Problems
Completing a service profile such as the CRM example generally takes less than two hours. But after that small investment in
time, the process of diagnosing an intermittent performance problem can be collapsed from hours, days and weeks of time
and frustration into a straightforward process that takes just minutes. Multiply this by the number of application performance
incidents you receive now, and the amount of time you’ll be gaining back in your day can be considerable.
For example, the Application Performance Monitor might issue a real time alert that Microsoft Dynamics is down. The
administrator can then quickly view a high-level profile of the CRM service and see that there is a problem in the supporting
SQL server farm (Figure 4). The problem area is shown in red, making it easy to identify.
Figure 4. In this example, you can quickly determine that Microsoft Dynamics is down due to a problem in the
supporting server farm.
The administrator can then drill down into the SQL server farm data and see that the problem lies in the Data
Warehouse server (Figure 5). Isolating the problem to a single server illuminates the next step in the diagnostic
process, which is a detailed look at the Data Warehouse server status. Click on that red square to learn more.
Figure 5. Drill-down analysis shows there’s a problem with the Data Warehouse Server.
Clicking on the red square above will take you to the Data Warehouse server status (below, Figure 6). You can
immediately see that the root cause of the problem experienced by the user as a Microsoft Dynamics failure is actually
Figure 6. The Data Warehouse Server profile reveals the root cause of the problem.
a memory issue in the Data Warehouse server. With only a few mouse clicks you’ve determined the root cause of
the Microsoft Dynamics failure. A historical graph of the Memory Manager metrics (Figure 7, below) offers further
information that the appearance of the memory issue correlates with user complaints of Microsoft Dynamics failures.
Figure 7. Intermittent Microsoft Dynamics failures correlate with Data Warehouse Server memory spikes over time.
But what happens if a user reports a Microsoft Dynamics performance problem, and a quick look at the high-level CRM
service profile reveals that all the system components are working properly? In this case, the problem very likely lies
in the network itself. Many applications were not designed with wireless networks in mind. These applications tend to
be intolerant of the delays and latencies that wireless can introduce. As a result, wireless users can experience timeouts, disconnects, and data loss.
Expanding Application Performance Monitoring capabilities to the network can help administrators to identify the root
cause of a network problem as easily as shown in the example above.
To determine if there is a network root cause to the Microsoft Dynamics performance problem, for example, an
administrator can start by determining whether the user was on a wired or wireless network. If the answer is a
wireless network, one obvious culprit is an oversubscribed access point.
A quick look at a graphic showing access point usage confirms that the access point is indeed oversubscribed (Figure
8). But knowing that doesn’t solve the problem. Why is the access point oversubscribed? If this is an intermittent
problem, what is the cause? Is network traffic from a monthly meeting in a nearby conference room temporarily
overloading the access point? Or are there employees disregarding IT policy and overloading the access point
by streaming unauthorized applications such as Pandora or YouTube? Determining the root cause is critical to
implementing a cost-effective solution. Solving the problem by more strictly enforcing bandwidth usage policies might
eliminate the unnecessary purchase of additional access points.
Figure 8. An oversubscribed access point is related to the performance problem.
Through the same drill-down process used to isolate the SQL server farm problem, an administrator can determine
who was using the access point when Microsoft Dynamics experienced the performance problem, and how much
bandwidth each person was using. Figure 9 shows that one person was clearly using an excessive amount of
bandwidth. A final drill-down shows that the vast majority of the bandwidth consumed by that user was taken by an
unclassified application—most likely one that is not allowed by IT policy (Figure 10).
Figure 9. The administrator can view access point usage by user and bandwidth.
Figure 10. An unclassified application is consuming an excessive amount of bandwidth.
Automated Remediation
Once root causes of intermittent application performance problems are identified, System Administrators can use
the Application Performance Monitor to create multi-step action policies to address future occurrences more quickly.
Action policies can include event logging, real-time alerts, and PowerShell self-healing scripts such as reboot and
service restart. Action policies can be assigned at the service, application, and component level. Dependency-aware
application profiles enable coordinated multi-tier action policies to ensure optimal performance of complex applications
and IT services.
Widespread Business Benefits
An Application Performance Monitoring tool can streamline the resolution of intermittent problems, turning many
hours of exasperating work into a few highly-productive minutes. The most successful solutions feature:
• A unified view of the complete IT infrastructure and applications
• Application profiles that can be customized to include all dependencies
• The option to create multi-step action policies to automatically address future instances of the same intermittent
conditions
These capabilities allow administrators to diagnose intermittent performance problems quickly and accurately, whether
they reside in a device or in the network itself. The ability to customize profiles can also foster cooperation throughout
IT. For example, a system administrator can easily ask a database administrator to recommend areas to monitor that
have historically tended to cause problems. Working cooperatively with common priorities and shared information
shortens the diagnostic cycle and streamlines the process to save administrator time and reduce user complaints.
About the Network Management Division of Ipswitch, Inc.
The Network Management Division of Ipswitch, Inc. is the force behind WhatsUp Gold, the integrated suite of IT
management solutions. Over 150,000 networks at large, mid-sized and small enterprises depend on WhatsUp Gold
for comprehensive network, system, application and end-user experience monitoring in physical, virtual and wireless
infrastructure environments.
For 21 years, Ipswitch has developed easy-to-use and affordable products that help IT managers worldwide improve
their ability to provide services that drive business success. Ipswitch is headquartered in Lexington, MA and has
offices in Atlanta and Augusta, GA, Madison, WI, and American Fork, UT, as well as international offices in Netherlands
and Japan. To learn more about WhatsUp Gold, please visit: http://www.whatsupgold.com/products/download/
Ipswitch, Inc.
83 Hartwell Avenue
Lexington, MA 02421 USA
(781) 676-5700
Kingsfordweg 151
1043 GR Amsterdam
The Netherlands
Copyright © 2013, Ipswitch, Inc. All rights reserved. WhatsUp is a registered trademark and Ipswitch is
a trademark of Ipswitch, Inc. Other products or company names are or may be trademarks or registered
trademarks and are the property of their respective holders.
Ipswitch Japan 株式会社
〒106-0047 東京都港区南麻布4-11-22
南麻布T&Fビル8階