Performance testing and analysis service

advertisement
Performance
testing & analysis
service
by Vladimir Marchenko
1 / 27
Contents





Introduction & Skills
Case Studies
Engagement Plan
Appendix 1: Why Performance Matters
Appendix 2: Performance Test Program
2 / 27
Introduction
3 / 27
Introduction
Who I am:
Vladimir Marchenko, a performance engineer/manager with more than 8 years experience
working for major American companies
Purpose of the Presentation:
To obtain a contractor position in a fast-paced software development environment requiring
performance improvement and monitoring
Area of Expertise:
Software system performance optimization starting with simple web sites and ending with
complex software platforms comprising multiple services with hundreds machines.
This includes locating performance bottlenecks; code, memory and database profiling;
optimization of system configuration based on required customer usage; coding special-cases
test software.
Business Opportunities:
Being backed up by a team of engineers with comparable experience, I am looking to grow my
business as a service to your Company.
References:
Available upon request. Please see my resume attached.
4 / 27
Skills & Expertise
Application types: web sites, web services, databases, data integration solutions
Technologies: primarily .NET, a little bit Java, non-managed code
Database systems: primarily MS SQL (including SSIS), a little bit MySQL
Environment for test: production (including testing production instance from Cloud), staging,
pre-production, test
Operating systems: Windows, Linux
Test tools: Apache JMeter, HP LoadRunner / Performance Center, and other tools
Client-side analysis tools: Google Page Speed, HTTP Watch, dynaTrace, Firebug, Fiddler2
Profilers: WinDBG, ANTS profiler, MAT, MS SQL profiler
Various other tools for monitoring, sniffing, log analysis, various in-house developed tools,
including http-test framework for quick script setup in Load Runner
5 / 27
Case Studies
6 / 27
Case Studies Overview
I have participated in number of efforts on performance testing and analysis
during my career.
This section describes just three the most typical case studies:
1. Complex web application with big web site on the front end and dozen of
web services on the back end
2. Typical 3-tier web application with many performance issues on each tier
3. Test against production site from Amazon cloud
7 / 27
Case Study #1
Problem:
Application under test: a financial information product from one of the world’s leading providers of intelligent
information.
Technically it was a very large web-site (more than 300 pages) on the front end with dozens of SOAP web
services on the back end, and number of data stores.
Tech stack: ASP.NET, SOAP, MS SQL.
Key challenge: establish a regular performance test against the whole Platform (consisting of more than 400
servers with vast number of interconnections). For the case when test reveals the high-level issue with
performance, the test would also need to provide insight of what specific component of the whole system to
blame.
Solution:
 Production usage statistics (in terms of hits/sec distribution) was collected for the front end site as well as
each of the web services it calls
 Test scripts were developed in LoadRunner for each of web services and front end web site
 To speed up and simplify the process of script development and modification a special performance test
framework was developed first. Then, the development of each particular script was more like listing
requests and their weights plus providing parameterization.
 Standard server side test program was implemented: capacity, response time, and stability tests
 Each release, performance test program was run against the front end side with unloaded monitoring runs
of each of web services in parallel. This made it possible to understand which web service to investigate
deeper in case of problems. For deeper investigation, additional tests then were run against specific web
services.
 The tests were run regularly, and prevented escaping to production big number of performance issues,
both application and environmental.
8 / 27
Case Study #2
Problem:
Application under: web-application used by health insurance companies to carry out care management.
It was a web-site on the front end, with number of COM and NT services, and MS SQL database on the back
end. BI tool for automatic rules processing was also incorporated into DB.
Tech stack: ASP.NET, SOAP, REST, COM+, MS SQL including SSIS.
Key challenge was about the application performance which was very poor by the moment when I picked up
the effort: application was very slow and unstable.
Another challenge was set of LoadRunner scripts which were developed before my start on the project. Test
scripts were overcomplicated and there was no time to re-develop them – performance tests were required
on the day 1.
Solution:
 Map of data flows in existent scripts was quickly created just to understand which of 15 scripts and 20
library files calls what else scripts and files, and thus application’s functionality.
 In addition to typical benchmark test that was already designed, capacity and stability tests were also
added to the program of tests.
 Direct communication with each of work stream leads was established for quicker and easier performance
issues troubleshooting.
 Web-site’s code profiling as well as DB profiling was extensively used for troubleshooting performance
issues.
 As a result, performance test effort was successfully picked up and re-established, and more than 20
performance issues were found in 7 months (most of them related to the inefficient operation with data in
DB) and fixed in close cooperation with the development teams.
9 / 27
Case Study #3
Problem:
Application under test: web player for broadcasting important events from one of the world’s leading stock
exchanges.
It was a very simple web site with MS SQL DB on the back end. The video was streamed from AKAMAI and was out
of scope of testing.
Tech stack: ColdFusion, Windows, IIS, MS SQL.
Customer’s complaint was that when the total number of viewers entering the presentation within 5 minutes
interval exceeds approximately 30,000 users, it becomes very difficult for new users to open a player. The task was
to find out the exact max number of users, and, if possible, the cause of the issue.
Primary challenge consisted in system under test which was a production system, so that the tests needed to be
run with close cooperation with operation engineers. This also implied no access to the servers which limited our
ability to analyze the issues very significantly. Also, customer could not provide infrastructure for test run inside
their data center.
Solution:
 Since the site itself was not complex, a test script with freeware Apache JMeter tool was quickly developed
 Instructions and scripts to setup specific counters monitoring were prepared and sent to the customer side,
also IIS logs were collected from the servers
 Tests were run from Amazon Cloud over the Internet in close contact with customer’s Operations. IPs of load
generation servers were added to the firewall’s white list.
 Test results analysis revealed that it is not only the number of users which limit the overall performance , it is
also how fast their enter the system. 27K users with +200/sec entrance rate was defined as a point of instability
– if the number of users is higher or entrance is quicker, then the system starts to refuse the connections. With
lower entrance 40K users was the ceiling of system capacity with CPU utilization up to 95%.
 It was also found that the connection refusal happens when total number of open connection exceeds 14K –
the firewall settings was the primary suspicion.
10 / 27
Engagement Plan
11 / 27
High Level Plan of Engagement
On the high level, to set up a performance test, you need:
1) Contact me and discuss what you expect from performance test, i.e. what is primary
objective  as a result, I can suggest the vision of what’s required to do for your specific
case
2) Allocate a technical contact on your side  so that I can discuss any technical questions in
organized manner not disturbing your development and test staff directly
3) On the high level, describe your system from the top to bottom layer  so that I can
estimate the effort on the very high level, as well as highlight important contingencies
4) Provide me with user access to your system  so that I can access it and estimate the effort
for scripting and test
5) Provide typical user scenarios or weighted distribution of requests, as well as values for
request parameterization (e.g. search keywords or test user ids / passwords, etc.)  this is
required for modeling good workload on your system
6) Allocate a test window  so that loaded tests do not interfere with your development and
test teams and/or production usage of your system
7) Optional, but still desirable: provide access to the servers under load  so that I am able to
setup monitoring as required, as well as conduct profiling for highest efficiency of
performance effort. If direct access is not possible, then contact on your side for online
requests to get the logs or tasks like that would be a great help.
12 / 27
Work Models
I am happy to work both:
1. Fixed Price
2. Time & Material
Legal status of Private Entrepreneur in the Republic of Belarus with privilege
of signing contracts with US companies as an independent contractor
In case of a complex task/project requiring more than 1 FTE, it is possible for
me to invite to the project more (up to 3) sub-contractors of comparable to
mine experience.
13 / 27
Contact Information
Vladimir Marchenko
Please contact me in case you are interested in my service and/or have any
further questions:
E-mail: vladimir.a.marchenko@gmail.com
Skype: vladimir-a-marchenko
Mobile: +375 (29) 767-36-68
LinkedIn: http://www.linkedin.com/pub/vladimir-marchenko/68/869/6bb
Thank you!
14 / 27
Appendix 1:
Why Performance Matters
15 / 27
Reliability under Load
Just one vivid and fresh example about application reliability:
“Healthcare.gov was frighteningly dysfunctional on day one. Users experienced
multi-hour wait times, menus filled with blanks, and bizarre quirks like the "prison
glitch" which stopped a user from proceeding until they specified how long they had
been incarcerated, even if they had never been to prison.” (The Verge)
While experts soundly say that the problems with healthcare.gov were
caused by wrong system architecture, there is no doubt that proper and
timely load performance testing could have prevent that in production.
http://www.theverge.com/us-world/2013/12/3/5163228/healthcare-gov-obamacare-website-shows-how-government-can-do-tech-better
http://www.reuters.com/article/2013/10/05/us-usa-healthcare-technology-analysis-idUSBRE99407T20131005
16 / 27
Velocity
These are some pieces of info graphics from Strangeloop Networks that clearly show
what application velocity could mean for end user experience and businesses.
“Amazon.com. Increased revenue
by 1% for every 100 milliseconds of
improvement”
“AOL. Visitors in the top ten
percentile of site speed viewed 50%
more pages than the visitors in the
bottom ten percentile”
“YAHOO. Increased traffic by 9% for
every 400 milliseconds of
improvement”
“Shopzilla. Sped up average page
load time from 6 to 1.2 seconds.
Results: increased revenue by 12%
and page views by 25%”
Complete report:
http://www.strangeloopnetworks.com/resources/infographics/web-performance-and-user-expectations/poster-visualizing-web-performance/
17 / 27
So,
Problems with system performance usually lead to:
1. Application unavailability to end users when the
loads gets high
2. Poor experience of end users working with your
application, often leading in loss in revenue
Performance testing (including load testing) and
analysis help to timely assess the system under test
and prevent those issues in production.
18 / 27
Appendix 2:
Performance Test Program
19 / 27
Program Overview
I firmly believe that the following tests is a bare minimum when it comes to
performance assessment of a web based system.
Performance of the server side of a system:
1) Assessing system capacity
2) Checking long term reliability
3) Assessing response times
Plus, finding the causes of bad performance (performance bottlenecks
elimination)
Performance of the client side of a system:
Number of analysis activities to make sure that the interaction between the
client and server sides is efficient for good performance, as well as make sure
that the client side’s code performs well.
20 / 27
Capacity
Special tests are conducted to assess system capacity in given configuration.
Typical questions to answer here:
1. With what number of users system becomes unavailable to end users?
2. What number of business operations can system process per second
(minute, hour, …) at maximum?
3. What hardware resource utilization is observed on each server when the
system saturates?
4. What factors limit the current system capacity: inefficient code,
improper configuration, weak hardware, etc.
Capacity testing can also provide great help when it is necessary to assess
how system power scales when system hardware or other parameters (such
as network bandwidth) change.
21 / 27
Typical Capacity Graph
Green line (left axis) represents the number of virtual users, which is being gradually increased in course of
test, so that the level of load is gradually increasing, too.
Blue line (left axis, scaled 10x) shows total operations/sec performed by the system under test. This value is
increasing until some value (~10 operations/sec) which is the maximum achieved value – system capacity.
Red line (right axis) is the response time of the slowest operation, given in seconds. Graph shows that the
system saturation is correlated to this operation’s response time increase.
After that particular test, application code profiling and DB profiling were conducted and it was found that the
root cause of performance issue was non-optimal indexing of one of the tables in DB, which led to slow
stored procedure call, and that led to the slow response time for end users. This was fixed, response time
improved, and thus system capacity increased for 3 times to the desirable level.
22 / 27
Response Times
While Capacity test usually gives a quick snapshot of application velocity and
performance bottlenecks in general, it also makes great sense to conduct
separate tests to measure response times for key transactions under the
defined levels of load.
It’s worth typically distinguishing the tasks of applying good realistic load and
measuring response times for the transactions in interest.
Questions to answer with response time test:
1. How fast different pieces of functionality execute under specific level of
load (typical production load, peak load, etc.)
2. Does this release / build perform better or worse than the previous one?
3. Similarly to Capacity test: if some transaction is not fast enough, what
would be the reason for that?
23 / 27
Typical Response Times Table
The table below shows typical example of response time report.
Response times for number of transactions (pages of a web site) were measured under medium load, for
original release candidate build (RC) and for the hot fix (HF).
The data shows that while the RC build had some serious trouble with filtering functionality (response times 8
sec and above), the hot fix helped with this: response times of filtering functionality improved to 300-400 msec.
The fix itself was all about adjusting the queries in two stored procedures, and one stored procedure showed
great improvement, but the other one showed big relative but still acceptable absolute degradation. Both these
together led to great improvement on the UI.
It is also seen that hot fix decreased the load on DB server’s CPU greatly, therefore WEB server was able to take
more requests to process per second.
24 / 27
Reliability
Performance testing stands not only for velocity, but also for long term
reliability.
To check application reliability a long loaded test to be conducted with the
closest look is on memory consumption. Usually, this is a test with level of
load as 80% of saturation with duration at least 12 hours, but specific setup
is to be chosen according to situation.
Typical questions to answer here:
1. Does application perform stably over time? Response times, network
traffic, operations/sec should be stable over all duration of the test
2. Are there any abnormal spikes of errors; what are the causes if any?
3. Does application free the allocated memory timely?
4. If there is memory leak, what is the cause?
5. If it is not possible to fix the memory leak quickly, what safe interval of
server reboot should we apply?
6. <and more similar questions related to reliability>
25 / 27
Longevity Test Graph
The graph below is kind of typical for longevity test that shows suspicious results.
The green horizontal line (right axis) is number of virtual users applying the load, which is constant for the
whole duration of the test.
Value of total operations performed per second is represented by the blue line (left axis).
The bold red line is memory consumption given in percentage from 100% (left axis).
It is seen that although total operations/sec remains stable during the whole duration of the test, which is
good, the memory consumption has visible trend to growth. And although no out of memory exceptions were
encountered in the test (as well as no application crashes), this is not a good sign in general, and additional
memory profiling is recommended to check if it is real memory leak (and also the cause of it), or not.
26 / 27
Client-side Analysis
Capacity, stability, and response time testing is a bare minimum to assure server side
performance. However, even with good server side performance, the client side
issues may ruin the overall web site velocity.
Using special tools for client side analysis it is possible to highlight those issues and
greatly improve the end user experience.
Some of these issues are not purely client side, but more about the interaction
between the client and server sides.
There are many questions to answer here, here are just few of them:
• Is it possible for client side code to execute faster?
• Is it possible to reduce the network traffic from server to client side?
• Can the site HTML be optimized for better rendering in browser?
• Does the client side caching work efficiently?
• Are the images sizes optimized enough?
• Is there any sense to add data pre-load for site?
• Is the site optimized well for mobile users’ networks and screens?
• <and many other questions like ones above>
27 / 27
Download