Performance testing & analysis service by Vladimir Marchenko 1 / 27 Contents Introduction & Skills Case Studies Engagement Plan Appendix 1: Why Performance Matters Appendix 2: Performance Test Program 2 / 27 Introduction 3 / 27 Introduction Who I am: Vladimir Marchenko, a performance engineer/manager with more than 8 years experience working for major American companies Purpose of the Presentation: To obtain a contractor position in a fast-paced software development environment requiring performance improvement and monitoring Area of Expertise: Software system performance optimization starting with simple web sites and ending with complex software platforms comprising multiple services with hundreds machines. This includes locating performance bottlenecks; code, memory and database profiling; optimization of system configuration based on required customer usage; coding special-cases test software. Business Opportunities: Being backed up by a team of engineers with comparable experience, I am looking to grow my business as a service to your Company. References: Available upon request. Please see my resume attached. 4 / 27 Skills & Expertise Application types: web sites, web services, databases, data integration solutions Technologies: primarily .NET, a little bit Java, non-managed code Database systems: primarily MS SQL (including SSIS), a little bit MySQL Environment for test: production (including testing production instance from Cloud), staging, pre-production, test Operating systems: Windows, Linux Test tools: Apache JMeter, HP LoadRunner / Performance Center, and other tools Client-side analysis tools: Google Page Speed, HTTP Watch, dynaTrace, Firebug, Fiddler2 Profilers: WinDBG, ANTS profiler, MAT, MS SQL profiler Various other tools for monitoring, sniffing, log analysis, various in-house developed tools, including http-test framework for quick script setup in Load Runner 5 / 27 Case Studies 6 / 27 Case Studies Overview I have participated in number of efforts on performance testing and analysis during my career. This section describes just three the most typical case studies: 1. Complex web application with big web site on the front end and dozen of web services on the back end 2. Typical 3-tier web application with many performance issues on each tier 3. Test against production site from Amazon cloud 7 / 27 Case Study #1 Problem: Application under test: a financial information product from one of the world’s leading providers of intelligent information. Technically it was a very large web-site (more than 300 pages) on the front end with dozens of SOAP web services on the back end, and number of data stores. Tech stack: ASP.NET, SOAP, MS SQL. Key challenge: establish a regular performance test against the whole Platform (consisting of more than 400 servers with vast number of interconnections). For the case when test reveals the high-level issue with performance, the test would also need to provide insight of what specific component of the whole system to blame. Solution: Production usage statistics (in terms of hits/sec distribution) was collected for the front end site as well as each of the web services it calls Test scripts were developed in LoadRunner for each of web services and front end web site To speed up and simplify the process of script development and modification a special performance test framework was developed first. Then, the development of each particular script was more like listing requests and their weights plus providing parameterization. Standard server side test program was implemented: capacity, response time, and stability tests Each release, performance test program was run against the front end side with unloaded monitoring runs of each of web services in parallel. This made it possible to understand which web service to investigate deeper in case of problems. For deeper investigation, additional tests then were run against specific web services. The tests were run regularly, and prevented escaping to production big number of performance issues, both application and environmental. 8 / 27 Case Study #2 Problem: Application under: web-application used by health insurance companies to carry out care management. It was a web-site on the front end, with number of COM and NT services, and MS SQL database on the back end. BI tool for automatic rules processing was also incorporated into DB. Tech stack: ASP.NET, SOAP, REST, COM+, MS SQL including SSIS. Key challenge was about the application performance which was very poor by the moment when I picked up the effort: application was very slow and unstable. Another challenge was set of LoadRunner scripts which were developed before my start on the project. Test scripts were overcomplicated and there was no time to re-develop them – performance tests were required on the day 1. Solution: Map of data flows in existent scripts was quickly created just to understand which of 15 scripts and 20 library files calls what else scripts and files, and thus application’s functionality. In addition to typical benchmark test that was already designed, capacity and stability tests were also added to the program of tests. Direct communication with each of work stream leads was established for quicker and easier performance issues troubleshooting. Web-site’s code profiling as well as DB profiling was extensively used for troubleshooting performance issues. As a result, performance test effort was successfully picked up and re-established, and more than 20 performance issues were found in 7 months (most of them related to the inefficient operation with data in DB) and fixed in close cooperation with the development teams. 9 / 27 Case Study #3 Problem: Application under test: web player for broadcasting important events from one of the world’s leading stock exchanges. It was a very simple web site with MS SQL DB on the back end. The video was streamed from AKAMAI and was out of scope of testing. Tech stack: ColdFusion, Windows, IIS, MS SQL. Customer’s complaint was that when the total number of viewers entering the presentation within 5 minutes interval exceeds approximately 30,000 users, it becomes very difficult for new users to open a player. The task was to find out the exact max number of users, and, if possible, the cause of the issue. Primary challenge consisted in system under test which was a production system, so that the tests needed to be run with close cooperation with operation engineers. This also implied no access to the servers which limited our ability to analyze the issues very significantly. Also, customer could not provide infrastructure for test run inside their data center. Solution: Since the site itself was not complex, a test script with freeware Apache JMeter tool was quickly developed Instructions and scripts to setup specific counters monitoring were prepared and sent to the customer side, also IIS logs were collected from the servers Tests were run from Amazon Cloud over the Internet in close contact with customer’s Operations. IPs of load generation servers were added to the firewall’s white list. Test results analysis revealed that it is not only the number of users which limit the overall performance , it is also how fast their enter the system. 27K users with +200/sec entrance rate was defined as a point of instability – if the number of users is higher or entrance is quicker, then the system starts to refuse the connections. With lower entrance 40K users was the ceiling of system capacity with CPU utilization up to 95%. It was also found that the connection refusal happens when total number of open connection exceeds 14K – the firewall settings was the primary suspicion. 10 / 27 Engagement Plan 11 / 27 High Level Plan of Engagement On the high level, to set up a performance test, you need: 1) Contact me and discuss what you expect from performance test, i.e. what is primary objective as a result, I can suggest the vision of what’s required to do for your specific case 2) Allocate a technical contact on your side so that I can discuss any technical questions in organized manner not disturbing your development and test staff directly 3) On the high level, describe your system from the top to bottom layer so that I can estimate the effort on the very high level, as well as highlight important contingencies 4) Provide me with user access to your system so that I can access it and estimate the effort for scripting and test 5) Provide typical user scenarios or weighted distribution of requests, as well as values for request parameterization (e.g. search keywords or test user ids / passwords, etc.) this is required for modeling good workload on your system 6) Allocate a test window so that loaded tests do not interfere with your development and test teams and/or production usage of your system 7) Optional, but still desirable: provide access to the servers under load so that I am able to setup monitoring as required, as well as conduct profiling for highest efficiency of performance effort. If direct access is not possible, then contact on your side for online requests to get the logs or tasks like that would be a great help. 12 / 27 Work Models I am happy to work both: 1. Fixed Price 2. Time & Material Legal status of Private Entrepreneur in the Republic of Belarus with privilege of signing contracts with US companies as an independent contractor In case of a complex task/project requiring more than 1 FTE, it is possible for me to invite to the project more (up to 3) sub-contractors of comparable to mine experience. 13 / 27 Contact Information Vladimir Marchenko Please contact me in case you are interested in my service and/or have any further questions: E-mail: vladimir.a.marchenko@gmail.com Skype: vladimir-a-marchenko Mobile: +375 (29) 767-36-68 LinkedIn: http://www.linkedin.com/pub/vladimir-marchenko/68/869/6bb Thank you! 14 / 27 Appendix 1: Why Performance Matters 15 / 27 Reliability under Load Just one vivid and fresh example about application reliability: “Healthcare.gov was frighteningly dysfunctional on day one. Users experienced multi-hour wait times, menus filled with blanks, and bizarre quirks like the "prison glitch" which stopped a user from proceeding until they specified how long they had been incarcerated, even if they had never been to prison.” (The Verge) While experts soundly say that the problems with healthcare.gov were caused by wrong system architecture, there is no doubt that proper and timely load performance testing could have prevent that in production. http://www.theverge.com/us-world/2013/12/3/5163228/healthcare-gov-obamacare-website-shows-how-government-can-do-tech-better http://www.reuters.com/article/2013/10/05/us-usa-healthcare-technology-analysis-idUSBRE99407T20131005 16 / 27 Velocity These are some pieces of info graphics from Strangeloop Networks that clearly show what application velocity could mean for end user experience and businesses. “Amazon.com. Increased revenue by 1% for every 100 milliseconds of improvement” “AOL. Visitors in the top ten percentile of site speed viewed 50% more pages than the visitors in the bottom ten percentile” “YAHOO. Increased traffic by 9% for every 400 milliseconds of improvement” “Shopzilla. Sped up average page load time from 6 to 1.2 seconds. Results: increased revenue by 12% and page views by 25%” Complete report: http://www.strangeloopnetworks.com/resources/infographics/web-performance-and-user-expectations/poster-visualizing-web-performance/ 17 / 27 So, Problems with system performance usually lead to: 1. Application unavailability to end users when the loads gets high 2. Poor experience of end users working with your application, often leading in loss in revenue Performance testing (including load testing) and analysis help to timely assess the system under test and prevent those issues in production. 18 / 27 Appendix 2: Performance Test Program 19 / 27 Program Overview I firmly believe that the following tests is a bare minimum when it comes to performance assessment of a web based system. Performance of the server side of a system: 1) Assessing system capacity 2) Checking long term reliability 3) Assessing response times Plus, finding the causes of bad performance (performance bottlenecks elimination) Performance of the client side of a system: Number of analysis activities to make sure that the interaction between the client and server sides is efficient for good performance, as well as make sure that the client side’s code performs well. 20 / 27 Capacity Special tests are conducted to assess system capacity in given configuration. Typical questions to answer here: 1. With what number of users system becomes unavailable to end users? 2. What number of business operations can system process per second (minute, hour, …) at maximum? 3. What hardware resource utilization is observed on each server when the system saturates? 4. What factors limit the current system capacity: inefficient code, improper configuration, weak hardware, etc. Capacity testing can also provide great help when it is necessary to assess how system power scales when system hardware or other parameters (such as network bandwidth) change. 21 / 27 Typical Capacity Graph Green line (left axis) represents the number of virtual users, which is being gradually increased in course of test, so that the level of load is gradually increasing, too. Blue line (left axis, scaled 10x) shows total operations/sec performed by the system under test. This value is increasing until some value (~10 operations/sec) which is the maximum achieved value – system capacity. Red line (right axis) is the response time of the slowest operation, given in seconds. Graph shows that the system saturation is correlated to this operation’s response time increase. After that particular test, application code profiling and DB profiling were conducted and it was found that the root cause of performance issue was non-optimal indexing of one of the tables in DB, which led to slow stored procedure call, and that led to the slow response time for end users. This was fixed, response time improved, and thus system capacity increased for 3 times to the desirable level. 22 / 27 Response Times While Capacity test usually gives a quick snapshot of application velocity and performance bottlenecks in general, it also makes great sense to conduct separate tests to measure response times for key transactions under the defined levels of load. It’s worth typically distinguishing the tasks of applying good realistic load and measuring response times for the transactions in interest. Questions to answer with response time test: 1. How fast different pieces of functionality execute under specific level of load (typical production load, peak load, etc.) 2. Does this release / build perform better or worse than the previous one? 3. Similarly to Capacity test: if some transaction is not fast enough, what would be the reason for that? 23 / 27 Typical Response Times Table The table below shows typical example of response time report. Response times for number of transactions (pages of a web site) were measured under medium load, for original release candidate build (RC) and for the hot fix (HF). The data shows that while the RC build had some serious trouble with filtering functionality (response times 8 sec and above), the hot fix helped with this: response times of filtering functionality improved to 300-400 msec. The fix itself was all about adjusting the queries in two stored procedures, and one stored procedure showed great improvement, but the other one showed big relative but still acceptable absolute degradation. Both these together led to great improvement on the UI. It is also seen that hot fix decreased the load on DB server’s CPU greatly, therefore WEB server was able to take more requests to process per second. 24 / 27 Reliability Performance testing stands not only for velocity, but also for long term reliability. To check application reliability a long loaded test to be conducted with the closest look is on memory consumption. Usually, this is a test with level of load as 80% of saturation with duration at least 12 hours, but specific setup is to be chosen according to situation. Typical questions to answer here: 1. Does application perform stably over time? Response times, network traffic, operations/sec should be stable over all duration of the test 2. Are there any abnormal spikes of errors; what are the causes if any? 3. Does application free the allocated memory timely? 4. If there is memory leak, what is the cause? 5. If it is not possible to fix the memory leak quickly, what safe interval of server reboot should we apply? 6. <and more similar questions related to reliability> 25 / 27 Longevity Test Graph The graph below is kind of typical for longevity test that shows suspicious results. The green horizontal line (right axis) is number of virtual users applying the load, which is constant for the whole duration of the test. Value of total operations performed per second is represented by the blue line (left axis). The bold red line is memory consumption given in percentage from 100% (left axis). It is seen that although total operations/sec remains stable during the whole duration of the test, which is good, the memory consumption has visible trend to growth. And although no out of memory exceptions were encountered in the test (as well as no application crashes), this is not a good sign in general, and additional memory profiling is recommended to check if it is real memory leak (and also the cause of it), or not. 26 / 27 Client-side Analysis Capacity, stability, and response time testing is a bare minimum to assure server side performance. However, even with good server side performance, the client side issues may ruin the overall web site velocity. Using special tools for client side analysis it is possible to highlight those issues and greatly improve the end user experience. Some of these issues are not purely client side, but more about the interaction between the client and server sides. There are many questions to answer here, here are just few of them: • Is it possible for client side code to execute faster? • Is it possible to reduce the network traffic from server to client side? • Can the site HTML be optimized for better rendering in browser? • Does the client side caching work efficiently? • Are the images sizes optimized enough? • Is there any sense to add data pre-load for site? • Is the site optimized well for mobile users’ networks and screens? • <and many other questions like ones above> 27 / 27