VMware Farm Optimization By Jeremy Kampwerth jkampy@hotmail.com Introduction To Me • Windows and Unix System Administrator for 8 years. • Capacity and Performance Engineer for 6 years. • Apparently I like working for large companies. • I consider myself a jack of many trades. • Presentations like this are not one of the trades. In General • Topic has VMware in the title but this is not about VMware specifically as the concepts I will discuss could be applied to any virtual environment. • The concepts are more about common sense then they are trade secrets. • My role was to assist the project by providing analysis and technical expertise. – I was not doing the dirty work Introduction to the Topic • Working hand-in-hand with the virtual support team to reign in the wild fire that was virtual sprawl. • Optimize VMs based on historical utilization data with added controls around application requirements. • Today capacity team is part of the before and after process to regulate and review. Introduction to the Topic • I will discuss – How we got into the mess and how did the capacity team help to get out of the mess. – How and why was the capacity team engaged. – The expensive tools we used to do the job. – The guidelines we used to make safe decisions. – What were the failures? – What led to the successes? How did we get into this mess? • Many factors led to the wild fire (virtual sprawl) – Corporate decision to push virtualization – Lack of controls in request process • Lead to many over-provisioned VMs – Existing large non-centralized environment managed across many different internal organizations each with a different set of rules Ask the Capacity Team for help • Surely the internal capacity team was the first call. • Surely before they ask for money they would think of the capacity team. • Surely upper management would know the capacity team exists. – Luckily they did What was being asked • Can we help the virtual team reign in the madness? • Can we produce same results as outside company? • Can we do it in a safe manner? • Can we do it reliably and reproduce reliability? What did they do? • Looked at data for thousands of VMs – Data only contained 4 weeks • Analysis via Modeling tool – Fancy tool with top secret formula • CMDB details not considered – No application relationships – No account for age of the VM • Many reductions found – Over 40% vCPU reduction – Over 70% vMEM reduction Our Guidelines • Make sure the server is being used for what it was intended – In deployment for 180 days • Consider the application – Match by role and function • Within each application, all production web servers should be sized the same • Enough Data – Minimum 90 days of data • Peak utilization – No arguing (but but why?) – 15 minute interval – Add headroom • 20% headroom for vCPU • 5% headroom for vMem (consumed memory) Candidacy Analysis Overview Outside Co. Internal Only subset of location included All locations included Resources vCPU & vMem Initial vCPU & Small vMem pilot Asset Status/Duration Not taken into consideration 180 days Deployed Environment Matching (by AppID) Not taken into consideration Match Prod/BCP Match Non-Prod 4 weeks via Modeling tool 90 days of data vCPU Formulas Used Not Disclosed Single Max vCPU (15min Interval) + 20%, rounded up vMem Formulas Used Not Disclosed Single Max Consumed vMem + 5%, rounded up thousands thousands plus thousands 92% 15% Location Minimum Data Required for Recommendation VMs Analyzed Candidates Identified Analysis Comparison Data Center 1 Server 1 Server 2 Server 3 Server 4 Server 5 Server 6 Current Configuration 4x8 4x8 4x8 4x8 Outside Co. Recommendation 1x1 1x1 1x1 1x1 Internal Recommendation 2x8 2x8 4x8 None Reason for Difference Max vCPU 1.48 Max vMem 7.8 BCP Match PROD Max vCPU 2.92 Max vMem 7.67 Disposed Data Center 2 Server 1 Server 2 Server 3 Server 4 Current Configuration 4 x 16 4 x 16 4 x 16 4 x 16 Outside Co. Recommendation 1x4 1x7 1x6 1x7 Internal Recommendation 3 x 12 2 x 16 2 x 16 3 x 16 Reason for Difference Max vCPU 2.8 Max vMem 11.4 Max vCPU 1.52 Max vMem 15.62 Max vCPU 1.14 Max vMem 15.62 Max vCPU 2.72 Max vMem 15.61 Data Center 3 Server 1 Server 2 Server 3 Server 4 Current Configuration 4x6 4x6 2x4 Outside Co. Recommendation 1x1 1x1 1x2 Internal Recommendation 2x6 2x6 None Max vCPU 1.44 Max vMem 5.49 Max vCPU 1.7 Max vMem 5.6 Disposed Reason for Difference The Process • Capacity team to produce the results and review and with project team to identify candidates. • Project team to communicate plan to planners and application owners. • Allow for rebutal – But you better bring the facts • Optimize The First Year Results • Of the 15% of VM candidates identified – 23% were cancelled after appeals process – Of the completed • 50% reduction in configured vCPUs • vMem was excluded – 100% of reductions made with no issues The Second Year Results • Of the 8% of VM candidates identified – 24% were cancelled after appeals process – Of the completed • 20% reduction in configured vCPUs • 10% reduction in configured vMem – 100% of reductions made with no issues Realized Benefits • Better performing VMs – Over-provisioning of resources can hurt • Better performing Hosts – Accurate view allowed for higher utilization of the clusters • Costs – Delayed purchase of new farms for over a year • Time to focus on future – New farms running more powerful hardware allowed for a many to one replacement What were the issues? • Communication breakdown – First knowledge of optimization was from the change request • Lack of understanding – Not knowing how and why • Coordination of optimizations – Had to learn how things would work What led to the Success? • Management backing – You will be optimized unless you can produce evidence • Conservative formula – Peak utilization served us well • Communication, Communication, Communication • Processes in place – Appeals process – Resources on demand (or at least with a phone call) What if we got more aggressive Scenario 1 Scenario 2 Scenario 3 Scenario 4 (Same Year 2) (1 hour interval) (12 month max) (Less overhead) Location All Infrastructure Resources vCPU & vMem Asset Status/Duration 180 days Deployed Environment Matching (by AppID) Match Prod/BCP Match Non-Prod Data Required for Recommendation vCPU Formulas Used 90 days minimum 90 days minimum (15 month max) (12 month max) Single Max vCPU (15min + 20%, rounded up Interval) vMem Formulas Used Single Max Consumed vMem (15min Interval) + 5%, rounded up Single Max vCPU + 20%, rounded up (1 hour Interval) Single Max Consumed vMem (1 hour Interval) + 5%, rounded up VMs Analyzed Candidates Identified Single Max vCPU Single Max vCPU (1 hour Interval) + 20%, (1 hour Interval), rounded rounded up up Single Max Consumed vMem (1 hour Interval) + 5%, rounded up Single Max Consumed vMem (1 hour Interval), rounded up Over 10k 4% 18% RISK 19% 29% Where are we today • Part of the request process – Previously we may or may not be asked for sizing – Currently all sizings come through us • All existing servers get a sizing recommendation • Annual Optimization Review – At least one optimization per year – Optimization now includes vCPU, vMem, and storage • Storage follows same type guidelines but analysis not be capacity team Thanks for listening!