June 3, 2011 AIX Rightsizing Clea Zolotow Senior Technical Staff Member, IBM Corporation Nicholas Lydakis, Manager, Capacity Planning, WellPoint Corporation © 2009 IBM Corporation AIX Rightsizing ABSTRACT There are many ways to reduce cost in a datacenter. One of the easiest ways to decrease costs is to decrease the number of servers on the floor. Now, along with physical consolidation, we can logically simplify the datacenter by utilizing virtualization. Some technical barriers to virtualization are Performance concerns – workloads competing for resources; Growth concerns – workloads cannot reserve space for growth; and Architectural constraints – servers run out of IO or memory before they run out of CPU. This presentation provides an mass analysis methodology to address performance and growth concerns, and architectural constraints as well as methodologies that can be used to coadunate LPARs to achieve higher utilization rates at the hardware level. This methodology has been quite successful at IBM. Our biggest cost savings was a run rate of $2.4 million yearly in hardware and a $2 million software savings due to decreased engine utilization. 2 © 2009 IBM Corporation AIX Rightsizing Virtualization = Infrastructure Simplification Efficient Virtualization provides the best ROI and minimize the RISK Windows Servers Windows Server Networking Unix Servers Management Servers Unix Server Networking Linux Server Linux Servers Storage Complex 1 workload per server Manual provisioning No sharing Vertical silo’s Disparate mgmt tools Multiple sites 3 Virtual Servers, Storage, Networks SAN Physical Consolidation Fewer sites Use of larger servers / SAN’s Mostly environmental savings Disparate management tools Labor intense provisioning Workload mgnt and isolation issues Virtualization Servers Storage Networking Logical Simplification Multiple virtual servers (OS’s) per physical server Significant savings – fewer servers, higher utilization Rapid “provisioning” Automatic workload mgmt Preserve logical “server to application” relations © 2009 IBM Corporation AIX Rightsizing Virtualization’s popularity today is based on its ability to optimize IT Why do organizations adopt virtualization? Virtualization has been around for decades And it is here to stay Large and small organizations alike are rapidly adopting the technology Virtualization motivators Reduce costs 57% Simplify IT infrastructure and administration 48% Increase server utilization 48% Increase scalability of infrastructure 29% Enhance resiliency and reliability Improve application performance Manage a heterogeneous server environment 25% 15% 9% For reasons that range from reduced IT costs to simplified IT environments, streamlined management and increased IT flexibility Source: IBM Systems and Technology Group (1Q06) 4 © 2009 IBM Corporation AIX Rightsizing Each Workload is Evaluated for Suitability Based on Technical Attributes Priority Workloads for Consolidation: WebSphere® applications Domino® Applications Selected tools: Tivoli®, WebSphere® and internally developed WebSphere MQ DB2® Universal Database™ 5 © 2009 IBM Corporation AIX Rightsizing Current Mid-Range Server Location by State – Physical Consolidation Opportunities still exist! Unix and Intel by Location 6 California Colorado Connecticut Georgia Illinois Indiana Kentucky Maine Massachusetts Michigan Missouri Nevada New Hampshire New York North Carolina Ohio Texas Virginia West Virgina Wisconsin © 2009 IBM Corporation AIX Rightsizing Analysis methodology to address performance and growth concerns: Rightsize individual LPARs (CPU and Memory) Know your current hardware utilization rates and derive potential cost savings to get customer/app owner buy-in. Production 8.43 Roll out resizing in waves. – Capacity planning has to measure pre- and post-wave to ensure that there is headroom for processing. – Find potential resource problems before the app owner Physical Box Busy 13.66 Rightsize individual LPARs – Initial pass is “perfect world” – Second pass is initial meeting with app owners. – Third and subsequent passes take into account most-loved and business critical applications. Non-Production 0 2 4 6 8 10 12 14 16 Actual hardware savings is usually 50% or less than perfect world analysis. 7 © 2009 IBM Corporation AIX Rightsizing UNIX Virtualized vs. Non-Virtualized Utilization Large Company – Recent Data 1200 90 80 1000 70 800 60 50 600 40 400 30 20 200 10 0 Non-Production Production LPAR Server 0 Production Server Total/Averages Average CPU Busy 11.88 9.43 12.94 10.85 11.28 Average CPU Max 84.36 71.26 81.81 67.25 76.17 499 54 406 153 1112 Number of Virtual Machines/LPARs 8 Non-Production LPAR © 2009 IBM Corporation AIX Rightsizing Capped and Uncapped Mode In the configuration of Micro-Partitioning, two types are available, capped and uncapped. The difference is in defining the ability of a partition to use extra capacity available in the system. If a processor donates unused cycles back to the shared pool, or if the system has idle capacity (because there is not enough workload running), the extra cycles may be used by other partitions, depending on their type and configuration. Capped mode The processing capacity never exceeds the assigned processing capacity. Uncapped mode The processing capacity may be exceeded when the shared processing pool has available resources. 9 © 2009 IBM Corporation AIX Rightsizing Capped Mode A capped partition is defined with a hard maximum limit of processing capacity. That means that it cannot go over its defined maximum capacity in any situation, unless you change the configuration for that partition (either by modifying the partition profile or by executing a dynamic LPAR operation). Even if the system is otherwise idle, the capped partition cannot exceed its entitled capacity. 10 © 2009 IBM Corporation AIX Rightsizing Uncapped Mode With an uncapped partition, you must specify the uncapped weight of that partition. If multiple uncapped logical partitions require idle processing units, the managed system distributes idle processing units to the logical partitions in proportion to each logical partition's uncapped weight. The higher the uncapped weight of a logical partition, the more processing units the logical partition gets. 11 © 2009 IBM Corporation AIX Rightsizing Min, Max and Desired When assigning processor values you must specify minimum, desired, and maximum values for both processing units and virtual processors. If any of the three types of resources cannot satisfy the specified minimum and required values, the activation of a partition fails. If the available resources satisfy all the minimum and required values but do not satisfy the desired values, the activated partition will get as many of the resources that are available. Min Processing Unit .1 Desired Processing Unit .5 Max Processing Unit 1 Min Virtual CPU 1 Desired Virtual CPU 1 Max Virtual CPU 2 This is the Cap The maximum value is used to limit the maximum processor resources when dynamic logical partitioning operations are performed on the partition. 12 © 2009 IBM Corporation AIX Rightsizing Rightsizing Methodology: AIX CPU Sizing Parameters (Uncapped) Minimum=the lowest configuration available without rebooting Engine Type Physical Physical Entitlement=the starting configuration of the LPAR Maximum=the highest configuration available without rebooting Minimum Entitlement Maximum The average CPU consumed by the LPAR, or 10% of the Virtual Half of the Physical entitlement, whichever is higher. The total of this Entitlement number cannot exceed the Virtual activated CPUs on the Entitlement=the maximum the frame. Twice the Physical Entitlement LPAR can receive Virtual 13 Half of the Virtual Entitlement The maximum of the CPU consumed by the LPAR * 1.30%. Twice the Virtual Entitlement © 2009 IBM Corporation AIX Rightsizing Rightsizing Methodology: AIX CPU Sizing Parameters (Capped) Minimum=the lowest configuration available without rebooting 14 Physical Entitlement=the capacity of the LPAR can receive Maximum=the highest configuration available without rebooting Engine Type Minimum Entitlement Maximum Physical The maximum of the CPU consumed by the LPAR * 30%. The total of this number cannot exceed the activated CPUs on the frame. Half of the Physical Entitlement Twice the Physical Entitlement © 2009 IBM Corporation AIX Rightsizing Advanced Power Virtualization Virtual I/O Server Dynamically resizable 6 CPUs 4 6 CPUs CPUs Ethernet sharing Virtual I/O paths Manager Server LPAR 1 AIX 5L V5.2 LPAR 2 AIX 5L V5.3 PLM agent PLM agent AIX 5L V 5.3 – Shared Ethernet – Shared SCSI and Fibre Channel-attached disk subsystems – Supports AIX 5L V5.3 and Linux partitions Micro-Partitioning – Share processors across multiple partitions – Minimum partition 1/10th processor Partition Load Manager Hypervisor PLM partitions AIX 5L V5.3 Storage sharing AIX 5L V5.3 IVM AIX 5L V5.3 i5/OS AIX 5L AIX 5L Linux V5R3** V5.2 V5.3 Linux Micro-Partitioning Linux Virtual I/O server partition 2 1 CPUs CPU AIX 5L V5.3 1 CPU Unmanaged partitions LPAR 3 Linux – Balances processor and memory request Managed via HMC or IVM Hypervisor 15 © 2009 IBM Corporation AIX Rightsizing Tooling and Data Retrieval: SRM To the right is the SRM methodology and data streams. This works like many other performance and capacity systems. Minutely agents are deployed (1) and sent to an interim holding spot (2) where the the data gets processed and crunched to 15 minute intervals or hourly intervals (3) where it’s stored in DB2 (4) and presented on the SRM website(4). 16 © 2009 IBM Corporation AIX Rightsizing Tooling and Data Retrieval: Brio (ODBC) After the data is loaded to the SRM data warehouse, it is extracted to the PC utilizing Microsoft’s Open Data Base Connectivity (ODBC). There, the architectural and utilization information is merged together to produce three reports utilized for rightsizing and server consolidation studies. SRM Data Warehouse Utilization Information Utilization Reporting Architectural Information Brio Rightsizing Reporting Architectural Reporting Custom Categorization 17 © 2009 IBM Corporation AIX Rightsizing Rightsizing Methodology: AIX CPU Sizing Parameters Part One: Pull the data: =ROUNDUP(IF( A3="Capped",( G3*I3/100)*1.3, J3),0) =IF(K3/10> M3,K3/10, M3) =ROUNDUP(IF( A3="Capped", G3*I3/100,J3),1 ) Use this later, start with the forest, not the trees. Part Two: Analyze it 18 © 2009 IBM Corporation AIX Rightsizing The Big Picture In the previous example, I chose only the 34 32-way boxes at this corporation (1088 CPUs). 385 physical CPUs on capped LPARs are currently allocated to the workload. After rightsizing, in a perfect world, we uncapped all the LPARs and could run them on 261 virtual CPUs and 174.8 physical CPUs, or 5.5 32-way boxes, a savings of 25 physical frames after accounting for headroom (2 CPUs per frame) and 4 engines per frame dedicated to VOIS. Your mileage will vary. 19 © 2009 IBM Corporation AIX Rightsizing Technical Barriers to Virtualization: Workloads Competing for Resources Monitoring workloads is essential. Silo-ed corporations seem to believe that in shared-host systems, someone else is stealing their CPU. The next chart shows how physical utilization can be calculated at the frame level. Uncapped LPAR utilization is calculated by utilizing the number of CPUs dispatched to service the workload and therefore includes any LPAR overhead of frame overhead (PURR value, physical processors consumed). Capped LPAR utilization can be calculated in two ways: – Simple count of engines as they are no longer in the shared pool (i.e., the number of physical CPUs). – CPU Utilization * the number of physical CPUs assigned. To prove to management that the boxes are underutilized and run a cost savings project, I usually use CPU Utilization (as seen on the next page). To prove to application owners that the CPUs isn’t being “stolen” I use the “simple count of engines” for the capped environment and the CPU dispatched for the uncapped. 20 © 2009 IBM Corporation AIX Rightsizing Technical Barriers to Virtualization: Workloads Competing for Resources The top (yellow bar) is the number of physical CPUs, here 32. 35 The top of the blue line is the maximum CPU utilization of the frame. 30 30 25 Physical CPUs 20 The bottom of the blue line is the average utilization of the frame. 20 15 15 10 10 5 5 0 0 IBM,011049B0F IBM,01022E17D IBM,0110401DF IBM,01103F92F IBM,0110BADAC IBM,0110BADCC IBM,0110BAEAC IBM,011023BAF IBM,0110BADFC IBM,011059FBD IBM,011095030 IBM,0102BC17C IBM,0102CF0DF Avg HW Used IBM,010288FDF IBM,010288F8F IBM,01021DD9B Max HW Used IBM,0102270FB IBM,0102398AB IBM,010225A9A 21 IBM,010222C5F IBM,01020ED2D IBM,01024DA1A IBM,01020EC9D IBM,01023F70B IBM,0102B13DF IBM,01021BF5B IBM,01021DDDB IBM,010247D1D IBM,0102105DB IBM,0102C586C IBM,010288F3F IBM,0102B143F IBM,01021062B IBM,01025C24A CPUs on Frame 90thPCtile HW Used © 2009 IBM Corporation CPU Utilization The red square is the 90th percentile of the CPU utilization of the frame utilizing hourly data. 25 AIX Rightsizing Growth Concerns – Workloads Cannot Reserve Space for Growth; In an uncapped environment, workloads can reserve space for growth by utilizing the amount of virtualized CPUs available to the workload. This was used to “sell” the benefits of uncapped LPARs to the application owners. In the previous example, a 30% uplift was built into the calculation for the virtual CPUs: – =ROUNDUP(IF(A3="Capped",(G3*I3/100)*1.3,J3),0). – As you work with your individual environment, you can customize that uplift. – Note that uplift not only covers growth, but intra-hour peaks (as I utilized hourly average data). 22 © 2009 IBM Corporation AIX Rightsizing Architectural Constraints – Servers Run out of IO or Memory Before They run out of CPU; These machines require 1,393,664 MB of memory to run their workload. (Memory optimization will have to wait for another day.) Spread over 7 machines, each machines (evenly) would require 199,095 MB of memory, or 200,704 (4096) or 204,800 (8192). Unfortunately, these machines came with 131,072. Further, there are 7 Oracle databases in which the application owner will not let the LPAR run on shared VOIS, adding to the number of frames and the number of engines. 23 © 2009 IBM Corporation AIX Rightsizing Methodologies to Coadunate LPARs coadunation the state or condition of being united by growth. — coadunate, adj. 24 © 2009 IBM Corporation AIX Rightsizing Coadunation Example Mixing workload shares headroom but you pay in response time at low utilization....workload management shifts peaks based on business priorities to use "white space" but response time of lower priority work is traded off... 25 © 2009 IBM Corporation AIX Rightsizing Data Preparation Data is readily available from the SRM database at srmweb.raleigh.ibm.com. Data is extracted and normalized to the receiving machine using the Ideas International database. The CSV file is briefly edited then run into SPOT. This extraction and load process takes about 20 minutes (depending on the response time of the SRM database). The SPOT tool takes about 10 minutes to run each datacenter (Southbury and Boulder). Total study time is 60 minutes. Easy! 26 © 2009 IBM Corporation AIX Rightsizing SPOT Screenshot #1 27 © 2009 IBM Corporation AIX Rightsizing SPOT Screenshot #2 28 © 2009 IBM Corporation AIX Rightsizing SPOT Screenshot #3 29 © 2009 IBM Corporation AIX Rightsizing Results of Co-adunation Study, Boulder Boulder has 24 physical frames holding 93 LPARs, averaging 3.875 LPARs per frame. Based on CPU utilization, the LPARs could all be deployed to 5 x445s, which would then run an average of 47.4% busy, a savings of 19 physical frames. 2 LPARs would be migrated to stand-alone. (This is an average 18.2 LPARs per frame.) Current host utilization for Boulder for March, 2007 was 7.33% busy. 30 © 2009 IBM Corporation AIX Rightsizing Results of Co-adunation Study, Southbury Southbury has 17 physical frames holding 59 LPARs, averaging 3.47 LPARs per frame. Based on CPU utilization, the LPARs could all be deployed to 4 x445s, which would then run an average of 45.6% busy, a savings of 13 physical frames. 2 LPARs would be migrated to stand-alone. (This is an average of 14.25 LPARs per frame.) Current utilization for Southbury was 7.62% busy. 31 © 2009 IBM Corporation AIX Rightsizing Conclusion There are many ways to reduce cost in a datacenter. Decrease the number of servers on the floor using physical or virtual consolidation. Address Concerns: – Performance concerns – workloads competing for resources; – Growth concerns – workloads cannot reserve space for growth; and – Architectural constraints – servers run out of IO or memory before they run out of CPU. Utilize a statistical or bin-packing mass analysis methodology to coadunate LPARs to achieve higher utilization rates at the hardware level. Get those cost savings! 32 © 2009 IBM Corporation AIX Rightsizing Questions? 33 © 2009 IBM Corporation