Characterizing Cloud Management Performance Adarsh Jagadeeshwaran CMG INDIA CONFERENCE, December 12, 2014 © 2014 VMware Inc. All rights reserved. Agenda •Building Blocks of VMware’s Cloud Infrastructure •The Software Defined Datacenter •Cloud Management Performance at VMware •Performance Challenges •Tools and Benchmarks •Role of Simulation •Performance Testing Methodology •Conclusion Building Blocks of VMware’s Cloud Infrastructure It all started with x86 virtualization Traditional Architecture Virtual Architecture CONFIDENTIAL 4 And features like.. • VM.migrate – Move the compute state of a Virtual Machine (VM) from one physical box to another – Typically used for resource load balancing • VM.snapshot – Preserve state and data of a VM at a specific point in time – Snapshots are very helpful in avoiding damages to VMs during patch or upgrade problems. • Distributed Resource Scheduling CONFIDENTIAL 5 Building the cloud The New Role for IT: IT as a Service Virtual Workspace Manage access to services, applications and data for any device Private Clouds Public Clouds 60% Hybrid Cloud Seamlessly extend your data center to the public cloud Software-Defined Data Center Virtualize the entire data center Management and Automation Storage and Availability Compute Network and Security CONFIDENTIAL 6 Cloud Infrastructure = Software Defined Data Center Compute : cpu, memory resources APP APP APP APP APP APP APP APP OS OS OS OS OS OS OS OS Compute CONFIDENTIAL 8 +Storage APP APP APP APP APP APP APP APP OS OS OS OS OS OS OS OS Storage Compute CONFIDENTIAL 9 +Networking/Security APP APP APP APP APP APP APP APP OS OS OS OS OS OS OS OS Network/Security Storage Compute CONFIDENTIAL 10 +Automation/Management – This is key APP APP APP APP APP APP APP APP OS OS OS OS OS OS OS OS Network/Security Storage Automation & Management Compute CONFIDENTIAL 11 =Virtual Datacenter APP APP APP APP APP APP APP APP OS OS OS OS OS OS OS OS Software-defined Datacenter VDC 1 Network/Security Storage VDC 2 Automation & Management Compute CONFIDENTIAL 12 Typical Deployment Finance R&D Grid Software-defined Datacenter Services Software-defined Datacenter Services Software-defined Datacenter CONFIDENTIAL 13 Cloud Management Performance at VMware SDDC Management Suite SDDC Cloud Service Virtual Networking Provisioning and Security Software-Defined Storage and Availability Operations Management VMware vCloud® Suite Virtual CONFIDENTIAL 15 VMware Performance R&D MEASURE instrument, benchmark, analyze PERFORMANCE OPTIMIZE design, fix code, tune settings PUBLISH white papers, blogs, kb articles, flings CONFIDENTIAL 16 Performance Challenges The Management Server UI Client UI Client Single SignOn vm UI Server Server 1 host_agent vm vm Server 2 Stats Processing Inventory DB (xml) host_agent vm Relational Database vm CONFIDENTIAL 18 Components affecting performance • VM Resources like cpu and memory – shared across other VMs on same physical server (host) • Virtual devices – storage, networking, VM devices – data stored in management server database • #Managed Objects – data stored in management server database – ESXi hosts – VMs – Resource Pools – Clusters • Performance statistics about objects – stored and processed in the database – Multiple levels of statistics from less to more detailed • Incoming tasks and queries – cpu/mem usage on mgmt. server CONFIDENTIAL 19 Deployment Size • Overall Size: – Small – Up to 150 servers, 3000 VMs – Medium – up to 300 servers, 6000 VMs – Large – up to 1000 servers, 10000 VMs • Single Cluster Size: – Resource Scheduling, Availability and Power Management work at a cluster level – Up to 32 servers or 4000 VMs in a single cluster • A setup with 50 servers and 2000 VMs with least detailed statistics can result in a database size of approx. 16GB CONFIDENTIAL 20 Identify Common Use Cases Cloud Solutions – Ex: vCloud Director (Spans multiple Management Servers) Cloud Management Workflow - 1 Instantiate vApp Deploy vApp Edit vApp Undeploy vApp Delete vApp Cloud Management Workflow - 2 Clone vApp Delete vApp CONFIDENTIAL 21 Identify Common Use Cases – Contd. Customer Usage Patterns • Customer Support Data • Software support bundle – logs, events, traces • Identify common operation pattern and frequency • Group patterns by deployment size CONFIDENTIAL 22 Build Tools for Stats and Monitoring • Monitor Resource Usage – Server level – Management level – Components of the Management Server • Build Internal Profiling Counters – Count of objects in memory – Aggregated stats about tasks, events, etc – Locking information CONFIDENTIAL 23 Tools and Benchmarks Microbenchmark • Simulates load on server from a given operation – Example: 256 VM.powerOn operations in sequence • Focus on specific operation (no background load) • Study scaling trend for a given operation (latency) • Study resource usage trend • Performance of a specific server component CONFIDENTIAL 25 Macro-benchmark • In-house benchmark: VCBench • Simulate (Admin) User Tasks – Issues management operations using public APIs • Simulate Multiple Users – Multiple threads issuing a series of operations • (User) Think time – User can specify “think” time between operations • Realistic work-load – Operation mix & frequency extracted from customer data • Measure throughput – Number of operations completed in given time • Measure latency of operation in the presence of load and corresponding resource usage CONFIDENTIAL 26 Benchmark Run Profile • Two primary modes – “Light”: around 100 operations issued per minute – “Heavy”: around 500 operations issued per minute • Light load slightly above most customer work loads – Lets us exercise the entire management stack – And anticipate increased realistic demand in the short term • Heavy load for saturating the management server – The point where increasing the amount of resources for my server doesn’t result in throughput increase any more. CONFIDENTIAL 27 Realistic Operation Mix Operation Operation/min. (light) Power On VM 40 Power Off VM 40 Clone VM 10 Migrate VM 40 Remove VM 10 Create Snapshot 5 Delete Snapshot 5 Reconfigure VM 10 • Mix of operations revised constantly based on new features and changing datacenter use cases. • Mix and frequency varied simply by editing a run list. CONFIDENTIAL 28 Tools for monitoring performance • Resource Usage Tool – Tool built into hypervisor (esxtop) and management server – Monitoring at component level • Profiling tools (post-process) – Uses management server’s internal profiling information from log bundle – Summarizes performance metrics of internal objects CONFIDENTIAL 29 Role of Simulation Why Simulation? • 1024 physical servers running ESXi (host) is a management nightmare • Plus 15K VMs and the associated networking and storage components • Solution? – Have a simulated version of the hypervisor – Fake the existence of VMs and datastores – Management Server sees no difference CONFIDENTIAL 31 Simulating the hypervisor • Hypervisor agent is the Management server’s agent running on the ESXi server • With the hardware and VMs simulated, we can have the real hypervisor agents run as separate threads in different containers • We retain the agent to management server communication intact • #Objects & properties to be managed by server remains the same • Some Challenges: – Simulating performance statistics, events and alarms – Simulating VM IO • Advantages: – Hypervisor layer is a black box with consistent performance – No hypervisor or storage performance bottleneck – Focus is purely on management server scaling and performance CONFIDENTIAL 32 Performance Testing Methodology Testing for Performance and Scale • Testing at supported scale • Hypervisor Scaling (Scale-up) – Stacking more VMs on the same physical box – Focus is on Hypervisor performance • Management Server Scaling (Scale-out) – Managing more physical boxes and VMs – Focus is on Management Server performance – a) Single Cluster at scale – b) Overall large deployment CONFIDENTIAL 34 Test configurations • Scale-Up – 1 or 2 ESXi Hosts – 0.5-1K VMs per Host – Microbenchmark with focus on one operation at a time – 1, 32, 64, 128, 256, 512 vm.powerOn, vm.reconfigure, etc. – Metrics measured: end-to-end latency, cpu/mem. usage • Scale-Out – 1024 ESXi Hosts managed by a single Management server – 15K VMs total – Benchmark with concurrently issued operations: datacenter.powerOn, vm.migrate, etc. – Metrics measured: Operation throughput, latency, cpu/mem. usage CONFIDENTIAL 35 Regression Tracking • Performance Automation automates processes for setup and regression tracking • Tracking for different scale inventories • Track benchmark data (throughput, latency), and resource usage of management server components for regression • Analyze and fix regressions in performance • Also useful for sizing guidelines CONFIDENTIAL 36 Conclusion Takeaways • Understand factors affecting performance • Have a comprehensive stats/monitoring framework • Build a realistic benchmark that replicates customer behavior • Ideal benchmark run should – Include common use cases and user behavior – remove variability in a multi-tiered setup – Be able to focus on single component • Simulation can help remove variability and with scaling • Generate microbenchmarks that stress a single/small number of components CONFIDENTIAL 38 References Thanks To• VMware vCenter Server Performance Team • “Benchmarking a Virtualized Platform” – Vijayaraghavan Soundararajan, et. al., IISWC 2014 (http://www.iiswc.org/iiswc2014/program2014.html) CONFIDENTIAL 40 Backup Example SDDC Management Task: Distributed Resource Scheduling using VMotion Resource Pool VMware ESX VMware ESX VMware ESXi • Balance VM Load in a cluster of ESXi servers • Enforce Policy Based Rules • Power Management CONFIDENTIAL 42