State of the Benchmarks Daniel Bowers daniel.bowers@gartner.com @Daniel_Bowers Agenda • Server Benchmarks • Benchmark Types • Public Result Pools • Why aren’t there more results? • TPC and SPEC • Vendor Benchmarks • Virtualization Benchmarks • Public and Open-Source Benchmarks • “Good” and “Bad” Benchmarks Server Benchmarks: A Definition The level of performance for a given server configuration that vendors promise you can’t achieve What People Do With Benchmark Results • Purchase or deployment decisions • Right-Sizing & Configuration Definition • Capacity Planning / Consolidation • Normalize server value for chargeback • Performance baselining and monitoring • Troubleshooting • Setting hardware and software pricing • Configuration advice Result Pools Bad Using Benchmarks: Good to Great. Basically: Better than nothing. If you size systems this way… Your results will be… Nothing Catastrophic Use published SPEC CPU2006 for everything Poor Use someone else’s results on a standard benchmark Marginal Run standard benchmarks on your gear OK Have someone else run your applications in their lab OK Run profiles of your workload on your gear Good Run your actual workloads on lab systems Really Good Run your actual workloads on actual production systems Great Smart, Hard-Working, Rich Why Use Published Results? • They’re free • They’re quick • No skill required • You don’t need to have the servers • More accurate than MHz or MIPS • They’re often audited (or at least scrutinized) Examples of Benchmarks with Public Result Pools • • Consortia • - SPEC - LINPACK / HPL - TPC - Stream Vendor Application Benchmarks • - SAP - Geekbench • - VMware Vendor – Relative Metrics - IBM rPerf Government - APP - ENERGY STAR Embedded - EEMBC CoreMark • - Fujitsu OLTP2 • Desktop - SiSoftware Sandra - Oracle • Academic & HPC • “Open” • DVDStore • BogoMIPS Purchasing metrics - Amazon ECU - Google GCE Example Results Pools Oracle R12 (Payroll Large 2T) 8 SPECjEnterprise2010 28 VMmark 2.x 38 TPC-E 54 TPC-H 162 TPC-C 264 SPECpower_ssj2008 357 SPECjbb2005 669 SAP R/3 SD 2-Tier 673 SPECint2000 (retired) 1100 SPECint2006_rate 6675 0 200 400 600 800 1000 1200 1400 Benchmarks “Dilbert”: December 25, 2004. Source: Dilbert.com. The Gap 19% Server configurations with published SPEC CPU2006 results Why aren’t there more results? • Vendors won’t publish losers • Publishing is rarely required, and can be prohibited • Can take lots of manpower & money • Little incentive for end-users to publish results • Benchmarks get broken or become stale Why Aren’t There More Results? “Benchmarks are chosen to show off a platform, not to allow comparisons” - From an IBM presentation to CMG, 2008 Why aren’t there more results? Why aren’t there more results? Why Aren’t There More Results? “A TPC-E measurement can take up to 3 months from preparation through the official acceptance by the TPC committee.” - From a Fujitsu document on server performance TPC Non-profit industry consortium • Creates benchmarks to compare database systems • Membership is primarily server hardware, OS, database vendors • 1997 - 53 members (including associates) • Today: 21 members • Disclosure: Gartner is an associate member All benchmark results are audited by 3rd party & subject to challenge • Full disclosure required (actually 2); must include pricing • Estimates and non-audited results not usually allowed Produces specifications, not programs Benchmark setups are large, usually dominated by storage 1990 1995 2000 TPC-A 2010 2005 OLTP Lightweight OLTP TPC-B TPC-E OLTP (Brokerage House) Batch/Database Stress TPC-C OLTP (Product Supplier) TPC-H Ad Hoc Decision Support DSS TPC-DS TPC-D Decision Support (2012) Decision Support TPC-R Business Reporting Legend TPC-W Retired Web commerce transactional Active New Other TPC-Energy TPC-APP Application Server TPC-VMS (2012) TPC-C • Long history, tremendous mindshare • Results, Estimates, and Predictions for “tmpC” are plentiful • Allows across many generations • OLTP Workload that’s old and weak • Disparity between processor & I/O performance growth • Storage costs dominate • Server -> Storage IO path is bottleneck • Quiz: Why don’t SSDs yield higher results? •TPC has tried to replace it Cost breakdown of example TPC-C result DL385 G1 using Microsoft SQL2005. Full report: http://tpc.org/tpcc/results/tpcc_result_detail.asp?id= 106032001 Software 13% Storage 68% Client Hardware 5% Client Software 1% Server Hardware 11% Other / 3rd Party 2% TPC-C : Example IBM TPC-C Full Disclosure Report http://tpc.org/tpcc/results/tpcc_result_detail.asp?id=112041101 Microsoft & TPC-C Microsoft: “TPC-E is far superior to TPC-C.” - Charles Levine, Principal Program Manager, SQL Server Performance Engineering Microsoft won’t approve TPC-C publications using SQL2008 or later Chart Source: Microsoft (http://blogs.technet.com/b/dataplatforminsider/archive/2012/08/28/tpc-e-raising-the-bar-in-oltp-performance.aspx ) Benchmarks “Pepper…and Salt” January 29, 2013 Source: Wall Street Journal TPC-E • OLTP, like TPC-C • More tables (33) • More transaction types (~100) including more complex transactions • Only results to date are on x86 with Microsoft SQL • Trivia: Dataset based on NYSE company list and uses some US census data Helpful Hint: Fujitsu OLTP2 • Results for all recent Xeon processor models • Search for PDF files entitled “Fujitsu PRIMERGY Servers Performance Report” TPC-H Benchmark results are for specific database sizes (scales) • TPC: Don’t compare different sizes (but my research says that’s OK) Parts of the data set scales linearly with performance * Some have become unrealistic: e.g. 50 billion customer records Smaller database sizes are “broken” by in-memory, columnar databases • Actian VectorWise results are about double the expected results Benchmark appears to be fading away, but may see surge of activity as Oracle & Microsoft adding columnar support to databases Source: HP Whitepaper TPC-DS Decision Support database benchmark meant to replace TPC-H. Released in mid-2012. No results to date. (No auditors either) • Includes many more query types than TPC-H. • Periodic database update process that more closely matches that of today’s databases. • “Modern” : modified star schema with fact tables and dimension tables Other TPC Benchmarks TPC-Energy • Optional add-on to other TPC benchmarks TPC-VMS • Just released • Runs 3 other TPC benchmarks simultaneously on single system under test In Development: • TPC-V • TPC-ETL • TPC-Express Benchmarks Why Pay Attention to BogoMIPs? “To see whether your system is faster than mine. Of course this is completely wrong, unreliable, ill-founded, and utterly useless, but all benchmarks suffer from this same problem. So why not use it?” - From the Linux BogoMIPS HOWTO Standard Performance Evaluation Corporation (SPEC) “The goal of SPEC is to ensure that the marketplace has a fair and useful set of metrics to differentiate candidate systems.” • Sells source code including common ports • Searchable results pool 115 members across four independent groups: 1) Open System Group (Mostly vendors) 2) Workstation group 3) Graphics group 4) Research group (Mostly academics) Disclosure: Gartner is a member Results generally require a standard-format report - Lists intermediary results, optimizations used - Price not included - Estimates are allowed for most benchmarks SPEC CPU2006 Measures CPU Integer and Floating Point capacity • Often correlates with overall system performance because server designs typically balance memory, IO, and CPU Actually 8 different metrics: • Integer and Floating Point tests • Speed and Rate tests • Base and Peak results ~25 different individual workloads, from games to quantum computing Changes versions every 6-8 years • CPU92, CPU2000, CPU2006 • CPUv6 currently under development • Results not comparable between versions SPEC CPU2006 • Depends almost entirely on CPU model, core count, clock speed • Some impact from Compiler (e.g. +15%) • Small impact from OS, cache • Floating Point impacted by memory speed • “Turbo Mode” frequency correlation • Benchmarked configurations must be “offered” • Published results are peer reviewed (by ‘competitors’) • Reviewers are PICKY! SPEC CPU2006 “Benchmarks Results are usually for •Sales and marketing •Customer awareness •Customer confidence” - Fujitsu presentation SPEC jbb2005 Server-side Java benchmark Heavily dependant on JVM • Also highly dependant on processor speed, core count, Java garbage collection • “Plateau” amount of cache, memory • Disk and network I/O play no part Emulates 3-tier system on a single host -Database tier is emulated, memory-resident Useful Tidbits: - Cheap & Easy to run, so lots of results - Measures transactions per second, similar transactions to TPC-C - Full report includes performance beyond peak - Being replaced (SPECjbb2013) SPEC jbb2013 Released last month! Scales more realistically than SPECjbb2005 * Includes inter-JVM communication Includes a response-time requirement & reporting, in addition to “operations per second” Like jbb2005, a key design goal was making it easy to run. SPEC jEnterprise2010 Java server benchmark designed to test whole system Java EE performance • Includes database and storage • System-Under-Tests can be more than 1 server Harder to set up and run vs. SPECjbb2005, so fewer results Product 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Oracle - SPARC T4-4 Oracle - Sun Server X2-8 Oracle - Sun Server X2-8 Cisco - UCS B440 M2 Cisco - UCS B440 M1 IBM - Power 780 (MHB) IBM - Power 780 (MHB) IBM - BladeCenter HS22 Dell - PowerEdge R910 IBM - Power 780 (MHD) IBM - System x3650 M4 Oracle - SPARC T3-4 Oracle - Sun Server X3-2 Oracle - Sun Server X3-2 IBM - BladeCenter HS22 Processor SPARC T4 Chips WebLogic 12.x 16 Oracle WebLogic 10.X Database Oracle 11g OS Solaris EjOPS 40,104.86 Xeon E7-8870 8 WebLogic 12.x Oracle 11g Linux 27,150.05 Xeon E7-8870 8 WebLogic 12.x Oracle 11g Linux 27,150.05 Xeon E7-4870 8 Oracle WebLogic 10.X Oracle 11g Linux 26,118.67 Xeon X7560 8 Oracle WebLogic 10.X Oracle 11g Linux 17,301.86 POWER7 8 WebSphere V7 DB2 9.7 AIX 16,646.34 POWER7 8 WebSphere V7 DB2 9.7 AIX 15,885.09 12 WebSphere V7 DB2 9.7 Linux 15,829.86 Xeon X5670 Xeon E7-4870 4 Oracle WebLogic 10.X Oracle 11g Linux 11,946.60 POWER7+ 4 WebSphere V8 DB2 10.1 AIX 10,902.30 Xeon E5-2690 2 WebSphere V8 DB2 10.1 Linux 9,696.43 SPARC T3 4 Oracle WebLogic 10.X Oracle 11g Solaris 9,456.28 Xeon E5-2690 2 Oracle WebLogic 10.X Oracle 11g Linux 8,310.19 Xeon E5-2690 2 Oracle WebLogic 10.X Oracle 11g Linux 8,310.19 DB2 9.7 Linux 7,903.16 Xeon X5570 16 WebSphere V7 Top 15 results (as of 1 Feb 2013) Other SPEC benchmarks Power: SPECpower_ssj2008 HPC Benchmarks: SPEC MPI, SPEC OMP File System: SPECsfs2008 Messaging: SPECsip SPECweb2009, SPECweb2005 SPECmail2009 SPECCloud Handheld working group SPEC also has an research group that creates benchmarks for research & development purposes Vendor-Sponsored Application Benchmarks SAP • Various, but SD 2-Tier is most popular • Results published on x86 due to support requirements • Correlates with clock, cores, OS, database. • Plateaus on relatively low memory • Pre-2009 results not comparable to current results • Used for SAP “QuickSizer” system-sizing tool Oracle • Official : EBS benchmarks, Siebel & Peoplesoft benchmarks, etc. • Good: Workload-specific. • Bad: Seeing fewer results than in the past. Microsoft • Fast-Track system benchmarks: MCR/BCR Oracle Benchmarks # Results Architectures Oracle EBS Oracle Applications Release 12 (12.1.3) Single Instance R12 Batch Order-to-Cash Large 2-Tier 1 1 Oracle EBS Oracle Applications Release 12 (12.1.3) Single Instance R12 Batch Payroll Large/Extra-Large 2-Tier 2 1 Oracle EBS Oracle Applications Release 12 (12.1.2) Single Instance R12 OLTP OLTP X-Large 3-Tier 3 2 Oracle EBS Oracle Applications Release 12 (12.1.2) Single Instance R12 Batch Order-to-Cash Large 2-Tier 3 2 Oracle EBS Oracle Applications Release 12 (12.1.2) Single Instance R12 Batch Payroll Large/Extra-Large 2-Tier 5 2 Oracle EBS Oracle Applications Release 12 (RUP 4) Single Instance R12 Batch Order-to-Cash Medium 3-Tier 1 0 Oracle EBS Oracle Applications Release 12 (RUP 4) Single Instance R12 Batch Order-to-Cash Medium 2-Tier 6 1 Oracle EBS Oracle Applications Release 12 (RUP 4) Single Instance R12 Batch Payroll Large/Extra-Large 2-Tier 4 0 Oracle EBS Oracle Applications Release 12 (RUP 4) Single Instance R12 Batch Payroll Medium 3-Tier 2 0 Oracle EBS Oracle Applications Release 12 (RUP 4) Single Instance R12 Batch Payroll Medium 2-Tier 7 1 Oracle EBS Oracle Applications Release 11i (11.5.10) Medium Configuration - RAC 6 0 Oracle EBS Oracle Applications Release 11i (11.5.10) Medium Configuration - Single DB (Database Tier) 33 0 Oracle EBS Oracle Applications Release 11i (11.5.10) Small Configuration - Single DB (Database Tier) 42 0 Oracle EBS Oracle Applications Release 11i (11.5.10) Small configuration - RAC (Database Tier) 3 0 Oracle EBS Real Application Clusters (11.5.9) 2 0 Peoplesoft Human Capital Management Payroll for North America 9.x Batch 12 2 Peoplesoft Financial Management Financials 9.x day-in-the-life Online 3 2 Siebel CRM Release 8.0 15 0 Siebel CRM Release 8.1.x 3 2 Benchmarks “Dilbert”: March 02, 2009. Source: Dilbert.com. Virtualization Benchmarks VMmark • Includes both application and “infrastructure” workload • DVDStore, OLIO, Exchange 2007 • Idle machine • vMotion, storage vMotion • Base on concept of “tiles”; each tile = 8 workloads • VMware (and therefore x86) only • CPU is the bottleneck, not memory • With the same CPU, results from different vendors almost identical • vSphere license contains DeWitt Clause SPECvirt • Uses 3 other SPEC benchmarks as its workloads: • SPECweb2005 • SPECjAppServer2004 • SPECmail2008 • Uses similar “tiles” concept to Vmmark • Just vSphere, KVM, Xen results Other Consortia Benchmarks Government “Benchmarks” ENERGY STAR • Sponsored by US EPA • Rewards servers that achieve “best in class” power efficiency targets • Version 1 and upcoming Version 2 disqualify some server categories APP • Calculated number used by US for export control reasons • Similar to MIPS Some commercial benchmark software Server-oriented Quest (Dell) Benchmark Factory Desktop-oriented SiSoftware Sandra* Primate Labs Geekbench* SysMark Phoronix Test Suite* Maxon Cinebench * Include public repositories of user-submitted results Repositories CloudHarmony (Cloud instances) Tools with metrics BMC Capacity Optimization Computer Associates Hyperformix Vmware Server Capacity Planner Popular Open Source or Public Domain benchmarks STREAM • Simple memory bandwidth test • Gets close to server theoretical maximums LINPACK / HPL • Floating-point tests used to compare HPC & supercomputer performance • “Results should not be taken too seriously” Other examples • PRIME95 • Terasort • DVDStore • ApacheBench • OLIO Vendor attitudes towards benchmarks Source: http://online.asbis.sk/documents/ORACLE/dokumenty/Kubacek_Dalibor_OS_Positioning.pdf Benchmarks We Lack • Converged Systems • Public Cloud versus On-Premises • “Microserver” Power • 3rd Party Mainframe My advice for using other people’s benchmark results • Only use when you’re lazy and poor - Full Disclosure: I am lazy and poor • Ask vendors for non-published results • Ignore differences < 10% • For big servers, don’t dividing results by the number of cores • If you’re going to just use SPEC CPU2006… - Use SPEC CPU Integer Rate base results Questions? 46