Performance of OpenVMS v8.4 with i2 servers Rafiq Ahamed K, OpenVMS Engineering ©2011 Hewlett-Packard Development Company, L.P. ©2010 The information contained herein is subject to change without notice Agenda • What's new in i2 Servers? • Introduction – New i2 server a performance differentiator – Performance – Future • Enhancements Focus Performance Results – Platform – Application – Alpha to i2 server Migration Differences • Summary • Q&A 2 9/19/2011 ―The performance results shared in this session are from engineering test environment, they do not represent any specific customer workload. Your mileage may vary.‖ 3 9/19/2011 What is new in i2 servers, post i2 blades? i2 server family Introduced rx2800 i2 rack-mounted server OpenVMS V8.4 Integrity 2 Socket Rack Server Integrity Server Blades World’s first scale-up blades built on the industry’s #1 blade infrastructure New! 5 9/19/2011 8-core scalability with 3x improved density without sacrificing RAS Introducing rx2800; a performance differentiator rx2800 i2 Server Architecture 7 9/19/2011 CPU-CPU 19.2 GB/s Memory: 28.8GB/s peak per Processor Module QPI, IOH to Processors: 38.4 GB/s Intel 9300 Processor A performance differentiator 8 9/19/2011 Performance Features of i2 Servers Memory • Dual Integrated Memory Controllers with 4 SMI channels, peak memory band-width up to 34 GB/s (6x) MB MB MB • Intel® Scalable Memory Interconnect (Intel® SMI), connects to the Intel® 7500 Scalable Memory Buffer (SMB) to support larger physical memory DDR3 RDIMMs MB SMI MB • DDR3 Higher Throughput (800MT/s), Lower Power, Faster Response Time, Increased Capacity/DIMM (16GB) MB • Directory-based Cache Coherency – Reduces Snoop traffic and contention DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM DDR3 DIMM MB • 1MB Directory Cache/IMC Processor MB • Capability to supports up to 1TB memory per IMC • 6 Gen 2 PCIe slots, supporting 5GT/sec • p410i SAS 6G RAID Controller • 8 10K RPM Disks • Enhanced Thread-Level Parallelism (TLP) [8T/P] • Instructions-level parallelism (ILP) – minimize threads from stalling the pipeline • Intel® Turbo Boost Technology – Performance on Demand • Intel® VT-i2 is Introduced Itanium® 9300 (Tukwila-MC) Itanium® 9300 (Tukwila-MC) QPI Intel® 7500 IOH (Boxboro-MC) • Data TLB support for 8K and 16K pages QPI Intel® ICH10 • New Intel® QuickPath Interconnect Technology replaces the Front Side Bus with a point-to-point • 4 full-width Intel QuickPath Interconnect links and 2 halfwidth links per processor • SMB supports different size and types of DIMM IO NUMA Aware! PCIe Gen2 Gen1 PCIe Devices PCIe Devices • Peak processor-to-processor and processor-to-I/O communications up to 96 GB/s (9x) • Glueless System Designs Up to Eight Sockets – FSB Limitations Key Characteristics Intel® Itanium® processor 9100 Intel® Itanium® processor 9300 Cores 2 4 Total On-Die Cache 27.5 MB 30 MB Software Threads per Core 2 2 (w/ enhanced thread management) System Interconnect (bandwidth per processor for a 2-socket system) Front Side Bus • Peak bandwidth per processor: 5 GB/s Intel® QuickPath Interconnect Technology • Peak bandwidth: 48 GB/s (up to 9x improvement) • Enhanced RAS • Enables common IOHs with nextgeneration Intel® Xeon® processors Memory Interconnect (bandwidth per processor for a 2-socket system) Front Side Bus • Peak bandwidth per processor: 5 GB/s Dual Integrated Memory Controllers • Peak bandwidth 34 GB/s (up to 6x improvement) Memory Capacity (4-socket system) 128-384 GB 1TB (using 16 GB RDIMMs) — up to 8x improvement Partitioning and Virtualization Intel® VT-i Intel® VT-i2 Energy Efficiency Demand Based Switching (DBS) • • • SMP Scalability 10 9/19/2011 Source: Intel Corporation • • • 64-bit Virtual Addressability 50-bit Physical Addressability Home snoop coherency • • • • Enhanced DBS (voltage modulation in addition to frequency) Intel® Turbo Boost Technology Advanced CPU and Memory Thermal Management 64-bit Virtual Addressability 50-bit Physical Addressability Directory coherency for better performance in large SMP configurations Up to 8-socket Glueless systems (higher scalability with OEM chipsets) OpenVMS V8.4 Performance Enhancements (includes post V8.4) 11 V8.4 Major Performance Features • Resource Affinity Domain (RAD) support for IA64 • Packet Processing Engine (PPE) Support for TCP/IP • Automatic Dynamic Processor Resilience (DPR) • Compression support for BACKUP • RMS SET MBC count support for 255 blocks • XFC support for 4G Global Hint Pages • Shadowing T4 Collectors • File System Consistency Check Control for reducing directory corruption • Asynchronous Virtual IO (AVIO) support for Guest OpenVMS running on HPVM V8.4 Performance Enhancements.. • Core OS improvements –Dedicated –Exception –Reduced Lock Manager and Cluster Driver Optimizations Handling Optimizations Spin Lock Contentions (SCHED, MMG etc) –Optimizations in Global Section Deletion and Creation Algorithms –Introduced –SYSMAN –RTL Paged Pool Look Aside List (LAL) IO AUTO Performance Improvements (Fibre Only) Changes to optimize strcmp() and memcmp() –Support for new high speed USB connectivity and enhancements –rx2800 console performance fixed! V8.4 Performance Enhancements.. • Shadow feature improvements to WriteBitMap, MiniCopy, MiniMerge and SPLIT_READ_LBN • Compiler improvements (BASIC RTL, COBOL, OTS$STRCMP) • Miscellaneous Improvements – FLT Tracing has 2 new options for SUMMARY report (using INDEX/PID) – LIBRARIAN, 14 9/19/2011 MONITOR , DCL DELETE OpenVMS 8.4 Performance Results OpenVMS V8.4 Summary of Improvements Significant performance gains across various workloads and system configurations Reduced BACKUP Compressed Data 272 Faster Shadow READ 300 Faster Bitmap Updating in SHADOWING 400 High Speed USB Boot 25 High Usage of mem(str)cmp() 350 Larger AST Queuing 50 Stressful DLM Resources 200 Heavy Exception Handling 50 Heavy Image Activation 45 0 100 % Improvement 200 300 400 500 Future Focus • Continuous work to optimize performance on new servers with new processors • On going effort to reduce alignment faults • Maximize utilization using hyper threading • Additional T4 collectors • Spin Lock Optimizations • Many more… 16 9/19/2011 Performance of i2 Servers; Platform and Application Tests Price/Performance Migration Path New Blades NewIntegrity Integrity blades: Cores 8-Socket 16-core rx7640 4-Socket BL890c i2 8-core rx6600 BL870c i2 2-Socket rx3600 BL870c 2-Socket rx2660 Blue is Price, Light Blue is Performance. BL860c i2 rx2800 BL860c New Integrity Blades are more scalable than previous generation • Double the number of cores • More memory • More I/Os OpenVMS running on i2 servers Performance Highlights i2 servers architected for high performance – Architecture provides increased number and faster cores/socket – Superior memory and interconnect technology – Memory intensive applications benefit from low latency and high bandwidth architecture – Higher – More time IO bandwidth and throughput resulting from new IO architecture headroom for CPU, Memory and IO intensive workloads with improved response Upto 3x performance improvement with i2 servers running OpenVMS – Our test have shown up to 2x improvement with java, some database and web server applications – Oracle 19 9/19/2011 has shown 3x improvement CPU Ratings Integer Tests Per Processor 1.73 GHz/ 1.6 GHz 9300 series processor show 2-2.3x performance improvement Ratings (More is Better) 4000 3500 3000 9300 - BL8x0c-i2 (1.73GHz/6.0MB) 2500 9300- BL8x0c-i2 (1.60GHz/6.0MB) 2000 9300 - BL8x0c-i2 (1.33GHz/4.0MB) 1500 9000 - BL860c (1.59GHz/9.0MB) 1000 9100 - rx7640 (1.60GHz/12.0MB) 500 0 – These numbers are per processor/socket – As the frequency increases, we see a increase in rating Applications with large Integer computations, financial analysis, high data processing 20 9/19/2011 CPU Ratings Floating Point Computation Tests Per Processor 1.73 GHz/ 1.6 GHz 9300 series processor show 2.12.3x performance improvement FP Rating (More is Better) 9000 8000 7000 6000 5000 4000 3000 9300 - BL8x0c-i2 (1.73GHz/6.0MB) 9300 - BL8x0c-i2 (1.60GHz/6.0MB) 9300 - BL8x0c-i2 (1.33GHz/4.0MB) 9000 - BL860c (1.59GHz/9.0MB) 9100 - rx7640 (1.60GHz/12.0MB) 2000 1000 0 – Intel ® Itanium ® 9300-series processors have new high precision floating architecture Fast response to complex operations; Scientific, Automation and robotic applications should benefit 21 9/19/2011 Memory Tests 55% improvement in memory bandwidth between 3600 and BL860c i2 Throughput – More is Better Memory Bandwidth Single Stream (More is Better) 3500 3000 2500 BL860c-i2 (1.73GHz/6.0MB) 2000 rx3600 (1.67GHz/9.0MB) rx6600 (1.59GHz/12.0MB) 1500 rx7640 (1.60GHz/12.0MB) 1000 SD32A (1.60GHz/9.0MB) 500 0 MB/Sec – Computed via Memory Test Program – Single Stream – The new BL8x0c i2 server demonstrated very good memory bandwidth – Memory bound applications should see a difference with aggregated bandwidth 22 9/19/2011 Core I/O on rx2800 rx2800 i2 - Core SAS Caching rx2800 i2 - SAS Logical Disk (Striping) 700 2500 600 2000 400 IOPS IOPS 500 300 200 1500 1000 500 100 0 0 1 2 4 8 16 32 64 128 1 256 2 • 8 16 32 64 128 Cache 1 disk with Cache 2 disk with Cache 4 disk with Cache rx2800 i2 server comes with p410i SAS Controller as Core SAS I/O – Small Block Random Tests were run on same sized disk – Logical Volumes were spread across multiple disks to show the I/O striping effect • p410i with cache* exponentially boosts the performance (upto 2.5x) • Increased number of disks in a raid group linearly increases performance 23 9/19/2011 256 Load Load W/O Cache 4 * Additional Cache Kit Rack mounted - Core I/O i2 comparison Integirty Core SAS 700 600 IOPS 500 400 300 200 100 0 1 2 4 8 16 32 64 128 256 Load rx6600 (1.59GHz/12.0MB) rx2800 i2 (1.46GHz/5.0MB) - Core Cache rx2800 i2 (1.46GHz/5.0MB) - W/O Cache • Core IO SAS Controller (rx6600 - LSISAS106x, rx2800 – p410i) • Older IA64 SAS performance is similar to W/O Cache on i2 server • rx2800 i2 server with Cache is a clear winner ( upto 2.5x times better) 24 9/19/2011 Apache Performance 2x Improvement Apache Bench Tests on OpenVMS 8.4 600 Bandwidth Throughput Time Taken (sec) (More is better) ( More is Better) (Less is Better) 1400 531 1200 500 60 1162 1000 400 276.21 300 800 600 200 40 597 0 Transfer rate (Kbytes/sec) 25.818 10 200 0 30 20 400 100 50.277 50 0 Requests per second Time Taken (sec) BL860c i2 (1.73GHz/6.0MB) BL860c i2 (1.73GHz/6.0MB) BL860c i2 (1.73GHz/6.0MB) BL860c (1.59GHz/9.0MB) BL860c (1.59GHz/9.0MB) BL860c (1.59GHz/9.0MB) – Configuration Details • The tests were run on OpenVMS 8.4, with 1G network end to end connectivity • Apache 2.1-1 with ECO2 , Apache Bench 2.0.40-dev – Time Taken should be less; Req/sec and KB/sec should be more • 25 BL860c-i2 was able to cater 2x performance compared to BL860c 9/19/2011 Java Workload Tests 2x Improvement Native Java Tests on OpenVMS 8.4 – More is Better Java Workload 140000 140000 120000 120000 Operation Rate Operation Rate Java Workload 100000 80000 60000 40000 100000 80000 60000 40000 20000 20000 0 0 0 2 4 6 8 10 12 14 16 18 8 9 10 BL870c i2 (1.60GHz/5.0MB) rx6600 (1.59GHz/12.0MB) – Java Workloads scale up better on i2 Servers – Java Workloads are high CPU and Memory Intensive 26 9/19/2011 12 13 14 15 Threads Threads rx6600 (1.59GHz/12.0MB) 11 BL870c i2 (1.60GHz/5.0MB) 16 Rdb Tests 2x Improvement Rdb Performance Load Tests – OpenVMS 8.4 RDB Performance RDB Performance (Less is better) 250 1.2 200 1 150 sec/txn TPS (More is better) 100 0.8 0.6 0.4 50 0.2 0 0 HP Integrity rx2800 i2 (1.60GHz/5.0MB) HP Integrity rx2800 i2 (1.60GHz/5.0MB) HP rx3600 (1.59GHz/9.0MB) HP rx3600 (1.59GHz/9.0MB) – Test runs a load with100 jobs of 100000 txns/job, DB on local disks (single) – Rx2800 i2 server shows 2x better throughput than rx3600 • • The Transaction Per Second (TPS) is 2x Rdb is high Memory and I/O oriented application, which seems to take advantage of p410i 6G SAS controller and high speed memory architecture – Results have shown that Rdb takes advantage of HT in high end i2 servers 27 9/19/2011 Oracle 10gR2 on new i2 Server 3x Improvement 20000 TPM 15000 10000 5000 0 16 32 48 Users rx7640 (1.60GHz/12.0MB) BL890c i2 (1.60GHz/6.0MB) – Oracle Swing Bench Tests were run with tuning configuration (same) • rx7640 and BL890c i2 are NUMA based systems, MostlyNUMA RAD Enabled and Hyper-Thread disabled, 6 RAID 5 EVA8100 Volumes – Oracle was run in Shared Server Mode – BL890c i2 server consistently shows 3x improvement for same numbers of users 28 9/19/2011 Oracle Tests – Resource Usage 16000 14000 12000 10000 8000 3x Increase 6000 4000 2000 0 CPU TPM BL890c i2 (1.60GHz/6.0MB) rx7640 (1.60GHz/12.0MB) – BL890c i2 is able to drive 3x improvement for same CPU usage 29 9/19/2011 ALPHA TO IA64 Speeds and Feeds ©2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Mission-critical scaling Speeds and Feeds comparison GS1280 M32 BL890c i2 Processor Type 1150 MHz EV7 8 Intel® Itanium® processor 9300 series (quad-core and dual-core) Max Memory 256 GB 1.5 TB Internal Storage Ultra3 SCSI 6G SAS HW RAID Networking (integrated) 1 GbE NICs 10 GbE (Flex-10) NICs PCI/PCI-X slots @133 MHz PCIe Gen2 Mezz slots 2x 41U 4U IO Slots Form Factor BL890c i2 scales to support mission-critical workloads 32 9/19/2011 Compilation Tests Compilation Tests (Less is better) 600000 557110 500000 400000 300000 242455 200000 100000 0 BL890C - 8S Itanium 9300 (1.6GHz) Alpha GS1280 EV7(1150 Mhz) Charged CPU • This test is to compile around 200 modules using c++ compiler • The overhead of running this on BL890C i2 was reduced ~2.3 times compared to GS1280 33 9/19/2011 Java Workload Tests Operation Rate Java Tests 200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 Threads BL890c i2 GS1280 – Java Workloads scale exponentially with load as compared to Alpha –There is no active development involved in Java on Alpha –Last available on alpha is JDK 1.5 as compared to JDK 1.6 – Java Workloads are high CPU and Memory Intensive 34 9/19/2011 EV7 vs. Itanium 9300(Tukwila) Per Core – Tukwila Processors are ~2.6 times faster than EV7 – These numbers are per Core (within a processor/socket) – Tukwila processors have new high precision floating architecture Floating Point Ratings (Higher is better) 2000 1882 1800 1600 1400 1200 1000 800 699 600 – Fast response to complex operations; Scientific, Automation and robotic applications should benefit 35 9/19/2011 400 200 0 BL890C - 8S Itanium 9300 (1.6GHz) Alpha GS1280 EV7(1150 Mhz) Floating Point Rate EV7 vs. Intel Itanium 9300(Tukwila) Per Core – Tukwila Processors are ~33% better than EV7 – These numbers are per Core (within a processor/socket) – CPU Bound applications should benefit (database queries), specifically integer computational bound applications Integer Ratings (Higher is better) 900 851.46 800 700 563.38 600 500 400 300 200 100 0 BL890C - 8S Itanium 9300 (1.6GHz) Integer Rate 36 9/19/2011 Alpha GS1280 EV7(1150 Mhz) Memory Bandwidth Tests • • This tests simulates heavy load directly to memory module by passing the cache The bandwidth on a ―single stream‖ test for BL890c i2 is 52% better Memory Bandwidth (Higher is better) 3500 3200 3000 2500 2100 2000 1500 1000 500 0 BL890C - 8S Itanium 9300 (1.6GHz) Memory (MB/sec) 37 9/19/2011 Alpha GS1280 EV7(1150 Mhz) IO Tests Single Process Tests IO Tests on EVA8100 with 2/4Gb Cards, RAID 5 320GB Volumes Random Workloads Small Block Random Mixed Load 647.9 700 14000 12000 IOPS 500 400 325 287.2 300 200 100 112.7 Operation Rate 600 10000 8000 6000 4000 2000 0 0 1 Threads AlphaServer ES45 Model 2B 2 4 8 16 32 64 128 256 Load BL860c i2 AlphaServer ES45 Model 2B BL860c i2 –4K IO Tests run with Mixed QIO/FastIO Loads, RANDOM workloads are very CPU intensive –We see ~3x increase in throughput for same load on i2 server – On alpha platforms, the max device supported is 2G FC HBA’s and the bus speed and feed influenced a lot in the performance 38 9/19/2011 Rdb Tests RDB Performance (More is better) 250 214 200 145 TPS 150 100 50 0 BL890C - 8S Itanium 9300 (1.6GHz) Alpha GS1280 EV7(1150 Mhz) – Test runs a load with 100 jobs of 100000 txns/job, DB on local disks (single) – BL890c 8S performs approx.1.5x better than GS1280 (Alpha) • 39 The Transaction Per Second (TPS) is 1.5x 9/19/2011 Performance Enhanced with the Integrity Server Blades Based on Blade Scale Architecture Up to 2X1 faster performance Integrity server blades based on Blade Scale Architecture Dual-core Integrity servers 2- & 4-socket Integrity with built-in resiliency and less power consumption 1 some Per socket performance increases 2.3x Up to 2x applications have shown up to 3x Integer & Floating Tests Application Performance Summary • i2 servers replace the Front Side Bus (FSB) with Quick Path Interconnect(QPI) Fabric • Even the low end i2 server comes with scalable features and supports NUMA • The performance improvements delivered with i2 server are a mix of – Hardware-level – Operating improvements system innovations • Customers will see up to 2x performance benefits with i2 servers • The i2 servers support state of the art RAS capabilities • The TCO paper outlines the cost, power and density benefits • The speed and feed of i2 servers are very large compared to Alpha! 43 9/19/2011 References • Please mail us your OpenVMS performance feedback or concerns to t4@hp.com • Please refer to below 2 white papers on OpenVMS performance and scalability – Performance – Total • Benefits on BL8x0c i2 Server Blades Cost of Upgrade i2 Server Technical Reads – Take a tour around the new HP Integrity rx2800 server (Video, 2:44) – new HP Integrity rx2800 technical white paper – White paper: Why scalable blades: HP Integrity server blades • Didn’t find what you are looking for – Enterprise servers, workstations, and systems hardware technical web site 44 9/19/2011 Questions/Comments • Business Manager (Rohini Madhavan) – rohini.madhavan@hp.com • Office of Customer Programs – OpenVMS.Programs@hp.com THANK YOU Boxboro & ICH10 • Boxboro I/O Hub (E7500 Chipset) Connects to local CPUs via QPI links – Provides 36 PCIe Gen 2 lanes – • – Hosts major IO functions • • • • • • Order of magnitude peak IO bandwidth increase over previous generation in BL870c i2 p410i RAID/SAS controller Two dual-port 10GbE Flex-10 NICs Three x8-provisioned mezzanine slots Gromit XE (iLO3) mgmt controller ICH10 I/O Controller Hub (SouthBridge) BIOH 2.5 GB/s 5 GB/s 5 GB/s 10 GB/s 10 GB/s 10 GB/s ICH10 utilization x4 PCIe Gen 1 link for partner blade support – Support VGA controller, USB controller – 53 1.2 GB/s 9/19/2011 Gromit XE RAID/SAS Controller Dual 10 GbE Dual 10 GbE Mezz Slot 1 Mezz Slot 2 Mezz Slot 3 ICH10 on ICH Card 2.5 GB/s Peak bidirectional link bandwidth Intel ® Itanium ® 9300 Changes High Performance Enhanced Thread-Level Parallelism (TLP) •4 Threads/Processor; Improve performance and scalability for heavily threaded software •Improved Thread management providing high core utilization; Thread switch for medium latency events and spin locks •Dedicated cache instead of shared cache model; Quick response, no contention on cache Enhanced Instruction-Level Parallelism (ILP) •Simultaneously process multiple instructions on each software thread; Increases throughput and faster response •Wide and short pipelining with many optimizations. Supports many zero cycle load and re-steering, with extensive bypassing for not stalling the threads. •1MB First-level Data Translation Lookaside Buffer (DTLB)/ Core supports larger pages (8 K and 16 K) Intel® Turbo Boost Technology – Performance on Demand •Automatically allows processor cores to run faster than the base operating frequency •When workload demands additional performance, the processor frequency will dynamically increase by 133 MHz on short and regular intervals (6microsec), 120 events monitored Intel® VT-i2 Introduced - Provide additional performance optimizations •HW assistance helping reduction in emulation code and OS-VMM transitions 54 9/19/2011 Changes/Improvements Scalability and Headroom Enhanced Communication Channels (Memory, IO) – Highly Scalable • New Intel® QuickPath Interconnect Technology - replaces the Front Side Bus with a pointto-point • 4 full-width Intel QuickPath Interconnect links and 2 half-width links • Peak processor-to-processor and processor-to-I/O communications up to 96 GB/s (9x) • Supports PCI Express Gen2 (5.0 GT/s) as well as PCI Express Gen1 (2.5 GT/s). Two Integrated Memory Controllers and an Intel® Scalable Memory Interconnect • 2 integrated memory controllers, peak memory band-width up to 34 GB/s (6x) • Capability to supports up to 1TB memory per IMC • Intel® Scalable Memory Interconnect (Intel® SMI), connects to the Intel® 7500 Scalable Memory Buffer to support larger physical memory DDR3 RDIMMs (faster) Directory-based Cache Coherency – Reduces Snoop traffic and contention • Cache coherency ensures that data in memory and cache remain synchronized, so the most current data is used in every transaction • In combination with QPI provides excellent scaling Glueless System Designs Up to Eight Sockets – FSB limitation of max 4 is eliminated. 55 9/19/2011 Memory tests Memory Latency on NUMA Systems Memory Latency (Less is Better) Latency(nsec) Single Stream 400 350 300 250 200 150 100 50 0 Local rx7640 (1.60GHz/12.0MB) 56 Remote Interleaved BL860c-i2 (1.73GHz/6.0MB) • Computed via Memory Test Program – Single Stream • The local is 10% better, remote 28% and interleaved 20% 9/19/2011