Session: C05 DB2 9.7 Performance Update presented by Serge Rielau IBM Toronto Lab Tuesday October 6th •11:00 Platform: DB2 for Linux, Unix, Windows Agenda • Basics • Benchmarks • DB2 9.7 Performance Improvements – Compression – Index Compression – Temp Table Compression – XML Compression – Range Partitioning with local indexes – – – – – Scan Sharing XML in DPF Statement Concentrator Currently Committed LOB Inlining • Summary 2 © 2009 IBM Corporation Basics – Platform / OS • The basic fundamentals have not changed • You still want/need a balanced configuration (I/O, Memory, CPU) • We recommend 4GB-8GB RAM / core • 6-20 disks per core where feasible • Use recommended generally available 64-bit OS • Applies to Linux, Windows, AIX, Solaris, HP-UX • e.g. AIX 5.3 TL09, AIX 6.1 TL03, SLES10 SP2, RHEL 5.3 etc • All performance measurements/assumptions are with a 64-bit DB2 server • Clients can be 32-bit or 64-bit or mixed • Even LOCAL clients 3 © 2009 IBM Corporation Basics - Storage • Disk spindles still matter • • • Be leery of Storage Administrators that tell you • • • With sophisticated storage subsystems and storage virtualization it just requires more sleuthing than ever to find them Drives keep getting bigger, 146GB-300GB now the norm “Don’t worry, it doesn’t matter” “The cache will take care of it” Make the Storage Administrator your best friend! • Take them out for lunch/dinner, whatever it takes! 4 © 2009 IBM Corporation Benchmarks DB2 is the performance leader TPoX 5 © 2009 IBM Corporation World Record Performance With TPC-C 7,200,000 6,085,166 6,200,000 tpmC 5,200,000 4,033,378 4,200,000 3,210,540 3,200,000 2,200,000 1,200,000 64x 1.9GHz POWER5 2 TB RAM 6400 disks 64x 2.3GHz POWER5+ 2 TB RAM 6400 disks 64x 5GHz POWER6 4 TB RAM 10,900 disks •Single Machine •Single Operating System •Single Database 200,000 DB2 8.2 on 64-way POWER5 DB2 9.1 on 64-way POWER5+ DB2 9.5 on 64-way POWER6 • Higher is better TPC Benchmark, TPC-C, tpmC, are trademarks of the Transaction Processing Performance Council. • DB2 8.2 on IBM System p5 595 (64 core POWER5 1.9GHz): 3,210,540 tpmC @ $5.07/tpmC available: May 14, 2005 • DB2 9.1 on IBM System p5 595 (64 core POWER5+ 2.3GHz): 4,033,378 tpmC @ 2.97/tpmC available: January 22, 2007 • DB2 9.5 on IBM POWER 595 (64 core POWER6 5.0GHz): 6,085,166 tpmC @ 2.81/tpmC available: December 10, 2008 6 Results current as of September 6, 2009 © 2009 IBM Corporation Check http://www.tpc.org for latest results World Record TPC-C Performance on x64 with RedHat Linux 1,420,000 1,220,000 1,200,632 •Single Machine tpmC 1,020,000 841,809 820,000 IBM x3950 M2 •Single Operating System •Single Database Intel Xeon7460 IBM x3950 M2 620,000 RHEL 5.2 420,000 Intel Xeon7350 Win2003 220,000 DB2 9.5 SQL Server 2005 • Higher is better TPC Benchmark, TPC-C, tpmC, are trademarks of the Transaction Processing Performance Council. • DB2 9.5 on IBM System x3950 M2 (8 Processor 48 core Intel Xeon 7460 2.66GHz): 1,200,632 tpmC @ $1.99/tpmC available: December 10, 2008 • SQL Server 2005 on HP DL580G5G4 (8 Processor 32 core Intel Xeon 7350 2.93GHz): 841,809 tpmC @$3.46/tpmC available: April 1, 2008 7 Results current as of September 6, 2009 © 2009 IBM Corporation Check http://www.tpc.org for latest results World record 10 TB TPC-H result on IBM Balanced Warehouse E7100 IBM System p6 570 & DB2 9.5 create top 10TB TPC-H performance 360,000 •Significant proof-point for the IBM Balanced Warehouse E7100 343551 QphH 300,000 240,000 •DB2 Warehouse 9.5 takes DB2 performance on AIX to new levels 208457 180,000 120,000 108099 60,000 0 IBM p6 570/DB2 9.5 HP Integrity Superdome-DC Itanium/Oracle 11g Sun Fire 25K/Oracle 10g •65% faster than Oracle 11g best result •Loaded 10TB data @ 6 TB / hour (incl. data load, index creation, runstats) • Higher is better TPC Benchmark, TPC-H, QphH, are trademarks of the Transaction Processing Performance Council. •DB2 Warehouse 9.5 on IBM System p6 570 (128 core p6 4.7GHz), 343551 QphH@10000GB, 32.89 USD per QphH@10000GB available: April 15, 2008 •Oracle 10g Enterprise Ed R2 w/ Partitioning on HP Integrity Superdome-DC Itanium 2 (128 core Intel Dual Core Itanium 2 9140 1.6 GHz), 208457 QphH@10000GB, 27.97 USD per QphH@10000GB, available: September 10, 2008 •Oracle 10g Enterprise Ed R2 w/ Partitioning on Sun Fire E25K (144 core Sun UltraSparc IV+ - 1500 MHz): 108099 QphH @53.80 USD per QphH@10000GB available: January 23, 2006 8 Results current as of September 6, 2009 © 2009 IBM Corporation Check http://www.tpc.org for latest results World record SAP 3-tier SD Benchmark Top SAP SD 3-tier Results by DBMS Vendor 180000 160000 140000 120000 SD Users • This benchmark represents a 3 tier SAP R/3 environment in which the database resides on its own server where database performance is the critical factor 168300 100000 100000 93000 80000 60000 40000 20000 0 DB2 8.2 on 32way p5 595 Oracle 10g on 64way HP Integrity SQL Server on 64-way HPIntegrity • Higher 9 is better Results current as of September 6, 2009 © 2009 IBM Corporation Check http://www.sap.com/benchmark for latest results TPoX Customer 1 n Account 11 n Holding n 1 Transaction Processing over XML Data n Order Open Source Benchmark: http://tpox.sourceforge.net/ Online Brokerage scenario based on standardized FIXML schema FIXML: financial industry XML Schema for security trading CustAcc: modeled after a real banking system that uses XML Security: information similar to investment web sites CustAcc.xsd 1 n FIXML (41 XSD files) 1 Security Security.xsd TPoX 2.0 Benchmark • • • • • Scale Factor “M”, 1 TB raw data 50M CustAcc XML docs; 2-23 KB 500M Order XML docs; 1-2 KB 20,833 Securities XML docs; 2-9 KB 3 Tables + XML Indexes 10 © 2009 IBM Corporation XML Transaction Processing with DB2 on IBM JS43 Blade • TPoX 2.0 TPoX Mixed Workload • DB2 compression ratio at 58% • 1TB raw, 1.4 TB database w/index, 604 GB compressed database • Mixed workload (70% Queries, 30% Insert/Update/Delete) achieves TPoX Transaction Per Second 4200 4107 4100 4000 3900 3800 3700 3600 3500 DB2 V9.5 • 4,107 XML tx/sec (246,420 tx/min, 14.7M tx/hr) • 5,119 Customer docs/sec, avg. size 6.6KB (18.2M doc/hour, 120 GB/hour) 13945 14000 • 300 concurrent connections 13500 Documents Per Second • 13,945 Orders docs/sec, avg. size 1.5KB (50M doc/hour, 75 GB/hour) DB2 V9.7 Document Injection Rate • Avg. response time 0.07 sec • Insert-only workload inserts 3987 13000 12810 12500 12000 11500 11000 10500 10000 DB2 V9.5 DB2 V9.7 11 © 2009 IBM Corporation XML Transaction Processing with DB2 DB2 9.7 on Intel® Xeon® Processor X5570 delivers • Outstanding out-of-the-box performance TPoX benchmark 1.3 results • Default db2 registry, no db or dbm configuration changes • Excellent performance scalability of 1.78x from Intel® Xeon® Processor 5400 Series to Intel® Xeon® processor 5500 Series (2-socket, quad-core) • Performance per Watt improves by 1.52x 733.27 400,000 300,000 308,384 172,993 200,000 100,000 0 DB2 9.7 Intel® Xeon® Processor 5400 Series Intel® Xeon® Processor 5500 Series 800 TPoX TPM / Watt TPoX TPM • 600 481.45 400 200 0 DB2 9.7 Intel® Xeon® Processor 5400 Series Intel® Xeon® Processor 5500 Series 12 © 2009 IBM Corporation Performance Improvements • DB2 9.7 has tremendous new capabilities that can substantially improve performance • When you think about the new features … • “It depends” • We don’t know everything (yet) • Your mileage will vary • Please provide feedback! 13 © 2009 IBM Corporation Upgrading to DB2 9.7 • You can directly upgrade to DB2 9.7 from • DB2 8.2, DB2 9.1, DB2 9.5 • You can expect overall performance to be similar to better (0%-15% improvement) without exploiting new features • Your individual “mileage” will vary by • Platform • Workload • CPU utilization • Upgraded databases retain their basic configuration characteristics. • New databases have new default behavior • E.g. monitoring, currently committed 14 © 2009 IBM Corporation Process/Thread Organization Per-instance DB2 Threaded Architecture Per-application Per-database Idle, pooled agent or subagent UDB Client Library Single, Multi-threaded Process db2sysc Idle Agent Pool Instance Level Commo n TCPIP (remote clients) or Shared Memory & Semaphores (local clients) Client UDB Server Listeners db2agent (idle) db2tcpcm db2ipccm db2agent Coordinator Agents Database Level db2agntp db2agntp Subagents Active Logging Subsyste m db2loggr db2loggw Log Disks Buffer Pool(s) Log Buffer Deadlock Detector db2dlock Idle Prefetche rs db2pfchr Page Cleaners db2pclnr Data Disks 15 © 2009 IBM Corporation Performance Advantages of the Threaded Architecture • Context switching between threads is generally faster than between processes • No need to switch address space • Less cache “pollution” • Operating system threads require less context than processes • Share address space, context information (such as uid, file handle table, etc) • Memory savings • Significantly fewer system file descriptors used • All threads in a process can share the same file descriptors • No need to have each agent maintain its own file descriptor table 16 © 2009 IBM Corporation From the existing DB2 9 Deep Compression … “With DB2 9, we’re seeing compression rates up to 83% on the Data Warehouse. The projected cost savings are more than $2 million initially with ongoing savings of $500,000 a year.” - Michael Henson “We achieved a 43 per cent saving in total storage requirements when using DB2 with Deep Compression for its SAP NetWeaver BI application, when compared with the former Oracle database, The total size of the database shrank from 8TB to 4.5TB, and response times were improved by 15 per cent. Some batch applications and change runs were reduced by a factor of ten when using IBM DB2.” - Markus Dellermann • Reduce storage costs • Improve performance 1.5 Times Better 3.3 Times Better 2.0 Times Better 8.7 Times Better • Easy to implement DB2 9 Other 17 © 2009 IBM Corporation Index Compression What is Index Compression? • • • The ability to decrease the storage requirements from indexes through compression. By default, if the table is compressed the indexes created for the table will also be compressed. • including the XML indexes Index compression can be explicitly enabled/disabled when creating or altering an index. Why do we need Index Compression? • Index compression reduces disk cost and TCO (total cost of ownership) • Index compression can improve runtime performance of queries that are I/O bound. When does Index Compression work best? • • Indexes for tables declared in a large RID DMS tablespaces (default since DB2 9). Indexes that have low key cardinality & high cluster ratio. 18 © 2009 IBM Corporation Index Compression Index Page (pre DB2 9.7) Page Header Fixed Slot Directory (maximum size reserved) AAAB, 1, CCC 1055, 1056 AAAB, 1, CCD 3011, 3025, 3026, 3027, 3029, 3033, 3035, 3036, 3037 BBBZ, 1, ZZZ 3009, 3012, 3013, 3015, 3016, 3017, 3109 BBBZ, 1, ZZCCAAAE 6008, 6009, 6010, 6011 How does Index Compression Work? Index Key RID List • DB2 will consider multiple compression algorithms to attain maximum index space savings through index compression. 19 © 2009 IBM Corporation Index Compression Index Page (DB2 9.7) Page Header Variable Slot Directory Saved Space from Variable Slot Directory AAAB, 1, CCC 1055, 1056 AAAB, 1, CCD 3011, 3025, 3026, 3027, 3029, 3033, 3035, 3036, 3037 BBBZ, 1, ZZZ 3009, 3012, 3013, 3015, 3016, 3017, 3109 BBBZ, 1, ZZCCAAAE 6008, 6009, 6010, 6011 Variable Slot Directory Index Key RID List • In 9.7, a slot directory is dynamically adjusted in order to fit as many keys into an index page as possible. 20 © 2009 IBM Corporation Index Compression Index Page (DB2 9.7) Page Header Variable Slot Directory Saved Space from Variable Slot Directory First RID AAAB, 1, CCC 1055, 1 Saved AAAB, 1, CCD 3011, 14, 1, 1, 2, 4, 2, 1, 1 BBBZ, 1, ZZZ 3009, 3, 1, 2, 1, 1, 92 BBBZ, 1, ZZCCAAAE 6008, 1, 1, 1 RID Deltas Saved from RID List Saved Saved RID List Compression Index Key Compressed RID • Instead of saving the full version of a RID, we can save some space by storing the delta between two RIDs. • RID List compression is enabled when there are 3 or more RIDs in an index page. 21 © 2009 IBM Corporation Index Compression COMMON PREFIX Index Page (DB2 9.7) Page Header Variable Slot Directory C 1055, 1 AAAB, 1, CC BBBZ, 1, ZZ SUFFIX RECORDS Saved Space from Variable Slot Directory 0, 2 Saved D 3011, 14, 1, 1, 2, 4, 2, 1, 1 Saved from RID List and Prefix Compression Z 3009, 3, 1, 2, 1, 1, 92 CCAAAE 6008, 1, 1, 1 Saved Saved Prefix Compression Compressed Key Compressed RID • Instead of saving all key values, we can save some space by storing a common prefix and suffix records. • During index creation or insertion, DB2 will compare the new key with adjacent index keys and find the longest common prefixes between them. 22 © 2009 IBM Corporation Index Compression Results in a Nutshell Complex Query Database Warehouse Tested Estimated Index Compression Savings Warehouse #7 • Index compression uses idle CPU cycles and idle cycles spent waiting for I/O to compress & decompress index data. 57% Warehouse #6 55% Warehouse #5 50% Warehouse #4 31% Warehouse #3 Average 36% 24% Warehouse #2 20% Warehouse #1 16% 0% 10% 20% 30% 40% 50% 60% 70% • When we are not CPU bound, we are able to achieve better performance in all inserts, deletes and updates. Percentage Compressed (Indexes) * Higher is better Simple Index Compression Tests Machine Utilization 100% 11.7 11.4 33.3 80% 37.1 25.9 30.9 38.0 20% 34.2 Simple Update 17.5 49.1 48.2 46.3 2.5 34.5 34.8 1.6 16.2 2.0 2.6 20.8 23.6 Insert: Insert: Base Ixcomp user system Update: Update: Base Ixcomp idle iowait 52.2 33.9 0% Select: Select: Base Ixcomp Runs 18% Faster 44.07 53.89 45.0 16.7 Runs 16% Faster 28.31 33.67 Simple Delete 18.5 36.4 60% 40% Simple Index Compression Tests - Elapsed Time 3.1 6.8 52.1 Simple Insert 3.3 10.5 Simple Select Delete: Delete: Base Ixcomp 68.3 Runs 19% Faster 83.99 49.12 49.24 0 10 20 Runs As fast 30 40 50 60 70 80 90 Seconds Without Index Compression With Index Compression • Lower is better23 © 2009 IBM Corporation Temp Table Compression What is Temp Table Compression? • • The ability to decrease storage requirements by compressing temp table data Temp tables created as a result of the following operations are compressed by default: • • • • Temps from Sorts Created Global Temp Tables Declared Global Temp Tables Table queues (TQ) Why do we need Temp Table Compression on relational databases? • Temp table spaces can account for up to 1/3 of the overall tablespace storage in some database environments. • Temp compression reduces disk cost and TCO (total cost of ownership) 24 © 2009 IBM Corporation Temp Table Compression How does Temp Table Compression Work? • It extends the existing row-level compression mechanism that currently applies to permanent tables, into temp tables. String of data across a row Canada|Ontario|Toronto|Matthew Canada|Ontario|Toronto|Mark USA|Illinois|Chicago|Luke USA|Illinois|Chicago|John Lempel-Ziv Algorithm Create dictionary from sample data 0x12f0 0xe57a 0xff0a 0x15ab 0xdb0a 0x544d – – – – – – CanadaOntarioToronto … Mathew … Mark … USAIllinoixChicago … Luke … John … Saved data (compressed) 0x12f0,0xe57a 0x12f0,0xff0a 0x15ab,0xdb0a 0x15ab,0x544d 25 © 2009 IBM Corporation Temp Table Compression Query Workload CPU Analysis for Temp Compression 100% 14.61 22.19 Results in a Nutshell Effective CPU Usage • For affected temp compression enabled complex queries, an average of 35% temp tablespace space savings was observed. For the 100GB warehouse database setup, this sums up to over 28GB of saved temp space. 80% 29.50 60% 29.00 1.3 40% 20% 1.7 46.50 39.26 0% Baseline Index Compression user 100.0 sys idle iowait Space Savings for Complex Warehouse Queries with Temp Compression 200.00 Elapsed Time for Complex Warehouse Queries with Temp Compression 190.00 Saves 35% Space 60.0 40.0 78.3 50.2 20.0 5% Faster 180.00 Minutes Size (Gigabytes) 80.0 170.00 160.00 150.00 183.98 175.56 140.00 130.00 0.0 120.00 Without Temp Comp Total Bytes Stored With Temp Comp Bytes Stored • Lower is better Without Temp Comp Runtime With Temp Comp Runtime • Lower is better 26 © 2009 IBM Corporation XML Data Compression What is XML Data Compression? Why do we need XML Data Compression? • The ability to decrease the storage requirements of XML data through compression. • Compressing XML data can improve storage efficiency and runtime performance of queries that are I/O bound. • XML compression reduces disk cost and TCO (total cost of ownership) for databases with XML data • XML Compression extends row compression support to the XML documents. • If row compression is enabled for the table, the XML data will be also compressed. If row compression is not enabled, the XML data will not be compressed either. 27 © 2009 IBM Corporation XML Data Compression How does XML Data Compression Work? Data (uncompressed) • Small XML documents (< 32k) can be inlined with any relational data in the row and the entire row is compressed. • Larger XML documents that reside in a data area separate from relational data can also be compressed. By default, DB2 places XML data in the XDA to handle documents up to 2GB in size. • XML compression relies on a separate dictionary than the one used for row compression. Relational Data < 32KB XML Data 32KB – 2GB XML Data Data (compressed) Comp. Data Inlined < 32KB XML Data Dictionary #1 Compressed 32KB – 2GB XML Data Dictionary #2 28 © 2009 IBM Corporation XML Data Compression XML Compression Savings • Significantly improved query performance for I/O-bound workloads. XML Database Tested Results in a Nutshell XML DB Test #7 77% XML DB Test #6 77% XML DB Test #5 74% XML DB Test #4 63% XML DB Test #3 63% XML DB Test #2 43% XML DB Test #1 • Achieved 30% faster maintenance operations such as RUNSTATS, index creation, and import. 0% 20% 40% 60% 80% Percentage Compressed • Higher is better Average Elapsed Time for SQLXML and Xquery Queries over an XML and Relational Data database using XDA Compression 35 37% Faster 30 25 Time (sec) • Average compression savings of ⅔ across 7 different XML customer databases and about ¾ space savings for 3 of those 7 databases. Average 67% 61% 20 15 31.1 19.7 10 5 0 Without XML Compression With XML Compression • Lower is better 29 © 2009 IBM Corporation Range Partitioning with Local Indexes What does Range Partitioning with Local Indexes mean? • A partitioned index is an index which is divided up across multiple storage objects, one per data partition, and is partitioned in the same manner as the table data • Local Indexes can be created using the PARTITIONED keyword when creating an index on a partitioned table (Note: MDC block indexes are partitioned by default) Why do we need Range Partitioning with local Indexes? • Improved ATTACH and DETACH partition operations • More efficient access plans • More efficient REORGs. When does Range Partitioning with Local Indexes work best? • When frequents roll-in and rollout of data are performed • When one tablespace is defined per range. 30 © 2009 IBM Corporation Range Partitioning with Local Indexes Total Time and Log Space required to ATTACH 1.2 million rows Log Space required (MB) • Partition maintenance with ATTACH: • 20x speedup compared to DB2 9.5 global index because of reduced index maintenance. • 3000x less log space used than with DB 9.5 global indexes. Log Space used, MB Attach/Set Integrity time (sec) 651.84 1.E+03 1.E+02 160.00 140.00 120.00 1.E+01 100.00 80.00 1.E+00 60.00 0.21 1.E-01 0.05 40.00 0.03 1.E-02 20.00 0.00 V9.5 Global Indexes Cobra Local Indexes built during ATTACH • Asynchronous index maintenance on DETACH is eliminated. Cobra Local Indexes built before ATTACH No Indexes Baseline * Lower is better Local Indexes Index size comparison: Leaf page count 20,000 25% Space Savings 16,000 Index leaf pages • Local indexes occupy fewer disk pages than 9.5 global indexes. • 25% space savings is typical. • 12% query speedup over global indexes for index queries – fewer page reads. 180.00 Attach/Set Integrity time (sec) Results in a Nutshell 12,000 18,409 8,000 13,476 4,000 0 global index on RP table local index on RP table • Lower is better 31 © 2009 IBM Corporation Scan Sharing What is Scan Sharing? • It is the ability of one scan to exploit the work done by another scan • This feature targets heavy scans such as table scans or MDC block index scans of large tables. • Scan Sharing is enabled by default on DB2 9.7 Why do we need Scan Sharing? • Improved concurrency • Faster query response times • Increased throughput When does Scan Sharing work best? • Scan Sharing works best on workloads that involve several clients running similar queries (simple or complex), which involve the same heavy scanning mechanism (table scans or MDC block index scans). 32 © 2009 IBM Corporation Scan Sharing Unshared Scan How does Scan Sharing work? • When applying scan sharing, scans may start somewhere other than the usual beginning, to take advantage of pages that are already in the buffer pool from scans that are already running. • When a sharing scan reaches the end of file, it will start over at the beginning and finish when it reaches the point that it started. • Eligibility for scan sharing and for wrapping are determined automatically in the SQL compiler. 1 2 3 4 5 6 7 8 1 2 3 4 5 A scan 6 7 8 B scan Re-read pages causing extra I/O Shared Scan 1 2 A scan 3 4 5 6 7 Shared A & B scan 8 1 2 3 B scan • In DB2 9.7, scan sharing is supported for table scans and block index scans. 33 © 2009 IBM Corporation Scan Sharing Block Index Scan Test : Q1 and Q6 Interleaved No Scan Sharing • MDC Block Index Scan Sharing shows 47% average query improvement gain. Q1 Query R an staggering every 10 sec Q6 • The fastest query shows up to 56% runtime gain with scan sharing. Q1 Q6 Q1 Q6 Q1 Q6 Q1 Q6 Q1 Q1 : CPU Intensive Q6 : IO Intensive Q6 Q1 Q6 Q1 Q6 Q1 Q6 0 50 100 150 200 250 300 350 400 450 500 550 600 Scan Sharing Q1 Query R an staggering every 10 sec Q6 Q1 Q6 Q1 Runs 47% Faster! Q6 Q1 Q6 Q1 Q6 Q1 Q6 Q1 Q6 Q1 Q6 Q1 Q6 Scan Sharing Tests on Table Scan 1,400.0 0 50 100 150 200 250 300 350 400 450 500 550 600 1,284.6 1,200.0 • Lower is better Seconds 1,000.0 Runs 14x Faster! 800.0 600.0 400.0 90.3 200.0 0.0 No Scan Sharing Scan Sharing Average of running 100 Instances of Q1 • 100 concurrent table scans now run 14 times faster with scan sharing! • Lower is better 34 © 2009 IBM Corporation Scan Sharing Results in a Nutshell • When running 8 concurrent streams of complex queries in parallel on a 10GB database warehouse, a 15% increase in throughput is attained when using scan sharing. Throughput for a 10GB Warehouse Database: 8 Parallel Streams 15% Throughput Improved 400 300 200 391.72 339.59 100 0 Scan Sharing OFF Scan Sharing ON • Higher is better 35 © 2009 IBM Corporation XML Scalability on Infosphere Warehouse (a.k.a DPF) What does it mean? • Tables containing XML column definitions can now be stored and distributed on any partition. • XML data processing is optimized based on their partitions. Why do we need XML in database partitioned environments? • As customers adopt the XML datatype in their warehouses, XML data needs to scale just as relational data • XML data also achieves the same benefit from performance improvements attained from the parallelization in DPF environments. 36 © 2009 IBM Corporation XML Scalability on Infosphere Warehouse (a.k.a DPF) Simple query: Elapsed time speedup from 4 to 8 partitions rel xml Results in a Nutshell xmlrel 2 1.5 * 1 0.5 0 count w ith index count, no index grouped agg update colo join noncolo join Complex query: Elapsed time speedup from 4 to 8 partitions 3 rel xml • Table results show the elapsed time performance speedup of complex queries from a 4 partition setup to an 8 partition setup. Queries tested have a similar star-schema balance for relational and XML. • Each query run in 2 or 3 equivalent variants: • Completely relational (“rel”) • Completely XML (“xml”) • XML extraction/predicates with relational joins (“xmlrel”) (join queries only) 3.5 Elapsed time 4P / 8P Elapsed time 4P / 8P 2.5 xmlrel 2.5 2 1.5 • Queries/updates/deletes scale as well as relational ones. 1 0.5 0 1 2 3 4 5 6 Query number 7 8 9 10 • Average XML query-speedup is 96% of relational 37 © 2009 IBM Corporation Statement Concentrator What is the statement concentrator? • It is a technology that allows dynamic SQL statements that are identical, except for the value of its literals, to share the same access plan. • The statement concentrator is disabled by default, and can be enabled either through the database configuration parameter (STMT_CONC) or from the prepare attribute Why do we need the statement concentrator? • This feature is aimed at OLTP workloads where simple statements are repeatedly generated with different literal values. In these workloads, the cost of recompiling the statements many times adds a significant overhead. • Statement concentrator avoids this compilation overhead by allowing the compiled statement to be reused, regardless of the values of the literals. 38 © 2009 IBM Corporation Statement Concentrator Effect of the Statement Concentrator on Prepare times for 20,000 statements using 20 users Results in a Nutshell 500 Prepare Time (sec) 436 400 19x Reduction in Prepare time! 300 200 100 23 • The statement concentrator allows prepare time to run up to 25x faster for a single user and 19x faster for 20 users. 0 Concentrator off Concentrator on • Lower is better Effect of the Statement Concentrator for an OLTP workload 200 180 35% Throughput Improved! 180 160 Throughput 140 133 120 100 • The statement concentrator improved throughput by 35% in a typical OLTP workload using 25 users 80 60 40 20 • Higher is better 0 Concentrator Off Concentrator On 39 © 2009 IBM Corporation Currently Committed What is Currently Committed? • Currently Committed semantics have been introduced in DB2 9.7 to improve concurrency where readers are not blocked by writers to release row locks when using Cursor Stability (CS) isolation. • The readers are given the last committed version of data, that is, the version prior to the start of a write operation. • Currently Committed is controlled with the CUR_COMMIT database configuration parameter Why do we need the Currently Committed feature? • Customers running high throughput database applications cannot tolerate waiting on locks during transaction processing and require non-blocking behavior for read transactions. 40 © 2009 IBM Corporation Currently Committed CPU Analysis - CPU Analysis on Currently Committed 100% 8.7 19.0 Results in a Nutshell 80% Effective CPU usage 5.0 33.5 17.2 • By enabling currently committed, we use CPU that was previously idle (18%), leading to an increase of over 28% in throughput. 60% 12.9 40% 58.9 20% 45.0 0% CC Disabled CC Enabled user system idle iowait Throughput of OLTP Workload using Currently Committed Transactions per second 1,500 Allows 28% more throughput 1,200 900 1,260.89 600 981.25 300 0 Currently Commit Disabled Currently Commit Enabled • Higher is better • With currently committed enabled, we see reduced LOCK WAIT time by nearly 20%. • We observe expected increases in LSN GAP cleaners and increased logging. 41 © 2009 IBM Corporation LOB Inlining What is LOB INLINING? • LOB inlining allows customers to store LOB data within a formatted data row in a data page instead of creating separate LOB object. • Once the LOB data is inlined into the base table row, LOB data is then eligible to be compressed. Why do we need the LOB Inlining feature? • Performance will increase for queries that access inlined LOB data as no additional I/O is required to fetch the LOB data. • LOBS are prime candidates for compression given their size and the type of data they represent. By inlining LOBS, this data is then eligible for compression, allowing further space savings and I/O from this feature. 42 © 2009 IBM Corporation LOB Inlining Results in a Nutshell • Inlined LOB vs. Non-Inlined LOB 80% 70% 60% 50% 40% 75% 75% 64% 30% 70% 55% 65% 20% 30% 22% 10% 7% 32k Lob 16k Lob 0% 8k Lob • INSERT and SELECT operations are the ones with more benefit. The smaller the LOB the bigger the benefit of the inlining For UPDATE operations the larger the LOB the better the improvements We can expect the inlined LOBs will have the same performance as a varchar(N+4) % Improvement • Size of LOB Insert Performance Select Performance Update Performance * Higher is better 43 © 2009 IBM Corporation Summary of Key DB2 9.7 Performance Features • Compression for indexes, temp tablespaces and XML data results on space savings and better performance • Range Partitioning with local indexes results in space savings and better performance including increased concurrency for certain operations like REORG and set integrity. It also makes roll-in and roll-out of data more efficient. • Scan Sharing improves workloads that have multiple heavy scans in the same table. • XML Scalability allows customers to exploit the same benefits in data warehouses as they exist for relational data • Statement Concentrator improves the performance of queries that use literals reducing there prepare times • Currently Committed increases throughput and reduces the contention on locks • LOB Inlining allows this type of data to be eligible for compression 44 © 2009 IBM Corporation A glimpse at the Future • Preparing for new workloads • Combined OLTP and Analtytics • Preparing for new operating environments • Virtualization • Cloud • Power-aware • Preparing for new hardware • SSD flash storage • IBM POWER7 • Intel Nehalem EX 45 © 2009 IBM Corporation Conclusion • DB2 is the performance leader • New features in DB2 9.7 that further boost performance • For BOTH the OLTP and Data warehouse areas • Performance is a critical and integral part of DB2! • Maintaining excellent performance • On current hardware • Over the course of DB2 maintenance • Preparing for future hardware/OS technology 46 © 2009 IBM Corporation Appendix – Mandatory SAP publication data Required SAP Information • For more information regarding these results and SAP benchmarks, visit www.sap.com/benchmark. • These benchmark fully complies with the SAP Benchmark Council regulations and has been audited and certified by SAP AG SAP 3-tier SD Benchmark: 168,300 SD benchmark users. SAP R/3 4.7. 3-tier with database server: IBM eServer p5 Model 595, 32-way SMP, POWER5 1.9 GHz, 32 KB(D) + 64 KB(I) L1 cache per processor, 1.92 MB L2 cache and 36 MB L3 cache per 2 processors. DB2 v8.2.2, AIX 5.3 (cert # 2005021) 100,000 SD benchmark users. SAP R/3 4.7. 3-tier with database server: HP Integrity Model SD64A, 64-way SMP, Intel Itanium 2 1.6 GHz, 32 KB L1 cache, 256 KB L2 cache, 9 MB L3 cache. Oracle 10g, HP-UX11i (cert # 2004068) 93,000 SD benchmark users. SAP R/3 4.7. 3-tier with database server: HP Integrity Superdome 64P Server, 64-way SMP, Intel Itanium 2 1.6 GHz, 32 KB L1 cache, 256 KB L2 cache, 9 MB L3 cache . SQL Server 2005, Windows 2003 (cert # 2005045) SAP 3-tier BW Benchmark: 311,004 throughput./hour query navigation steps.. SAP BW 3.5. Cluster of 32 servers, each with IBM x346 Model 884041U, 1 processor/ 1 core/ 2 threads, Intel XEON 3.6 GHz, L1 Execution Trace Cache, 2 MB L2 cache, 2 GB main memory. DB2 8.2.3 SLES 9. (cert # 2005043) SAP TRBK Benchmark: 15,519,000. Day processing no. of postings to bank accounts/hour. SAP Deposit Management 4.0. IBM System p570, 4 core, POWER6, 64GB RAM. DB2 9 on AIX 5.3. (cert # 2007050) 10,012,000 Day processing no. of postings to bank accounts/hour. SAP Account Management 3.0. Sun Fire E6900, 16 core, UltraSPARC1V, 56GB RAM, Oracle 10g on Solaris 10, (cert # 2006018) 8,279,000 Day processing no. of postings to bank accounts/hour/ SAP Account Management 3.0. HP rx8620, 16 core, HP mx2 DC,64 GB RAM, SQL Server on Windows Server (cert # 2005052) SD 2-tier SD Benchmark: 39,100 SD benchmark users, SAP ECC 6.0. Sun SPARC Enterprise Server M9000, 64 processors / 256 cores / 512 threads, SPARC64 VII, 2.52 GHz, 64 KB(D) + 64 KB(I) L1 cache per core, 6 MB L2 cache per processor, 1024 GB main memory, Oracle 10g on Solaris 10. (cert # 2008-042-1) 35,400 SD benchmark users, SAP ECC 6.0. IBM Power 595, 32 processors / 64 cores / 128 threads, POWER6 5.0 GHz, 128 KB L1 cache and 4 MB L2 cache per core, 32 MB L3 cache per processor, 512 GB main memory. DB2 9.5, AIX 6.1. (Cert# 2008019). 30,000 SD benchmark users. SAP ECC 6.0. HP Integrity SD64B , 64 processors/128 cores/256 threads, Dual-Core Intel Itanium 2 9050 1.6 GHz, 32 KB(I) + 32 KB(D) L1 cache, 2 MB(I) + 512 KB(D) L2 cache, 24 MB L3 cache, 512 GB main memory. Oracle 10g on HP-UX 11iV3. (cert # 2006089) 23,456 SD benchmark users. SAP ECC 5.0. Central server: IBM System p5 Model 595, 64-way SMP, POWER5+ 2.3GHz, 32 KB(D) + 64 KB(I) L1 cache per processor, 1.92 MB L2 cache and 36 MB L3 cache per 2 processors. DB2 9, AIX 5.3 (cert # 2006045) 20,000 SD benchmark users. SAP ECC 4.7. IBM eServer p5 Model 595, 64-way SMP, POWER5, 1.9 GHz, 32 KB(D) + 64 KB(I) L1 cache per processor, 1.92 MB L2 cache and 36 MB L3 cache per 2 processors, 512 GB main memory. (cert # 2004062) These benchmarks fully comply with SAP Benchmark Council's issued benchmark regulations and have been audited and certified by SAP. For more information, see http://www.sap.com/benchmark 47 © 2009 IBM Corporation Session C05 DB2 9.7 Performance Update Serge Rielau IBM Toronto Lab srielau@ca.ibm.com 48