Oracle Active Data Guard Performance Joseph Meeks Director, Product Management Oracle High Availability Systems 1 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Note to viewer 2 These slides provide various aspects of performance data for Data Guard and Active Data Guard – we are in the process of updating for Oracle Database 12c. It can be shared with customers, but is not intended to be a canned presentation ready to go in its entirety It provides SC’s data that can be used to substantiate Data Guard performance or to provide focused answers to particular concerns that may be expressed by customers. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Note to viewer See this FAQ for more customer and sales collateral – http://database.us.oracle.com/pls/htmldb/f?p=301:75:1014514610433 66::::P75_ID,P75_AREAID:21704,2 3 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Agenda – Data Guard Performance Failover and Switchover Timings SYNC Transport Performance ASYNC Transport Performance Primary Performance with Multiple Standby Databases Redo Transport Compression Standby Apply Performance 4 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Data Guard 12.1 Example - Faster Failover 48 seconds 2,000 sessions on both primary and standby 43 seconds 2,000 sessions on both primary and standby 5 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. # of database sessions on primary and standby # of database sessions on primary and standby Data Guard 12.1 Example – Faster Switchover 72 seconds 1,000 sessions on both primary and standby 83 seconds 500 sessions on both primary and standby 6 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. # of database sessions on primary and standby # of database sessions on primary and standby Agenda – Data Guard Performance Failover and Switchover Timings SYNC Transport Performance ASYNC Transport Performance Primary Performance with Multiple Standby Databases Redo Transport Compression Standby Apply Performance 7 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Synchronous Redo Transport Zero Data Loss Primary database performance is impacted by the total round-trip time for acknowledgement to be received from the standby database – Data Guard NSS process transmits Redo to the standby directly from log buffer, in parallel with local log file write – Standby receives redo, writes to a standby redo log file (SRL), then returns ACK – Primary receives standby ACK, then acknowledges commit success to app The following performance tests show the impact of SYNC transport on primary database using various workloads and latencies In all cases, transport was able to keep pace with generation – no lag We are working on test data for Fast Sync (SYNCNOAFFIRM) in Oracle Database 12c (same process as above, but standby acks primary as soon as redo is received in memory – it does not wait for SRL write. 8 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Test 1) Synchronous Redo Transport OLTP with Random Small Insert < 1ms RTT Network Latency Workload: – Random small inserts (OLTP) to 9 tables with 787 commits per second – 132 K redo size, 1368 logical reads, 692 block changes per transaction Sun Fire X4800 M2 (Exadata X2-8) – 1 TB RAM, 64 Cores, Oracle Database 11.2.0.3, Oracle Linux – InfiniBand, seven Exadata cells, Exadata Software 11.2.3.2 Exadata Smart Flash, Smart Flash Logging and Write-Back flash enabled provided significant gains 9 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Test 1) Synchronous Redo Transport OLTP with Random Small Inserts and < 1ms RTT Network Latency Local standby, Redo Rate With Data Guard Synchronous Transport Enabled 104,051,368.80 Data Guard Transport Disabled 104,143,368.00 Txn Rate 0 DG Sync No DG 20000000 40000000 Txn Rate 787 790.6 60000000 80000000 100000000 Redo Rate 104,051,368.80 104,143,368.00 120000000 <1ms RTT 99MB/s redo rate <1% impact on database throughput 1% impact on transaction rate RTT = network round trip time 10 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Test 2) Synchronous Redo Transport Swingbench OLTP Workload with Metro-Area Network Latency Exadata X2-8, 2-node RAC database – smart flash logging, smart write back flash Swingbench OLTP workload – Random DMLs, 1 ms think time, 400 users, 6000+ transactions per second, 30MB/s peak redo rate (different from test 2) Transaction profile – 5K redo size, 120 logical reads, 30 block changes per transaction 1 and 5ms RTT network latency 11 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Test 2) Synchronous Redo Transport Swingbench OLTP Workload with Metro-Area Network Latency Transactions per/second 6000 5000 4000 3000 2000 1000 0 12 Swingbench OLTP 30 MB/s redo 6363 6151 6077 tps tps tps 3% impact at 1ms RTT 5% impact at 5ms RTT Baseline No Data Guard Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Data Guard SYNC 1ms RTT Network Latency Data Guard SYNC 5ms RTT Network Latency Test 3) Synchronous Redo Transport Large Insert OLTP Workload with Metro-Area Network Latency Exadata X2-8, 2-node RAC database – smart flash logging, smart write back flash Large insert OLTP workload – 180+ transactions per second, 83MB/s peak redo rate, random tables Transaction profile – 440K redo size, 6000 logical reads, 2100 block changes per transaction 1, 2 and 5ms RTT network latency 13 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Test 3) Synchronous Redo Transport Large Insert OLTP Workload with Metro-Area Network Latency Transactions per/second Large Insert - OLTP 200 150 83 MB/s redo <1%% impact 189 188 177 167 tps tps tps tps 2ms RTT Network Latency 5ms RTT Network Latency 100 50 0 14 Baseline No Data Guard Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 1ms RTT Network Latency at 1ms RTT 7% impact at 2ms RTT 12% impact at 5ms RTT Test 4) Synchronous Redo Transport Mixed OLTP workload with Metro-Area Network Latency Exadata X2-8, 2-node RAC database – smart flash logging, smart write back flash Mixed workload with high TPS – Swingbench plus large insert workloads – 26000+ txn per second and 112 MB/sec peak redo rate Transaction profile – 4K redo size, 51 logical reads, 22 block changes per transaction 1, 2 and 5ms RTT network latency 15 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Test 4) Synchronous Redo Transport Mixed OLTP workload with Metro-Area Network Latency 35,000 Swingbench plus large insert 30,000 112 MB/s redo Txn Rate Redo Rate 25,000 3% impact at < 1ms RTT 20,000 5% impact at 2ms RTT 15,000 6% impact at 5ms RTT 10,000 5,000 0 Txns/s Redo Rate (MB/sec) % Workload 16 No Sync 29,496 116 100% 0ms 28,751 112 97% Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 2ms 27,995 109 95% 5ms 27,581 107 94% 10ms 26,860 104 91% Note: 0ms latency on graph 20ms represents values falling in 26,206 the 102 range <1ms 89% Additional SYNC Configuration Details For the Previous Series of Synchronous Transport Tests No system bottlenecks (CPU, IO or memory) were encountered during any of the test runs – Primary and standby databases had 4GB online redo logs – Log buffer was set to the maximum of 256MB – OS max TCP socket buffer size set to 128MB on both primary and standby – Oracle Net configured on both sides to send and receive 128MB with an SDU for 32k – Redo is being shipped over a 10GigE network between the two systems. – Approximately 8-12 checkpoints/log switches are occurring per run 17 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Customer References for SYNC Transport Fannie Mae Case Study that includes performance data Other SYNC references – Amazon – Intel – MorphoTrak – prior biometrics division of Motorola, case study, podcast, presentation – Enterprise Holdings – Discover Financial Services, podcast, presentation – Paychex – VocaLink 18 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Synchronous Redo Transport Caveat that Applies to ALL SYNC Performance Comparisons Redo rates achieved are influenced by network latency, redo-write size, and commit concurrency – in a dynamic relationship with each other that will vary for every environment and application Test results illustrate how an example workload can scale with minimal impact to primary database performance Actual mileage will vary with each application and environment. Oracle recommends customers conduct their own tests, using their workload and environment. Oracle tests are not a substitute. 19 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Agenda Failover and Switchover Timings SYNC Transport Performance ASYNC Transport Performance Primary Performance with Multiple Standby Databases Redo Transport Compression Standby Apply Performance 20 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Asynchronous Redo Transport Near Zero Data Loss ASYNC does not wait for primary acknowledgement A Data Guard NSA process transmits directly from log buffer in parallel with local log file write – NSA reads from disk (online redo log file) if log buffer is recycled before redo transmission is completed ASYNC has minimal impact on primary database performance Network latency has little, if any, impact on transport throughput – Uses Data Guard 11g streaming protocol & correctly sized TCP send/receive buffers Performance tests are useful to characterize max redo volume that ASYNC is able to support without transport lag – Goal is to ship redo as fast as generated without impacting primary performance 21 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Asynchronous Test Configuration Details 100GB online redo logs Log buffer set to the maximum of 256MB OS max TCP socket buffer size set to 128MB on primary and standby Oracle Net configured on both sides to send and receive 128MB Read buffer size set to 256 (_log_read_buffer_size=256) and archive buffers set to 256 (_log_archive_buffers=256) on primary and standby Redo is shipped over the IB network between primary and standby nodes (insures that transport is not bandwidth constrained) – Near-zero network latency, approximate throughput of 1200MB/sec. 22 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. ASYNC Redo Transport Performance Test Oracle Database 11.2. Data Guard ASYNC transport can sustain very 600 high rates 500 484 Redo Transport MB/sec 400 ‒ 484 MB/sec on single node ‒ Zero transport lag Add RAC nodes to scale transport performance 300 ‒ Each node generates its own redo thread and has a dedicated Data Guard transport process 200 ‒ Performance will scale as nodes are added assuming adequate CPU, I/O, and network resources 100 A 10GigE NIC on standby receives data at 0 maximum of 1.2 GB/second Single Instance ‒ Standby can be configured to receive redo across two or more instances 23 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Data Guard 11g Streaming Network Protocol High Network Latency has Negligible Impact on Network Throughput Redo Transport Rate 35 Streaming protocol is new with Data Guard 11g 30 Test measured throughput with 0 – 100ms RTT 25 ASYNC tuning best practices MB/sec 20 – Set correct TCP send/receive buffer size = 3 x Network Latency 15 0ms 25ms 50ms 100ms 10 5 0 ASYNC 24 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. BDP (bandwidth delay product) BDP = bandwidth x round-trip network latency – Increase log buffer size if needed to keep NSA process reading from memory See support note 951152.1 X$LOGBUF_READHIST to determine buffer hit rate Agenda Failover and Switchover Timings SYNC Transport Performance ASYNC Transport Performance Primary Performance with Multiple Standby Databases Redo Transport Compression Standby Apply Performance 25 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Multi-Standby Configuration Primary - A Local Standby - B A growing number of customers use multi-standby Data Guard configurations. SYNC Additional standbys are used for: – Local zero data loss HA failover with remote DR – Rolling maintenance to reduce planned downtime – Offloading backups, reporting, and recovery from primary ASYNC – Reader farms – scale read-only performance This leads to the question: How is primary database Remote Standby - C 26 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. performance affected as the number of remote transport destinations increases? Redo Transport in Multi-Standby Configuration Primary Performance Impact: 14 Asynchronous Transport Destinations 105.0% 104.0% 103.0% 102.0% 101.0% 100.0% 99.0% 98.0% 97.0% Increase in CPU Change in redo volume (compared to baseline) (compared to baseline) 100.0% 98.0% 96.0% 94.0% 92.0% 0 - 14 destinations 27 102.0% Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 0 -14 destinations Redo Transport in Multi-Standby Configuration Primary Performance Impact: 1 SYNC and multiple ASYNC Destinations Increase in CPU 104.0% (compared to baseline) (compared to baseline) 98.0% 100.0% 96.0% 98.0% 94.0% Zero 1/0 1/1 1/14 # of SYNC/ASYNC destinations 28 102.0% 100.0% 102.0% 96.0% Change in redo volume Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 92.0% Zero 1/0 1/1 1/14 # of SYNC/ASYNC destinations Redo Transport for Gap Resolution Standby databases can be configured to request log files needed to resolve gaps from other standby’s in a multi-standby configuration A standby database that is local to the primary database is normally the preferred location to service gap requests – Local standby database are least likely to be impacted by network outages – Other standby’s are listed next – The primary database services gap requests only as a last resort – Utilizing a standby for gap resolution avoids any overhead on the primary database 29 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Agenda Failover and Switchover Timings SYNC Transport Performance ASYNC Transport Performance Primary Performance with Multiple Standby Databases Redo Transport Compression Standby Apply Performance 30 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Redo Transport Compression Conserve Bandwidth and Improve RPO when Bandwidth Constrained 2500 2000 Test configuration Transport Lag - MB – 12.5 MB/second bandwidth – 22 MB/second redo volume 22 MB/sec uncompressed 1500 Uncompressed volume exceeds available bandwidth – Recovery Point Objective (RPO) 1000 impossible to achieve – perpetual increase in transport lag 500 12 MB/sec compressed 0 – volume < bandwidth = achieve RPO – ratio will vary across workloads Elapsed Time - Minutes 31 50% compression ratio results in: Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Requires Advanced Compression Agenda Failover and Switchover Timings SYNC Transport Performance ASYNC Transport Performance Primary Performance with Multiple Standby Databases Redo Transport Compression Standby Apply Performance 32 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Standby Apply Performance Test Redo apply was first disabled to accumulate a large number of log files at the standby database. Redo apply was then restarted to evaluate max apply rate for this workload. All standby log files were written to disk in Fast Recovery Area Exadata Write Back Flash Cache increased the redo apply rate from 72MB/second to 174MB/second using test workload (Oracle 11.2.0.3) – Apply rates will vary based upon platform and workload Achieved volumes do not represent physical limits – They only represent the particular test case configuration and workload, higher apply rates have been achieved in practice by production customers 33 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Apply Performance at Standby Database Test 1: no write-back flash cache On Exadata x2-2 quarter rack Swing bench OLTP workload 72 MB/second apply rate – I/O bound during checkpoints – 1,762ms for checkpoint complete – 110ms DB File Parallel Write 34 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Apply Performance at Standby Database Test 2: a repeat of the previous test but with write-back flash cache enabled On Exadata x2-2 quarter rack Swing bench OLTP workload 174 MB/second apply rate – Checkpoint completes in 633ms vs 1,762ms – DB File Parallel Write is 21ms vs 110ms 35 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Two Production Customer Examples Data Guard Redo Apply Performance Thomson-Reuters – Data Warehouse on Exadata, prior to write-back flash cache – While resolving a gap of observed an average apply rate of 580MB/second Allstate Insurance – Data Warehouse ETL processing resulted in average apply rate over a 3 hour period of 668MB/second, with peaks hitting 900MB/second 36 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Redo Apply Performance for Different Releases Range of Observed Apply Rates for Batch and OLTP Standby Apply Rate MB/sec 700 600 500 400 300 200 100 0 High End - Batch High End - OLTP Oracle Database 9i 37 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Oracle Database 10g Oracle Database 11g (non Exadata) Oracle Database 11g (Exadata) 38 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 39 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.