Oracle 10g RAC Scalability – Lessons Learned Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com About the Author • • • • • • Oracle Dev & DBA for 20 years, versions 4 through 10g Worked for Oracle Education & Consulting Holds several Oracle Masters (DBA & CASE) BS, MS, PhD in Computer Science and also an MBA LOMA insurance industry designations: FLMI and ACS Books – The TOAD Handbook (March 2003) – Oracle DBA Guide to Data Warehousing and Star Schemas (June 2003) – TOAD Pocket Reference 2nd Edition (June 2005) • Articles – Oracle Magazine – Oracle Technology Network (OTN) – Oracle Informant – PC Week (now E-Magazine) – Linux Journal – www.Linux.com About Quest Software Used in this paper Project Formation This paper is based upon collaborative RAC research efforts between Quest Software and Dell Computers. Quest: •Bert Scalzo •Murali Vallath – author of RAC articles and books Dell: •Anthony Fernandez •Zafar Mahmood Also an extra special thanks to Dell for allocating a million dollars worth of equipment to make such testing possible Project Purpose Quest: •To partner with a leading hardware vendor •To field test and showcase our RAC enabled software •Spotlight on RAC •Benchmark Factory •TOAD for Oracle with DBA module Dell: •To write a Dell Power Edge Magazine article about the OLTP scalability of Oracle 10g RAC running on typical Dell servers and EMC storage arrays •To create a standard methodology for all benchmarking of database servers to be used for future articles and for lab testing & demonstration purposes OLTP Benchmarking TPC benchmark (www.tpc.org) TPC Benchmark™ C (TPC-C) is an OLTP workload. It is a mixture of read-only and update intensive transactions that simulate the activities found in complex OLTP application environments. It does so by exercising a breadth of system components associated with such environments, which are characterized by: • The simultaneous execution of multiple transaction types that span a breadth of complexity • On-line and deferred transaction execution modes • Multiple on-line terminal sessions • Moderate system and application execution time • Significant disk input/output • Transaction integrity (ACID properties) • Non-uniform distribution of data access through primary and secondary keys • Databases consisting of many tables with a wide variety of sizes, attributes, and relationships • Contention on data access and update Excerpt from “TPC BENCHMARK™ C: Standard Specification, Revision 3.5” Create the Load - Benchmark Factory The TPC-C like benchmark measures on-line transaction processing (OLTP) workloads. It combines read-only and update intensive transactions simulating the activities found in complex OLTP enterprise environments. Monitor the Load - Spotlight on RAC Hardware & Software Servers, Storage and Software Oracle 10g RAC Cluster Servers 10 x 2-CPU Dell PowerEdge 1850 3.8 GHz P4 processors with HT 4 GB RAM (later expanded to 8GB RAM) 1 x 1 Gb NICs (Intel) for LAN 2 x1 Gb LOM teamed for RAC interconnect 1 x two port HBAs (Qlogic 2342) DRAC Benchmark Factory Servers 2 x 4-CPU Dell PowerEdge 6650 8 GB RAM Storage 1 x Dell | EMC CX700 1 x DAE unit: total 30 x 73GB 15K RPM disks Raid Group 1: 16 disks having 4 x 50GB RAID 1/0 LUN’s for Data and backup Raid Group 2: 10 disks having 2 x 20GB RAID 1/0 LUN’s for Redo Logs Raid Group 3: 4 disks having 1 x 5 GB RAID 1/0 LUN for voting disk, OCR, and spfiles 2 x Brocade SilkWorm 3800 Fibre Channel Switch (16 port) Configured with 8 paths to each logical volume Network 1 x Gigabit 5224 Ethernet Switches (24 port) for private interconnect 1 x Gigabit 5224 Ethernet switch for Public LAN RHEL AS 4 QU1 (32-bit) EMC PowerPath 4.4 EMC Navisphere agent Oracle 10g R1 10.1.0.4 Oracle ASM 10.1.0.4 Oracle Cluster Ready Services 10.1.0.4 Linux bonding driver for interconnect Dell OpenManage Windows 2003 server Quest Benchmark Factory Application Quest Benchmark Factory Agents Quest Spotlight on RAC Quest TOAD for Oracle Flare Code Release 16 Linux binding driver used to team dual onboard NIC’s for private interconnect TOAD for Oracle Setup Planned vs. Actual Planned: •Redhat 4 Update 1 64-bit •Oracle 10.2.0.1 64-bit Actual: •Redhat 4 Update 1 32-bit •Oracle 10.0.1.4 32-bit Issues: •Driver problems with 64-bit (no real surprise) •Some software incompatibilities with 10g R2 •Known ASM issues require 10.0.1.4, not earlier Testing Methodology – Steps 1 A-C 1. For a single node and instance a. Establish a fundamental baseline i. Install the operating system and Oracle database (keeping all normal installation defaults) ii. Create and populate the test database schema iii. Shutdown and startup the database iv. Run a simple benchmark (e.g. TPC-C for 200 users) to establish a baseline for default operating system and database settings b. Optimize the basic operating system i. Manually optimize typical operating system settings ii. Shutdown and startup the database iii. Run a simple benchmark (e.g. TPC-C for 200 users) to establish a new baseline for basic operating system improvements iv. Repeat prior three steps until a performance balance results c. Optimize the basic non-RAC database i. Manually optimize typical database “spfile” parameters ii. Shutdown and startup the database iii. Run a simple benchmark (e.g. TPC-C for 200 users) to establish a new baseline for basic Oracle database improvements iv. Repeat prior three steps until a performance balance results Testing Methodology – Steps 1 D-E d) Ascertain the reasonable per-node load a) Manually optimize scalability database “spfile” parameters b) Shutdown and startup the database c) Run an increasing user load benchmark (e.g. TPC-C for 100 to 800 users increment by 100) to find the “sweet spot” of how many concurrent users a node can reasonably support d) Monitor the benchmark run via the vmstat command, looking for the point where excessive paging and swapping begins – and where the CPU idle time consistently approaches zero e) Record the “sweet spot” number of concurrent users – this represents an upper limit f) Reduce the “sweet spot” number of concurrent users by some reasonable percentage to account for RAC architecture and inter/intra-node overheads (e.g. reduce by say 10%) e) Establish the baseline RAC benchmark g) Shutdown and startup the database h) Create an increasing user load benchmark based upon the node count and the “sweet spot” (e.g. TPC-C for 100 to node count * sweet spot users increment by 100) i) Run the baseline RAC benchmark Step 1B - Optimize Linux Kernel Linux Kernel parameters /etc/sysctl.conf: kernel.shmmax = 2147483648 kernel.sem = 250 32000 100 128 fs.file-max = 65536 fs.aio-max-nr = 1048576 net.ipv4.ip_local_port_range = 1024 65000 net.core.rmem_default = 262144 net.core.rmem_max = 262144 net.core.wmem_default = 262144 net.core.wmem_max = 262144 Step 1C - Optimize Oracle Binaries Oracle compiled & linked for Asynchronous IO: 1.cd to $ORACLE_HOME/rdbms/lib a. make -f ins_rdbms.mk async_on b.make -f ins_rdbms.mk ioracle 2.Set necessary “spfile” parameter settings a. disk_asynch_io = true (default value is true) b.filesystemio_options = setall (for both async and direct io) Note that in Oracle 10g Release 2 asynchronous IO is now compiled & linked in by default. Step 1C - Optimize Oracle SPFILE spfile adjustments shown below: cluster_database=true .cluster_database_instances=10 db_block_size=8192 processes=16000 sga_max_size=1500m sga_target=1500m pga_aggregate_target=700m db_writer_processes=2 open_cursors=300 optimizer_index_caching=80 optimizer_index_cost_adj=40 The key idea was to eek out as much SGA memory usage as possible within the 32-bit operating system limit (about 1.7 GB). Since our servers had only 4 GB of RAM each, we figured that allocating half to Oracle was sufficient – with the remaining memory to be shared by the operating system and the thousands of dedicated Oracle server processes that the TPC-C like benchmark would be creating as its user load. Step 1D – Find Per Node Sweet Spot Finding the ideal per node sweet spot is arguably the most critical aspect of the entire benchmark testing process – and especially for RAC environments with more than just a few nodes. We initially ran a 100-800 user TPC-C on the single node Without monitoring the database server using the vmstat command Simply looked the BMF transactions per second graph, which was positive to beyond 700 users Assumed this meant the “sweet spot” was 700 users per node (and did not factor in any overhead) What was happening in reality: The operating system was being overstressed and exhibited thrashing characteristics at about 600 users Running benchmarks for 700 users per node did not scale either reliably or predictably beyond four servers Our belief is that by taking each box to a near thrashing threshold by our overzealous per node user load selection, the nodes did not have sufficient resources available to communicate in a timely enough fashion for inter/intra-node messaging – and thus Oracle began to think that nodes were either dead or non-respondent Furthermore when relying upon Oracle’s client and server side load balancing feature, which allocates connections based upon node responding, the user load per node became skewed and then exceeded our per node “sweet spot” value. For example when we tested 7000 users for 10 nodes, since some nodes appeared dead to Oracle – the load balancer simply directed all the sessions across whatever node were responding. So we ended up with nodes trying to handle far more than 700 users – and thus the thrashing was even worse. Sweet Spot Lessons Learned •Cannot solely rely on BMF transactions per second graph •Can still be increasing throughput while beginning to trash •Need to monitor database server with vmstat and other tools •Must stop just shy of bandwidth challenges (RAM, CPU, IO) •Must factor in multi-node overhead, and reduce accordingly •Prior to 10g R2, better to rely on app (BMF) load balancing •If you’re not careful on this step, you’ll run into roadblocks which either invalidate your results or simply cannot scale!!! Testing Methodology – Steps 2 A-C 2. For 2nd through Nth nodes and instances a. Duplicate the environment i. Install the operating system ii. Duplicate all of the base node’s operating system settings b. Add the node to the cluster i. Perform node registration tasks ii. Propagate the Oracle software to the new node iii. Update the database “spfile” parameters for the new node iv. Alter the database to add node specific items (e.g. redo logs) c. Run the baseline RAC benchmark i. Update the baseline benchmark criteria to include user load scenarios from the prior run’s maximum up to the new maximum based upon node count * “sweet spot” of concurrent users using the baseline benchmark’s constant for increment by ii. Shutdown and startup the database – adding the new instance iii. Run the baseline RAC benchmark iv. Plot the transactions per second graph showing this run versus all the prior baseline benchmark runs – the results should show a predictable and reliable scalability factor Step 2C – Run OLTP Test per Node With the correct per node user load now correctly identified and guaranteed load balancing, it was now a very simple (although time consuming) exercise to run the TPC-C like benchmarks listed below: 1 Node: 2 Node 4 Node 6 Node 8 Node 10 Node 100 to 500 users, increment by 100 100 to 1000 users, increment by 100 100 to 2000 users, increment by 100 100 to 3000 users, increment by 100 100 to 4000 users, increment by 100 100 to 5000 users, increment by 100 Benchmark Factory’s default TPC-C like test iteration requires about 4 minutes for a given user load. So for the single node with five user load scenarios, the overall OLTP benchmark test run requires 20 minutes. During the entire testing process the load was monitored to identify any hiccups using Spotlight on RAC. Some Speed Bumps Along the Way As illustrated below when we reached our four node tests we did identify that CPU’s on node racdb1 and racdb3 reached 84% and 76% respectively. Analyzing the root cause of the problem it was related to temporary overload of users on these servers, and the ASM response time. Some ASM Fine Tuning Necessary We increased the following parameters on the ASM instance ran our four node tests again and all was well beyond this: Parameter Default Value SHARED_POOL 32M LARGE_POOL 12M New Value 67M 67M This was the only parameter change we had to make to the ASM instance and beyond this everything work just smooth. Smooth Sailing After That As shown below, the cluster level latency charts from Spotlight on RAC during our eight node run. This indicated that the interconnect latency was well within expectations and in par with any industry network latency numbers. Full Steam Ahead! As shown below, ASM was performing excellently well at this user load. 10 instances with over 5000 users indicated an excellent service time from ASM, actually the I/O’s per second was pretty high and noticeably good - topping over 2500 I/O’s per second! Final Results Other than some basic monitoring to make sure that all is well and the tests are working, there’s really not very much to do while these tests run – so bring a good book to read. The final results are shown below. Interpreting the Results The results are quite interesting. As the previous graph clearly shows, Oracle’s RAC and ASM are very predictable and reliable in terms of its scalability. Each successive node seems to continue the near linear line almost without issue. Now there are 3 or 4 noticeable troughs in the graph for the 8 and 10 node test runs that seem out of place. Note that we had one database instance that was throwing numerous ORA-00600 [4194] errors related to its UNDO tablespace. And that one node took significantly longer to startup and shutdown than all the other nodes combined. A search of Oracle’s metalink web site located references to a known problem that would require a database restore or rebuild. Since we were tight on time, we decided to ignore those couple of valleys in the graph, because it’s pretty obvious from the overall results we obtained that smoothing over those few inconsistent points would yield a near perfect graph – showing that RAC is truly reliable and predictable in terms of scalability. Projected RAC Scalability Using the 6 node graph results to project forward, the figure below shows a reasonable expectation in terms of realizable scalability – where 17 nodes should equal nearly 500 TPS and support about 10,000 concurrent users. Next Steps … •Since first iteration of test we were limited by memory, we upgraded each database server from 4 to 8 GB RAM •Now able to scale up to 50% more users per node •Now doing zero percent paging and/or swapping •But – now CPU bound •Next step, replace each CPU with a dual-core Pentium •Increase from 4 CPU’s (2-real/2-virtual) to 8 CPU’s •Should be able to double users again ??? •Will we now reach IO bandwidth limits ??? •Will be writing about those results in future Dell articles… Conclusions … A few minor hiccups at the initial round where we tried to determine the optimal user load on a node for the given hardware and processor configuration The scalability of the RAC cluster was outstanding. Addition of every node to the cluster showed steady - close to linear scalability. Close to linear scalability because of the small overhead that the cluster interconnect would consume during block transfer between instances. The interconnect also performed very well, in this particular case NIC paring/bonding feature of Linux was implemented to provide load balancing across the redundant interconnects which also helped provide availability should any one interconnect fail. The DELL|EMC storage subsystem that consisted of six ASM diskgroups for the various data files types performed with high throughput also indicating high scalability numbers. EMC PowerPath provided IO load balancing and redundancy utilizing dual Fibre Channel host bus adapters on each server. It’s the unique architecture of RAC that makes this possible, because irrespective of the number of instances in the cluster, the maximum number of hops that will be performed to before the requestor gets the block requested will not exceed three under any circumstances. This unique architecture of RAC removes any limitations in clustering technology (available from other database vendors) giving maximum scalability. This was demonstrated through the tests above. Oracle® 10g Real Application Clusters (RAC) running on standards-based Dell™ PowerEdge™ servers and Dell/EMC storage can provide a flexible, reliable platform for a database grid. In particular, Oracle 10g RAC databases on Dell hardware can easily be scaled out to provide the redundancy or additional capacity that the grid environment requires. Questions … Thanks for coming