hp partnering solutions october 2002 technical white paper Concurrency in an Oracle 9iR1 RAC Environment on HP Tru64 UNIX summary This document provides information on an informal account of CPU usage by RAC when performing certain operations in a clustered database environment. This document also contains and presents information that can be used for optimizing performance when performing operations under those conditions. It presents information on changes made to the settings and parameters of the operating system and Oracle 9iR1 RAC software and their effects on performance in a sequential manner as each of those changes are implemented. The goal of the testing was to simulate and study the performance of bulk inserts in order to derive easy methods for optimizing the throughput of this database usage pattern in a RAC environment. Writing medium or small-sized records at a high rate into a database is a common activity and characteristic of billing or workflow applications. This type of workload is also resource-intensive and prone to causing contention in non-clustered as well as clustered database systems. test results summary It has become easier to manage out of the box performance with the introduction of Oracle’s 9iR1 RAC. However, usage of RAC in a clustered or complex environment still requires due care and planning before implementation in order to get the best performance possible. A summary of the test findings and issues are as follows: • • • • • • High CPU usage can significantly slow down global cache access speed and the processes that depend on it even if the process runs on another node in the cluster. The use of RAC will impose some additional CPU overhead even when not performing concurrent processing. Consider application and/or data partitioning when deploying RAC. The use of ‘gc_files_to_locks’ setting should be carefully considered and managed when used on data files that are read or data files with fewer and predictable write/write or read/write access concurrency. The addition of more nodes will make things more complex and result in the use of even more CPU resources. When nodes and remote users are added and the shared working set does not scale correspondingly there is a chance that the local buffer cache hit efficiency is lower due to an increase in the number of messages sent and buffers received from remote instances. Consider the load that Cache Fusion will impose on interconnects before adding multiple nodes. for more information Contact the author by electronic mail at Mario.Broodbakker@hp.com. A list of publications related to this document is also included in the References section. 1 trademark and date information Oracle, Oracle 9i, and Oracle 9iR1 RAC are trademarks and/or registered trademarks of Oracle Corporation. Copyright ©2002 Compaq Information Technologies Group, L.P and Hewlett-Packard Company. All rights reserved. Printed in the U.S.A. This performance report was prepared by the System and Applications Integrated Engineering Group of Hewlett-Packard Company. All brand names are trademarks of their respective owners. Technical information in this document is subject to change without notice. © Copyright Hewlett-Packard Company 2002 10/02 2 introduction This paper reports on testing of the effects of writer/writer concurrency and reader/writer concurrency in a RAC environment. This paper is not intended as an introduction to OPS or to the RAC architecture. It assumes that the reader has an understanding and working knowledge of the components involved in RAC processing. However, some concepts will be described in detail when necessary and references will be made to existing books or whitepapers where applicable. Most of these tests were performed on a 2-node AlphaServer ES40 cluster (4-CPUs at 500 MHz each) running Oracle 9.0.1.2. Additional tests were also run on a 2-node AlphaServer ES45 cluster (4 CPUs at 1 GHz each) running Oracle 9.0.1.3 RAC1. Note that these tests were produced in a lab situation using a synthetic workload. The goal of the testing was to simulate and study the performance of bulk inserts in order to derive easy methods for optimizing the throughput of this database usage pattern in a RAC environment. Writing medium or small-sized records at a high rate into a database is a common activity and characteristic of billing or workflow applications. This type of workload is also resource-intensive and prone to causing contention in non-clustered as well as clustered database systems. Testing consisted of performing the following key operations: • Concurrent inserts from two nodes using a randomly generated key • Query client joins for just inserted data with non-cached data reads from the disks Carefully review the applicability and possible effects on your own workload before implementing any of the changes or recommendations that were made based on these results. In general RAC does a great job, far superior to its predecessor OPS, of making clustered solutions work efficiently and obtaining good scaling with minimal database administrative attention. To maintain consistency across nodes a clustered solution involves cluster-wide, global data consistency, and coherence management. While there is normally a reasonable CPU overhead related to this type of activity there are circumstances where that cost becomes relatively high. In such cases, it may be worthwhile considering application or data partitioning and/or coarse-grained global cache coherence management to achieve optimal performance. This paper discusses such a special case, its effect on CPU utilization, and further suggests a solution through specialized tuning techniques to improve on 'out of the box' scaling capabilities. When doing performance tests to achieve high insert rates there were many factors that were found to influence the speed at which data could be inserted into the database. Almost every issue discovered was found and fixed using the event-based wait statistics and information found in a few books. Those books include Steve Adams’ Oracle8i Internal Services and the OPS and RAC manuals. How to use the session waits and system event statistics, and why, is very well described in the Yapp paper from Anjo Kolk, Shari Yamaguchi, and some presentations from Cary Millsap and friends. This information can be found at the following web sites: www.oraperf.com and www.hotsos.com. Basically the Yapp method says: I’m either working or waiting, and when I work it should show ‘CPU used by this session’ in v$sesstat (after joining with v$statname). Otherwise I wait, and what I wait for will be shown in the ‘event’ columns of v$session_wait along with what I’m waiting on in the parameter columns (p1(raw) to p3(raw)). Yapp also says not to bother looking for what you are waiting for if most of what you are doing is consuming CPU. This doesn’t mean, by the way, that there is no reason for further investigation on the reasons why you are using a large amount of CPU time. 1 Many of the tests, with only slight variations, were also performed on Oracle 8i (8.1.7) OPS. 3 To get an indication of the resource consumption since database startup, get the script resource_waits.sql from Steve Adams’ web site at www.ixora.com.au. This will give you information on resource usage on your system. The script is referred to and used as an example throughout this paper. Many items contained in the first part of this paper are also applicable to non-RAC situations and have a lot to do with storage options and sequences. The tests begin by focusing on inserts and later on to issues associated with (concurrent) query performance. The influence of updates, modifies, and triggers in a RAC environment are not mentioned or discussed. In addition, there are also many simple documented recommendations that can be used to design tables and indexes for high insert efficiency. These techniques include using such options as freelist group or automatic space management, multiple freelists, and locally managed tablespaces. The techniques are well documented in the Oracle manuals. All of these options and techniques were used during testing. test methodology The same database and test program was used for all tests. These conditions include one order table that receives all the inserts comprised of small 65 byte records with one index based on the customer_id only. The customer_id was generated using a random number between 1 and 11 million. In addition, a unique order number was generated using an Oracle ordered sequence. Before each test the order table was truncated and the instances were restarted between test runs for consistency. some non-RAC notes on the configuration Locally managed tablespaces were used for these tests (see the Oracle Administrator’s manual for an explanation). Using the dictionary-managed tablespaces in previous Oracle8i tests showed lots of ‘ST’ enqueue waits. Starting with Oracle9i the default tablespace format is to ‘locally managed’. Bitmapped segments were not used but are planned for later tests. On the order table, a setting of Freelist groups =2 and freelists =10 was used. A spin_count of 4,096 was used for both instances. Oracle9i’s ‘automatic undo’ was configured as one undo tablespace/file per instance. commits and arrays When trying to achieve high insert rates it is important to minimize the commit overhead as much as possible. Therefore, commit as many rows as possible as the transaction design or application permits. Also, by inserting data in arrays of multiple rows, the number of network or IPC roundtrips can be dramatically reduced. Support for array inserts (and fetches) is in all the Oracle pre-compilers, OCI, PL/SQL, and many standard applications. For example, we found that running 20 insert clients on only one node and inserting 20 * 10,000 rows while committing each row individually will show an insert row rate of about 733 insert rows per second. However, creating an array size of 100 and committing each array of 100 rows resulted in achieving a total of 2,450 insert rows per second. 4 Using the response_time_breakdown script, it is easy to see where time is being spent in Oracle. Running the test resulted in the following response times: Script Example 1: Response Time Breakdown MAJOR MINOR -------- ------------CPU time parsing reloads execution WAIT_EVENT SECONDS PCT ---------------------------------------- -------- -----n/a 2 .13% n/a 0 .02% n/a 139 8.04% disk I/O normal I/O other I/O db file sequential read control file heartbeat control file sequential read waits enqueue locks enqueue PCM locks global cache cr request other locks latch free row cache lock library cache pin buffer busy waits library cache load lock library cache lock latency commits network process ctl global locks misc log file sync name-service call wait SQL*Net more data from client SQL*Net message to client process startup DFS lock handle global cache open x global cache s to x global cache busy global cache bg acks global cache open s ges cgs registration contacting SCN server or SCN lock master reliable message refresh controlfile command 1 4 0 .04% .24% .00% 1 0 88 5 3 0 0 0 .07% .01% 5.09% .26% .15% .03% .02% .01% 2 .13% 7 .38% 0 .02% 0 .00% 8 .48% 1465 84.48% 0 .02% 0 .01% 0 .01% 0 .01% 0 .00% 1 .08% 0 .02% 0 .00% 0 .00% Script Example 2: Response Time Breakdown The overall system CPU usage for this series of tests was 55% and the insert rate was 2,450 insert rows per second. A quick look in v$session_event shows the following waits consuming the most resources during the run: Event sid p1raw p2 ------------------------------------------------------------------DFS lock handle 36 0000000053560005 2987 latch free 35 0000000402C024A8 35 latch free 21 000000005800BE20 93 Script Example 3: v$session_event interpretation of the p1raw and p2 columns The ‘DFS lock handle’ wait is a global enqueue wait. In OPS/RAC and non-OPS Oracle environments, enqueues have a global presence. Enqueues and the ‘DFS lock handle’ are locks that someone holds to achieve a goal. The ‘goal’ is encoded in the p1raw data. For example, the value of 5356 (highlighted in Script Example 2 above) that is contained in p1raw is hexadecimal for the letters ‘SV’. The Oracle8i Reference Manual (Appendix B: Enqueue and Lock names) translates ‘SV’ as ‘Sequence Number Value’. The p2 is the ‘object_id’ that we are waiting for. Using this object_id we can easily query dba_objects to find the name of the sequence causing those waits. The ‘latch free’ waits show a ‘35’ and a ‘93’ as the p2 values. According to the v$latchname we deal with, this translates to the ‘dlm resource hash list’ and ‘sequence cache’. Latch event waits are recorded if Oracle tries to get a latch and goes to sleep because some other process is holding that latch. Note that before going to sleep, Oracle will ‘spin’ (up to the _spin_count value) on the latch to acquire it before an expensive (in terms of time) sleep. 5 If you didn’t have a chance to monitor the v$session_wait during the run then you can use the Adams’ latch_sleep script (see the References section of this document for the web site) to get a breakdown on latch names and sleeping behaviors. So, without even knowing the exact architecture and implementation, everything in our analysis thus far points to sequence related issues. sequences and RAC One of the worst things that can happen with your table is that you need a synthetic identifier that uniquely identifies your row. In the past, we might have typically used a row in some table with a number that was read and updated all the time. This is, however, a nightmare from a concurrency perspective because everybody wants to read and update the row at the same time resulting in high row waits and (potential) application deadlock problems. Unfortunately, the scenario above is still a fact of life in many ERP/bookkeeping/general ledger types of applications. Oracle introduced the sequence number as a solution for that situation long ago. Specifying a ‘<sequence name>.nextval’ allows you to assign a database wide unique number. However, there are some problems with this solution if you are not aware of some basic background information on the topic. When using a sequence number, a difficult scenario occurs when the business need specifies that the sequence of numbers cannot be interrupted. This means that if there is a number 7, and a number 5, there must also be a number 6. In this case, an Oracle sequence cannot be used because when a transaction aborts the number is not rolled back or reused. An additional challenge occurs if the numbers need to be sequential on a time basis. This means that a row containing number 8 is always stored after the row with sequence number 7, such that the sequence number is responsible for an ordering in time. In this case the ‘order’ keyword must be used. The access to the sequence number in that case is coordinated cluster-wide by the SV enqueue. This affects scalability on a single node system as well as in a multi-node system. If new ordered sequence numbers are required on multiple nodes then the inter-instance coordination can become a bottleneck. The previous script example showed what happens when using the ‘order’ keyword. Also, using ‘nocache’ instead of the default ‘cache 20’ results in similar performance degradation. In this case, the degradation shows up as ‘row cache lock’ waits on the dc_sequence object. This is the primary reason you should always use ‘noorder’ and a large ‘cache’ value whenever possible based on the business situation. In any case, always do your best to convince your users or designers on the necessity of properly implementing and using these values in order to get the best performance possible. A caching of sequences and cache values at least 10 times or more higher than the default value (20) are always recommended unless the business requirements preclude the use of sequence caches as described above. There are also drawbacks to consider when using a large ‘cache’ value. This is especially true when using RAC where every instance maintains its own sequence cache of the size you specify and sequence numbers can be out of order when compared to time. This also applies to those situations where one node uses more sequence numbers than the other node or in instances such as a shutdown of one node where large “holes” can exist in the sequence number range. test continuation After setting the noorder and the cache value to 10,000 the tests were run again. At these new settings, the CPU is overwhelmed and there is simply too much work for the system to complete. Therefore, the tests were continued using 4 insert processes instead of 20 (insert 4*1000*100 rows). 6 This is what the resource consumption looks like now after the testing runs to completion: Script Example 4: Resource Consumption After 4 Insert Processes MAJOR MINOR -------- ------------CPU time parsing reloads execution WAIT_EVENT SECONDS PCT ---------------------------------------- -------- -----n/a 1 .71% n/a 0 .12% n/a 88 62.89% disk I/O normal I/O full scans other I/O db file db file control control waits PCM locks other locks global cache cr request latch free buffer busy waits library cache pin row cache lock library cache lock library cache load lock buffer deadlock latency commits network log file sync SQL*Net more data from client SQL*Net message to client process startup global cache open x global cache bg acks global cache busy global cache open s global cache s to x ges cgs registration contacting SCN server or SCN lock master process ctl global locks misc sequential read scattered read file heartbeat file sequential read 1 0 4 0 .37% .01% 3.01% .01% 0 .30% 15 11.06% 3 2.25% 1 .60% 0 .34% 0 .04% 0 .02% 0 .01% 9 0 0 8 2 0 0 0 0 2 0 6.49% .32% .06% 5.96% 1.64% .13% .07% .02% .02% 1.21% .32% Script Example 5: Resource Consumption After 4 Insert Processes With a rate of about 13,200 inserted rows per second this looks quite good with a 63% CPU busy rate, 6.5% waiting for commits to finish, and about 14% waiting for mainly library cache latches and some sequence cache latches. writer/writer concurrency At this point, it’s now time to use a second instance for inserting data and observe the results. The same test workload mix is then used on the other node (node-2). You might expect to get a total of 26,400 (that is 2*13,200) inserts per second. Instead, you get the following: • • Node-1: average 574 inserts per second Node-2: average 585 inserts per second This results in a little over 1,000 inserts per second and was far below the expected level of performance. 7 This is what resource consumption looks like on test node-1 (results are about the same on test node-2): MAJOR MINOR -------- ------------CPU time parsing reloads execution WAIT_EVENT SECONDS PCT ---------------------------------------- -------- -----n/a 1 .05% n/a 0 .01% n/a 82 5.30% disk I/O normal I/O full scans db file sequential read db file scattered read waits enqueue locks enqueue PCM locks buffer busy due to global cache global cache cr request other locks buffer busy waits latch free latency commits network process ctl global locks log file sync SQL*Net more data from client SQL*Net message to client process startup global cache null to x global cache busy global cache s to x global cache null to s global cache open x global cache bg acks DFS lock handle global cache open s 1 0 .04% .00% 62 57 3 87 11 4.04% 3.73% .20% 5.63% .73% 1 .07% 0 .01% 0 .00% 9 .58% 1046 67.99% 65 4.24% 48 3.14% 37 2.40% 20 1.32% 0 .01% 0 .00% 0 .00% Script Example 6: Test Node 1 Results Clearly, the wait event statistics show that most of the wait time is being spent waiting for global cache events, and to be more exact, waiting for ‘global cache n to x’. The most probable outcome of this particular event is that a data block is shipped from the holding instance (the one that last modified the block) to the requesting instance. The duration of the wait is influenced by the time it takes the request message to reach the instance which is holding the block, the processing time in the holding instance, and the time until the block reaches the requesting instance. If the blocks being waited for are “hot” blocks (that is, frequently accessed by all instances in the cluster) then the processing time can be increased by: • The number of active transactions in the particular block • The number of processes on the waiter list for this buffer at the holding instance • Whether the changes for this block need to be written to the redo log before the buffer can be shipped During the run, v$session_wait was queried and shows the following results on a consistent basis: Event sid p1 p2 p3 ---------------------------------------------------------------global cache null to x 13 69 13355 1.7341E+10 global cache null to x 17 69 14993 1.7343E+10 global cache null to x 20 69 13908 1.7342E+10 global cache null to x 19 69 13378 1.7341E+10 Script Example 7: v$session_wait Statistics If we look at the definition of the ‘global cache null to x’, it shows that: • p1 equates to file# (file number) • p2 equates to block# (block number) • p3 equates to the global cache element number 8 If we look up in v$datafile to determine which file belongs to file# 69, we find that this is the file where our customer_id indexes are stored. The block# (p2) will also change every time we perform a query. The same goes for the global cache element number (p3). (In Script Example 3, it looks like p3 is the same. However, if you take a look at p3 raw, which shows an exact hex number, you will see they are different element numbers). In this case, two instances want to have access to the same blocks because both instances are inserting into the same files so they have to modify the same index blocks. Even though the index is on a randomly chosen customer_id, it still generates lots of conflicts especially when starting with an empty table as used in this example. When the table is empty, meaning the index does not have any values yet, the contention will be high in the root block of the index that has to be accessed every time a new customer id is inserted. Moreover, once the root block fills up, it will split, and the frequency of a split depends on how fast the leaf blocks fill up. If we continue running without truncating the table, a second run shows about 2*1200 inserts per second, a third run shows 2*1400 and a fourth run an improvement of 2*1800. In all probability, the improvement is due to the chance that the node-1 hits to the same block that node-2 “owns” is getting smaller and smaller but the tree structure nature of indexes is always a possible bottleneck. Without the index, the insert rate is much higher. By using a coarse-grained coherence model with data block contiguity, (multiple adjacent blocks hash to the same global cache element, for example gc_files_to_locks !255) and bumping the segment high water mark (_bump_highwater_mark_count) by this value, and/or pre-allocating extents, a very good insert rate can be achieved (see the Oracle RAC Deployment and Performance manual). It would be nice if we could have multiple indexes on the same table and same columns–ideally one for each node. This is possible if the table is partitioned (refer to the Oracle Concepts, and Administrator manuals). In Oracle it is possible to build local indexes per partition. Each index covers one partition, so if we partition a table by node_id/instance_id, and put a local index on the customer_id we will end up with multiple indexes (one index partition per database instance). Even without adding a column for an instance id, Oracle range or hash partitions can be used efficiently and thus an application code change is avoided. Using the partitioning option and local indexes for hot data will reduce the degree and rate of concurrent access by spreading the contention over multiple, intra-table segments. A partitioned table basically looks like a table consisting of multiple tables. Every partition is in its own segment, in its own tablespace or data file, and with its own segment header (where freelists and other housekeeping data is located). This makes partitions function very well with RAC. However, just like RAC, it is an extra option that you must purchase in order to use. test continuation After partitioning, and making sure every instance inserts into its own partition, our test results now looks like this: Two (nodes) * 12,000 inserts per second = 24,000 inserts per second, which is what we had hoped to achieve in the first place. 9 The resource consumption under these test conditions was as follows: MAJOR MINOR -------- ------------CPU time parsing execution WAIT_EVENT SECONDS PCT ---------------------------------------- -------- -----n/a 1 .75% n/a 86 56.14% disk I/O normal I/O db file sequential read 1 .37% waits DBWn writes checkpoint completed enqueue locks enqueue PCM locks global cache cr request buffer busy due to global cache other locks latch free buffer busy waits library cache pin 0 .03% 5 3.30% 1 .35% 0 .03% 18 11.86% 3 2.17% 2 1.25% latency commits network 10 0 0 9 5 0 0 0 0 0 process ctl global locks log file sync SQL*Net more data from client SQL*Net message to client process startup global cache open x global cache open s global cache bg acks global cache null to x global cache s to x DFS lock handle 6.75% .28% .03% 5.95% 3.17% .12% .11% .09% .02% .01% Script Example 8: Resource Consumption Note that the global cache coherence times have now vanished, except for the ‘open’, which is justifiable. What is left now are mainly latch free, commit related ‘log file syncs’, and CPU execution time (latch time is still library cache and sequence latch). reader/writer concurrency Now let’s see what happens when we combine queries and inserts on the same data. Node-1 still inserts at maximum speed, which should be about 12,500 inserts per second. On Node-2 a small Pro*C program runs that performs a union between the two tables. From a very large and different ‘OrderHistory’ table it selects 10-20 rows randomly using an index. The size of the table forces the random reads to be physical random reads. It also selects all the records for a randomly chosen customer_id using the customer_id index from the order table. The query then loops without pausing. This causes Node-2 to read data that is being inserted on Node-1, that is reader/writer concurrency. Running 4 inserters on Node-1 and 25 query clients on Node-2 the following resource consumption occurred: MAJOR MINOR WAIT_EVENT SECONDS PCT -------- ------------- ---------------------------------------- -------- -----CPU time execution n/a 1132 57.51% disk I/O normal I/O db file sequential read 5 .25% waits enqueue locks enqueue PCM locks buffer busy due to global cache global cache cr request other locks latch free buffer busy waits library cache pin 24 1.22% 15 .74% 1 .04% 332 16.88% 15 .75% 2 .11% latency commits network 155 5 1 9 91 90 31 22 22 process ctl global locks log file sync SQL*Net more data from client SQL*Net message to client process startup global cache open x global cache s to x global cache busy global cache null to x global cache null to s 7.89% .28% .05% .46% 4.63% 4.56% 1.57% 1.14% 1.13% Script Example 9: Four Inserters on Node 1 and 25 Query Clients on Node 2 10 CPU and latch percentages appear to be about the same. Latch is a little higher but the main resource usage is still cache buffer chain, library cache, and sequence latches. In addition, quite a few RAC related latch waits are now being registered. You can refer to v$latch/v$latch_children to see what latches are being slept on. Although there is also a ‘wait_time’ column in v$latch, it always seemed to be at 0 on this platform and version2. About 12% wait time is consumed by ‘global cache ’ events. The wait time is being caused by the insert processes wanting database blocks to be used exclusively for inserting data. Hence the ‘opens’ and ‘converts’ from ‘s’ (shared) to ‘x’ (exclusive). Oracle RAC maintains the global state of each data block in the Global Cache Service. The global state of each block is known as the so-called master. Depending on its address, a block is mastered on a particular node. The resource master is determined when the block is accessed for the first time. Using two instances, about half of the resources are maintained at the other instance. Depending on where the block is mastered and the requested access mode, messages need to be sent or not sent. By looking at v$sesstat of the inserting sessions, you can calculate, by dividing the ‘global cache get time ‘ by the ‘global cache gets’, that it takes about 0.27 milliseconds to get a locally mastered block. To calculate the time it takes to get a remotely mastered block, use the same division as above but use the gets registered for the LMSx process (or Oracle8i: LMD0). This process takes care of the lock if it is maintained on another node and was observed to run at around 1.3 milliseconds on a relatively idle system. Note that commencing in Oracle 9.0.1.3. and all newer versions, the grant/ast goes directly to the Oracle foreground (client) process. This should reduce CPU and latency and the calculations above will probably not hold. Based on the number of inserters to Node-1 and the 25 query clients on Node-2, the insert speed for this test drops to around 9,900 inserts per second now. On Node-2 the resource consumption statistics now look like this: MAJOR MINOR WAIT_EVENT SECONDS PCT -------- ------------- ---------------------------------------- -------- -----CPU time execution n/a 833 6.44% disk I/O normal I/O db file sequential read waits enqueue locks enqueue PCM locks global cache cr request buffer busy due to global cache other locks latch free global cache freelist wait latency network process ctl global locks SQL*Net message to client process startup global cache s to x global cache bg acks global cache busy DFS lock handle global cache open s global cache null to x global cache open x 9665 74.71% 8 .07% 2298 17.76% 8 .06% 97 .75% 4 .03% 1 5 2 0 0 0 0 0 0 .01% .04% .01% .00% .00% .00% .00% .00% .00% Script Example 10: Resource Consumption Statistics The queries hardly use any CPU time now. That is because they are busy trying to get database blocks either through physical reads (that is, random or db file sequential reads) or through the ‘global cache’. The requested consistent read blocks are being served by the block server process, LMSx (or Oracle8i BSPx). The block is being read from the buffer cache by this process and, if necessary (and available in the cache), undo data is applied to build a consistent read version before the block is ‘shipped’ to the other node. 2 Subsequently discovered that this problem is due to a port-specific bug 11 Starting with Oracle9i, Oracle has the capability of shipping and maintaining locks for ‘current’ blocks. This eliminates the need to write database blocks back to disk in order to read and modify them by another instance. This new facility is called Cache Fusion and, while it was designed to address scalability problems, it is still a process that requires the use of system time and resources. This process will be explained and discussed later on in this paper. By looking at intervals to v$sysstat and v$system_event during the test run, we can observe node-2 now doing about 220 selects per second (‘exec’ column) while waiting on 1,500 physical reads (‘dbpr’ column). It is taking an average of 11.8 milliseconds per read while 1,120 global cache consistent read requests were being waited for averaging about 4.3 milliseconds per request. A total of about 340 consistent read blocks (the ‘gc bl rec’ column) are being received per second. However, it should be noted that when waiting for a cr request, it is not known whether a block is received or a permission to read from disk is being granted. time execs 09:28:07 09:28:11 09:28:14 09:28:18 09:28:22 09:28:26 212.63 223.36 224.87 230.65 216.75 227.03 dbpr 11.87 11.94 11.75 11.88 11.88 11.82 1549.21 1520.47 1515.04 1516.41 1523.56 1540.16 gc cr rds3 4.12 4.27 4.68 4.36 4.43 4.21 gc bl rec 1144.59 1124.93 1130.53 1128.91 1108.14 1136.75 337.17 342.06 357.55 340.42 342.93 336.65 Script Example 11: V$sysstat and v$system_event Statistics The number of ‘global cache cr requests’ waits is much higher than the number of ‘cr blocks received’. Not all global cache consistent block reads result in blocks being shipped from the other node. Many can be read by the local instance but the instance still needs to ‘globally’ request every block it reads. As we will see later, in the ‘gc_files_to_locks’ section, much depends on the global coherence scheme being utilized. test continuation Now, we will add 25 more query clients on Node-2 for a total of 50 query clients. Node-1 resource usage now looks like this after re-running the test: MAJOR MINOR WAIT_EVENT SECONDS PCT -------- ------------- ---------------------------------------- -------- -----CPU time execution n/a 2176 41.54% disk I/O normal I/O db file sequential read 5 .09% 3.23% 1.89% .01% 7.85% .68% .17% waits enqueue locks enqueue PCM locks buffer busy due to global cache global cache cr request other locks latch free buffer busy waits buffer deadlock 169 99 1 411 36 9 latency commits global locks 274 5.22% 952 18.18% 510 9.73% 374 7.15% 126 2.40% 55 1.05% log file sync global cache s to global cache open global cache null global cache null global cache busy x x to x to s Script Example 12: Node 1 Resource Usage for 50 Query Clients 3 V$sysstat and v$system_event Statistics for ‘gc cr rds’ 12 After increasing the number of query clients to 50 on Node-2 and re-running the test we are now achieving about 6,500 inserts per second. MAJOR MINOR WAIT_EVENT SECONDS PCT -------- ------------- ---------------------------------------- -------- -----CPU time execution n/a 2806 4.43% disk I/O normal I/O db file sequential read 34610 54.67% waits enqueue locks enqueue PCM locks global cache cr request buffer busy due to global cache other locks latch free buffer busy waits global cache freelist wait latency global locks 12 .02% 23565 37.22% 152 .24% 2043 3.23% 41 .07% 23 .04% global cache s to x global cache busy global cache bg acks 16 6 0 .03% .01% .00% Script Example 13: Node-2 Resource Usage The average request time for all processes on Node-2 is about 7 milliseconds while the cr block receive time is 9 milliseconds. This results in a query rate of 306 queries per second with an average response time of 0.18 seconds. time execs 02:25:01 02:25:05 02:25:09 02:25:14 02:25:18 02:25:22 02:25:26 02:25:31 02:25:35 02:25:39 320.00 295.24 306.55 311.21 311.00 302.49 301.63 312.56 294.47 307.09 dbpr 13.69 13.57 13.37 13.42 13.21 13.28 13.49 13.49 13.32 13.45 gc cr rds 1967.20 2012.93 1961.65 2002.86 1969.27 2022.17 1986.34 1998.83 1980.05 1956.79 12.12 12.10 12.47 11.74 11.69 11.99 11.70 12.24 11.90 11.93 gc bl rec 1485.29 1533.25 1574.52 1467.49 1559.29 1481.26 1561.28 1495.23 1520.49 1534.44 508.33 500.97 490.27 495.15 484.62 477.00 482.33 494.24 489.34 491.67 Script Example 14: V$sysstat and v$system_event Statistics Adding 25 more clients, for a total of 75 query clients, and re-running the test results in the following resource usage: MAJOR MINOR WAIT_EVENT SECONDS PCT -------- ------------- ---------------------------------------- -------- -----CPU time execution n/a 570 26.52% disk I/O normal I/O db file sequential read waits enqueue locks enqueue PCM locks buffer busy due to global cache global cache cr request other locks latch free buffer busy waits latency commits global locks log file sync global cache s to global cache null global cache open global cache null global cache busy global cache open x to x x to s s 5 .22% 74 68 1 82 36 3.42% 3.15% .04% 3.82% 1.66% 48 2.25% 609 28.33% 316 14.71% 203 9.46% 59 2.73% 52 2.43% 1 .05% Script Example 15: Resource Usage on Node-1 Running 75 Query Clients on Node-2 13 The system (Node 1) is now doing 4,050 inserts per second and resource usage for Node-2 is as follows: MAJOR MINOR WAIT_EVENT SECONDS PCT -------- ------------- ---------------------------------------- -------- -----CPU time execution n/a 1109 3.25% disk I/O normal I/O db file sequential read 15221 44.68% waits enqueue locks enqueue PCM locks global cache cr request buffer busy due to global cache other locks latch free buffer busy waits global cache freelist wait latency global locks misc 7 .02% 16262 47.73% 270 .79% 1063 3.12% 60 .18% 24 .07% global cache s to x global cache busy cr request retry 14 9 4 .04% .03% .01% Script Example 16: Node-2 Resource Usage The average global cache wait time for all processes are about 13 milliseconds with the cr block receive time averaging 13.3 milliseconds. The Query rate is 324 queries per second, with an average response time of 0.25 seconds. The v$sysstat and v$system_event statistics are: time execs 10:11:59 10:12:04 10:12:09 10:12:14 10:12:18 10:12:23 320.12 323.29 319.03 332.63 321.23 324.94 dbpr 13.92 14.43 14.02 14.18 13.89 14.19 gc cr rds 2130.22 2096.26 2115.86 2050.21 2058.60 2172.81 22.98 22.03 24.33 23.69 21.11 23.98 1539.46 1572.85 1418.81 1402.63 1506.18 1528.72 gc bl rec 399.08 390.93 418.43 400.00 414.77 431.03 Script Example 17: v$sysstat and v$system_event Statistics The number of waits for physical blocks and global cache cr blocks is about the same as the previous 50 client run but the wait time for the global cache cr blocks nearly doubles. Note also that the number of cr blocks received is at the same level as before. In the resource breakdown it also shows that almost 50% of the time spent in waiting is for those cr requests. Also notice that the global cache wait ratio is going up to about 50% on Node-1, the Insert node. 14 The following is a summary of results for all of the tests: inserts/sec queries/sec query response time (seconds) cr blocks recvd/sec #physical reads/sec readtime in ms gc cr requests insert node: gc get time in ms gc convert time in ms query node: gc get time in ms crbl rec time ms Query node CPU busy % Number of Query Clients 25 50 75 9850 6940 4050 211 306 324 0.13 0.18 0.25 340 1550 11.5 1100 490 2000 13.5 1500 400 2000 14 1500 1.58 2.8 4.79 8.57 9.13 16.51 2.13 3.26 7.02 9.02 13 13.33 * 95 100 *No data available. Table 1: Performance Summary of Query Clients It is easy to see how the global cache latency times go up because, as described earlier, half of the blocks (with a 2node cluster) are acquired on the local node with the other half being acquired on the other node through the interconnect. The times being referenced in Table 1 above are calculated using v$sysstat. If you use v$sesstat and look at the LMSx global lock statistics, you can see much higher numbers (>20 milliseconds) for the remotely mastered blocks.4 Rather than running into an interconnect bottleneck here, it appears that the inserting node loses speed mainly because the global cache coherence management takes too long and the query node slows down for the same reason. In addition, the slowdown is also due to the block shipping taking longer to complete. The interconnect used on this cluster is a 100 megabytes per second, low latency, high speed interconnect. Instead of the default UDP protocol, the instances were configured to use the Reliable Datagram Protocol (RDG), which is a platform specific option. In previous tests, RDG was proven to be a faster method for interconnects. However, what was missing from this test was the CPU usage statistics. It appears that running 50 query clients had almost saturated the CPUs. By using a faster system we were able to build a new cluster with 2 x 4 CPU x 1 GHz processors instead of the 2 x 4-CPU x 500 MHz processors. This configuration was twice as fast in terms of raw CPU power. When the tests were re-run, it showed an ability to insert almost twice the number of rows while performing almost twice the number of queries per second. With 4 insert processes and 25 query processes, the resource usage ratios stayed about the same. 4 Commencing with version 9.2, all ast messages are sent directly back to the requestor instead of going through LMS; therefore, the session statistics for LMS will no longer be collected. By sending ast messages directly to the requesting foreground processes, context switches can be reduced, resulting in decreased demands on CPU resources and a reduction in latency time to the consumer. 15 Adding users and going from the range of 25 to 100 query clients also resulted in different resource usage ratios, which are as follows: inserts/sec queries/sec query response time in sec cr blocks recvd/sec #physical reads/sec readtime in ms gc cr requests insert node: gc get time in ms gc convert time in ms query node: gc get time in ms crbl rec time ms Query node CPU busy % Number of Query Clients 25 50 75 100 19990 18460 18014 16890 410 580 620 631 0.06 0.09 0.12 0.16 700 890 900 900 2250 3200 3200 3200 9 12 20 28 1800 2500 2500 2500 0.58 1.13 0.93 1.77 1.02 1.86 2.9 1.87 0.88 1.27 1.59 2.1 1.74 2.26 1.74 2.24 55 80 80 80 Table 2: Test Re-run Resource Usage Ratios The insert rate is pretty stable now but the query rate stabilizes at around 50 query clients. Also, the CPU usage won’t get higher by adding more users. An interesting result is that although the number of physical reads is not increasing, the read times dramatically increase when adding more queries, which is an indication of an I/O bottleneck. Under these conditions it appears as if the I/O subsystem cannot handle more than 3,200 random reads per second. The global cache and cr block timing also looks quite healthy compared to the previous numbers that were achieved. Note the 900 cr blocks being shipped per second. This is 100% more blocks shipped per second than achieved in any of the previous runs. With the higher number of users the resource usage ratios for the inserting node were stable and about the same for the earlier 25 query client test. For the query node the ‘db file sequential read’ ratio continued to go higher and higher. To prove that the limiting factor for the insert rate on Node-1 was caused by the CPU usage on Node-2 and not by the interconnect speed, the query that was used by the query clients was changed to do less physical I/O. This caused them to do more queries per second and thus caused a much higher load on the GCS processes and cr block shipping. 16 The results of that test are as follows: inserts/sec queries/sec query response time in sec cr blocks recvd/sec #physical reads/sec readtime in ms gc cr requests insert node: gc get time in ms gc convert time in ms query node: gc get time in ms crbl rec time ms Query node CPU busy % Number of Query Clients 50 75 9500 7200 1517 1546 0.03 0.05 1700 1700 2500 2500 8.5 9 3100 3200 2.35 4.03 2.3 7.9 3.55 3.72 7.32 4.91 90 100 Table 3: Query Resource Usage Ratios This change results in a stable 2,500 physical reads per second and 1,700 cr blocks shipped per second. This test was designed to specifically generate more cr blocks traffic over the interconnect. Results point to the conclusion that the slowdown in the previous tests was not being caused by a slow interconnect but rather were a direct result of the higher CPU usage due to higher interconnect usage and more processing overhead for block shipping and global cache management. The insert rate has now dropped to 7,200 inserts per second. After initiating a few more CPU-bound queries on the query node doing millions of logical I/Os with no physical I/Os the insert rate goes down dramatically to just a few hundred per second. conclusions • When high CPU usage occurs in an Oracle RAC environment, it can significantly slow down global cache access speed and thus slow down those processes that depend on it, even if the process runs on another node in the cluster. Further testing is needed to determine the impact of manually giving higher CPU priority to key RAC processes. Even without performing concurrent processing on a system, the use of RAC will impose some additional CPU overhead • Just applying RAC to an existing application that is known to frequently access certain data concurrently can cause performance issues. In these tests, we found a degradation of performance that resulted in a drop of 12,500 insert per second to a few hundred per second. Test results highlight the importance of considering application and/or data partitioning when deploying RAC. Therefore, due care should be exercised on how the data is actually being used by the application(s) before implementing RAC. • Although it is possible to run without ‘gc_files_to_locks’ it can still be beneficial to use this setting because requests for global cache management are less frequent. It should be used on data files that are known to be mostly read, or data files with little and predictable write/write or read/write access concurrency. Its use has to be carefully considered because of the fact that the 9i Cache Fusion protocol is turned off and the danger of 17 severe performance degradation due to “false pinging” exists. It should also be mentioned that the use of coarse-grained global cache management (“gc_files_to_locks”) is harder to administer.5 • Tests were run in a two-node cluster. Adding more nodes will certainly make things more complex and will probably result in the use of even more CPU resources. When nodes and remote users are added and the shared working set does not scale correspondingly, there is a chance that locality of access can be diminished, that is the local buffer cache hit efficiency can be lower. Hence more messages are sent and more buffers received from remote instances. • Consider the load that Cache Fusion will impose on the interconnects before adding multiple nodes. • Since the tests were done with 9.0.1 the ORACLE RAC kernel has been optimized to reduce overhead for certain operation. The RDBMS release 9.2.0.1 incorporates many of these optimizations. However, this does not make the findings and tuning recommendations described obsolete. As in a non-clustered system, minimizing contention and improving locality of access are still valid performance tuning goals. Usage of RAC in a clustered or complex environment will require due care and planning before implementation in order to get the best performance possible. 5 This conclusion is based on test information referenced in the Appendix section of this document. 18 appendix a GC_FILES_TO_LOCKS or coarse-grained global cache coherence management All the tests in this paper were performed without setting gc_files_to_locks, so 9iRAC works quite well. But things can be done faster, using less CPU, and as we have seen in this paper, high CPU usage on one node can have a serious impact on the performance of the other node. One of the more challenging things about working with Oracle8 OPS was finding the correct setting for the gc_files_to_locks parameter. There is a lot to be said about this parameter and before using it you should fully understand it’s overall impact on Oracle performance. See the references section at the end of this paper for a few books that can help you understand and use this parameter setting. The Oracle OPS/RAC manuals also do a very good job of explaining this as well. Starting with Oracle9, the RAC Administration Guide states: “Oracle automatically controls resource assignments so gc locks are not needed” (fortunately the RAC Deployment and Performance manual appendix gives a good explanation of when and how you should use gc_files_to_locks). For example, this is what happens if only performing reads on Node-2 and Node-1 is not being used: Script Example 18: Using Node 2 Only--Read Statistics time execs 11:49:40 11:49:44 11:49:48 11:49:53 11:49:58 11:50:03 11:50:08 340.88 348.87 360.78 357.78 361.25 365.85 382.98 dbpr 15.54 15.80 15.98 16.55 15.80 15.99 15.80 2413.53 2528.54 2379.44 2758.66 2167.29 2455.12 2513.92 gc cr rds 10.13 9.05 11.37 15.23 16.64 18.89 22.37 1156.09 1231.43 1232.71 1285.69 1189.82 1227.81 1261.37 gc bl rec 0.00 0.00 0.22 0.00 0.00 0.00 0.00 Node-2 is reading from disk as fast as it can, but is still hindered by the I/O bottleneck. We observe no cr blocks coming over the interconnect. Processor statistics (‘ps’) is showing about 10-15% overall CPU usage on the machine being used for the GCS processes. The idle node also shows a 15% CPU utilization rate for global cache management overhead. You must realize that for every block read a lock must be acquired. As we’ve seen in previous measurements, it can take between 0.01 milliseconds locally and up to 1 millisecond if the lock resides on another node. Note that this 1 millisecond is measured during low usage and it can be 10 or more if the system is very busy. So getting 50 blocks can take anywhere between 25 and 250 milliseconds or more only to acquire the locks. So, as you can see, it can be beneficial to think about your coherence scheme. In this test case we chose to use only 1 lock per data file for the read-only data and to use range locks (2,200 blocks per lock) for the insert data files. The indexes were stored in separate files using a 1:1 locking scheme. Using this method showed a 15-20% higher insert speed mainly because we only have to take a lock for each 2,200 blocks of data instead of every block. This is even more true for queries. After acquiring a lock for every data file they read from, no more locks are necessary for the read only data files. This saves a lock get for every block visited. Note however, that if processes from remote nodes are reading or updating the data files that are subject to the coarsegrain cache management policy, chances are high that performance will be severely impacted. 19 I would like to finish with an OPS Guru’s (Anjo Kolk) recommendations: Usage: Read Write Low =3EACH =1000(R)EACH High =3EACH =0 Refer to the OPS or RAC manuals for the appropriate GC_FILES_TO_LOCKS syntax. 20 references Oracle8i Internal Services, written by Steve Adams, published by O’Reilly. A must-read book for everybody involved in Oracle performance tuning. He also has a good website and newsletter. Scaling Oracle8i, written by James Morle, published by Addison Wesley. Not mentioned in this performance report but the best book available on general Oracle design and performance issues. Can be used for OPS. Oracle Parallel Processing, written by Tushar Mahapatra and Sanjay Mishra, published by O’Reilly. Combining OPS with Parallel Query stuff, worth buying and reading. This is a good introduction to the OPS and PQ features. And don’t forget the official Oracle manuals that do a very good job, especially the RAC Deployment and Performance and the Oracle Performance Tuning manual. www.oraperf.com Anjo Kolk’s site, the papers, Yapp, plus upload your statspack or utlstat report and get performance advice. www.ixora.com.au Almost everything is worth reading at the site. www.hotsos.com Oracle performance and problem diagnosis paper on using wait events for performance tuning as well as some other good stuff. Try to visit their seminar. 21