hp partnering
solutions
october 2002
technical white
paper
Concurrency in an Oracle 9iR1 RAC Environment on HP Tru64 UNIX
summary
This document provides information on an informal account of CPU usage by RAC when performing certain operations
in a clustered database environment.
This document also contains and presents information that can be used for optimizing performance when performing
operations under those conditions. It presents information on changes made to the settings and parameters of the
operating system and Oracle 9iR1 RAC software and their effects on performance in a sequential manner as each of
those changes are implemented.
The goal of the testing was to simulate and study the performance of bulk inserts in order to derive easy methods for
optimizing the throughput of this database usage pattern in a RAC environment. Writing medium or small-sized records
at a high rate into a database is a common activity and characteristic of billing or workflow applications. This type of
workload is also resource-intensive and prone to causing contention in non-clustered as well as clustered database
systems.
test results summary
It has become easier to manage out of the box performance with the introduction of Oracle’s 9iR1 RAC. However,
usage of RAC in a clustered or complex environment still requires due care and planning before implementation in order
to get the best performance possible.
A summary of the test findings and issues are as follows:
•
•
•
•
•
•
High CPU usage can significantly slow down global cache access speed and the processes that depend on it
even if the process runs on another node in the cluster.
The use of RAC will impose some additional CPU overhead even when not performing concurrent processing.
Consider application and/or data partitioning when deploying RAC.
The use of ‘gc_files_to_locks’ setting should be carefully considered and managed when used on data files that
are read or data files with fewer and predictable write/write or read/write access concurrency.
The addition of more nodes will make things more complex and result in the use of even more CPU resources.
When nodes and remote users are added and the shared working set does not scale correspondingly there is a
chance that the local buffer cache hit efficiency is lower due to an increase in the number of messages sent and
buffers received from remote instances.
Consider the load that Cache Fusion will impose on interconnects before adding multiple nodes.
for more information
Contact the author by electronic mail at Mario.Broodbakker@hp.com.
A list of publications related to this document is also included in the References section.
1
trademark and date information
Oracle, Oracle 9i, and Oracle 9iR1 RAC are trademarks and/or registered trademarks of Oracle Corporation.
Copyright ©2002 Compaq Information Technologies Group, L.P and Hewlett-Packard Company. All rights reserved. Printed in the U.S.A.
This performance report was prepared by the System and Applications Integrated Engineering Group of Hewlett-Packard Company.
All brand names are trademarks of their respective owners.
Technical information in this document is subject to change without notice.
© Copyright Hewlett-Packard Company 2002
10/02
2
introduction
This paper reports on testing of the effects of writer/writer concurrency and reader/writer concurrency in a RAC
environment. This paper is not intended as an introduction to OPS or to the RAC architecture. It assumes that the reader
has an understanding and working knowledge of the components involved in RAC processing. However, some
concepts will be described in detail when necessary and references will be made to existing books or whitepapers
where applicable.
Most of these tests were performed on a 2-node AlphaServer ES40 cluster (4-CPUs at 500 MHz each) running Oracle
9.0.1.2. Additional tests were also run on a 2-node AlphaServer ES45 cluster (4 CPUs at 1 GHz each) running Oracle
9.0.1.3 RAC1.
Note that these tests were produced in a lab situation using a synthetic workload. The goal of the testing was to
simulate and study the performance of bulk inserts in order to derive easy methods for optimizing the throughput of this
database usage pattern in a RAC environment. Writing medium or small-sized records at a high rate into a database is
a common activity and characteristic of billing or workflow applications. This type of workload is also resource-intensive
and prone to causing contention in non-clustered as well as clustered database systems.
Testing consisted of performing the following key operations:
•
Concurrent inserts from two nodes using a randomly generated key
•
Query client joins for just inserted data with non-cached data reads from the disks
Carefully review the applicability and possible effects on your own workload before implementing any of the changes or
recommendations that were made based on these results.
In general RAC does a great job, far superior to its predecessor OPS, of making clustered solutions work efficiently and
obtaining good scaling with minimal database administrative attention. To maintain consistency across nodes a
clustered solution involves cluster-wide, global data consistency, and coherence management. While there is normally a
reasonable CPU overhead related to this type of activity there are circumstances where that cost becomes relatively high.
In such cases, it may be worthwhile considering application or data partitioning and/or coarse-grained global cache
coherence management to achieve optimal performance. This paper discusses such a special case, its effect on CPU
utilization, and further suggests a solution through specialized tuning techniques to improve on 'out of the box' scaling
capabilities.
When doing performance tests to achieve high insert rates there were many factors that were found to influence the
speed at which data could be inserted into the database. Almost every issue discovered was found and fixed using the
event-based wait statistics and information found in a few books. Those books include Steve Adams’ Oracle8i Internal
Services and the OPS and RAC manuals.
How to use the session waits and system event statistics, and why, is very well described in the Yapp paper from Anjo
Kolk, Shari Yamaguchi, and some presentations from Cary Millsap and friends. This information can be found at the
following web sites: www.oraperf.com and www.hotsos.com.
Basically the Yapp method says: I’m either working or waiting, and when I work it should show ‘CPU used by this
session’ in v$sesstat (after joining with v$statname). Otherwise I wait, and what I wait for will be shown in the ‘event’
columns of v$session_wait along with what I’m waiting on in the parameter columns (p1(raw) to p3(raw)).
Yapp also says not to bother looking for what you are waiting for if most of what you are doing is consuming CPU. This
doesn’t mean, by the way, that there is no reason for further investigation on the reasons why you are using a large
amount of CPU time.
1
Many of the tests, with only slight variations, were also performed on Oracle 8i (8.1.7) OPS.
3
To get an indication of the resource consumption since database startup, get the script resource_waits.sql from Steve
Adams’ web site at www.ixora.com.au. This will give you information on resource usage on your system. The script is
referred to and used as an example throughout this paper.
Many items contained in the first part of this paper are also applicable to non-RAC situations and have a lot to do with
storage options and sequences. The tests begin by focusing on inserts and later on to issues associated with (concurrent)
query performance. The influence of updates, modifies, and triggers in a RAC environment are not mentioned or
discussed. In addition, there are also many simple documented recommendations that can be used to design tables
and indexes for high insert efficiency. These techniques include using such options as freelist group or automatic space
management, multiple freelists, and locally managed tablespaces. The techniques are well documented in the Oracle
manuals. All of these options and techniques were used during testing.
test methodology
The same database and test program was used for all tests. These conditions include one order table that receives all
the inserts comprised of small 65 byte records with one index based on the customer_id only. The customer_id was
generated using a random number between 1 and 11 million. In addition, a unique order number was generated using
an Oracle ordered sequence. Before each test the order table was truncated and the instances were restarted between
test runs for consistency.
some non-RAC notes on the configuration
Locally managed tablespaces were used for these tests (see the Oracle Administrator’s manual for an explanation).
Using the dictionary-managed tablespaces in previous Oracle8i tests showed lots of ‘ST’ enqueue waits. Starting with
Oracle9i the default tablespace format is to ‘locally managed’. Bitmapped segments were not used but are planned for
later tests.
On the order table, a setting of Freelist groups =2 and freelists =10 was used. A spin_count of 4,096 was used for both
instances.
Oracle9i’s ‘automatic undo’ was configured as one undo tablespace/file per instance.
commits and arrays
When trying to achieve high insert rates it is important to minimize the commit overhead as much as possible. Therefore,
commit as many rows as possible as the transaction design or application permits. Also, by inserting data in arrays of
multiple rows, the number of network or IPC roundtrips can be dramatically reduced. Support for array inserts (and
fetches) is in all the Oracle pre-compilers, OCI, PL/SQL, and many standard applications.
For example, we found that running 20 insert clients on only one node and inserting 20 * 10,000 rows while
committing each row individually will show an insert row rate of about 733 insert rows per second. However, creating
an array size of 100 and committing each array of 100 rows resulted in achieving a total of 2,450 insert rows per
second.
4
Using the response_time_breakdown script, it is easy to see where time is being spent in Oracle. Running the test
resulted in the following response times:
Script Example 1: Response Time Breakdown
MAJOR
MINOR
-------- ------------CPU time parsing
reloads
execution
WAIT_EVENT
SECONDS
PCT
---------------------------------------- -------- -----n/a
2
.13%
n/a
0
.02%
n/a
139 8.04%
disk I/O normal I/O
other I/O
db file sequential read
control file heartbeat
control file sequential read
waits
enqueue locks enqueue
PCM locks
global cache cr request
other locks
latch free
row cache lock
library cache pin
buffer busy waits
library cache load lock
library cache lock
latency
commits
network
process ctl
global locks
misc
log file sync
name-service call wait
SQL*Net more data from client
SQL*Net message to client
process startup
DFS lock handle
global cache open x
global cache s to x
global cache busy
global cache bg acks
global cache open s
ges cgs registration
contacting SCN server or SCN lock master
reliable message
refresh controlfile command
1
4
0
.04%
.24%
.00%
1
0
88
5
3
0
0
0
.07%
.01%
5.09%
.26%
.15%
.03%
.02%
.01%
2
.13%
7
.38%
0
.02%
0
.00%
8
.48%
1465 84.48%
0
.02%
0
.01%
0
.01%
0
.01%
0
.00%
1
.08%
0
.02%
0
.00%
0
.00%
Script Example 2: Response Time Breakdown
The overall system CPU usage for this series of tests was 55% and the insert rate was 2,450 insert rows per second.
A quick look in v$session_event shows the following waits consuming the most resources during the run:
Event
sid p1raw
p2
------------------------------------------------------------------DFS lock handle
36 0000000053560005
2987
latch free
35 0000000402C024A8
35
latch free
21 000000005800BE20
93
Script Example 3: v$session_event
interpretation of the p1raw and p2 columns
The ‘DFS lock handle’ wait is a global enqueue wait. In OPS/RAC and non-OPS Oracle environments, enqueues have a
global presence. Enqueues and the ‘DFS lock handle’ are locks that someone holds to achieve a goal. The ‘goal’ is
encoded in the p1raw data. For example, the value of 5356 (highlighted in Script Example 2 above) that is contained
in p1raw is hexadecimal for the letters ‘SV’. The Oracle8i Reference Manual (Appendix B: Enqueue and Lock names)
translates ‘SV’ as ‘Sequence Number Value’. The p2 is the ‘object_id’ that we are waiting for. Using this object_id we
can easily query dba_objects to find the name of the sequence causing those waits.
The ‘latch free’ waits show a ‘35’ and a ‘93’ as the p2 values. According to the v$latchname we deal with, this
translates to the ‘dlm resource hash list’ and ‘sequence cache’. Latch event waits are recorded if Oracle tries to get a
latch and goes to sleep because some other process is holding that latch. Note that before going to sleep, Oracle will
‘spin’ (up to the _spin_count value) on the latch to acquire it before an expensive (in terms of time) sleep.
5
If you didn’t have a chance to monitor the v$session_wait during the run then you can use the Adams’ latch_sleep script
(see the References section of this document for the web site) to get a breakdown on latch names and sleeping
behaviors.
So, without even knowing the exact architecture and implementation, everything in our analysis thus far points to
sequence related issues.
sequences and RAC
One of the worst things that can happen with your table is that you need a synthetic identifier that uniquely identifies
your row. In the past, we might have typically used a row in some table with a number that was read and updated all
the time. This is, however, a nightmare from a concurrency perspective because everybody wants to read and update
the row at the same time resulting in high row waits and (potential) application deadlock problems.
Unfortunately, the scenario above is still a fact of life in many ERP/bookkeeping/general ledger types of applications.
Oracle introduced the sequence number as a solution for that situation long ago. Specifying a ‘<sequence
name>.nextval’ allows you to assign a database wide unique number.
However, there are some problems with this solution if you are not aware of some basic background information on the
topic. When using a sequence number, a difficult scenario occurs when the business need specifies that the sequence of
numbers cannot be interrupted. This means that if there is a number 7, and a number 5, there must also be a number 6.
In this case, an Oracle sequence cannot be used because when a transaction aborts the number is not rolled back or
reused.
An additional challenge occurs if the numbers need to be sequential on a time basis. This means that a row containing
number 8 is always stored after the row with sequence number 7, such that the sequence number is responsible for an
ordering in time. In this case the ‘order’ keyword must be used. The access to the sequence number in that case is
coordinated cluster-wide by the SV enqueue. This affects scalability on a single node system as well as in a multi-node
system. If new ordered sequence numbers are required on multiple nodes then the inter-instance coordination can
become a bottleneck.
The previous script example showed what happens when using the ‘order’ keyword. Also, using ‘nocache’ instead of the
default ‘cache 20’ results in similar performance degradation. In this case, the degradation shows up as ‘row cache
lock’ waits on the dc_sequence object. This is the primary reason you should always use ‘noorder’ and a large ‘cache’
value whenever possible based on the business situation. In any case, always do your best to convince your users or
designers on the necessity of properly implementing and using these values in order to get the best performance
possible.
A caching of sequences and cache values at least 10 times or more higher than the default value (20) are always
recommended unless the business requirements preclude the use of sequence caches as described above.
There are also drawbacks to consider when using a large ‘cache’ value. This is especially true when using RAC where
every instance maintains its own sequence cache of the size you specify and sequence numbers can be out of order
when compared to time. This also applies to those situations where one node uses more sequence numbers than the
other node or in instances such as a shutdown of one node where large “holes” can exist in the sequence number
range.
test continuation
After setting the noorder and the cache value to 10,000 the tests were run again. At these new settings, the CPU is
overwhelmed and there is simply too much work for the system to complete. Therefore, the tests were continued using 4
insert processes instead of 20 (insert 4*1000*100 rows).
6
This is what the resource consumption looks like now after the testing runs to completion:
Script Example 4: Resource Consumption After 4 Insert Processes
MAJOR
MINOR
-------- ------------CPU time parsing
reloads
execution
WAIT_EVENT
SECONDS
PCT
---------------------------------------- -------- -----n/a
1
.71%
n/a
0
.12%
n/a
88 62.89%
disk I/O normal I/O
full scans
other I/O
db file
db file
control
control
waits
PCM locks
other locks
global cache cr request
latch free
buffer busy waits
library cache pin
row cache lock
library cache lock
library cache load lock
buffer deadlock
latency
commits
network
log file sync
SQL*Net more data from client
SQL*Net message to client
process startup
global cache open x
global cache bg acks
global cache busy
global cache open s
global cache s to x
ges cgs registration
contacting SCN server or SCN lock master
process ctl
global locks
misc
sequential read
scattered read
file heartbeat
file sequential read
1
0
4
0
.37%
.01%
3.01%
.01%
0
.30%
15 11.06%
3 2.25%
1
.60%
0
.34%
0
.04%
0
.02%
0
.01%
9
0
0
8
2
0
0
0
0
2
0
6.49%
.32%
.06%
5.96%
1.64%
.13%
.07%
.02%
.02%
1.21%
.32%
Script Example 5: Resource Consumption After 4 Insert Processes
With a rate of about 13,200 inserted rows per second this looks quite good with a 63% CPU busy rate, 6.5% waiting
for commits to finish, and about 14% waiting for mainly library cache latches and some sequence cache latches.
writer/writer concurrency
At this point, it’s now time to use a second instance for inserting data and observe the results. The same test workload
mix is then used on the other node (node-2). You might expect to get a total of 26,400 (that is 2*13,200) inserts per
second. Instead, you get the following:
•
•
Node-1: average 574 inserts per second
Node-2: average 585 inserts per second
This results in a little over 1,000 inserts per second and was far below the expected level of performance.
7
This is what resource consumption looks like on test node-1 (results are about the same on test node-2):
MAJOR
MINOR
-------- ------------CPU time parsing
reloads
execution
WAIT_EVENT
SECONDS
PCT
---------------------------------------- -------- -----n/a
1
.05%
n/a
0
.01%
n/a
82 5.30%
disk I/O normal I/O
full scans
db file sequential read
db file scattered read
waits
enqueue locks enqueue
PCM locks
buffer busy due to global cache
global cache cr request
other locks
buffer busy waits
latch free
latency
commits
network
process ctl
global locks
log file sync
SQL*Net more data from client
SQL*Net message to client
process startup
global cache null to x
global cache busy
global cache s to x
global cache null to s
global cache open x
global cache bg acks
DFS lock handle
global cache open s
1
0
.04%
.00%
62
57
3
87
11
4.04%
3.73%
.20%
5.63%
.73%
1
.07%
0
.01%
0
.00%
9
.58%
1046 67.99%
65 4.24%
48 3.14%
37 2.40%
20 1.32%
0
.01%
0
.00%
0
.00%
Script Example 6: Test Node 1 Results
Clearly, the wait event statistics show that most of the wait time is being spent waiting for global cache events, and to be
more exact, waiting for ‘global cache n to x’. The most probable outcome of this particular event is that a data block is
shipped from the holding instance (the one that last modified the block) to the requesting instance. The duration of the
wait is influenced by the time it takes the request message to reach the instance which is holding the block, the
processing time in the holding instance, and the time until the block reaches the requesting instance.
If the blocks being waited for are “hot” blocks (that is, frequently accessed by all instances in the cluster) then the
processing time can be increased by:
• The number of active transactions in the particular block
• The number of processes on the waiter list for this buffer at the holding instance
• Whether the changes for this block need to be written to the redo log before the buffer can be shipped
During the run, v$session_wait was queried and shows the following results on a consistent basis:
Event
sid p1
p2
p3
---------------------------------------------------------------global cache null to x
13 69
13355 1.7341E+10
global cache null to x
17 69
14993 1.7343E+10
global cache null to x
20 69
13908 1.7342E+10
global cache null to x
19 69
13378 1.7341E+10
Script Example 7: v$session_wait Statistics
If we look at the definition of the ‘global cache null to x’, it shows that:
• p1 equates to file# (file number)
• p2 equates to block# (block number)
• p3 equates to the global cache element number
8
If we look up in v$datafile to determine which file belongs to file# 69, we find that this is the file where our customer_id
indexes are stored. The block# (p2) will also change every time we perform a query. The same goes for the global
cache element number (p3). (In Script Example 3, it looks like p3 is the same. However, if you take a look at p3 raw,
which shows an exact hex number, you will see they are different element numbers).
In this case, two instances want to have access to the same blocks because both instances are inserting into the same
files so they have to modify the same index blocks. Even though the index is on a randomly chosen customer_id, it still
generates lots of conflicts especially when starting with an empty table as used in this example. When the table is
empty, meaning the index does not have any values yet, the contention will be high in the root block of the index that
has to be accessed every time a new customer id is inserted. Moreover, once the root block fills up, it will split, and the
frequency of a split depends on how fast the leaf blocks fill up.
If we continue running without truncating the table, a second run shows about 2*1200 inserts per second, a third run
shows 2*1400 and a fourth run an improvement of 2*1800. In all probability, the improvement is due to the chance
that the node-1 hits to the same block that node-2 “owns” is getting smaller and smaller but the tree structure nature of
indexes is always a possible bottleneck.
Without the index, the insert rate is much higher. By using a coarse-grained coherence model with data block contiguity,
(multiple adjacent blocks hash to the same global cache element, for example gc_files_to_locks !255) and bumping the
segment high water mark (_bump_highwater_mark_count) by this value, and/or pre-allocating extents, a very good
insert rate can be achieved (see the Oracle RAC Deployment and Performance manual).
It would be nice if we could have multiple indexes on the same table and same columns–ideally one for each node. This
is possible if the table is partitioned (refer to the Oracle Concepts, and Administrator manuals). In Oracle it is possible
to build local indexes per partition. Each index covers one partition, so if we partition a table by node_id/instance_id,
and put a local index on the customer_id we will end up with multiple indexes (one index partition per database
instance).
Even without adding a column for an instance id, Oracle range or hash partitions can be used efficiently and thus an
application code change is avoided. Using the partitioning option and local indexes for hot data will reduce the degree
and rate of concurrent access by spreading the contention over multiple, intra-table segments.
A partitioned table basically looks like a table consisting of multiple tables. Every partition is in its own segment, in its
own tablespace or data file, and with its own segment header (where freelists and other housekeeping data is located).
This makes partitions function very well with RAC. However, just like RAC, it is an extra option that you must purchase
in order to use.
test continuation
After partitioning, and making sure every instance inserts into its own partition, our test results now looks like this:
Two (nodes) * 12,000 inserts per second = 24,000 inserts per second, which is what we had hoped to achieve in the
first place.
9
The resource consumption under these test conditions was as follows:
MAJOR
MINOR
-------- ------------CPU time parsing
execution
WAIT_EVENT
SECONDS
PCT
---------------------------------------- -------- -----n/a
1
.75%
n/a
86 56.14%
disk I/O normal I/O
db file sequential read
1
.37%
waits
DBWn writes
checkpoint completed
enqueue locks enqueue
PCM locks
global cache cr request
buffer busy due to global cache
other locks
latch free
buffer busy waits
library cache pin
0
.03%
5 3.30%
1
.35%
0
.03%
18 11.86%
3 2.17%
2 1.25%
latency
commits
network
10
0
0
9
5
0
0
0
0
0
process ctl
global locks
log file sync
SQL*Net more data from client
SQL*Net message to client
process startup
global cache open x
global cache open s
global cache bg acks
global cache null to x
global cache s to x
DFS lock handle
6.75%
.28%
.03%
5.95%
3.17%
.12%
.11%
.09%
.02%
.01%
Script Example 8: Resource Consumption
Note that the global cache coherence times have now vanished, except for the ‘open’, which is justifiable. What is left
now are mainly latch free, commit related ‘log file syncs’, and CPU execution time (latch time is still library cache and
sequence latch).
reader/writer concurrency
Now let’s see what happens when we combine queries and inserts on the same data. Node-1 still inserts at maximum
speed, which should be about 12,500 inserts per second.
On Node-2 a small Pro*C program runs that performs a union between the two tables. From a very large and different
‘OrderHistory’ table it selects 10-20 rows randomly using an index. The size of the table forces the random reads to be
physical random reads. It also selects all the records for a randomly chosen customer_id using the customer_id index
from the order table. The query then loops without pausing. This causes Node-2 to read data that is being inserted on
Node-1, that is reader/writer concurrency.
Running 4 inserters on Node-1 and 25 query clients on Node-2 the following resource consumption occurred:
MAJOR
MINOR
WAIT_EVENT
SECONDS
PCT
-------- ------------- ---------------------------------------- -------- -----CPU time execution
n/a
1132 57.51%
disk I/O normal I/O
db file sequential read
5
.25%
waits
enqueue locks enqueue
PCM locks
buffer busy due to global cache
global cache cr request
other locks
latch free
buffer busy waits
library cache pin
24 1.22%
15
.74%
1
.04%
332 16.88%
15
.75%
2
.11%
latency
commits
network
155
5
1
9
91
90
31
22
22
process ctl
global locks
log file sync
SQL*Net more data from client
SQL*Net message to client
process startup
global cache open x
global cache s to x
global cache busy
global cache null to x
global cache null to s
7.89%
.28%
.05%
.46%
4.63%
4.56%
1.57%
1.14%
1.13%
Script Example 9: Four Inserters on Node 1 and 25 Query Clients on Node 2
10
CPU and latch percentages appear to be about the same. Latch is a little higher but the main resource usage is still
cache buffer chain, library cache, and sequence latches. In addition, quite a few RAC related latch waits are now being
registered. You can refer to v$latch/v$latch_children to see what latches are being slept on. Although there is also a
‘wait_time’ column in v$latch, it always seemed to be at 0 on this platform and version2.
About 12% wait time is consumed by ‘global cache ’ events. The wait time is being caused by the insert processes
wanting database blocks to be used exclusively for inserting data. Hence the ‘opens’ and ‘converts’ from ‘s’ (shared) to
‘x’ (exclusive).
Oracle RAC maintains the global state of each data block in the Global Cache Service. The global state of each block is
known as the so-called master. Depending on its address, a block is mastered on a particular node. The resource master
is determined when the block is accessed for the first time. Using two instances, about half of the resources are
maintained at the other instance. Depending on where the block is mastered and the requested access mode, messages
need to be sent or not sent.
By looking at v$sesstat of the inserting sessions, you can calculate, by dividing the ‘global cache get time ‘ by the
‘global cache gets’, that it takes about 0.27 milliseconds to get a locally mastered block. To calculate the time it takes
to get a remotely mastered block, use the same division as above but use the gets registered for the LMSx process (or
Oracle8i: LMD0). This process takes care of the lock if it is maintained on another node and was observed to run at
around 1.3 milliseconds on a relatively idle system.
Note that commencing in Oracle 9.0.1.3. and all newer versions, the grant/ast goes directly to the Oracle foreground
(client) process. This should reduce CPU and latency and the calculations above will probably not hold.
Based on the number of inserters to Node-1 and the 25 query clients on Node-2, the insert speed for this test drops to
around 9,900 inserts per second now.
On Node-2 the resource consumption statistics now look like this:
MAJOR
MINOR
WAIT_EVENT
SECONDS
PCT
-------- ------------- ---------------------------------------- -------- -----CPU time execution
n/a
833 6.44%
disk I/O normal I/O
db file sequential read
waits
enqueue locks enqueue
PCM locks
global cache cr request
buffer busy due to global cache
other locks
latch free
global cache freelist wait
latency
network
process ctl
global locks
SQL*Net message to client
process startup
global cache s to x
global cache bg acks
global cache busy
DFS lock handle
global cache open s
global cache null to x
global cache open x
9665 74.71%
8
.07%
2298 17.76%
8
.06%
97
.75%
4
.03%
1
5
2
0
0
0
0
0
0
.01%
.04%
.01%
.00%
.00%
.00%
.00%
.00%
.00%
Script Example 10: Resource Consumption Statistics
The queries hardly use any CPU time now. That is because they are busy trying to get database blocks either through
physical reads (that is, random or db file sequential reads) or through the ‘global cache’.
The requested consistent read blocks are being served by the block server process, LMSx (or Oracle8i BSPx). The block
is being read from the buffer cache by this process and, if necessary (and available in the cache), undo data is applied
to build a consistent read version before the block is ‘shipped’ to the other node.
2
Subsequently discovered that this problem is due to a port-specific bug
11
Starting with Oracle9i, Oracle has the capability of shipping and maintaining locks for ‘current’ blocks. This eliminates
the need to write database blocks back to disk in order to read and modify them by another instance. This new facility
is called Cache Fusion and, while it was designed to address scalability problems, it is still a process that requires the
use of system time and resources. This process will be explained and discussed later on in this paper.
By looking at intervals to v$sysstat and v$system_event during the test run, we can observe node-2 now doing about
220 selects per second (‘exec’ column) while waiting on 1,500 physical reads (‘dbpr’ column). It is taking an average
of 11.8 milliseconds per read while 1,120 global cache consistent read requests were being waited for averaging
about 4.3 milliseconds per request. A total of about 340 consistent read blocks (the ‘gc bl rec’ column) are being
received per second. However, it should be noted that when waiting for a cr request, it is not known whether a block is
received or a permission to read from disk is being granted.
time
execs
09:28:07
09:28:11
09:28:14
09:28:18
09:28:22
09:28:26
212.63
223.36
224.87
230.65
216.75
227.03
dbpr
11.87
11.94
11.75
11.88
11.88
11.82
1549.21
1520.47
1515.04
1516.41
1523.56
1540.16
gc cr rds3
4.12
4.27
4.68
4.36
4.43
4.21
gc bl rec
1144.59
1124.93
1130.53
1128.91
1108.14
1136.75
337.17
342.06
357.55
340.42
342.93
336.65
Script Example 11: V$sysstat and v$system_event Statistics
The number of ‘global cache cr requests’ waits is much higher than the number of ‘cr blocks received’. Not all global
cache consistent block reads result in blocks being shipped from the other node. Many can be read by the local instance
but the instance still needs to ‘globally’ request every block it reads. As we will see later, in the ‘gc_files_to_locks’
section, much depends on the global coherence scheme being utilized.
test continuation
Now, we will add 25 more query clients on Node-2 for a total of 50 query clients. Node-1 resource usage now looks
like this after re-running the test:
MAJOR
MINOR
WAIT_EVENT
SECONDS
PCT
-------- ------------- ---------------------------------------- -------- -----CPU time execution
n/a
2176 41.54%
disk I/O normal I/O
db file sequential read
5
.09%
3.23%
1.89%
.01%
7.85%
.68%
.17%
waits
enqueue locks enqueue
PCM locks
buffer busy due to global cache
global cache cr request
other locks
latch free
buffer busy waits
buffer deadlock
169
99
1
411
36
9
latency
commits
global locks
274 5.22%
952 18.18%
510 9.73%
374 7.15%
126 2.40%
55 1.05%
log file sync
global cache s to
global cache open
global cache null
global cache null
global cache busy
x
x
to x
to s
Script Example 12: Node 1 Resource Usage for 50 Query Clients
3
V$sysstat and v$system_event Statistics for ‘gc cr rds’
12
After increasing the number of query clients to 50 on Node-2 and re-running the test we are now achieving about 6,500
inserts per second.
MAJOR
MINOR
WAIT_EVENT
SECONDS
PCT
-------- ------------- ---------------------------------------- -------- -----CPU time execution
n/a
2806 4.43%
disk I/O normal I/O
db file sequential read
34610 54.67%
waits
enqueue locks enqueue
PCM locks
global cache cr request
buffer busy due to global cache
other locks
latch free
buffer busy waits
global cache freelist wait
latency
global locks
12
.02%
23565 37.22%
152
.24%
2043 3.23%
41
.07%
23
.04%
global cache s to x
global cache busy
global cache bg acks
16
6
0
.03%
.01%
.00%
Script Example 13: Node-2 Resource Usage
The average request time for all processes on Node-2 is about 7 milliseconds while the cr block receive time is 9
milliseconds. This results in a query rate of 306 queries per second with an average response time of 0.18 seconds.
time
execs
02:25:01
02:25:05
02:25:09
02:25:14
02:25:18
02:25:22
02:25:26
02:25:31
02:25:35
02:25:39
320.00
295.24
306.55
311.21
311.00
302.49
301.63
312.56
294.47
307.09
dbpr
13.69
13.57
13.37
13.42
13.21
13.28
13.49
13.49
13.32
13.45
gc cr rds
1967.20
2012.93
1961.65
2002.86
1969.27
2022.17
1986.34
1998.83
1980.05
1956.79
12.12
12.10
12.47
11.74
11.69
11.99
11.70
12.24
11.90
11.93
gc bl rec
1485.29
1533.25
1574.52
1467.49
1559.29
1481.26
1561.28
1495.23
1520.49
1534.44
508.33
500.97
490.27
495.15
484.62
477.00
482.33
494.24
489.34
491.67
Script Example 14: V$sysstat and v$system_event Statistics
Adding 25 more clients, for a total of 75 query clients, and re-running the test results in the following resource usage:
MAJOR
MINOR
WAIT_EVENT
SECONDS
PCT
-------- ------------- ---------------------------------------- -------- -----CPU time execution
n/a
570 26.52%
disk I/O normal I/O
db file sequential read
waits
enqueue locks enqueue
PCM locks
buffer busy due to global cache
global cache cr request
other locks
latch free
buffer busy waits
latency
commits
global locks
log file sync
global cache s to
global cache null
global cache open
global cache null
global cache busy
global cache open
x
to x
x
to s
s
5
.22%
74
68
1
82
36
3.42%
3.15%
.04%
3.82%
1.66%
48 2.25%
609 28.33%
316 14.71%
203 9.46%
59 2.73%
52 2.43%
1
.05%
Script Example 15: Resource Usage on Node-1 Running 75 Query Clients on Node-2
13
The system (Node 1) is now doing 4,050 inserts per second and resource usage for Node-2 is as follows:
MAJOR
MINOR
WAIT_EVENT
SECONDS
PCT
-------- ------------- ---------------------------------------- -------- -----CPU time execution
n/a
1109 3.25%
disk I/O normal I/O
db file sequential read
15221 44.68%
waits
enqueue locks enqueue
PCM locks
global cache cr request
buffer busy due to global cache
other locks
latch free
buffer busy waits
global cache freelist wait
latency
global locks
misc
7
.02%
16262 47.73%
270
.79%
1063 3.12%
60
.18%
24
.07%
global cache s to x
global cache busy
cr request retry
14
9
4
.04%
.03%
.01%
Script Example 16: Node-2 Resource Usage
The average global cache wait time for all processes are about 13 milliseconds with the cr block receive time averaging
13.3 milliseconds. The Query rate is 324 queries per second, with an average response time of 0.25 seconds.
The v$sysstat and v$system_event statistics are:
time
execs
10:11:59
10:12:04
10:12:09
10:12:14
10:12:18
10:12:23
320.12
323.29
319.03
332.63
321.23
324.94
dbpr
13.92
14.43
14.02
14.18
13.89
14.19
gc cr rds
2130.22
2096.26
2115.86
2050.21
2058.60
2172.81
22.98
22.03
24.33
23.69
21.11
23.98
1539.46
1572.85
1418.81
1402.63
1506.18
1528.72
gc bl rec
399.08
390.93
418.43
400.00
414.77
431.03
Script Example 17: v$sysstat and v$system_event Statistics
The number of waits for physical blocks and global cache cr blocks is about the same as the previous 50 client run but
the wait time for the global cache cr blocks nearly doubles. Note also that the number of cr blocks received is at the
same level as before. In the resource breakdown it also shows that almost 50% of the time spent in waiting is for those
cr requests. Also notice that the global cache wait ratio is going up to about 50% on Node-1, the Insert node.
14
The following is a summary of results for all of the tests:
inserts/sec
queries/sec
query response time
(seconds)
cr blocks recvd/sec
#physical reads/sec
readtime in ms
gc cr requests
insert node:
gc get time in ms
gc convert time in ms
query node:
gc get time in ms
crbl rec time ms
Query node CPU busy %
Number of Query Clients
25
50
75
9850
6940
4050
211
306
324
0.13
0.18
0.25
340
1550
11.5
1100
490
2000
13.5
1500
400
2000
14
1500
1.58
2.8
4.79
8.57
9.13
16.51
2.13
3.26
7.02
9.02
13
13.33
*
95
100
*No data available.
Table 1: Performance Summary of Query Clients
It is easy to see how the global cache latency times go up because, as described earlier, half of the blocks (with a 2node cluster) are acquired on the local node with the other half being acquired on the other node through the
interconnect. The times being referenced in Table 1 above are calculated using v$sysstat. If you use v$sesstat and look
at the LMSx global lock statistics, you can see much higher numbers (>20 milliseconds) for the remotely mastered
blocks.4
Rather than running into an interconnect bottleneck here, it appears that the inserting node loses speed mainly because
the global cache coherence management takes too long and the query node slows down for the same reason. In
addition, the slowdown is also due to the block shipping taking longer to complete.
The interconnect used on this cluster is a 100 megabytes per second, low latency, high speed interconnect. Instead of
the default UDP protocol, the instances were configured to use the Reliable Datagram Protocol (RDG), which is a
platform specific option. In previous tests, RDG was proven to be a faster method for interconnects.
However, what was missing from this test was the CPU usage statistics. It appears that running 50 query clients had
almost saturated the CPUs. By using a faster system we were able to build a new cluster with 2 x 4 CPU x 1 GHz
processors instead of the 2 x 4-CPU x 500 MHz processors. This configuration was twice as fast in terms of raw CPU
power.
When the tests were re-run, it showed an ability to insert almost twice the number of rows while performing almost twice
the number of queries per second. With 4 insert processes and 25 query processes, the resource usage ratios stayed
about the same.
4
Commencing with version 9.2, all ast messages are sent directly back to the requestor instead of going through LMS; therefore, the session
statistics for LMS will no longer be collected. By sending ast messages directly to the requesting foreground processes, context switches can be
reduced, resulting in decreased demands on CPU resources and a reduction in latency time to the consumer.
15
Adding users and going from the range of 25 to 100 query clients also resulted in different resource usage ratios, which
are as follows:
inserts/sec
queries/sec
query response time in sec
cr blocks recvd/sec
#physical reads/sec
readtime in ms
gc cr requests
insert node:
gc get time in ms
gc convert time in ms
query node:
gc get time in ms
crbl rec time ms
Query node CPU busy %
Number of Query Clients
25
50
75
100
19990
18460
18014
16890
410
580
620
631
0.06
0.09
0.12
0.16
700
890
900
900
2250
3200
3200
3200
9
12
20
28
1800
2500
2500
2500
0.58
1.13
0.93
1.77
1.02
1.86
2.9
1.87
0.88
1.27
1.59
2.1
1.74
2.26
1.74
2.24
55
80
80
80
Table 2: Test Re-run Resource Usage Ratios
The insert rate is pretty stable now but the query rate stabilizes at around 50 query clients. Also, the CPU usage won’t
get higher by adding more users. An interesting result is that although the number of physical reads is not increasing,
the read times dramatically increase when adding more queries, which is an indication of an I/O bottleneck. Under
these conditions it appears as if the I/O subsystem cannot handle more than 3,200 random reads per second.
The global cache and cr block timing also looks quite healthy compared to the previous numbers that were achieved.
Note the 900 cr blocks being shipped per second. This is 100% more blocks shipped per second than achieved in any
of the previous runs.
With the higher number of users the resource usage ratios for the inserting node were stable and about the same for the
earlier 25 query client test. For the query node the ‘db file sequential read’ ratio continued to go higher and higher.
To prove that the limiting factor for the insert rate on Node-1 was caused by the CPU usage on Node-2 and not by the
interconnect speed, the query that was used by the query clients was changed to do less physical I/O. This caused
them to do more queries per second and thus caused a much higher load on the GCS processes and cr block shipping.
16
The results of that test are as follows:
inserts/sec
queries/sec
query response time in sec
cr blocks recvd/sec
#physical reads/sec
readtime in ms
gc cr requests
insert node:
gc get time in ms
gc convert time in ms
query node:
gc get time in ms
crbl rec time ms
Query node CPU busy %
Number of Query Clients
50
75
9500
7200
1517
1546
0.03
0.05
1700
1700
2500
2500
8.5
9
3100
3200
2.35
4.03
2.3
7.9
3.55
3.72
7.32
4.91
90
100
Table 3: Query Resource Usage Ratios
This change results in a stable 2,500 physical reads per second and 1,700 cr blocks shipped per second.
This test was designed to specifically generate more cr blocks traffic over the interconnect. Results point to the
conclusion that the slowdown in the previous tests was not being caused by a slow interconnect but rather were a direct
result of the higher CPU usage due to higher interconnect usage and more processing overhead for block shipping and
global cache management.
The insert rate has now dropped to 7,200 inserts per second. After initiating a few more CPU-bound queries on the
query node doing millions of logical I/Os with no physical I/Os the insert rate goes down dramatically to just a few
hundred per second.
conclusions
•
When high CPU usage occurs in an Oracle RAC environment, it can significantly slow down global cache
access speed and thus slow down those processes that depend on it, even if the process runs on another node
in the cluster. Further testing is needed to determine the impact of manually giving higher CPU priority to key
RAC processes. Even without performing concurrent processing on a system, the use of RAC will impose some
additional CPU overhead
•
Just applying RAC to an existing application that is known to frequently access certain data concurrently can
cause performance issues. In these tests, we found a degradation of performance that resulted in a drop of
12,500 insert per second to a few hundred per second. Test results highlight the importance of considering
application and/or data partitioning when deploying RAC. Therefore, due care should be exercised on how
the data is actually being used by the application(s) before implementing RAC.
•
Although it is possible to run without ‘gc_files_to_locks’ it can still be beneficial to use this setting because
requests for global cache management are less frequent. It should be used on data files that are known to be
mostly read, or data files with little and predictable write/write or read/write access concurrency. Its use has to
be carefully considered because of the fact that the 9i Cache Fusion protocol is turned off and the danger of
17
severe performance degradation due to “false pinging” exists. It should also be mentioned that the use of
coarse-grained global cache management (“gc_files_to_locks”) is harder to administer.5
•
Tests were run in a two-node cluster. Adding more nodes will certainly make things more complex and will
probably result in the use of even more CPU resources. When nodes and remote users are added and the
shared working set does not scale correspondingly, there is a chance that locality of access can be diminished,
that is the local buffer cache hit efficiency can be lower. Hence more messages are sent and more buffers
received from remote instances.
•
Consider the load that Cache Fusion will impose on the interconnects before adding multiple nodes.
•
Since the tests were done with 9.0.1 the ORACLE RAC kernel has been optimized to reduce overhead for
certain operation. The RDBMS release 9.2.0.1 incorporates many of these optimizations. However, this does
not make the findings and tuning recommendations described obsolete. As in a non-clustered system,
minimizing contention and improving locality of access are still valid performance tuning goals.
Usage of RAC in a clustered or complex environment will require due care and planning before implementation in order
to get the best performance possible.
5
This conclusion is based on test information referenced in the Appendix section of this document.
18
appendix a
GC_FILES_TO_LOCKS or coarse-grained global cache coherence management
All the tests in this paper were performed without setting gc_files_to_locks, so 9iRAC works quite well. But things can be
done faster, using less CPU, and as we have seen in this paper, high CPU usage on one node can have a serious
impact on the performance of the other node.
One of the more challenging things about working with Oracle8 OPS was finding the correct setting for the
gc_files_to_locks parameter. There is a lot to be said about this parameter and before using it you should fully
understand it’s overall impact on Oracle performance. See the references section at the end of this paper for a few
books that can help you understand and use this parameter setting. The Oracle OPS/RAC manuals also do a very good
job of explaining this as well.
Starting with Oracle9, the RAC Administration Guide states: “Oracle automatically controls resource assignments so gc
locks are not needed” (fortunately the RAC Deployment and Performance manual appendix gives a good explanation of
when and how you should use gc_files_to_locks).
For example, this is what happens if only performing reads on Node-2 and Node-1 is not being used:
Script Example 18: Using Node 2 Only--Read Statistics
time
execs
11:49:40
11:49:44
11:49:48
11:49:53
11:49:58
11:50:03
11:50:08
340.88
348.87
360.78
357.78
361.25
365.85
382.98
dbpr
15.54
15.80
15.98
16.55
15.80
15.99
15.80
2413.53
2528.54
2379.44
2758.66
2167.29
2455.12
2513.92
gc cr rds
10.13
9.05
11.37
15.23
16.64
18.89
22.37
1156.09
1231.43
1232.71
1285.69
1189.82
1227.81
1261.37
gc bl rec
0.00
0.00
0.22
0.00
0.00
0.00
0.00
Node-2 is reading from disk as fast as it can, but is still hindered by the I/O bottleneck. We observe no cr blocks
coming over the interconnect. Processor statistics (‘ps’) is showing about 10-15% overall CPU usage on the machine
being used for the GCS processes. The idle node also shows a 15% CPU utilization rate for global cache management
overhead.
You must realize that for every block read a lock must be acquired. As we’ve seen in previous measurements, it can
take between 0.01 milliseconds locally and up to 1 millisecond if the lock resides on another node. Note that this 1
millisecond is measured during low usage and it can be 10 or more if the system is very busy. So getting 50 blocks can
take anywhere between 25 and 250 milliseconds or more only to acquire the locks. So, as you can see, it can be
beneficial to think about your coherence scheme.
In this test case we chose to use only 1 lock per data file for the read-only data and to use range locks (2,200 blocks
per lock) for the insert data files. The indexes were stored in separate files using a 1:1 locking scheme.
Using this method showed a 15-20% higher insert speed mainly because we only have to take a lock for each 2,200
blocks of data instead of every block.
This is even more true for queries. After acquiring a lock for every data file they read from, no more locks are necessary
for the read only data files. This saves a lock get for every block visited.
Note however, that if processes from remote nodes are reading or updating the data files that are subject to the coarsegrain cache management policy, chances are high that performance will be severely impacted.
19
I would like to finish with an OPS Guru’s (Anjo Kolk) recommendations:
Usage:
Read
Write
Low
=3EACH
=1000(R)EACH
High
=3EACH
=0
Refer to the OPS or RAC manuals for the appropriate GC_FILES_TO_LOCKS syntax.
20
references
Oracle8i Internal Services, written by Steve Adams, published by O’Reilly. A must-read book for everybody
involved in Oracle performance tuning. He also has a good website and newsletter.
Scaling Oracle8i, written by James Morle, published by Addison Wesley. Not mentioned in this performance report
but the best book available on general Oracle design and performance issues. Can be used for OPS.
Oracle Parallel Processing, written by Tushar Mahapatra and Sanjay Mishra, published by O’Reilly. Combining
OPS with Parallel Query stuff, worth buying and reading. This is a good introduction to the OPS and PQ features.
And don’t forget the official Oracle manuals that do a very good job, especially the RAC Deployment and
Performance and the Oracle Performance Tuning manual.
www.oraperf.com
Anjo Kolk’s site, the papers, Yapp, plus upload your statspack or utlstat report and get
performance advice.
www.ixora.com.au
Almost everything is worth reading at the site.
www.hotsos.com
Oracle performance and problem diagnosis paper on using wait events for performance tuning
as well as some other good stuff. Try to visit their seminar.
21