PARALLEL EXECUTION FACILITY CONFIGURATION AND USE

advertisement
Reviewed by Oracle Certified Master Korea Community
( http://www.ocmkorea.com http://cafe.daum.net/oraclemanager )
PARALLEL EXECUTION FACILITY
CONFIGURATION AND USE
Introduction
Beginning with Oracle Enterprise Server 7.1, it became possible to utilize more than a single server (shadow) process to
execute a query, which was available as the Parallel Query Option (PQO). The goal was to return results to the user more
quickly than with serial execution, at the expense of consuming additional system resources. This capability was greatly
expanded with each new release of Oracle Enterprise Server. In versions 8 and 9, an impressive array of parallel operations
are possible that include not only parallel query, but also parallel DML, parallel DDL, parallel direct-load using SQL*Loader,
an numerous maintenance tasks. This expanded capability is now collectively referred to as Parallel Execution (PX).
The purpose of this paper is to provide practical knowledge for configuring PX, monitoring performance once it has been
configured, and to highlight some of the most common uses in the data warehousing environment. The scope of this paper
will include the following topic areas to achieve that goal:
•
Overview of Parallel Execution architecture
•
Host and database design considerations
•
Configuring Parallel Execution
•
Implementing parallel operations
•
Common parallel data warehouse operations
•
Run-time performance monitoring
Many configuration parameters and recommended values are discussed. While the Oracle documentation is general in nature
and is applicable to all host vendors and configurations, this paper is geared towards the most common data warehouse
configurations encountered by the author as follows:
•
Sun Microsystems or Hewlett-Packard enterprise class host that has between 4 and 64 CPUs
•
High performance storage array such as EMC Symmetrix1
•
Single instance environment (eg. Oracle Parallel Server or Real Application Clusters not implemented)
•
Solaris2 or HP-UX3 operating system
Overview of PX Architecture
Before leaning about the configuration of the PX facility, it is necessary to achieve a good understanding of the process
architecture. To begin with a more simple case, consider non-parallel execution of a SQL statement. Several steps are
involved. The client process provides the interface to the end-user. The client (user) process may be SQL*Plus, or a custom
application written in Java, Perl, Visual Basic, or some other language. The client process may reside on the database server
or some other computer such as an end-user’s workstation. Each client process communicates with an Oracle server process.
The server process is responsible for interacting with the database on behalf of the client process. It accepts SQL statements
from the client process, parses and executes the statement against the database, and returns the results to the client process.
This is illustrated in figure 1.
1
Symmetrix is a registered trademark of EMC Corporation
Solaris is a registered trademark of Sun Microsystems, Inc.
3
HP-UX is a registered trademark of Hewlett-Packard Company
2
PX Facility Configuration and Use
Maresh
R e s u lts
C lie nt
P ro c e s s
S e rve r
P ro c e s s
D a ta b a s e
SQL
Figure 1. Serial SQL Execution
Figure 2 illustrates the process configuration when a parallel query is executed. First, the server process that interacts with
the client process is promoted to the role of Query Coordinator (QC). The QC is primarily responsible for handling
communications with the client and managing a set of PX slave processes that perform most of the work against the database.
The QC is responsible for acquiring multiple PX slave processes, dividing up the workload between the PX slaves, and
passing the results back to the client.
PX
Slave
Process
Results
SQL
Results
Results
Query
Coordinator
Client
Process
SQL
SQL
PX
Slave
Process
Database
Results
SQL
PX
Slave
Process
Figure 2. Parallel SQL Execution with one PX slave per execution thread
When a parallel query is initiated by the client process, the QC determines how to divide the workload between the various
PX slave processes. It shouldn’t be surprising that the QC rewrites the SQL statement submitted by the client process into
one or more statements that are passed to the PX slaves for execution. In the above figure, there are three threads of
execution, each of which has one PX slave process. This is referred as intra-operational parallelism.
When a parallel operation is initiated, PX slaves are borrowed from a common slave pool that is available to all users of the
database instance. Once the parallel operation has completed, the slaves are released back to the pool for other parallel
operations. The maximum number of intra-operational PX slaves that will be brought to bear on a particular query can be
PX Facility Configuration and Use
Maresh
controlled at the table or index level using the ALTER TABLE or ALTER INDEX directive respectively, or by SQL
statement hints. In certain cases, twice the degree of PX slaves will be used to satisfy a query. This configuration is
illustrated in Figure 3 where two PX slaves are utilized for each execution thread.
This configuration occurs when executing either a merge or hash join, or when sorting or other aggregation operations are
present in the original query. It will also occur when a parallel DML statement is executed with a correlated parallel
SELECT statement. In this case, each set of intra-operational PX slaves in an execution thread act in a producer-consumer
relationship. In the above case, the PX slaves that access the database directly produce data that are consumed by the second
set of intra-operational PX slaves. For example, in a sorting operation, the first set of PX slaves would retrieve rows from
the database and apply any limiting conditions that appear in the query. The resulting rows would be fed to the second set of
PX slaves for sorting. Each of the second set of PX slaves is responsible for sorting rows within a particular range. Each of
the PX slaves that accessed the data directly sends their results to the particular slave according to the value of the sort key.
When two PX slaves communicate with each other within the same thread of execution, it is referred to as inter-operational
parallelism.
Client
Process
QC
PX
Slave
PX
Slave
PX
Slave
PX
Slave
PX
Slave
PX
Slave
Database
Figure 3. Parallel SQL Execution with two PX slaves per execution thread
The above example contains two parallel operations. The first operation retrieves rows from tables in the database while the
second operation sorts or aggregates the results. When executing complex joins, multiple inter-operational parallel
operations may occur. In this case, the cost-based optimizer will split the original SQL statement into the optimal number of
operations, each of which is executed in parallel in succession. Consider the execution plan below involving three tables that
are hash joined, followed by a sort operation.
SORT
HASH JOIN
TABLE C
HASH JOIN
TABLE A
TABLE B
Step
Step
Step
Step
Step
Step
4,
3,
3,
2,
1,
2,
PX
PX
PX
PX
PX
PX
Slave
slave
slave
slave
slave
slave
set
set
set
set
set
set
2
1
1
2
1
2
The optimizer may choose to perform four parallel operations in the following order using two sets of PX slaves.
1.
The first set of PX slaves produces rows from table A using intra-operational parallelism.
2.
The results of the step 1 are consumed by the second set of PX slaves, which also retrieve rows from table B using
intra-operational parallelism and perform the first hash join. Since steps 1 and 2 occur simultaneously, it represents
an inter-operational process.
3.
The results produced in step 2 are consumed by the first set of PX slaves which have completed executing step 1.
The same set of PX slaves also retrieves rows from Table C using intra-operational parallelism and performs the
second hash join. Since steps 2 and 3 occur simultaneously, it represents an inter-operational process.
PX Facility Configuration and Use
4.
Maresh
The results from step 3 are consumed by the second set of PX slaves that have completed executing step 2. The PX
slaves perform the sort and return the result to the Query Coordinator.
This begins to sound hopelessly complex but there are several patterns that enable one to both clearly understand what is
occurring in each step. When SQL statements require more than two parallel operations to return the result, each set of PX
slaves is reused as necessary. In the absence of subqueries, union, intersect, or minus operations , the maximum number of
PX slaves required to satisfy the original statement is no more than twice the highest Degree of Parallelism specified on any
of the tables in the query.
Host and Database Design Considerations
Nobody will argue with the fact that it is much easier to build a system correctly from the ground up rather than to redesign
and retrofit when performance problems present themselves. It is a frustrating experience to spend a lot of effort setting up a
data warehouse with the intention of heavily utilizing PX only to find that it’s not working as expected. Many times, this is
caused because the DBAs and System Administrators designing the data warehouse have most of their experience with
OLTP systems that have very different requirements than data warehouse systems.
The most common performance problems encountered when implementing PX have two sources. The first source is that the
hardware may not be suitable or configured properly for use as a data warehouse where parallel operations are heavily
utilized. The second source of problems is caused by a database physical design that is not properly optimized to support
parallel operations. The usual culprit in both cases is resource contention. When designing and configuring a host system
and database that will rely heavily on PX, a number of general design principals should be employed. Many of these design
points are only applicable if a new system will be procured, but others can be employed on an existing system.
•
A RAID 1 (mirroring) disk configuration offers more tuning options than RAID 0+1 (striping+mirroring) because
objects can be manually striped across multiple physical devices. Arguably, RAID1 takes a bit more administration
because of the manual striping effort, but the resulting performance is worth it.
•
For a fixed quantity of disk capacity, more smaller mount points are likely to perform better than a few large ones.
For example, for a 700GB disk configuration, 10 – 70GB disks will perform better than 4 – 180GB disks. Having
more disks reduces physical disk contention.
•
Adding high throughput disk controllers (eg. Fiber Channel) will increase throughput between the disk subsystem
and the host. PX utilizes Direct Path operations whenever possible. This typically occurs when performing full
table and partition scans, and when specified on INSERT statements. In the absence of the buffer cache bottleneck,
remarkably high I/O throughput rates can be achieved if the I/O subsystem is properly designed and configured.
•
For a fixed quantity of CPU capacity, more less powerful CPUs are likely to perform better than fewer more
powerful ones. For a fixed amount of performance (eg. SpecINT92, SPECbase, SPECrate), if the choice is between
a 4-CPU or a 12-CPU configuration, the 12-CPU configuration is likely to perform better by reducing CPU
contention. The assumption here is that there is adequate bus bandwidth to support the higher number of CPUs.
•
When constructing tablespaces that will house large tables, more smaller datafiles will perform better than a few
large ones if they are striped over many mount points. This once again reduces disk contention. For example, if a
10GB tablespace is required, 40 – 256MB data files will perform better than 5 – 2GB data files.
•
The same paradigm holds true when constructing tablespaces to hold temporary and rollback segments. While the
amount of data sorted will usually remain the same whether or not a query utilizes PX, the disk sort activity at a
point in time when a parallel sort is running will be increased by a factor approximately equal to the Degree of
Parallelism of the query. Rollback segments will experience a similar increase in activity when a parallel DML
statement is executed.
•
Continuing on the theme of reducing disk contention, the online redo and archive log destinations should be used
exclusively for those purposes. Although much space may remain unused, contention on the online redo and archive
log disks during parallel DML operations can be the bottleneck on an otherwise well designed system.
•
When creating large tables or table partitions, many smaller extents will perform better than a few large ones. For
example, if a partitioned table will be created, 50 to 100 smaller extents per partition will perform better than with a
one or two large extents. Each new extent is allocated round robin style from tablespace datafiles that have space
available. If datafiles have been striped across many mount points, parallel full-table and partition scans will
PX Facility Configuration and Use
Maresh
perform better because the data is spread across many physical disks. Here, Locally Managed Tablespaces should
be employed.
•
When creating partitioned tables, many smaller partitions will perform better than a few large ones. For example, if
a partitioned table will be created to house 5-years worth of data, a partitioned table created with 260 weekly
partitions or 60 monthly partitions will likely perform better than a table with 5 annual partitions. Choosing the
partition granularity also requires careful analysis of the types and volumes of SQL statements that will access the
table, data loading, and backup and recovery strategies.
In each of these examples, spreading the work over more objects reduces resource contention and potentially improves PX
throughput. If the optimizer determines that the specified Degree of Parallelism will degrade query performance because of
resource contention, or there are an insufficient number of objects over which to spread the work, the optimizer will reduce
the Degree of Parallelism behind the scenes, much to the chagrin of the user.
Configuring Parallel Execution
To this point, we’ve looked at an overview of the PX process architecture and its host requirements. In this section, more
details about the process and memory architecture will be discussed, and the relevant configuration parameters will be
explained. All of these parameters appear in the database parameter file (init.ora). Some may be changed dynamically using
ALTER SYSTEM and ALTER SESSION commands. Others require the database to be restarted to take affect. The first set
of parameters that will be discussed are those that affect the number of slaves in the slave pool, and how they are used by
statements that utilize them.
PX Slave configuration
When the optimizer determines that an operation will run using PX, the Query Coordinator recruits slaves from the PX slave
pool. It can only use slaves that are currently idle, or inactive. Once idle slaves are found, each of the slaves that will be
used is marked as active, and no other query can use them until they are released when the current parallel operation has
completed. Since PX slaves can only be used by one Query Coordinator at a time, the number of slaves available in the slave
pool is one of the primary limiting factors on the number of concurrent parallel operations. Additionally, PX slaves are
borrowed from the slave pool on a first-come, first-served basis.
The PX slave pool consists of a number of PX slaves that are available to all operations that will utilize PX on the database
instance. The PARALLEL_MIN_SERVERS parameter is used to determine the minimum number of PX slaves in the pool
and PARALLEL_MAX_SERVERS determines the maximum number. When the database instance is started, Oracle creates
a PX slave pool with the minimum number of slaves. As various PX operations begin to execute, the number of slaves in the
pool that are idle eventually becomes depleted.
If there is an insufficient number of PX slaves available in the pool to support a parallel operation, the value
PARALLEL_MAX_SERVERS is inspected. If the value is higher than the value of PARALLEL_MIN_SERVERS,
additional PX slave processes will be dynamically created (spawned), added to the pool, and recruited by parallel operations
that need them. Therefore, the maximum number of PX slaves that will ever reside in the pool is
PARALLEL_MAX_SERVERS.
As each Query Coordinator completes it’s parallel operations, the PX slaves become idle again and are now available for use
by other parallel operations. If there are more slaves than PARALLEL_MIN_SERVERS in the PX slave pool, the excessive
slaves will be terminated after a five minute period of inactivity. For example, let’s say that
PARALLEL_MIN_SERVERS=16 and PARALLEL_MAX_SERVERS=64 and all 64 slaves are active in the PX slave pool
because many PX queries were running concurrently. After this period of high activity, no parallel statements are run for a
while. When this occurs, the number of PX slaves in the slave pool will be reduced from 64 to 16.
The value for PARALLEL_MIN_SERVERS should be set to a value such that there are enough slaves readily available in
the slave pool to cover the average PX operation load on the system. The value for PARALLEL_MAX_SERVERS should
be set to the maximum number of slaves that can be created, and is usually based upon practical CPU and I/O limits. Some
consideration should be given to choosing the two values. If the minimum number of servers is set too low, then some
queries will always be waiting for additional PX slaves to be created, causing them to run longer. Setting the value too high
simply causes more system memory to be consumed because the idle processes remain in the slave pool. If memory is not
much of an issue, then it is acceptable to set the two parameters to the same value so that the maximum number of PX slaves
will always be available for use.
PX Facility Configuration and Use
Maresh
A good reason why one might choose two different values is when the PX operation load varies greatly over time. Suppose
that most of the data are loaded nightly during a four-hour window. Many of the operations are parallelized including data
loads, materialized view and index rebuilds, and the creation of summary tables. There are few total sessions running on the
system during this period, but they are all active. This represents the peak-processing load on the system. However during
the day, there are hundreds of concurrent users on the system but only a few of them utilize PX for running parallel queries.
Here, it would be appropriate to choose a low value for PARALLEL_MIN_SERVERS to support the requirements during the
day and a much higher value for PARALLEL_MAX_SERVERS to support nightly processing requirements. During the day,
additional slaves would be dynamically added to the pool if required to support parallel queries, up to the value of
PARALLEL_MAX_SERVERS. Otherwise, the memory would be available to support the higher volume of concurrent
users. During the nightly load window, a much higher number of PX slaves will remain in the pool because of their frequent
use.
Now what happens as the number of slaves currently being used approaches the value of PARALLEL_MAX_SERVERS? If
there are an insufficient number of PX slaves available to satisfy the requirements of a particular parallel operation, it will
consider using whatever slaves are available. The value of the PARALLEL_MIN_PERCENT parameter is used to decide
whether or not an operation will proceed using PX when the requested number of PX slaves is not available. Consider a
query where 16 slaves have been requested, but only 10 are available. If the value of PARALLEL_MIN_PERCENT was 50,
meaning that at least 50% of the requested slaves must be available for the operation to proceed in parallel, the operation will
proceed if there are a minimum of 8 slaves available. All 10 slaves will be used in this case. If the value of
PARALLEL_MIN_PERCENT was 75, the minimum number of slaves required to proceed in parallel is 12. Now the
statement will fail to execute and produce the following error message.
ORA-12827: insufficient parallel query slaves available.
If PARALLEL_MIN_PERCENT is set to 0, queries will utilize whatever PX slaves are available and proceed in parallel. If
no PX slaves are available, the query will execute non-parallel with no contrary warnings indications, other than a longer
elapsed time.
When PARALLEL_MIN_PERCENT is set to a value of 100, queries will not proceed in parallel unless all PX slaves
requested are actually acquired. Setting PARALLEL_MIN_PERCENT to a value of 100 assures that parallel operations will
behave more predictably. Since all PX slaves must be acquired for the operation to proceed in parallel, queries that run
successfully will do so with similar execution times.
The purpose of the PARALLEL_ADAPTIVE_MULTI_USER parameter is to help preserve overall database performance
during periods of high system load. It permits the requested Degree of Parallelism to be automatically reduced on new
parallel queries during periods of high database activity. If for example, the requested Degree of Parallelism is 16 but the
system load is high when the parallel operation begins to execute, the Degree of Parallelism might be reduced to 8. This
reduction will cause the query to run longer than if the original requested Degree of Parallelism were employed.
The PARALLEL_THREADS_PER_CPU parameter is used to adjust the load on each CPU when
PARALLEL_ADAPTIVE_MULTI_USER is enabled. The value represents the average number of PX slaves that each CPU
can process concurrently. The default value is 2 is usually adequate. If the host system has a few high-powered CPUs rather
than many lower performance CPUs, increasing the value may improve throughput. Likewise, if the host system has a
slower I/O subsystem, increasing the value may improve PX throughput.
The PARALLEL_BROADCAST_ENABLED parameter can significantly improve parallel query performance at the expense
of additional PGA memory. If a small row source is being joined to a large row source in either a merge or hash join, each of
the PX slaves processing the large row source may each require the entire small row source for the join. For example, if the
DOP for a hash join is 8 and a small table must be scanned by each of the PX slaves, the small table would be scanned 8
times if PARALLEL_BROADCAST_ENABLED is set to false, the default value. If the parameter is set to true, the table
will be scanned once by the Query Coordinator and the results will be delivered to each of the 8 PX slaves. Enabling this
parameter results in more memory usage within the PGA of each PX slave process.
SGA Memory Configuration
The next set of parameters to be discussed affect the SGA memory structures associated with PX. Oracle uses message
buffers to pass data between the various processes during inter-operational parallel operations. These buffers reside in the
shared pool inside the SGA if PARALLEL_AUTOMATIC_TUNING is set to FALSE, and in the large pool if the parameter
value is set to a value of TRUE. A fixed number of buffers is required for each producer-consumer connection. The number
of connections for a particular query varies with the square of the maximum Degree of Parallelism of the query since each
PX Facility Configuration and Use
Maresh
producer has a connection to each consumer. If the maximum DOP of the query is 4, then there could be as many as 16
connections. If the maximum Degree of Parallelism were 8, there would be 64 connections. Based upon this relationship,
one can see that memory requirements for message buffers increase greatly as the maximum value of the DOP increases.
The system-wide amount of memory required for the message buffers also depends upon the number of concurrent queries
executing parallel statements, and the size of the message buffer. Based upon the variables, computing the buffer space is
quite inexact and is based upon a number of assumptions. We are simply interested in computing a ballpark number. We
want to know if the requirement is 20MB vs. 200MB, not 20.2MB vs. 20.5MB. The following formula from the Oracle
documentation can be used to compute the amount of memory required for PX message buffers on SMP systems:
buffer space = (3 x size x users x groups x connections)
where
size = PARALLEL_EXECUTION_MESSAGE_SIZE
users = The maximum number of sessions that will be concurrently issuing PX queries.
groups = The number of PX groups used per query. A nominal value of 2 is a good nominal value.
connections = (DOP x DOP) + (2 x DOP) Where DOP is the highest Degree of Parallelism that will be used for any query.
The PARALLEL_EXECUTION_MESSAGE_SIZE parameter controls the size of the messages and has a default value of
2KB if PARALLEL_AUTOMATIC_TUNING is set to FALSE, and 4KB if the parameter value is TRUE. Within certain
limits, increasing the size of the message buffer increases throughput between the producer and consumer PX slaves, which
in turn, reduces execution time. If machine memory is plentiful, increasing the value to 8KB may provide performance gains
if communication between inter-operational PX slaves is a bottleneck. If the value is increased, the SGA size must be
increased accordingly. If there is insufficient space in the PX message pool, parallel queries will fail with an ORA-4031
memory allocation failure message.
Here is an example computation for the PX message pool. Let’s use an 8KB PARALLEL_EXECUTION_MESSAGE_SIZE.
The value for size is 8192. A total of 20 users will be issuing parallel queries concurrently so the value of users is 20. Of
these 20 users, most will be using a Degree of Parallelism of 4, but several large queries use a degree of 8. So for the
computation of connections, the value of 8 for DOP should be used. The value of connections is:
connections = (8 x 8) + (2 x 8) = 80
Substituting these numbers into the above formula for buffer space yields
buffer space = (3 x 8192 x 20 x 2 x 80) = 78,643,200 bytes
Dividing by 1,048,576 to get megabytes yields 75MB. This is the minimum amount of memory that should be added directly
to the size of the shared pool or large pool depending upon the setting for PARALLEL_AUTOMATIC_TUNING. The
SHARED_POOL_SIZE and LARGE_POOL_SIZE parameters control the size of the share pool and large pool respectively.
It is desirable to locate the message buffers in the large pool because the shared pool is already heavily used for other
purposes, and is often a place for contention without further exacerbation. The large pool, on the other hand, has
considerably less activity because it has fewer uses. It is only used for the PX message pool, DB Writer I/O slaves, and
Multithreaded Server (MTS) if those features are used. Since all of the other features provided by setting
PARALLEL_AUTOMATIC_TUNING to a value of TRUE can be overridden with other parameter settings if desired, set it
to a value of TRUE so that the PX message pool will be located in the large pool. There is no other mechanism to specify
that the message pool should be located in the large pool. The above computation is based upon a number of assumptions
you make and therefore produces a good starting point.
PX operations also cause more shared pool memory to be used than their serial counterparts. PX slaves are specialized
versions of the garden-variety server process so their operational characteristics are similar. So if 8 PX slaves will be used to
solve a particular query, the queries running on each slave will occupy space in the shared pool. Even if the PX message
pool resides in the large pool, the size of the shared pool should be increased to accommodate the additional memory
requirements caused by the additional SQL statements. A good starting point is to increase the size by 25% when
implementing PX on a well-tuned database, then monitoring for contention to determine is further adjustments are necessary.
PX Facility Configuration and Use
Maresh
Other Resource Considerations
In addition to the obvious database parameters required to configure PX, there are also a number of others that must be
reevaluated. When a SQL statement executes using PX, the Query Coordinator uses a two-phase commit strategy to commit
transactions. Consider a parallelized UPDATE statement that utilizes 8 PX slaves to update rows in a total of 30 table
partitions. This results in a total of nine transactions; one for each of the 8 PX slaves, and one for the Query Coordinator. If
the SQL statement utilizes two PX slaves per thread of execution, then the number of transactions would be 17; one for each
of the 16 PX slaves, and one for the Query Coordinator. Therefore, the parameter value of TRANSACTIONS should be
increased by the value of PARALLEL_MAX_SERVERS to handle the additional number of transactions caused by the PX
facility.
The Query Coordinator will acquire DML locks on each of the partitions that are being processed. Each of the PX slaves will
also acquire DML locks on each of the partitions that they are processing. Therefore, the number of DML locks and enqueue
resources must both be increased considerably to handle the additional requirements imposed by PX.
If nondefault values for the PROCESSES and SESSIONS parameters are used, they must be increased by the value of
PARALLEL_MAX_SERVERS to accommodate the addition of PX slave processes.
Since the UPDATE statement modifies data, transaction information must be written to rollback segments in the event that
the transaction must be rolled back. Each slave process will utilize its own transaction space within a rollback segment since
each is treated as a separate transaction. The total amount of space used in rollback segments is not considerably higher than
that used by a serial transaction. But the overall load on the rollback segments while the statement is executing will be
considerably higher. This is caused because all of the work has been spread across multiple PX slaves to process the
statement within a shorter period of time. While parallel DML statements are running, verify that rollback segment
contention is minimal. If contention exists, then the rollback segments should be tuned accordingly. Performance on the
disks on which the rollback segments reside should also be monitored for contention
Likewise, when parallel DML statements are executing, redo entries are being generated at a much faster rate. This will
affect the all of the structures associated with the log writer (LGWR) including the log buffer in the SGA, and the number
and size of the online redo logs. This facility should be monitored closely when parallel DML statements are executing to
determine if tuning is necessary to improve performance.
PGA Memory Issues
The next set of parameters affect size of the Process Global Area (PGA) of the PX slave processes. The PGA is the private
memory area within each server or PX slave process. Sort and hash areas are located within this memory structure. Once
this memory has been allocated in the PGA, it remains allocated for the life of the server process.
Sort and hash memory areas can account for a considerable amount of overall memory usage on the database host,
particularly when PX is employed. If the database has been configured so that 10MB of memory sort area may be allocated,
then up to 10MB will be allocated for each sort in the query that runs serially. However, the same query running in parallel
with a degree of 16 may use 170MB of sort memory per sort because each PX slave process and the Query Coordinator are
each able to allocate 10MB. If the query performs multiple sorts, as in a merge join, then considerably more than 170MB of
memory will be allocated.
Data are sorted anytime an aggregation operation such as GROUP BY or ORDER BY is performed, and when merge joins
are performed. The SORT_AREA_SIZE parameter controls the amount of memory that can be allocated within the PGA for
a single sort. For small sorts, only the amount of memory required to perform the sort will be allocated. If a sort cannot be
performed entirely within memory, then the additional space required to perform the sort is allocated from temporary
segments on disk. With the high data volumes processed within a data warehouse, some sorts will be performed using
temporary segments on disk. When this occurs, multiple sort runs are performed in memory while the results of each run are
stored on disk. Even though disk sorts occur, sorting time will decrease as the size of the in-memory sort area increases
because fewer sort runs will be required to complete the sort. Query performance will be optimal if the sort is performed
entirely in memory.
The SORT_AREA_RETAINED_SIZE parameter controls the amount of memory retained in the PGA from a sort after it
completes. It affects the total amount of sort memory within the PGA when multiple sorts are performed within a single
query. In a merge join, for example, each of the two row sources must be sorted before they are merged. Sort memory up to
the value of SORT_AREA_SIZE is allocated to perform the first sort. After the first sort has completed, all of the data that
cannot fit within the value of SORT_AREA_RETAINED_SIZE are written to disk. For example, if SORT_AREA_SIZE is
1MB and SORT_AREA_RETAINED_SIZE is 512KB, after the first sort completes 512KB of the data will be written to disk
PX Facility Configuration and Use
Maresh
and the remaining 512KB will be retained in memory. To sort the second row source, an additional 1MB of memory is
allocated to perform the sort. At the end of the second sort, the total sort memory used is now 1.5MB; 512KB retained from
the first sort, and 1MB from the second sort. Recursive statements may allocate additional PGA memory for sorting. So
allow at least the sum of SORT_AREA_SIZE and SORT_AREA_RETAINED_SIZE for the maximum number of PX slaves
that will be configured for additional host memory.
The hash area used for hash joins has similar properties to the in-memory sort area. Whenever a query performs a hash join,
memory is allocated in the PGA to hold the hash table of the first row source in the join. The value of HASH_AREA_SIZE
determines the maximum amount of memory that can be allocated per session when a hash join is performed. Like
SORT_AREA_SIZE, hash join memory can be allocated by each PX slave that performs the hash join.
PARAMETER
DEFAULT
ALTER
SYSTEM?
ALTER
SESSION?
AUTO TUNING
DEFAULT
DML_LOCKS
4x
TRANSACTIONS
No
No
4x
TRANSACTIONS
ENQUEUE_RESOURCES
Derived from
SESSIONS
No
No
Derived from
SESSIONS
HASH_AREA_SIZE
2x
SORT_AREA_SIZE
Yes
Yes
2x
SORT_AREA_SIZE
LARGE_POOL_SIZE
0
No in 8i
No
Derived
Yes in 9i
PARALLEL_ADAPTIVE_MULTI_USER
FALSE
Yes
No
TRUE
PARALLEL_AUTOMATIC_TUNING
FALSE
No
No
N/A
PARALLEL_BROADCAST_ENABLED (8i)
FALSE
No
Yes
FALSE
PARALLEL_EXECUTION_MESSAGE_SIZE
2048
Yes
Yes
4096
PARALLEL_MAX_SERVERS
5
No
No
CPU_COUNT X 10
PARALLEL_MIN_PERCENT
0
No
Yes
0
PARALLEL_MIN_SERVERS
0
No
No
Derived
PARALLEL_THREADS_PER_CPU
2
Yes
No
2
PROCESSES
Derived
No
No
Derived
SESSIONS
1.1 * PROCESSES
+5
No
No
1.1 * PROCESSES
+5
SHARED_POOL_SIZE
Derived
No in 8i
No
Derived
Yes in 9i
SORT_AREA_RETAINED_SIZE
SORT_AREA_SIZE
Yes
Yes
SORT_AREA_SIZE
SORT_AREA_SIZE
64KB
Yes
Yes
64KB
TRANSACTIONS
1.1 x SESSIONS
No
No
1.1 x SESSIONS
Table 1. PX Configuration Parameters
The total amount of memory that will be used for sorting and performing hash joins by all PX slave processes should be
accounted for when configuring PX. When configuring these parameters, choose conservative values and monitor overall
system memory usage during peak loads to verify that there is sufficient memory on the host machine to comfortably support
PX Facility Configuration and Use
Maresh
the load. If additional memory is available that could be used for sorting and performing hash joins, the parameter values can
be increased if additional performance is desired.
Enabling parallel broadcasting by setting the value of PARALLEL_BROADCAST_ENABLED to true will also cause more
PGA memory to be used. The additional memory usage will be proportional to the size of the row set being broadcast and
may range from a few kilobytes up to tens or hundreds of megabytes. It is best to enable this parameter after a stable PX
configuration has been established, so that it can be monitored in the absence of other changes.
Automatic PX Configuration
By this time, PX configuration may appear to be difficult and complex to configure. However, to make configuration easier,
Oracle has provided the PARALLEL_AUTOMATIC_TUNING parameter. When set to a value of TRUE, all of the
parameters associated with PX are configured to a set of reasonable values. Table 1 shows all of the parameters relevant to
PX including the less obvious ones, and the effects of setting PARALLEL_AUTOMATIC_TUNING to a value of TRUE.
All of the parameters in the table should be considered when configuring and monitoring the PX facility, although some have
only an indirect affect on performance. The table can be used effectively as a checklist.
Recommended Starting Configuration
The following parameters usually provide a stable starting point when setting up PX on the host configurations mentioned in
the introduction of this article. It uses a conservative approach that should produce good PX throughput without
overburdening the host system.
PARALLEL_AUTOMATIC_TUNING=TRUE
PARALLEL_MAX_SERVERS – Set to a value of CPU_COUNT x 3
PARALLEL_MIN_SERVERS – Set to a value of PARALLEL_MAX_SERVERS / 4
LARGE_POOL_SIZE – Use the value computed from the above formula
SHARED_POOL_SIZE – Increase the current value by 20% to accommodate PX SQL statements
In addition to reducing the complexity of PX configuration, enabling Parallel Automatic Tuning is the only way to locate the
PX message buffers in the large pool. If nondefault values for PROCESSES, SESSIONS, LARGE_POOL_SIZE, or
TRANSACTIONS are already in use, refer to the Oracle documentation listed in the Bibliography to adjust the values
accordingly.
Leaving PARALLEL_MAX_SERVERS at the default value of CPU_COUNT x 10 may overload the host, even on a
database host and database designed and configured specifically for PX. Setting the value of PARALLEL_MIN_SERVERS
to a value of PARALLEL_MAX_SERVERS / 4 leaves at least some servers in the slave pool ready for immediate use. After
monitoring the behavior over several days or weeks, begin to make incremental adjustments as necessary to tune performance,
and to take advantage of additional free host system resources.
Implementing Parallel Operations
The overall goal in implementing PX is to improve throughput by utilizing excess resources available on the database
instance. This may occur at all times, or may only occur during certain times during the day such as during the evening when
end-user loads are minimal. PX should not be implemented on instances that are already overburdened or that are near their
resource limit.
There are two methods of implementing Parallel Query and Parallel DML operations. The first method involves setting the
Degree of Parallelism on tables and indexes as in the following example.
SQL> alter table f_monthly_acctg parallel(degree 8);
Table altered.
The default value for degree is 1. Once the above statement is executed, any queries that access the f_monthly_acctg table
may utilize PX if the optimizer determines that it is appropriate. Certainly, this is the least invasive method of implementing
PX because in many cases, neither SQL statements nor applications must to be changed. The risk here is that queries that
inappropriately perform full-table scans on large tables may cause significant performance degradation on the entire database
if they are executed simultaneously using PX by multiple users.
PX Facility Configuration and Use
Maresh
On partitioned tables with local indexes, PX can be used to perform index range scans. To utilize PX in this capacity, the
degree of the index must be altered as shown in the following example.
SQL> alter index f_monthly_acctg_ix03 parallel(degree 8);
Index altered.
Before implementing PX by altering the degree at the object level, it is prudent to review SQL statements running against the
candidate tables either through the V$SQL dynamic performance view, or by code review, to make sure that all queries are
optimized. PX should not be used as a quick fix for missing or inadequate indexes. PX implemented on poorly tuned queries
may result in disastrous performance.
Also, determine the number of queries that will potentially use PX, and how long they will run to get an idea of the expected
increase in system load. Queries that run in the 5 second to 5 minute range are good candidates for implementing PX using
this method. The overhead associated with PX will usually not significantly benefit queries that run in less than 5 seconds. If
queries run longer than 5 minutes, then there is a risk that many users will execute them simultaneously causing overall
significant database performance degradation.
Some queries may still need to be tuned to utilize PX even though the table and/or index degree has been increased from a
value of one. Information about tuning queries to utilized PX is contained in the references listed in the Bibliography section.
The second, and preferred approach is to modify the SQL statements to use PX through the use of hints as shown in the
statement below. While this may be more invasive to applications than the first method, PX operations occur in a predictable
and controllable manner.
SELECT /*+ PARALLEL(s,8) / quarter_num, sum(acctg_num)
FROM s_daily_acctg s
WHERE year_num = 2002
AND acct_num = 70035
GROUP BY quarter_num;
DDL statements require special directives to utilize PX. Maintenance tasks such as gathering table statistics and rebuilding
indexes are performed under the control of Database Administrators so performance can be evaluated and controlled. Other
DDL statements that create temporary objects, as part of the data loading process, are usually performed by experienced
developers, possibly with the help of a Database Administrator. So here too, performance can be predicted and controlled.
In a production data warehouse environment, many of these tasks can be performed during off-peak hours when plenty of
excess database resources are available. Shortening the time of data loading windows is a particularly good use for PX.
So what is the best method? That depends upon when and where the extra performance is required. The following
implementation methodology lowers the risk of overall poor database performance when implementing PX.
1.
2.
For queries and DML statements that run during peak database load times
a.
First, verify that all of the queries accessing the tables of interest are optimized. If the queries are properly
tuned and perform well, it may not be necessary to implement PX.
b.
After tuning, verify that there are sufficient database resources available to implement PX. There should be
a considerable amount of CPU idle time available with few disk waits.
c.
If there are many different statements running against a particular table that could benefit from PX, then
implement PX by altering DEGREE on the table and/or index. This is the more aggressive approach and
performance should be monitored to verify that overall database performance is maintained.
d.
If there are few statements that could benefit from PX, tune and modify the queries using hints. This is a
more conservative approach, but performance should still be monitored to verify that overall database
performance is not impaired. These are likely to be long-running reports that run rather infrequently.
For all other tasks that run during off-peak hours
a.
Identify the tasks that are good candidates to utilize PX. These include data loading tasks as well as
maintenance tasks that require considerable time to complete.
b.
Verify that all of the SQL statements are fully optimized. If the queries are properly tuned and perform
well, it may not be necessary to implement PX.
PX Facility Configuration and Use
c.
Tune SQL statements for PX and implement PX using hints.
d.
For other maintenance tasks, implement PX using the directives specific to each statement.
e.
Verify performance so that overall database performance is maintained.
Maresh
The optimizer may reduce the run-time, or actual, Degree of Parallelism because certain conditions are not met. Here are
four of the most common reasons why this will occur, an example of each, and how it may be avoided.
•
Parallel operations on too small a table. A degree of 10 is specified for a nonpartitioned table that has a total of 500
data blocks. With today’s high performance disk storage subsystems, scanning 500 blocks may require less than a
second. The optimizer may determine that the overhead associated with parallelizing a query that scans 500 data
blocks will take more time than performing a scan without PX (serially). Here, the overhead associated with
parallelizing the query leads to the reduction in DOP. For this case, choose a Degree of Parallelism that is
appropriate for the size of the table.
•
Disk contention. A degree of 10 is specified for a nonpartitioned table that has a total of 20,000 data blocks. The
tablespace on which the table resides has two large datafiles. The DBAs thought it would be easier to create and
maintain a few large datafiles instead of many smaller ones. Additionally, the table has a total of only five extents.
In this case, the optimizer will determine that 10 PX slaves operating on two datafiles will cause disk contention and
will reduce the Degree of Parallelism. To avoid this problem, create many smaller tablespace datafiles across many
disk mount points. Set the next extent size for the table to a small enough value so that extents will be allocated
across many datafiles.
•
CPU contention. A degree of 20 is specified for a partitioned table that has a total of 20,000 data blocks. The table
has a total of 80 partitions and data is distributed fairly uniformly between all of the partitions (good job!). The
database is hosted on a single-instance system that has a total of 4 CPUs. Here, the optimizer will determine that 20
PX slaves operating on 4 CPUs will cause CPU contention and will reduce the Degree of Parallelism accordingly.
Values for the degree should be chosen that are reasonable for the number of CPUs on the host system. A good rule
of thumb is to choose a value for DEGREE that is no greater than the number of CPUs per node on the system. If
inter-operational parallelism occurs in the query, a total of 2 x CPU_COUNT slaves will be brought to bear on the
query.
Poorly partitioned table. A degree of 20 is specified for a partitioned table that has a total of 50,000 data blocks.
The table is range partitioned by month, and has a total of 36 partitions. A query that will produce a quarterly report
accesses data in a total of three partitions. Since partition level granularity will be used to divide the workload, the
optimizer will reduce the Degree of Parallelism to 3 since only one PX slave can operate in each partition when
partition level granularity is employed. To utilize the desired Degree of Parallelism, the table must be range
partitioned in shorter intervals so that the three month’s worth of data resides in at least 20 partitions, and preferably
a much higher number for PX to perform efficiently.
Common Parallel Data Warehouse Operations
While there are many PX operations that can be performed, the ones discussed in this section typically account for over 90%
of all PX operations on a data warehouse, and they are easy to implement.
Parallel Query
Parallel Query (PQ) was one of the first parallel operations that was introduced in Oracle version 7. PQ has a number of very
good uses. On both partitioned and nonpartitioned tables, it can be used to perform full-table scans when all of the rows in
the table must be processed. This occurs when building drill across and summary tables, and materialized views. It is also
useful for scanning tables when nonselective report queries are executed.
On nonpartitioned tables, the scan is performed by splitting the table into multiple ranges of ROWIDs. Each PX process
operates on one or more ranges or ROWIDs. This is referred to as block range granularity. On partitioned tables, if one
partition is scanned, the same paradigm is employed. If multiple partitions are scanned, then each PX slave scans one or
more table partition. This is referred to as partition level granularity.
If multiple partitions are accessed within a partitioned table, PX can be employed to perform index range scans on local
indexes. Here, each PX slave scans one or more index partitions. It soon becomes obvious that the design of partition tables
and the associated index strategy will have a major impact on how well PX will perform on them.
PX Facility Configuration and Use
Maresh
The example below shows a parallel query implemented using a query hint. To use the PARALLEL hint, specify the table
and the desired degree. If the table is aliased, as in the below example, the alias must be used in the hint. In this example, a
Degree of Parallelism of 8 is requested to process the s_daily_acctg table.
SELECT /*+ PARALLEL(s,8) / quarter_num, sum(acctg_num)
FROM s_daily_acctg s
WHERE year_num = 2002
AND acct_num = 70035
GROUP BY quarter_num;
An optional third argument of the hint is available to specify the number of instances to use when Oracle Parallel Server
(OPS) or Real Application Cluster (RAC) is implemented. In a single instance environment, specifying instances as in the
following example has unintended results.
SELECT /*+ PARALLEL(s,8,8) / quarter_num, sum(acctg_num)
FROM s_daily_acctg s
WHERE year_num = 2002
AND acct_num = 70035
GROUP BY quarter_num;
In a RAC environment, this query requests a total of 64 PX slaves, 8 slaves on each of 8 instances. In a single instance
environment, it will still result in 64 PX slaves (8 x 8) being requested! The instances argument of the PARALLEL hint
should not be used in single instance environments.
Parallel DML
Update and delete statements on partitioned tables can be parallelized when the operation includes multiple partitions. Each
PX slave processes one or more partitions. Parallel DML must explicitly be enabled at the session level before executing
parallel DML statements.
SQL> ALTER SESSION ENABLE PARALLEL DML;
Session altered.
Now, any DML statements executed within the session may utilize PX. Simple statements may be executed as in the
following example.
SQL> UPDATE /*+ PARALLEL(dl,8) */ ddl_log dl
2 SET status_txt = ‘SUCCESS’
3 WHERE status_ind = 1;
107288 rows updated.
SQL> COMMIT;
Commit complete.
Because parallel DML statements utilize a two-phase commit strategy, either a commit or rollback statement must be
executed after each parallel DML statement completes. Any other statement executed after a parallel DML statement has
completed will produce an error.
In compound statements, PX can be employed in the select statement, the DML statement, or both. Each is configured and
tuned separately as shown in the following example that aggregates daily fact data into a monthly fact table.
INSERT /*+ PARALLEL(s,8) */ INTO f_monthly_acctg_stage s
SELECT /*+ PARALLEL(f,8) */ period_id, store_id, location_id, account_id, SUM(acctg_nmbr)
FROM f_daily_acctg f
WHERE period_id = 1774
GROUP BY period_id, store_id, location_id, account_id;
First, the select statement is tuned and parallelized. Since the query includes an aggregation operation, two sets of eight PX
slaves each will be employed in an inter-operational parallel step. In the first intra-operational step, eight slaves will scan
multiple partitions of the f_daily_acctg table. These slaves will produce rows that will be consumed by the second set of
eight PX slaves that perform the aggregation. The second step, in turn, produces rows that will be consumed by the eight
slaves employed in the insert statement to populate the f_monthly_acctg_stage table. A total of 24 PX slaves will be used by
this statement.
PX Facility Configuration and Use
Maresh
By default, parallel insert statements are performed using direct path operations. These operations bypass the SGA buffer
cache so throughput is much higher than conventional path inserts. However, there are two artifacts associated with direct
path inserts. Since entire data blocks are preformatted and inserted directly into datafiles, existing table blocks with free
space will not be used. Additionally, any free space left in data blocks inserted by direct path inserts will never be eligible
for subsequent conventional path inserts because they are never placed on the table or partition free lists. The most useful
place for direct path inserts is for loading intermediate tables that will be subsequently loaded into production tables using
conventional path. If the target table of a parallel insert statement is nonpartitioned, only direct path inserts can be performed.
If the table is partitioned, both direct and conventional path inserts can be performed. To use conventional path inserts on
partitioned tables, specify the NOAPPEND hint on the insert statement.
If the insert statement is loading data into a table that can easily be recreated in the event that the database crashed and
required recovery, the NOLOGGING hint can be used to further increase data loading throughput. The nologging option
causes the amount of redo information to be significantly reduced. This option is useful for loading temporary tables that may
be part of the data warehouse data loading task.
Create Table As Select
This is another feature that is useful for manipulating table data during data warehouse loading tasks that has been available
since Oracle 7. It combines the functionality of a create table and insert statement into a single step. The example below
illustrates use of the statement.
CREATE TABLE f_monthly_acctg_stage
TABLESPACE stage
PARALLEL 8
NOLOGGING
AS
SELECT /*+ PARALLEL(f,8) */
period_id,
store_id,
location_id,
account_id,
SUM(acctg_nmbr) acctg_nmbr
FROM f_daily_acctg f
GROUP BY
period_id,
store_id,
location_id,
account_id;
Once again, each of two operations is performed separately. The select statement will employ 16 PX slaves to perform the
rollup using inter-operational parallelism. The create table statement will both create the table and insert the rows produced
by the parallel query using direct path operations. Since this is an intermediate temporary table, the nologging clause has
also been used to improve throughput.
The Degree of Parallelism for the insert part of the statement is specified by the parallel 8 directive in the create table
statement. It is syntactically correct to not specify the DOP following the parallel directive. If a DOP does not follow the
directive, then the query will utilize all available slaves available on the instance up to the value of
PARALLEL_MAX_SERVERS as an unintended consequence. Be sure to always specify the requested DOP after the
parallel directive in this statement.
Create Materialized View
Similar to the Create Table as Select statement, materialized view creation can also be parallelized. This is illustrated in the
statement below.
In this example, up to 12 PX threads will be exploited to read the source tables and populate the materialized view. In the
select statement, the optimizer will use parallelism on the tables it determines that can best benefit from it. In this case, it is
likely to be the fact table, f_daily_store_sales, since the remaining dimension tables are small.
PX will also be employed in the step that populates the materialized view in the same way that the table was populated in the
create table as select statement using direct path operations.
PX Facility Configuration and Use
Maresh
CREATE MATERIALIZED VIEW mv_daily_store_sales_state
TABLESPACE materialized_view
PARALLEL 12
PCTFREE 0
NOLOGGING
BUILD IMMEDIATE
REFRESH COMPLETE
ENABLE QUERY REWRITE
AS
SELECT
dp.production_dt,
pr.product_id,
pr.product_dsc,
pr.sic_cd,
pr.upc_cd,
pr.size_cd,
pr.product_category_cd,
ds.store_id,
ds.state,
SUM(fss.total_sales_dol) sales,
SUM(fss.total_return_dol) returns
FROM
f_daily_store_sales fss,
d_production_day dp,
d_product pr,
d_store ds
WHERE fss.production_day_id = dp.production_day_id
AND fss.product_id = pr.product_id
AND fss.store_id = ds.store_id
GROUP BY
dp.production_dt,
pr.product_id,
pr.product_dsc,
pr.sic_cd,
pr.upc_cd,
pr.size_cd,
pr.product_category_cd,
ds.store_id,
ds.state;
Create Index
Both local and global indexes may be efficiently created using PX on both nonpartitioned and partitioned tables. To build an
index using PX, simply add the PARALLEL directive to the create index statement and specify the desired Degree of
Parallelism.
CREATE BITMAP INDEX f_store_status_IX02
ON f_comp_status(store_id)
PARALLEL(DEGREE 4)
NOLOGGING;
Rebuild Index
Likewise, PX can be used to rebuild both nonpartitioned and partitioned indexes. Add the PARALLEL directive to the alter
index statement and specify the desired Degree of Parallelism.
ALTER INDEX f_daily_store_sales REBUILD PARALLEL 12 NOLOGGING;
When creating or rebuilding indexes, increasing the sort area size at the session level will improve performance since more of
the sorting process will occur in memory rather than on disk. Additionally, building indexes with the nologging feature
increases throughput. But remember that with the nologging feature, the index would not be able to be recovered in the event
that a database recovery was necessary.
Another artifact of both the parallel index creation and rebuild operations is that the Degree of Parallelism on the index will
be set to the degree used for the index operation. Even if the degree at the table is set to a value of 1, a statement running
against the table that uses the index may use PX as an unintended consequence. To prevent this from occurring, always set
the Degree of Parallelism to the desired value after the parallel index maintenance operation has completed, as shown in the
following statement.
PX Facility Configuration and Use
Maresh
ALTER INDEX f_daily_store_sales PARALLEL(DEGREE 1 INSTANCES 1);
Gather Object Statistics
Most readers are probably familiar with the ANALYZE statement used to collect statistics on tables and indexes. This
statement cannot utilize PX. While there haven’t been any recent significant enhancements to this statement, the
DBMS_STATS package continues to be enhanced. The DBMS_STATS package can be used to gather statistics at various
levels, and can utilize the PX facility. As a result, object statistics can be gathered in a much shorter period of time compared
with the ANALYZE statement. Many will find that running the DBMS_STATS procedures are more cumbersome than the
ANALYZE statement, but the inconvenience is worth the flexibility and performance gains.
There are many procedures within the DBMS_STATS package, but the following procedures are the basic ones for collecting
statistics. In all cases, the Degree of Parallelism is specified as an argument in the procedure call.
GATHER_INDEX_STATS gathers global and index statistics on an index or index partition.
GATHER_TABLE_STATS is used to gather global and object level statistics on partitioned and nonpartitioned tables. An
option is available to also gather statistics on all index and index partitions that exist on the specified table.
GATHER_SCHEMA_STATS gathers global and object level statistics on all tables and indexes in a specified schema.
Statistics should be gathered exclusively with either the ANALYZE statement or DBMS_STATS. When migrating from
ANALYZE to the DBMS_STATS method of statistics collection, drop all of the statistics collected with the ANALYZE
statement before using DBMS_STATS.
Performance Monitoring
The final step in achieving optimal PX performance across the database instance is to monitor performance. Oracle provides
several dynamic performance views for this purpose. The older views begin with V$PQ while the newer ones begin with
V$PX. The older views date back to Oracle 7 which supported few parallel features, one of which included parallel query,
hence the V$PQ designation. When capabilities were added beginning with version 8 to include parallel execution of other
operations, the overall facility was named PX, and the V$PX views were added. Both sets of views provide useful
information. There is however, some redundancy between several V$PQ and V$PX views.
Monitoring the PX Message Pool
After configuring PX for the first time on a database instance, or when configuration changes have been made, the size of the
PX message pool should be monitored. The following query indicates that the size of the message pool is approximately
40MB and is located in the large pool. From the query, we can also determine that the
PARALLEL_AUTOMATIC_TUNING configuration parameter is set to a value of true because the message pool is located
in the large pool. If the value of PARALLEL_AUTOMATIC_TUNING were set to false, then the query would have
indicated that the message pool was located in the shared pool.
SQL>
SQL>
SQL>
SQL>
SQL>
2
3
4
break on pool skip 1
col bytes format 9,999,999,999;
col name format a25
SELECT pool, name, SUM(bytes) bytes
FROM v$sgastat
WHERE pool = 'large pool'
GROUP BY ROLLUP(pool,name);
POOL
NAME
BYTES
----------- ------------------------- -------------large pool PX msg pool
41,179,660
free memory
1,299,988
42,479,648
Message buffer usage can be monitored by querying the V$PX_PROCESS_SYSTAT view as follows.
PX Facility Configuration and Use
Maresh
SQL> SELECT statistic, value bytes
2
FROM v$px_process_sysstat
3
WHERE statistic LIKE 'Buffers%';
STATISTIC
BYTES
------------------------------ -------------Buffers Allocated
1,246,409
Buffers Freed
1,246,409
Buffers Current
0
Buffers HWM
3,141
Multiplying the value for buffers HWM (high water mark) by the value of PARALLEL_EXECUTION_MESSAGE_SIZE
will yield the highest usage of PX message buffer space. Using and 8KB message size for this example, the highest amount
of memory usage by message buffers was 25MB out of roughly 40MB of total space available. After periodically monitoring
this view over days or weeks, any major size discrepancies of the message pool should be adjusted accordingly. It is prudent
to maintain at least a 25% margin between the highest message buffer space usage and the total amount of PX message pool
space allocated.
Monitoring the Overall PX Facility
V$PX_PROCESS_SYSTAT shows overall statistics for the PX facility. The server statistics show the number of PX slaves
that are in use and how many are available in the PX slave pool at a point in time. Here, the term slave and server are
synonymous. Slaves that are in use are slaves that are currently under the command of a Query Coordinator that cannot be
used for new parallel operations. Slaves that are available are slaves that are idle in the slave pool that may be acquired for
parallel operations. The total number of slaves that are in use can be as large as the parameter value of
PARALLEL_MAX_SERVERS. If the sum of the values of Servers In Use and Servers Available does not equal the value of
PARALLEL_MAX_SERVERS, then it means that the value of PARALLEL_MIN_SERVERS is less than the value of
PARALLEL_MAX_SERVERS. In such a case, additional slaves will be dynamically created when a Query Coordinator
requests them and none are available in the slave pool.
SQL> SELECT *
2
FROM v$px_process_sysstat
3
WHERE statistic LIKE 'Servers%';
STATISTIC
-----------------------------Servers In Use
Servers Available
Servers Started
Servers Shutdown
Servers Highwater
Servers Cleaned Up
Server Sessions
VALUE
---------38
2
820
386
40
0
67,372
If the value for Servers Shutdown is high, it indicates that many slaves were dynamically destroyed to reduce the total
number of slaves in the slave pool to the value of PARALLEL_MIN_SERVERS. If there is sufficient memory available,
consider increasing the value of PARALLEL_MIN_SERVERS to avoid the overhead associated with dynamically creating
and destroying PX slaves.
V$SYSSTAT contains some useful statistics about the types and numbers of statements parallelized. The total number of
data flow operations parallelized is shown on the second line of the output. Typically, one inter-operational parallel
operation results in one data flow operation (DFO). If the number of DFOs is substantially higher than the sum of the
remaining three rows, then many queries have been parallelized that contain multiple join operations, or compound
statements such as INSERT INTO..SELECT…, CREATE TABLE … AS SELECT, or CREATE MATERIALIZED VIEW.
SQL> SELECT name, value
2
FROM v$sysstat
3
WHERE name LIKE '%parallel%'
4
order by name;
NAME
VALUE
------------------------------ ---------DDL statements parallelized
360
DFO trees parallelized
7425
DML statements parallelized
26
queries parallelized
6998
PX Facility Configuration and Use
Maresh
The following query against V$SYSSTAT is useful for determining when parallel operations were unable to acquire the total
number of slaves that were requested. The total number of slaves at run-time will be less than the requested Degree of
Parallelism if there are an insufficient number of slaves available when PARALLEL_MIN_PERCENT is set to a nonzero
value, or when PARALLEL_ADAPTIVE_MULTI_USER is set to a value of TRUE. Downgrading can often explain erratic
performance of the same query that executes many times over the course of a day. When the query requires a longer time to
execute, it is likely to have acquired fewer slaves than requested.
SQL> SELECT name, value
2
FROM v$sysstat
3
WHERE name LIKE 'Parallel%'
4
order by name;
NAME
VALUE
--------------------------------------------- ---------Parallel operations downgraded 1 to 25 pct
18
Parallel operations downgraded 25 to 50 pct
4
Parallel operations downgraded 50 to 75 pct
0
Parallel operations downgraded 75 to 99 pct
0
Parallel operations downgraded to serial
108
Parallel operations not downgraded
7425
At the next lower level of the PX facility, V$PX_PROCESS shows the status of each PX slave. The number of slaves in the
pool will range from the value of PARALLEL_MIN_SERVERS to the value of PARALLEL_MAX_SERVERS. Note that
the SID and SERIAL# are only assigned when the slave is in use.
SQL> SELECT *
2
FROM v$px_process
3
ORDER BY server_name;
SERV
---P000
P001
P002
P003
P004
…
P038
P039
STATUS
PID SPID
SID
SERIAL#
--------- ---------- --------- ---------- ---------IN USE
21 20559
34
9438
IN USE
22 20561
50
24257
IN USE
23 20563
84
540
IN USE
24 20565
26
10661
IN USE
25 20567
18
27455
AVAILABLE
AVAILABLE
64 22264
65 22266
Moving to the SQL statement level, V$PX_SESSION can be used to determine how many slaves have been acquired by each
Query Coordinator. The results of the following query indicate that there are 9 sessions associated with the first Query
Coordinator, and 17 sessions associated with each of the last two. Since V$PX_SESSION also includes information about
the Query Coordinator, the number of PX slaves being used by each QC will be one less than the number of reported in the
COUNT(*) column.
SQL> SELECT qcsid, COUNT(*)
2
FROM v$px_session
3
GROUP BY qcsid;
QCSID
COUNT(*)
---------- ---------45
9
73
17
74
17
Summary
PX architecture is a logical extension of the core database architecture. When a parallel SQL statement begins to execute, it
borrows PX slaves from a common pool available to all database users. When the statement completes, the slaves are
returned to the slave pool for use by other processes.
There are many considerations about the database and host that affect overall performance of the PX facility. Prior to setting
up PX, the database and host configuration should be optimized to take full advantage of PX. PX is one application that can
take full advantage of a well-designed high-throughput disk subsystem. Proper database physical design is also an important
factor that influences optimal PX performance.
PX Facility Configuration and Use
Maresh
There are many database configuration parameters that control the configuration and performance of the PX facility. Besides
the obvious parameters that begin with the word PARALLEL, there are many others that are often overlooked that cause to
less than optimal performance, or even statement failure.
Once the PX facility has been configured, it can generally be implemented by two methods. The first method is by setting
the Degree of Parallelism on tables and indexes to a value greater than one. In this case, statements will automatically begin
using PX. The greatest risk of this approach is that some poorly performing statements may degrade overall database
performance. Additionally, some statements may require tuning to fully take advantage of PX. The recommended approach
is to choose statements that can best benefit from PX and manually tune them with query hints. In any case, statements
running against the candidate tables should be optimized prior to using PX. You may find that performance is quite adequate
without the use of PX.
The most common data warehouse operations that can take advantage of PX include parallel query, parallel DML, Create
Table as Select, materialized view creation. For maintenance operations, index creation and rebuilds, and object statistics
gathering can also be parallelized.
Once the database is up and running on PX, it is important to monitor the overall health of the PX facility. The
V$PX_PROCESS_SYSTAT view provides a high level view. One can monitor the size of the PX message pool and verify
that its intended location is correct. The view also provides a summary of message buffer and PX slave allocation. The
V$SYSSTAT view is useful for determining the types of PX operations that have been run, as well as information about
operations that did not acquire all of the slaves that were requested. The V$PX_PROCESS and V$PX_SESSION views
show details about PX statements that are currently executing.
Once one has a good overview of the overall PX architecture as it relates to the overall database and host environments, and
properly configures and monitors PX operation on the database instance, PX can be effectively and efficiently be used to
scale data warehouse throughput.
Download