Oracle Architecture and Tuning on AIX

advertisement
IBM Americas Advanced Technical Support
IBM Oracle Technical Brief
Oracle Architecture and Tuning on AIX
Damir Rubic
IBM SAP & Oracle Solutions Advanced Technical Skills
Version: 2.30
Date: 08 October 2012
© 2011 International Business Machines, Inc.
IBM Americas Advanced Technical Support
ACKNOWLEDGEMENTS ........................................................................................................................ 5
•
•
•
•
DISCLAIMERS ...................................................................................................................................... 5
TRADEMARKS ..................................................................................................................................... 5
FEEDBACK .......................................................................................................................................... 6
VERSION UPDATES.............................................................................................................................. 6
INTRODUCTION........................................................................................................................................ 7
1.
ORACLE DATABASE ARCHITECTURE ...................................................................................... 8
1.1.
DATABASE STRUCTURE .................................................................................................................. 8
1.1.1.
Logical Structure ................................................................................................................... 8
1.1.2.
Physical storage structures .................................................................................................. 10
1.2.
INSTANCE AND APPLICATION PROCESSES ..................................................................................... 12
1.3.
ORACLE MEMORY STRUCTURES .................................................................................................... 19
2.
AIX CONFIGURATION & TUNING FOR ORACLE ................................................................. 22
2.1.
OVERVIEW OF THE AIX VIRTUAL MEMORY MANAGER (VMM) .................................................. 22
2.1.1.
Real-Memory Management .................................................................................................. 22
2.1.2.
Persistent versus Working Segments ................................................................................... 23
2.1.3.
Computational versus File Memory..................................................................................... 25
2.1.4.
Page Replacement ................................................................................................................ 25
2.2.
MEMORY AND PAGING .................................................................................................................. 27
2.2.1.
AIX 6.1 and 7.1 Restricted Tunables concept ...................................................................... 27
2.2.2.
AIX free memory .................................................................................................................. 27
2.2.3.
AIX file system cache size .................................................................................................... 28
2.2.4.
Allocating sufficient paging space ....................................................................................... 32
2.2.5.
AIX Multi-page size support for Oracle & SGA Pinning .................................................... 32
2.2.6.
AIX 1TB Segment Aliasing ................................................................................................... 35
2.2.7.
AIX Kernel Memory Pinning ............................................................................................... 37
2.2.8.
AIX Enhanced Memory Affinity ........................................................................................... 38
2.3.
I/O CONFIGURATION ..................................................................................................................... 42
2.3.1.
AIX LVM - Volume Groups .................................................................................................. 42
2.3.2.
AIX LVM - Logical Volumes ................................................................................................ 42
2.3.3.
AIX sequential read ahead ................................................................................................... 44
2.3.4.
I/O Buffer Tuning ................................................................................................................. 45
2.3.5.
Disk I/O Tuning.................................................................................................................... 47
2.3.6.
Disk I/O Pacing.................................................................................................................... 54
2.3.7.
Asynchronous I/O................................................................................................................. 55
2.3.8.
JFS2 file system DIO/CIO mount options ............................................................................ 58
2.3.9.
Oracle ASM & AIX Considerations ..................................................................................... 61
2.4.
CPU TUNING ................................................................................................................................ 65
2.4.1.
Simultaneous Multi-Threading (SMT) ................................................................................. 66
2.4.2.
Logical Partitioning ............................................................................................................. 67
2.4.3.
Micro-Partitioning ............................................................................................................... 67
2.4.4.
Virtual Processor Folding ................................................................................................... 69
2.5.
NETWORK TUNING ........................................................................................................................ 71
© 2011 International Business Machines, Inc.
Page 2
IBM Americas Advanced Technical Support
3.
ORACLE TUNING ........................................................................................................................... 73
APPENDIX: RELATED PUBLICATIONS ............................................................................................ 78
© 2011 International Business Machines, Inc.
Page 3
IBM Americas Advanced Technical Support
Table of Figures:
Figure 1 - Relationship between logical and physical storage ....................................................................... 9
Figure 2 - Relationship between tablespaces, datafiles, tables and indexes ................................................ 11
Figure 3 - Connection between a user and the database .............................................................................. 12
Figure 4 - Oracle Process Architecture ........................................................................................................ 13
Figure 5 - Oracle Dedicated Process Architecture ....................................................................................... 16
Figure 6 - Oracle Shared Process Architecture ............................................................................................ 17
Figure 7 - Oracle RAC Specific Instance Processes .................................................................................... 18
Figure 8 - Oracle Memory structures ........................................................................................................... 19
Figure 9 - Persistent and Working Storage Segments. ................................................................................ 24
Figure 10 - Legacy “page_steal_method=0” vs. List-based “page_steal_method=1” ................................ 26
Figure 11 - “vmstat –v” sample output ........................................................................................................ 29
Figure 12 - AIX VMM Page Replacement Overview ................................................................................. 30
Figure 13 - AIX Memory Page Sizes ........................................................................................................... 33
Figure 14 - AIX Large Segment Aliasing .................................................................................................... 36
Figure 15 - Multi-Tier memory hierarchy of Power System 795 ............................................................... 38
Figure 16 - Power7 Systems - Memory "distances" .................................................................................... 39
Figure 17 – Power System 770 "lssrad -av" sample output ......................................................................... 39
Figure 18 - "mpstat -d" sample output ......................................................................................................... 41
Figure 19 - Volume Groups Configuration Limits ...................................................................................... 42
Figure 20 - Optimal Storage Layout ............................................................................................................ 43
Figure 21 - AIX I/O Buffering ..................................................................................................................... 46
Figure 22 - Disk I/O Layers ......................................................................................................................... 47
Figure 23 - “lsattr –El hdisk2” sample output ............................................................................................. 48
Figure 24 - “lsattr –El fsc2” sample output ................................................................................................. 49
Figure 25 - “iostat” sample output ............................................................................................................... 50
Figure 26 - “fcstat” sample output ............................................................................................................... 50
Figure 27 - “filemon” sample output ........................................................................................................... 52
Figure 28 - AIX Asynchronous I/O ............................................................................................................. 55
Figure 29 - AIX Fastpath Asynchronous I/O ............................................................................................... 55
Figure 30 - AIX & Oracle Cached I/O Operations ...................................................................................... 58
Figure 31 - AIX & Oracle Direct I/O Operations ........................................................................................ 59
Figure 32 - Oracle ASM Architecture ......................................................................................................... 61
Figure 33 - SAP recommended ASM configuration for a very large databases .......................................... 62
Figure 34 - AIX PVIDs ................................................................................................................................ 63
Figure 35 - AIX “rendev” command ........................................................................................................... 64
Figure 36 - AIX “lkdev” command ............................................................................................................. 64
Figure 37 - PowerVM key terms ................................................................................................................. 65
Figure 38 - AIX VP Folding ........................................................................................................................ 70
Figure 39 - Suggested “no” tunable values .................................................................................................. 72
© 2011 International Business Machines, Inc.
Page 4
IBM Americas Advanced Technical Support
Acknowledgements
Thank you to Rebecca Ballough, Dan Braden, Jim Dilley, Dale Martin and Ralf Schmidt-Dannert for all
their help regarding this and many other Oracle & IBM Power Systems performance related subjects.
• Disclaimers
While IBM has made every effort to verify the information contained in this paper, this paper may still
contain errors. IBM makes no warranties or representations with respect to the content hereof and
specifically disclaim any implied warranties of merchantability or fitness for any particular purpose. IBM
assumes no responsibility for any errors that may appear in this document. The information contained in
this document is subject to change without any notice. IBM reserves the right to make any such changes
without obligation to notify any person of such revision or changes. IBM makes no commitment to keep
the information contained herein up to date.
• Trademarks
The following terms are registered trademarks of International Business Machines Corporation in the
United States and/or other countries: AIX, AIX/L, AIX/L (logo), IBM, IBM (logo), pSeries, Total
Storage, Power PC.
The following terms are registered trademarks of International Business Machines Corporation in the
United States and/or other countries: Power VM, Advanced Micro-Partitioning, AIX 5L, AIX 6 (logo),
Micro Partitioning, Power Architecture, POWER5, POWER6, POWER 7, Redbooks, System p, System
p5, System p6, System p7, System Storage.
A full list of U.S trademarks owned by IBM may be found at:
http://www.ibm.com/legal/copytrade.shtml
Oracle, Java and all Java-based trademarks are registered trademarks of Oracle Corporation in the USA
and/or other countries.
UNIX is a registered trademark in the United States, other countries or both.
SAP R/3 is a registered trademark of SAP Corporation.
© 2011 International Business Machines, Inc.
Page 5
IBM Americas Advanced Technical Support
• Feedback
Please send comments or suggestions for changes to drubic@us.ibm.com.
• Version Updates
•
•
•
•
•
•
•
•
•
•
Version 1.00 - initial version
Version 1.10 - added architectural and tuning elements related to AIX 5.3
Version 1.20 - added architectural and tuning elements related to AIX 6.1
Version 1.30 - minor corrections/updates applied
Version 1.31 - minor corrections applied
Version 2.00
o AIX 5.2 related content removed
o versioning changed
o added “Virtual Processor Folding” chapter
o added “I/O Pacing” chapter
o added “Oracle ASM” chapter
o corrections/updates applied
Version 2.10 - added architectural and tuning elements related to AIX 7.1
o added “AIX LSA” chapter
o corrections/updates applied
Version 2.11
o O_CIOR flag explained
o corrections/updates applied
Version 2.20
o added “Disk I/O Tuning” chapter
o corrections/updates applied
Version 2.30
o added “Oracle ASM & AIX Considerations” chapter
o added “AIX Kernel Memory Pinning” chapter
o added “AIX Enhanced Memory Affinity” chapter
o corrections/updates applied
© 2011 International Business Machines, Inc.
Page 6
IBM Americas Advanced Technical Support
Introduction
This paper is intended for IBM Power Systems customers, IBM Technical Sales Specialists and
consultants who are interested in learning more about the requirements involved in building and tuning an
Oracle RDBMS system for optimal performance on AIX platform.
This white paper contains best practices which have been collected during the extensive period of time my
team colleagues and I have spent working in the Oracle RDBMS based environment. It is focused on AIX
versions 5.3, 6.1, 7.1 and Oracle 9i, 10g, 11g.
It begins with a short description of the most important Oracle DB architectural elements. It continues
with an overview of the AIX-related tuning elements that are most crucial for optimal DB activity.
This document can be expanded into many different OS or DB-related directions. Additional information
on related topics is included in the Appendix of this paper, as well as references to supporting
documentation. However, this paper is not focused on the application tuning area. Application
performance tuning is a subject too broad to be covered in a white paper of this length.
© 2011 International Business Machines, Inc.
Page 7
IBM Americas Advanced Technical Support
1. Oracle Database Architecture
1.1.
Database Structure
An Oracle database is a collection of data treated as a unit. The purpose of a database is to store and
retrieve related information. A database server is the key to solving the problems of information
management. In general, a server reliably manages a large amount of data in a multi-user environment so
that many users can concurrently access the same data. All this is accomplished while delivering high
performance. A database server also prevents unauthorized access and provides efficient solutions for
failure recovery.
The database has logical structures and physical structures. Because the physical and logical structures
are separate, the physical storage of data can be managed without affecting the access to logical storage
structures.
1.1.1. Logical Structure
An Oracle database is made up of several logical storage structures, including data blocks, extents and
segments, tablespaces, and schema objects (Figure 1).
The actual physical storage space in the data files is logically allocated and deallocated in the form of
Oracle data blocks. The Database Block Size (DB_BLOCK_SIZE) is the smallest unit of I/O that can be
used for accessing data or index files in Oracle database. Oracle reserves a portion of each block for
maintaining information, such as the address offset of all the rows contained in the block and the type of
information stored in the block.
An extent is a set of logically contiguous data blocks allocated for storing a specific type of information.
Physically, they could be in different file system extents and/or different physical LUNs or drives. A table
is comprised of one or more extents. The very first extent of a table is known as the initial extent. When
the data blocks of the initial extent become full, Oracle allocates an incremental extent. The incremental
extent does not have to be the same size (in bytes) as the initial extent.
A segment is the collection of extents that contain all of the data for a particular logical storage structure
in a tablespace, such as a table or index. There are four different types of segments, each corresponding to
a specific logical storage structure type:
•
•
•
•
Data segments
Index segments
Rollback (Undo) segments
Temporary segments
© 2011 International Business Machines, Inc.
Page 8
IBM Americas Advanced Technical Support
Data segments store all the data contained in a table, partition, or cluster. Likewise, index segments store
all the data contained in an index. For backward compatibility, rollback (undo) segments are used to hold
the previous contents of an Oracle data block prior to any change made by a particular transaction. If any
part of the transaction should not complete successfully, the information contained in the rollback (undo)
segments is used to restore the data to its previous state. In Oracle 9i, 10g and 11g this is achieved using
automatic undo and undo tablespaces, which allows better control and use of the server resources.
Figure 1 - Relationship between logical and physical storage
Rollback (undo) segments are also used to provide read-consistency. There are two different types of
read-consistency: statement-level and transaction-level.
Statement-level read consistency ensures that all of the data returned by an individual query comes from a
specific point in time: the point at which the query started. This guarantees that the query does not see
changes to the data made by other transactions that have committed since the query began. This is the
default level of read-consistency provided by Oracle.
In addition, Oracle offers the option of enforcing transaction-level read consistency. Transaction-level
read consistency ensures that all queries made within the same transaction do not see changes made by
queries outside of that transaction but can see changes made within the transaction itself. These are known
as serializable transactions.
Temporary segments are used as temporary workspaces during intermediate stages of a query’s execution.
They are typically used for sort operations that cannot be performed in memory.
© 2011 International Business Machines, Inc.
Page 9
IBM Americas Advanced Technical Support
The following types of queries may require a temporary segment:
•
•
•
•
•
•
•
SELECT.....ORDER BY
SELECT.....GROUP BY
SELECT.....UNION
SELECT.....INTERSECT
SELECT.....MINUS
SELECT DISTINCT.....
CREATE INDEX....
Tablespaces are used to group related logical entities or objects together in order to simplify physical
management of the database. Tablespaces are the primary means of allocating and distributing database
data at the file system or physical disk level. Tablespaces are used to:
•
•
•
•
Control the physical disk space allocation for the database
Control the availability of the data by taking the tablespaces online or off-line
Distribute database objects across different physical storage devices to improve performance
Regulate space for individual database users
Every Oracle database contains tablespace named SYSTEM, which Oracle creates automatically when the
database is created. The SYSTEM tablespace contains the data dictionary tables for the database used to
describe its structure. The SYSAUX tablespace is an auxiliary tablespace to the SYSTEM tablespace.
Many database components use the SYSAUX tablespace as their default location to store data. Therefore,
the SYSAUX tablespace is always created during database creation or database upgrade. The SYSAUX
tablespace provides a centralized location for database metadata that does not reside in the SYSTEM
tablespace. It reduces the number of tablespaces created by default, both in the seed database and in userdefined databases. When the SYSTEM tablespace is locally managed, you must define at least one
default temporary tablespace when creating a database. A locally managed SYSTEM tablespace cannot
be used for default temporary storage. When SYSTEM is dictionary managed and if you do not define a
default temporary tablespace when creating the database, then SYSTEM is still used for default temporary
storage.
Schema objects are the logical structures used to refer to the database’s data. A few examples of schema
objects would be tables, indexes, views, and stored procedures. Schema objects, and the relationships
between them, constitute the relational design of a database.
1.1.2. Physical storage structures
An Oracle database is made up of three different types of physical database files: data files, redo logs, and
control files.
An Oracle database must have one or more data files in order to operate. Data files contain the actual
database data logically represented in the form of tables or indexes. At the operating system level, data
files can be implemented in a several different ways (JFS, JFS2, GPFS, Veritas FS and Oracle ASM). In
© 2011 International Business Machines, Inc.
Page 10
IBM Americas Advanced Technical Support
this document we will focus primarily on JFS or JFS2 based files. The data contained in the data files is
read from disk into the Oracle SGA memory regions.
An Oracle tablespace is comprised of one or more data files. A data file cannot be associated with more
than one tablespace, nor can it be used by more than one database (Figure 2).
At creation time, the physical disk space associated with a data file is pre-formatted, but does not contain
any user data. As data is loaded into the system, Oracle reserves space for data or indexes in the data file
in the form of extents.
Redo logs are used by Oracle to record all changes made to the database. Every Oracle database must have
at least two redo logs in order to function. The redo log files are written to in a circular fashion; when the
current online log fills up, Oracle switches to the next available log and, if needed, archives the previous
log. In the event of a failure, changes to the Oracle database can be reconstructed using the copy of the DB
(e.g. point in time backup) as the starting point for applying the redo logs. Due to their importance, Oracle
provides a facility for mirroring or multiplexing the redo logs so that two (or more) copies of the log are
available on disk.
The control file describes the physical structure of the database. It contains information, such as the
database name, date, and time the database was created, and the names and locations of all the database
data files and redo logs. Like the redo logs, Oracle can have multiple copies of the control file to protect
against logical or physical corruption.
Figure 2 - Relationship between tablespaces, datafiles, tables and indexes
© 2011 International Business Machines, Inc.
Page 11
IBM Americas Advanced Technical Support
1.2.
Instance and Application Processes
A process is defined as a thread of control used in an operating system to execute a particular task or
series of tasks. Oracle utilizes three different types of processes to accomplish these tasks:
•
•
•
User or client processes
Server processes
Background processes
User processes are created to execute the code of a client application program. The user process is
responsible for managing the communication with the Oracle server process using a session. A session is a
specific connection of a user application program to an Oracle instance. The session lasts from the time
that the user or application connects to the database until the time the user disconnects from the database.
Figure 3 - Connection between a user and the database
Server processes are created by Oracle to service requests from connected user processes. They are
responsible for interfacing with the database to carry out the requests of user processes. The number of
user processes per server process is dependent on the configuration of Oracle. In a dedicated server
configuration, one server process is spawned for each connected user process. In a shared server
configuration, user processes are distributed among a pre-defined number of server processes. Oracle 11g
has introduced a new background process to support server-side connection pooling called Database
Resident Connection Pooling (DRCP). DRCP provides a connection pool in the database server for typical
Web application usage scenarios. DRCP pools dedicated servers, which comprise of a server foreground
combined with a database session, to create pooled servers. A Web application typically acquires a
database connection, uses the connection for a short period, and then releases the connection. DRCP
enables multiple Web application threads and processes to share the pooled servers for their connection
needs.
© 2011 International Business Machines, Inc.
Page 12
IBM Americas Advanced Technical Support
Oracle background processes are created upon database startup or initialization. Some background
processes are necessary for normal operation of the system, while others are only used to perform certain
database maintenance or recovery related functions.
Figure 4 - Oracle Process Architecture
© 2011 International Business Machines, Inc.
Page 13
IBM Americas Advanced Technical Support
The Oracle background processes include:
•
Database Writer (DBWn)
The database writer process is responsible for writing modified or dirty database buffers from the database
buffer cache to disk. It uses a Least Recently Used (LRU) algorithm to ensure that the user processes
always find free buffers in the database buffer cache. Dirty buffers are usually written to disk using single
block writes. Additional database writer processes can be configured to improve write performance if
necessary. On AIX, enabling Asynchronous I/O in most cases eliminates the need for multiple database
writer processes and should yield better performance.
•
Log Writer (LGWR)
The log writer process is responsible for writing modified entries from the redo log buffer to the online
redo log files on disk. It may write as little as 512Bytes in a single write (check agblksize requirements
when used with AIX DIO/CIO feature). This occurs when one of the following conditions is met:
o Three seconds have elapsed since the last buffer write to disk.
o The redo log buffer is one-third full.
o The DBWn process needs to write modified data blocks to disk for which the
corresponding redo log buffer entries have not yet been written to disk.
o A transaction commits.
A commit record is placed in the redo log buffer when a user issues a COMMIT statement, at which point
the buffer is immediately written to disk. The redo log buffer entries are written in First-In-First-Out
(FIFO) order. This means that once the commit record has been written to disk, all of the other log records
associated with that transaction have also been written to disk. The actual modified database data blocks
are written to disk at a later time, a technique known as fast commit. A System Change Number (SCN)
defines a committed version of a database at a point in time. Each committed transaction is assigned a
unique SCN.
•
Checkpoint Process (CKPT)
The checkpoint process is responsible for notifying the DBWn process that the modified database blocks
in the SGA need to be written to the physical data files. It is also responsible for updating the headers of
all Oracle data files and the control_file(s) to record the occurrence of the most recent checkpoint.
•
Archiver (ARCn)
This is an optional process as far as the database is concerned, but usually a required process for the
business. Without one or more ARCn process (there can be from one to thirty, named ARC0, ARC1, and
so on) it is possible to lose data.
All change vectors applied to data blocks are written out to the log buffer (by the sessions making the
changes) and then to the online redo log files (by the LGWR). The online redo log files are of fixed size
and number: once they have been filled, LGWR will overwrite them with more redo data. The time that
must elapse before this happens is dependent on the size and number of the online log files, and the
© 2011 International Business Machines, Inc.
Page 14
IBM Americas Advanced Technical Support
amount of DML activity (and therefore the amount of redo generated) against the database. This means
that the online redo log only stores change vectors for recent activity. In order to preserve a complete
history of all changes applied to the data, the online log files must be copied as they are filled and before
they are reused. The ARCn is responsible for doing this. Provided that these copies, known archive redo
log files, are available, it will always be possible to recover from any damage to the database by restoring
datafile backups and applying change vectors to them extracted from all the archive log files generated
since the backups were made. Then the final recovery, to bring the backup right up to date, will come
from the online redo log files. Most production transactional databases will run in archive log mode,
meaning that ARCn is started automatically and that LGWR is not permitted to overwrite an online log
file until ARCn has successfully archived it to an archive log file.
•
System Monitor (SMON)
SMON initially has the task of mounting and opening a database. In brief, SMON mounts a database by
locating and validating the database controlfile. It then opens a database by locating and validating all the
datafiles and online log files. Once the database is opened and in use, SMON is responsible for various
housekeeping tasks, such as collating free space in datafiles.
•
Process Monitor (PMON)
A user session is a user process that is connected to a server process. The server process is launched when
the session is created and destroyed when the session ends. An orderly exit from a session involves the
user logging off. When this occurs, any work he/she was doing will be completed in an orderly fashion,
and his/her server process will be terminated. If the session is terminated in a disorderly manner (perhaps
because the user’s PC is rebooted), then the session will be left in a state that must be cleared up. PMON
monitors all the server processes and detects any problems with the sessions. If a session has terminated
abnormally, PMON will destroy the server process, return its PGA memory to the operating system’s free
memory pool, and roll back any incomplete transaction that may have been in progress.
•
Recover (RECO)
The recover process is responsible for recovering all in-doubt transactions that were initiated in a
distributed database environment. RECO contacts all other databases involved in the transaction to
remove any references associated with that particular transaction from the pending transaction table. The
RECO process is not present at instance startup unless the DISTRIBUTED_TRANSACTIONS parameter
is set to a value greater than zero and distributed transactions are allowed.
•
Manageability Monitor (MMON) and Manageability Monitor Light (MMNL)
MMON is a process that was introduced with database release 10g and is the enabling process for many of
the self-monitoring and self-tuning capabilities of the database.
The database instance gathers a vast number of statistics about activity and performance. These statistics
are accumulated in the SGA, and their current values can be interrogated by issuing SQL queries. For
performance tuning and also for trend analysis and historical reporting, it is necessary to save these
statistics to long term storage. MMON regularly (by default, every hour) captures statistics from the SGA
and writes them to the data dictionary, where they can be stored indefinitely (though by default, they are
© 2011 International Business Machines, Inc.
Page 15
IBM Americas Advanced Technical Support
kept for only eight days). Every time MMON gathers a set of statistics (known as a snapshot); it also
launches the Automatic Database Diagnostic Monitor, the ADDM. The ADDM is a tool that analyses
database activity using an expert system developed over many years by many DBAs. As well as gathering
snapshots, MMON continuously monitors the database and the instance to check whether any alerts
should be raised. MMNL is a process that assists the MMON.
•
Memory Manager (MMAN)
MMAN is a process that was introduced with database release 10g. It enables the automatic management
of memory allocations. Prior to release 9i of the database, memory management in the Oracle environment
was far from satisfactory. The PGA memory associated with session server processes was nontransferable: a server process would take memory from the operating system’s free memory pool and
never return it—even though it might only have been needed for a short time. The SGA memory
structures were static: defined at instance startup time and unchangeable unless the instance were shut
down and restarted. Release 9i changed that: PGAs can grow and shrink, with the server passing out
memory to sessions on demand while ensuring that the total PGA memory allocated stays within certain
limits. The SGA and the components within it (with the notable exception of the log buffer) can also be
resized, within certain limits. Release 10g automated the SGA resizing: MMAN monitors the demand for
SGA memory structures and can resize them as necessary. Release 11g takes memory management a step
further: all the DBA need do is set an overall target for memory usage, and MMAN will observe the
demand for PGA memory and SGA memory, and allocate memory to sessions and to SGA structures as
needed, while keeping the total allocated memory within a limit set by the DBA.
Figure 5 - Oracle Dedicated Process Architecture
© 2011 International Business Machines, Inc.
Page 16
IBM Americas Advanced Technical Support
Shared server architecture eliminates the need for a dedicated server process for each connection. A
dispatcher directs multiple incoming network session requests to a pool of shared server processes. An
idle shared server process from a shared pool of server processes picks up a request from a common
queue, which means a small number of shared servers can perform the same amount of processing as
many dedicated servers. Also, because the amount of memory required for each user is relatively small,
less memory and process management are required, and more users can be supported.
A number of different processes are needed in a shared server system:
o A network listener process that connects the user processes to dispatchers or dedicated
servers (the listener process is part of Oracle Net Services, not Oracle Database).
o One or more dispatcher processes
o One or more shared server processes
•
Dispatcher (Dnnn)
The dispatcher processes support shared server configuration by allowing user processes to share a limited
number of server processes. Shared server can support a greater number of users, particularly in
client/server environments where the client application and server operate on different computers. You
can create multiple dispatcher processes for a single database instance. The database administrator starts
an optimal number of dispatcher processes depending on the operating system limitation and the number
of connections for each process, and can add and remove dispatcher processes while the instance runs.
•
Shared Server Processes (Snnn)
Each shared server process serves multiple client requests in the shared server configuration. Shared
server processes and dedicated server processes provide the same functionality, except shared server
processes are not associated with a specific user process. Instead, a shared server process serves any client
request in the shared server configuration.
Figure 6 - Oracle Shared Process Architecture
© 2011 International Business Machines, Inc.
Page 17
IBM Americas Advanced Technical Support
The Oracle RAC Specific Instance Processes include (Figure 7):
•
Global Cache Service Processes (LMSn)
LMSn processes control the flow of messages to remote instances and manage global data block access.
LMSn processes also transmit block images between the buffer caches of different instances. This
processing is part of the Cache Fusion feature. n ranges from 0 to 9 depending on the amount of
messaging traffic.
•
Global Enqueue Service Monitor (LMON)
Monitors global enqueues and resources across the cluster and performs global enqueue recovery
operations. Enqueues are shared memory structures that serialize row updates.
•
Global Enqueue Service Daemon (LMD)
Manages global enqueue and global resource access. Within each instance, the LMD process manages
incoming remote resource requests.
•
Lock Process (LCK)
Manages non-Cache Fusion resource requests such as library and row cache requests.
•
Diagnosability Daemon (DIAG)
Captures diagnostic data about process failures within instances. The operation of this daemon is
automated and it updates an alert log file to record the activity that it performs.
Figure 7 - Oracle RAC Specific Instance Processes
© 2011 International Business Machines, Inc.
Page 18
IBM Americas Advanced Technical Support
1.3.
Oracle memory Structures
Oracle utilizes several different types of memory structures for storing and retrieving data in the system.
These include the System Global Area (SGA) and Program Global Areas (PGA).
Figure 8 - Oracle Memory structures
© 2011 International Business Machines, Inc.
Page 19
IBM Americas Advanced Technical Support
The Oracle SGA is a shared memory region used to hold data and internal control structures of the
database. Shared memory region details can be displayed in AIX with the “ipcs -ma” command. The
Oracle SGA and its associated background processes are known as an Oracle instance. The instance
identity is called by the short name SID. This short name is used in all Oracle processes connected to this
instance. To connect to a particular instance, set the shell variable $ORACLE_SID to the SID or specify it
in the connect command at SQL*Plus. The SGA memory region is allocated upon instance startup and deallocated when the instance is shut down and is unique to each database instance. The SGA is dynamically
managed by Oracle, while the instance is up.
The information contained in the SGA is logically separated into three to five different areas: The
database buffer cache, the redo log buffer, the shared pool, the Java pool (optional) and the large pool
(optional).
The database buffer cache consists of Oracle database data blocks or buffers that have been read from disk
and placed into memory. These buffers are classified as either free, dirty or pinned and are organized in
two lists: the write list and the LRU list. Free buffers are those that have not yet been modified and are
available for use. Dirty buffers contain data that has been modified but has not yet been written to disk and
are held in the write list. Lastly, pinned buffers are buffers that are currently being accessed.
Oracle uses a LRU algorithm to age the dirty data blocks from memory to disk. A LRU list is used to keep
track of all available buffers and their state (dirty, free, or pinned). Infrequently accessed data is moved to
the end of the LRU list where it will eventually be written to disk by the Oracle DBWn process should an
additional free buffer be requested but not be available. Blocks on the end of the LRU list are candidates
for being stolen when a new page needs to be read into buffer cache. Dirty pages are likely to have been
already written to disk well before they reach the end of the LRU list.
The size of the database buffer cache is determined by a combination of several Oracle initialization
parameters. It’s possible to use different block sizes in a database by specifying a non-default block size at
the tablespace level and using different cache sizes for each of these non-standard block sizes. The
DB_BLOCK_SIZE parameter specifies the default or standard block size and the DB_CACHE_SIZE
parameter defines the minimum size of the cache, allocated at instance startup using blocks of size
DB_BLOCK_SIZE. To specify different block sizes, use the initialization parameter
DB_nK_CACHE_SIZE, where n ranges from 2KB to 32KB. The DBA can change these parameters while
the instance is up and running, except for the DB_BLOCK_SIZE parameter, which requires the database
to be recreated. Carefully coordinate the values of these variables with the prerequisites for the AIX
DIO/CIO mount options explained in the chapter 2.3.8.
The Java pool is used to hold Java execution code and classes information if the Java option is enabled for
the specific Oracle instance.
The redo log buffer is used to store information about changes made to data in the database. This
information can be used to reapply or redo the changes made to the database should a database recovery
become necessary. The entries in the redo log buffer are written to the online redo logs by the LGWR
process. The size of the redo log buffer is determined by the LOG_BUFFER parameter in the Oracle
initialization file.
© 2011 International Business Machines, Inc.
Page 20
IBM Americas Advanced Technical Support
The shared pool area stores memory structures, such as the shared SQL areas, private SQL areas, and the
data dictionary cache and, if large pool has not been allocated, the buffers for parallel execution messages.
Shared SQL areas contain the parse tree and execution plan for SQL statements. Identical SQL statements
share execution plans. Multiple identical Data Manipulation Language (DML) statements can share the
same memory region, thus saving memory. DML statements are SQL statements that are used to query
and manipulate data stored in the database, such as SELECT, UPDATE, INSERT, and DELETE.
Private SQL areas contain Oracle bind information and run-time buffers. The bind information contains
the actual data values of user variables contained in the SQL query.
The data dictionary cache is used to hold information pertaining to the Oracle data dictionary. The Oracle
data dictionary serves as a roadmap to the structure and layout of the database. The information contained
in the data dictionary is used during Oracle’s parsing of SQL statements.
The buffers for parallel execution hold information needed to synchronize all the parallel operations in the
database. This is allocated from the shared pool area only when the large pool is not configured.
The large pool is an optional memory area that can be used to address shared memory contention. It holds
information such as the Oracle XA interface (the interface that Oracle uses to enable distributed
transactions), I/O server processes and backup/restore operations. Large pool is enabled by setting the
LARGE_POOL_SIZE to a number greater than 0.
Starting with Oracle 9i a new feature called Dynamic SGA is available that allows optimized memory
usage by instance. Also, it allows increasing or decreasing Oracle’s use of physical memory, with no
downtime, because the database administrator can issue ALTER SYSTEM commands to change the SGA
size, while the instance is running. This feature is also available in Oracle 10g and 11g.
The Oracle PGA is the collection of non-shared memory regions that each contain data and control
information for an individual server process. Whenever a server process is started, PGA memory is
allocated for that process. The PGA memory can only be used by the particular server process it has been
allocated for. In general, PGA contains a private SQL area (storing bind information and run-time
structures that are associated with a shared SQL area) and session memory that holds session specific
variables, such as logon information. If Shared Server (MTS) architecture is being used, the server process
and associated PGA memory may be "shared" by multiple client processes. The PGA may also be used as
a work area for complex queries making use of memory-intensive operators like the following ones:
•
•
•
•
Sort operations, like ORDER BY, GROUP BY, ROLLUP…
HASH JOINS
BITMAP MERGE
BITMAP CREATE
When a user session connects to a dedicated Oracle server, the PGA is automatically managed. The
database administrator just needs to set the PGA_AGGREGATE_TARGET initialization parameter to the
target amount of memory he/she wants Oracle to use for server processes. PGA_AGGREGATE_TARGET is
a target value only. In some situations, the actual aggregate PGA usage may be significantly higher than
the specified target value.
© 2011 International Business Machines, Inc.
Page 21
IBM Americas Advanced Technical Support
2. AIX Configuration & Tuning for Oracle
Disclaimer: The suggestions presented here are considered to be basic configuration “starting
points” for general Oracle database workloads. Individual customer workloads will vary. Ongoing
performance monitoring and tuning is required to ensure that the configuration is optimal for the
customer's particular workload characteristics.
This section contains information on common AIX and system administration items related to Oracle
database environments. The items in this section are necessary to ensure a good performing database
system, and should be considered and verified first before engaging in fine, detailed-level database tuning.
Proper Virtual Memory Manager (VMM) configuration is a key to optimal Oracle performance, so we are
going to present the major VMM functions and how to tune them for Oracle.
2.1.
Overview of the AIX Virtual Memory Manager (VMM)
The VMM services memory requests from the system and its applications. Virtual-memory segments are
partitioned in units called pages; each page is located in real physical memory (RAM), ‘paged out’ to disk
to make room for other pages or stored on disk until it is needed. AIX uses virtual memory to address
more memory than is physically available in the system. The management of memory pages in RAM or
on disk is handled by the VMM.
The virtual address space is partitioned into segments. A segment is a 256 MB, contiguous portion of the
virtual-memory address space into which a data object can be mapped. Process addressability to data is
managed at the segment (or object) level so that a segment can be shared between processes or maintained
as private. For example, processes can share code segments yet have separate and private data segments.
If 1TB Segment Aliasing is used, segment size for shared memory segments is 1TB. More details will be
provided in a later chapter.
2.1.1. Real-Memory Management
Virtual-memory segments are partitioned into fixed-size units called pages. AIX 5.3 ML04 (and later
versions of AIX) running on POWER5+ (and later version of POWER processors) supports four page
sizes: 4KB, 64KB, 16MB and 16GB. The role of the VMM is to manage the allocation of real-memory
page frames and to resolve references by the program to virtual-memory pages that are not currently in
real memory or do not yet exist (for example, when a process makes the first reference to a page of its data
segment).
Because the amount of virtual memory that is in use at any given instant can be larger than real physical
memory, the VMM must store the surplus on disk.
© 2011 International Business Machines, Inc.
Page 22
IBM Americas Advanced Technical Support
From a performance viewpoint, the VMM has two, somewhat opposed, objectives:
• Minimize the overall processor-time and disk-bandwidth cost of the use of virtual memory
• Minimize the response-time cost of page faults
In pursuit of these objectives, the VMM maintains a free list of page frames that are available to satisfy a
request for memory. The VMM uses a page-replacement algorithm to determine which virtual-memory
pages currently in memory will have their page frames reassigned to the free list. The page-replacement
algorithm uses several mechanisms:
o
o
o
o
o
Virtual-memory segments are classified as containing either computational or file memory.
Virtual-memory pages whose access causes a page fault are tracked.
Page faults are classified as new-page faults or as repage faults.
Statistics are maintained on the rate of repage faults in each virtual-memory segment.
User-tunable thresholds influence the page-replacement algorithm's decisions.
The following sections describe the free list and the page-replacement mechanisms in more detail.
2.1.2. Persistent versus Working Segments
The pages of a persistent segment have permanent storage locations on disk. Files containing data or
executable programs are mapped to persistent segments. Because each page of a persistent segment has a
permanent disk storage location, the VMM writes the page back to that location when the page has been
changed and can no longer be kept in real memory. If the page has not changed when selected for
placement on a free list, no I/O is required. If the page is referenced again later, it is read in from its
permanent disk-storage location.
Working segments in contrast are transitory, exist only during their use by a process, and have no
permanent disk-storage location. Process stack and data regions are mapped to working segments, as are
the kernel text segment, the kernel-extension text segments, as well as the shared-library text and data
segments. Pages of working segments must also have disk-storage locations to occupy when they cannot
be kept in real memory. The disk-paging space is used for this purpose.
The following illustration shows the relationship between some of the types of segments and the locations
of their pages on disk. It also shows the actual (arbitrary) locations of the pages when they are in real
memory.
© 2011 International Business Machines, Inc.
Page 23
IBM Americas Advanced Technical Support
Figure 9 - Persistent and Working Storage Segments.
Persistent-segment types are further classified.
Client segments are used for:
• mapping remote files (for example, files that are being accessed through NFS), including remote
executable programs. Pages from client segments are saved and restored over the network to their
permanent file location, not on the local-disk paging space.
• mapping the local files that are located on a JFS2 or Veritas based file system
Non-Client segments are used for:
• mapping the local files that are located on a JFS based file system
Journaled and deferred segments are persistent segments that must be atomically updated. If a page from a
journaled or deferred segment is selected to be removed from real memory (paged out), it must be written
to disk paging space unless it is in a state that allows it to be committed (written to its permanent file
location).
© 2011 International Business Machines, Inc.
Page 24
IBM Americas Advanced Technical Support
2.1.3. Computational versus File Memory
Computational memory, also known as computational pages, consists of the pages that belong to workingstorage segments or program text (executable files) segments.
File memory (or file pages) consists of the remaining pages. These are usually pages from permanent data
files in persistent storage.
2.1.4. Page Replacement
When the number of available real memory frames on the free list becomes low, a page stealer is invoked.
A page stealer moves through the Page Frame Table (PFT), looking for pages to steal.
The PFT includes flags to signal which pages have been referenced and which have been modified. If the
page stealer encounters a page that has been referenced, it does not steal that page, but instead, resets the
reference flag for that page. The next time the clock hand (page stealer) passes that page and the reference
bit is still off, that page is stolen. A page that was not referenced in the first pass is immediately stolen.
The modify flag indicates that the data on that page has been changed since it was brought into memory.
When a page is to be stolen, and the modify flag is set, a pageout call is made before stealing the page.
Pages that are part of working segments are written to paging space; persistent segments are written to
disk.
In addition to the page-replacement, the algorithm keeps track of both new page faults (referenced for the
first time) and repage faults (referencing pages that have been paged out - either to paging space or to the
backing file to the disk), by using a history buffer that contains the IDs of the most recent page faults. It
then tries to balance file (persistent data) page outs with computational (working storage or program text)
page outs. With AIX 6.1 this "balancing" feature is disabled by default (lru_file_repage = 0). The
lru_file_repage tunable can be changed dynamically without requiring a system reboot. With AIX 7.1
lru_file_repage value is hardcoded to 0 and removed from the list of the vmo tunables.
When a process exits, its working storage is released immediately and its associated memory frames are
put back on the free list. However, any memory frames allocated to files opened by the terminated process
are not removed from memory under the assumption that another process could access them in the near
future.
On a multiprocessor system, page replacement is done through the lrud kernel process, which is
dispatched to a CPU when the minfree threshold has been reached. Starting with AIX 5L, the lrud kernel
process is multithreaded with one thread per memory pool. Physical memory is split into one or more
memory pools based on the number of CPUs and the amount of memory configured for the system
(LPAR). The number of memory pools on a system (LPAR) can be determined by running the
“vmstat -v” command.
© 2011 International Business Machines, Inc.
Page 25
IBM Americas Advanced Technical Support
Prior to AIX 5.3, the page frame table method was the only method available, as explained before. AIX
5.3.0 introduced list-based LRU for AIX Workload Manager (WLM). It was enabled by default until listbased LRU was made available outside of WLM. AIX 5.3 TL04 introduced list-based LRU system wide
without the requirement to enable WLM.
The list-based algorithm provides a list of pages to scan for each type of segment (working, persistent,
client and compressed). You can disable the list-based LRU feature and enable the original physicaladdress-based scanning with the page_steal_method parameter of the vmo command. The default value
in AIX 5.3 for the page_steal_method parameter is 0, which means that the list-based LRU feature is
disabled and physical-address-based scanning is used. If the page_steal_method parameter is set to 1
(default in AIX 6.1 and 7.1), the lists are used to scan pages (Figure 10). The value for the
page_steal_method parameter takes effect after a bosboot and reboot.
Figure 10 - Legacy “page_steal_method=0” vs. List-based “page_steal_method=1”
© 2011 International Business Machines, Inc.
Page 26
IBM Americas Advanced Technical Support
2.2.
Memory and Paging
Several numerical thresholds define the objectives of the VMM. When one of these thresholds is
breached, the VMM takes appropriate action to bring the state of memory back within bounds. This
section discusses the thresholds that the system administrator can alter through the vmo command.
2.2.1. AIX 6.1 and 7.1 Restricted Tunables concept
Since AIX 5.3, six tuning commands (vmo, ioo, schedo, raso, no and nfso) are available to tune the
operating system for a specific workload. Beginning with AIX 6.1, some tunables are now classified as
restricted use tunables. The default values for restricted tunables are considered to be optimum for most
AIX environments and should only be modified at the recommendation of IBM support professionals.
As these parameters are not recommended for user modification, they are no longer displayed by default,
but can be displayed with the -F option (force). The -F option forces the display of restricted tunable
parameters when the options -a, -L, or -x are specified alone on the command line to list all tunables.
When -F is not specified, restricted tunables are not included in a display unless specifically named in
association with a display option.
2.2.2. AIX free memory
In AIX, memory is hierarchically represented by the data structures vmpool, mempool, and frameset. A
vmpool represents an affinity domain of memory. A vmpool is divided into multiple mempools, each
mempool being managed by a single page replacement least recently used (LRU) daemon. Each mempool
is further subdivided into one or more framesets that contain the free-frame lists, so as to improve the
scalability of free-frame allocators.
The number of page frames on the free list is controlled by the following thresholds and are managed
individually for each memory pool:
•
•
minfree
o Minimum acceptable number of real-memory page frames in the free list. When the
size of the free list falls below this number, the VMM begins stealing pages. It
continues stealing pages until the size of the free list reaches maxfree,
maxfree
o Maximum size to which the free list will grow by VMM page-stealing. The size of the
free list may exceed this number as a result of processes terminating and freeing their
working-segment pages or the deletion of files that have pages in memory.
The units used when specifying the parameter values are the numbers of 4k memory pages.
© 2011 International Business Machines, Inc.
Page 27
IBM Americas Advanced Technical Support
In AIX 5.3, 6.1 and 7.1, minfree and maxfree thresholds apply to each mempool individually. Whenever
the number of free pages in a given mempool drops below minfree, the LRU daemon for that mempool
starts freeing pages in that mempool until the free page count in that mempool reaches maxfree.
Therefore, the number of pages on the system wide memory free list is normally >= minfree times the
number of memory pools.
The minfree and maxfree values are modified using the vmo command. Generally, we want a large
enough value for minfree so that there are always a few pages on the free list so we never need to wait for
lrud, the process implementing the “page stealing”, to execute in order to satisfy a new memory page
request. And, the difference between the maxfree and minfree settings should be large enough that lrud is
not constantly starting and stopping.
Recommended starting values for minfree and maxfree are the AIX 5.3, 6.1 and 7.1 default values.
2.2.3. AIX file system cache size
With Oracle database workloads, we want to make sure that the computational pages used for Oracle
executable code, the Oracle SGA and Oracle PGA, etc. always stay resident in memory. And, since the
Oracle DB buffer cache already provides caching for Oracle .dbf file data, the AIX file system cache
provides a somewhat redundant feature. Therefore, we want to use the VMM policies that favor
computational pages over file system cache pages.
Optimal file system cache size depends mostly on the workload and I/O characteristics of your database
and whether you are using a JFS(2) file system based database, raw devices or ASM. If JFS2 DIO or CIO
options are active, no file system cache is being used for Oracle .dbf and/or online redo logs files.
Overallocating physical memory can generate excessive paging activity that decreases performance
substantially. In a memory constrained Oracle DB environment, we can reduce the effective file system
cache size (to reduce the physical memory requirement) without adversely affecting DB I/O performance.
In this situation, a large number of SGA data buffers might also have analogous journaled file system
buffers containing the most frequently referenced data. The behavior of the AIX file buffer cache manager
can have a significant impact on performance. On AIX, tuning buffer-cache paging activity is possible but
it must be done carefully and infrequently.
The following thresholds are expressed as percentages. They represent the fraction of the total real
memory of the machine that is occupied by file pages (pages for non-computational segments):
•
•
minperm%
o If the percentage of real memory occupied by file pages (numperm%, numclient%) falls
below the MINPERM value, the page-replacement algorithm steals both file and
computational pages, regardless of repage rates or the setting of ‘lru_file_repage’,
maxperm%
o If the percentage of real memory occupied by file pages (numperm%) rises above the
MAXPERM value, the page-replacement algorithm steals only file pages,
© 2011 International Business Machines, Inc.
Page 28
IBM Americas Advanced Technical Support
•
•
maxclient%
o If the percentage of real memory occupied by client file pages (numclient%) rises
above the MAXCLIENT value, the page-replacement algorithm steals only client
pages,
o maxclient% is always <= maxperm%
minperm% <> maxperm%
o If the percentage of real memory occupied by file pages (numperm%) is between the
MINPERM and MAXPERM parameter values, the VMM normally steals only file
pages, but if the repaging rate for file pages is higher than the repaging rate for
computational pages, then computational pages are stolen as well. When
lru_file_repage is set to zero (0), the repage rate is ignored and only file pages will be
stolen.
A simplistic way of remembering this is that AIX will try to keep the AIX buffer cache (numperm%) size
between the minperm% and maxperm%, maxclient% percentage of memory. Use the vmo command
to determine and change the current minperm%, maxperm% and maxclient% values. The vmo
command shows these values as a percentage of real memory (minperm%, maxperm%, maxclient%) as
well as the total number of 4k pages (minperm, maxperm and maxclient). The “vmstat -v” command may
also be used to display minperm, maxperm and maxclient settings as a percentage basis. In addition,
“vmstat –v” also shows the current number of file and client pages (numperm%, numclient%) in memory
as well as the percentage of memory these pages currently occupy.
Included is the sample output from the “vmstat –v” command displaying the percentage of the real
memory used for different page categories.
Figure 11 - “vmstat –v” sample output
© 2011 International Business Machines, Inc.
Page 29
IBM Americas Advanced Technical Support
Figure 12 - AIX VMM Page Replacement Overview
© 2011 International Business Machines, Inc.
Page 30
IBM Americas Advanced Technical Support
•
Typical vmo settings for Oracle environments:
o lru_file_repage = 0 (the default is 1 in AIX 5.3 and 0 in AIX 6.1/ 7.1)
• The parameter was introduced at ML1 AIX 5.3
• It provides a hint to the VMM about whether re-page counts should be
considered when determining what type of memory to steal
• With value 0, protects computational memory pages and only file memory
pages are stolen,
• In AIX 7.1, lru_file_repage is hardcoded to 0 and removed from the list of
tunable parameters
o minperm% = 3 (the AIX 5.3 default is 20 and 3 in AIX 6.1/ 7.1)
• Target for minimum % of physical memory to be used for file system cache
• minperm% < numperm% (check output of the vmstat –v)
o maxperm% = 90 (the AIX 5.3 default is 80 and 90 in AIX 6.1/ 7.1)
• Target for maximum % of physical memory to be used for JFS file system
cache
• maxperm% >= maxclient% >= numclient% (check output of the vmstat –v)
o strict_maxperm = 0 (the default in AIX 5.3/6.1/7.1)
• Enables/disables enforcement of maxperm as a “hard” limit
• If this variable has to be changed, this should be done only under the close
supervision of the IBM AIX Support Team.
o maxclient% = 90 (the AIX 5.3 default is 80 and 90 in AIX 6.1/7.1)
• Target for maximum % of physical memory to be used for JFS2/NFS file
system cache
• maxclient% should normally be set equal to maxperm%. If maxperm% default
value is changed, maxclient% should be changed accordingly.
o strict_maxclient = 1 (the default in AIX 5.3/6.1/7.1)
• Enables/disables enforcement of the maxclient% as a “hard” limit (important:
value can be changed, but needs to be done very carefully in respect to the
memory utilization)
• If this variable has to be changed, this should be done only under the close
supervision of the IBM AIX Support team.
o page_steal_method = 1 (the default in AIX 6.1/7.1 and 0 in AIX 5.3)
• In AIX 5.3, the LRU algorithm can either use lists or the page frame table. Prior
to AIX 5.3, the page frame table method was the only method available. The
list-based algorithm provides a list of pages to scan for each type of segment.
o lru_poll_interval = 10 (the default in AIX 5.3/6.1/7.1)
• Indicates the time period (in milliseconds) after which lrud pauses and interrupts
can be serviced.
© 2011 International Business Machines, Inc.
Page 31
IBM Americas Advanced Technical Support
2.2.4. Allocating sufficient paging space
Inadequate paging space can result in a process being killed, a system hang or an AIX crash. On AIX, you
dynamically add and resize paging space on raw disk partitions. The amount of paging space you should
configure depends on the amount of physical memory present and the paging space requirements of your
applications. A good starting point would be a ‘½ the physical memory + 4 GB’ up to the size of a single
internal disk. If required, additional pages spaces can be added. To balance the distribution of the paging
activity, paging spaces should be defined in equal sizes across multiple physical disks. Defining the
multiple paging spaces on the single physical disk is not recommended. Use the lsps command to monitor
paging space use and the vmstat command to monitor system paging activities.
AIX 5L and higher use deferred paging, where paging space is not allocated until needed. The system uses
swap space only if it runs out of real memory. If the memory is sized correctly, there is no paging and the
page space can be small. Workloads where the demand for pages does not fluctuate significantly perform
well with a small paging space.
Even with that, one should always plan for zero paging activity and paging space use only as a protection
mechanism when physical memory utilization is undersized. An additional element to be aware of is that
once you page out a computation page, it continues to take up space on the paging file as long as the
process exists, even if the page is subsequently paged back in. This means that if you even have a minimal
amount of paging activity because the physical memory is “slightly” overallocated, the amount of paging
space will gradually grow over time –potentially up to the total amount of virtual memory footprint of the
entire system. Constant and excessive paging indicates that the real memory is over-committed. In
general, you should avoid paging at all.
The following AIX commands provide paging status and statistics:
# vmstat –s
# lsps –a
# svmon –P (this command will list per process paging space, useful for tracking down paged-out
processes)
2.2.5. AIX Multi-page size support for Oracle & SGA Pinning
As explained in Chapter 2.1.1, AIX 5.3 ML04 (and later versions of AIX) running on POWER5+ (and
later version of POWER processors) supports four page sizes: 4KB (small page), 64KB (medium page),
16MB (large page) and 16GB. Usage description for the 16GB page size is outside the scope of this
document, as it is not appropriate for use with Oracle databases.
Memory access intensive Oracle applications that use large amounts of virtual memory may obtain
performance improvements by using medium and large pages in conjunction with SGA pinning. The
medium and large page related performance improvements are attributable to reduced translation
lookaside buffer (TLB) misses. Medium and large pages also improve memory prefetching by eliminating
the need to restart prefetch operations on 4KB boundaries.
© 2011 International Business Machines, Inc.
Page 32
IBM Americas Advanced Technical Support
Oracle versions prior to 10.2.0.4 will allocate only two types of pages for the SGA, 4KB and 16MB. The
SGA initialization process during the startup of the instance will try to allocate 16MB pages for the shared
memory if the Oracle initialization parameter LOCK_SGA is set to TRUE. If the LOCK_SGA is set to
FALSE, the 4KB page will be used and no pinning will occur.
Even if not specifically requested, 64KB pages can be automatically coalesced from 4 KB pages. This
executed by AIX kernel process psmd (Page Size Management Daemon, implemented in AIX6.1+ and
Power6+).
Oracle 10g (10.2.0.4+) and 11g included the explicit support for the use of 64KB pages. Oracle 10.2.0.4
and 10.2.0.5 requires implementation of the Oracle patch 7226548 to automatically allocate 64KB pages
for the SGA. If during the startup of the Oracle DB variable LOCK_SGA = TRUE, v_pinshm = 1 and the
oracle user has the permissions correctly set, the instance will initially try to map and pin 16MB pages in
the shared memory. If it fails, SGA initialization process will use 64KB pages automatically and no
additional configuration is required. 64 KB pages are preferred over 16 MB pages since 64 KB pages
provide most of the benefit of 16 MB pages, without the special tuning requirements of 16 MB pages.
More information is provided in Figure 13.
Figure 13 - AIX Memory Page Sizes
The automatic conversion of 16MB pages to 64KB pages, as shown in the table above, only occurs during
the initial startup of the Oracle database. Later and smaller memory requests for 64KB pages do not
trigger the automatic conversion. This can lead to situations where a significant amount of physical
memory is pinned, but unused, in 16MB pages and no free 4KB and/or 64KB pages are available to
support additional memory requests. In this case AIX will start paging to paging space to free up
additional 4KB and/or 64KB pages.
© 2011 International Business Machines, Inc.
Page 33
IBM Americas Advanced Technical Support
64KB pages can be utilized for the support of the program data, binary text and thread stack memory
regions of the Oracle binary program. To enable this feature, use one of the following options:
• use environment variable LDR_CNTRL before starting Oracle:
o export LDR_CNTRL=DATAPSIZE=64K@TEXTPSIZE=64K@STACKPSIZE=64K
• edit the Oracle XCOFF header file using the following command:
o ldedit –btextpsize=64K –bdatapsize=64K –bstackpsize=64K oracle
The primary motivation for considering the pinning of SGA memory is to prevent Oracle SGA from ever
being paged out. In a properly tuned Oracle on AIX environment there should not be any paging
activity to begin with, so SGA related pages should stay resident in physical memory even without
explicitly pinning them. In improperly configured or tuned environments where the demand for
computational pages exceeds the physical memory available to them, SGA pinning will not address the
underlying issue and will merely cause other computational pages (e.g. Oracle server or user processes) to
be paged out. This can potentially have as much or more impact on overall Oracle performance as the
paging of infrequently used SGA pages.
If not done properly, Oracle SGA pinning and/or the use of large pages can potentially result in
significant performance issues and/or system crashes. And, for many Oracle workloads, SGA
pinning is unlikely to provide significant additional benefits. It should therefore only be considered
where there is a known performance issue that could not be addressed through other options, such
as VMM parameter tuning.
•
Pinning shared memory
o AIX Parameters
• vmo –p –o v_pinshm = 1 (allow pinning of Shared Memory Segments)
• leave maxpin% at the default of 80
o Oracle Parameters
• LOCK_SGA = TRUE
o Enabling AIX Large Page Support
• vmo –p –o lgpg_size = 16777216 –o lgpg_regions = number_of_large_pages
• input calculation: number_of_large_pages = INT [(SGA size – 1) / 16 MB)] + 1
o Allowing Oracle to use Large Pages
• chuser capabilities=CAP_BYPASS_RAC_VMM, CAP_PROPAGATE oracle
Following are some general guidelines for customers considering SGA pinning:
1. There should be a strong change control process in place and there should be good interlock
between DBA and AIX admin staff on proposed changes. For example, DBAs should not install
new Oracle instances, or modify Oracle SGA or PGA related parameters without coordinating with
AIX admin.
2. Prior to SGA pinning, the system pinned page usage (e.g. through svmon or vmstat) should be
monitored prior to SGA pinning to verify that pinning is not likely to result in pinned page demand
exceeding maxpin%
© 2011 International Business Machines, Inc.
Page 34
IBM Americas Advanced Technical Support
3. maxpin% should not be changed from its default setting of 80. Any exceptions should undergo
critical technical review regarding perceived requirements, workload profile and associated risks.
4. SGA pinning should not be considered if the aggregate SGA requirement (for all instances in the
LPAR) exceeds 60% of the total physical memory. This allows for 20% or more of the physical
memory to be used for non-SGA related pinning. Any exceptions should undergo critical technical
review regarding perceived requirements, workload profile and associated risks.
5. If large pages (16 MB) are used for SGA pinning, any DBA changes to SGA size should be tightly
coupled to corresponding AIX admin vmo lgpg_regions changes. An imbalance between SGA size
and the lgpg_regions allocation may result in sub-optimal performance.
6. Monitor pinned page usage (e.g. through svmon or vmstat) on an ongoing basis to ensure there is
always an adequate reserve of pinnable pages. Consider implementing automated alerts to notify in
the event the reserve drops below some specified threshold.
7. Application and system performance should be observed with and without SGA pinning (in a
stress test environment if possible). If there are no observable and quantifiable performance
improvements due to SGA pinning, SGA pinning should not be used.
2.2.6. AIX 1TB Segment Aliasing
In AIX 6.1 TL06 and AIX 7.1, 1 TB memory segments are a new autonomic operating system feature
designed to improve performance of 64-bit large memory applications. This feature is supported on all
POWER processors that are supported by the respective AIX release.
This enhancement optimizes performance when using shared memory regions that are initiated using the
shmat()/mmap() system calls. Oracle uses shmat() system call to allocate its System Global Area (SGA)
memory region. 1 TB segment aliasing improves performance by using 1 TB segment translations on
shared memory regions with 256 MB segment size. If an application qualifies to have its shared memory
regions use 1 TB aliases, the AIX operating system uses 1 TB segment translations without changing the
application.
By default, LSA is disabled on AIX 6.1 TL06 and enabled in AIX 7.1. If LSA is enabled, by default 1TB
segments are used if the shared memory is > 3GB on POWER7 and > 11GB on POWER6 or earlier. In
the context of Oracle databases the key parameter “triggering” LSA is sga_max_size. When Oracle starts
up it allocates the full virtual address space specified in sga_max_size. If sga_max_size is not set then
sga_target is used.
The parameter changes to activate LSA are dynamic and do not require a reboot of AIX, but those changes
do not influence an already running Oracle database. If you made changes and want to have them effective
for the Oracle DB SGA you will need to recycle the database.
New restricted vmo options are available to change the operating system policy (VMM_CNTRL
environment variable is available to alter per process behavior):
•
esid_allocator, VMM_CNTRL=ESID_ALLOCATOR=[0,1]
o Default off in AIX 6.1 TL06, on in AIX 7.1. When on, indicates that LSA is in use
including aliasing capabilities and address space layout changes.
© 2011 International Business Machines, Inc.
Page 35
IBM Americas Advanced Technical Support
•
•
•
shm_1tb_shared, VMM_CNTRL=SHM_1TB_SHARED=[0,4096]
o Default set to 12 on POWER 7, 44 on POWER 6 and earlier. Controls the threshold at
which a shared memory attachment will get a shared alias (3GB for Power 7 and 11GB for
Power 6).
shm_1tb_unshared, VMM_CNTRL=SHM_1TB_UNSHARED=[0,4096]
o Default set to 256. Controls the threshold at which multiple homogenous small shared
memory regions will be promoted to an unshared alias. Conservatively set low.
shm_1tb_unsh_enable
o Default set to on. Determines whether unshared aliases will be used.
Figure 14 - AIX Large Segment Aliasing
Appendix - “Oracle Database and 1TB Segment Aliasing” will provide more information about this
chapter.
© 2011 International Business Machines, Inc.
Page 36
IBM Americas Advanced Technical Support
2.2.7. AIX Kernel Memory Pinning
AIX 6.1 TL6 and 7.1 provide a facility to keep AIX kernel and kernel extension data in physical memory
for as long as possible. This feature is referred to as Kernel Memory Pinning or Locking. On systems
running with sufficiently large amounts of memory, locking avoids unnecessary kernel page faults,
thereby providing improved performance.
The following Virtual Memory Management (VMM) tunables were added or modified to support kernel
memory locking:
• maxpin - Kernel’s locked memory is treated like pinned memory. Therefore, the default maxpin%
is raised from 80% to 90% if kernel locking is enabled.
• vmm_klock_mode - New tunable to enable and disable kernel memory locking.
Following vmm_klock_mode options are available and are controlled using the vmo command:
 0 = kernel locking is disabled
 1 = kernel locking is enabled automatically if Active Memory Expansion (AME) feature is also
enabled. In this mode, only a subset of kernel memory is locked. This is default in AIX 6.1
 2 = kernel locking is enabled regardless of AME and all of kernel data is eligible for locking. This
is recommended for Oracle RAC environments and Oracle installations where EMC storage is
used for AIX paging devices. This is default in AIX 7.1.
 3 = only the kernel stacks of processes are locked in memory.
Enabling kernel locking has the most positive impact on performance of systems that do paging but not
enough to page out kernel data or on systems that do not do paging activity at all. Note that 1, 2, and 3 are
only advisory. If a system runs low on free memory and performs extensive paging activity, kernel
locking is rendered ineffective by paging out kernel data. Kernel locking only impacts pageable page-sizes
in the system.
Before changing vmm_klock_mode value on AIX 6.1 verify the following is installed:
•
AIX 6.1 TL6 SP5+, plus APAR IZ95744.
© 2011 International Business Machines, Inc.
Page 37
IBM Americas Advanced Technical Support
2.2.8. AIX Enhanced Memory Affinity
PowerVM hypervisor and AIX operating system (AIX 6.1 TL 5 and above) on Power7 have implemented
affinity enhancements to achieve optimized performance for workloads running in a virtualized shared
processor logical partition (SPLPAR) environment. Power7 architecture is more sensitive to application
memory affinity due to 4x the number of cores on a chip compared to the Power6 chip. Proper planning of
LPAR configuration would increase the probability of getting both CPU and memory in the same domain
in relation to the topology of a system.
The PowerVM Hypervisor selects the required processors and memory configured for an LPAR from the
system free resource pool. During this selection process, the hypervisor takes the topology of the system
(based on the “shared memory” model) into consideration and allocates as best as it can processors and
memory where both resources are in the close proximity. This ensures that the workload on a LPAR has
the lowest latency in accessing its memory.
High-end Power Systems use multiple building blocks (CECs) to scale system capacity. Each building
block has its own set of CPUs and memory chips. As an example, the largest Power6 based system (595)
had 64 cores across eight CPU books in the system. The largest Power7 based system now has 256
physical cores across the same eight CPU books with four chips, but now each has 8 cores for a total of 32
cores per book (8 cores per Power7 chip). This large number of total threads increases demand on memory
fabric so the CPU to memory affinity has a greater impact on large system scalability and performance.
Figure 15 - Multi-Tier memory hierarchy of Power System 795
Figure 15 presents a multi-tier memory hierarchy of the Power System 795 processor books:
•
•
•
“Local” memory access in between Power7 Chip and the memory that is directly attached to the
Power7 Chip’s memory controller (MC1/MC2)
“Near” memory access in between two Power7 Chips over the direct intra-node communication
paths built into MCM
“Far” memory access over inter-node communication paths. For 770/780 models this is a Flexcable connecting two CEC drawers. For the 795 model this is a communication path built into the
systems back-plane.
© 2011 International Business Machines, Inc.
Page 38
IBM Americas Advanced Technical Support
The same type of mapping applied to all current IBM Power7 based systems is presented here:
Figure 16 - Power7 Systems - Memory "distances"
From an AIX LPAR perspective, the lssrad command can be used to display CPU/Memory topology. The
lssrad syntax is: lssrad –av. Sample output for a 2-node Power System 770 is presented in Figure 17.
Figure 17 – Power System 770 "lssrad -av" sample output
After the initial machine state (reboot) and when all the resources are free, PowerVM will allocate
memory and cores as optimally as possible. If the LPAR configuration has a combination of near, local
and far memory allocated to it, SGA will be (more or less) evenly spread (striped) across all of it. The
greater the number of CECs involved, the greater the likelihood of remote memory accesses. Oracle PGA
for a given process tends to be allocated in the near memory of the processor that the process was running
on when its memory was allocated. The AIX dispatcher will attempt to maintain affinity between a given
process and the processor that process gets scheduled on. RSETs may (optionally) be used to force affinity
© 2011 International Business Machines, Inc.
Page 39
IBM Americas Advanced Technical Support
to a subset of available processors (e.g. those on a given chip, or within a given CEC), although this could
potentially cause dispatching delays in heavily loaded environments and is not generally recommended.
Oracle based LPARs that require high performance can be forced to get the best resources by activating
the critical LPAR first prior to activating any other LPARs including VIO Server. The VIO server hosts
I/O for all the LPARs so it has to be up to service other LPARs. However the LPAR resource assignment
by the hypervisor occurs prior to any access to storage. Therefore critical LPARs can be activated first, go
through the hypervisor resource assignment, and then will wait for the VIO server to be activated before
proceeding further.
After the initial configuration, the setup may not stay static. There will be numerous operations that take
place such as:
1. Reconfiguration of existing LPARs with new profiles
2. Reactivating existing LPARs and replacing them with new LPARs
3. Adding and removing resources to LPAR dynamically (DLPAR operations)
Any of these changes may result in memory fragmentation where LPARs are spread across multiple
domains. Fragmentation due to frequent movement of memory or processors between partitions can be
avoided through proper planning ahead of time. DLPAR actions can be done in a controlled way so that
the performance impact of resource addition or deletion will be minimal. Planning for growth will help to
alleviate the fragmentation caused by DLPAR operations. Knowing the LPARs that would need to grow
or shrink dynamically and placing them with LPARs that can tolerate nodal crossing latency (less critical
LPARs) would be one approach to handle the changes of critical LPARs dynamically. In such a
configuration, when growth is needed for the critical LPAR the resources assigned to the non-critical
LPAR can be reduced so that critical LPAR can grow. Some additional recommendations are included
here:
• Don’t over-allocate CPUs
o If a given workload (LPAR) requires <= 16 processors (single CEC), don’t allocate more
than 16 processors (2 or more CECs)
o If all the LPARs in a given shared pool require (in aggregate) <= 32 processors (2 CECs),
don’t allocate more than 32 processors (3 or more CECs) to the shared pool
o For Shared Processor LPARs, don’t overallocate vCPUs relative to Entitled Capacity
• Don’t over-allocate memory
o May cause processors/memory to be allocated on additional CECs because there wasn’t
sufficient free memory available on the optimal CEC
• Help the Hypervisor do its job
o Stay current on Firmware (e.g. AM720_101 or later) to avoid any known CPU/memory
allocation or virtual processor dispatching issues
o Where appropriate, consider LPAR boot order to ensure high priority get optimal choice of
the available CPUs and memory
• Consider the use of RSETs
o For example, where heavy application (e.g. java) workload is co-located on DB LPAR
o Or, could potentially affinitize some of the Oracle processes, e.g. system processes, or DB
connections spawned by a given listener
© 2011 International Business Machines, Inc.
Page 40
IBM Americas Advanced Technical Support
Enhanced feature of the mpstat -d command can be used to display statistics on local, near and far
memory accesses for every logical CPU in an LPAR, as presented here:
Figure 18 - "mpstat -d" sample output
The svmon command provides options to display the memory affinity at the process level, segment level,
or cpu level SRAD allocation for each thread.
There is a vmo tunable enhanced_affinity_private that can be tuned from 0 to 100 to improve application
memory locality. The vmo tunable enhanced_affinity_private is set to 40 by default on AIX 6.1 TL6 or
AIX 7.1 and beyond. It is set to 20 on AIX 6.1 TL5. A value of 40 indicates the percentage of application
data that is allocated locally (or “affinitized”) on the memory behind its home POWER7 socket. The rest
of the memory is striped behind all the sockets in its partition. In order to decide the need to implement
this tuning parameter, an analysis of the AIX kernel performance statistics and an analysis of the AIX
kernel trace are required to determine if tuning changes to the enhanced affinity tunable parameters can
help to improve application performance. Increasing the degree of memory locality can negatively affect
application performance on LPARs configured with low memory or with workloads that consist of
multiple applications with various memory locality requirements. Determine how to configure
enhanced_affinity_private or other memory affinity parameters in a production environment after you
have performed sufficient testing with the application.
Important: The enhanced_affinity_private and enhanced_affinity_vmpool_limit vmo tunables are
restricted tunables and must not be changed unless recommended by IBM support (!!!).
© 2011 International Business Machines, Inc.
Page 41
IBM Americas Advanced Technical Support
2.3.
I/O Configuration
The I/O subsystem is one of the most important system components supporting an Oracle RDBMS. The
performance of the applications can be limited by the suboptimal configuration of the disk I/O subsystem
and/or by the huge amount of I/Os generated by the applications SQL code. If the application is spending
most of the CPU time waiting for I/O to complete, it is said to be I/O bound. This means that the I/O
subsystem is not able to service I/O requests fast enough to keep the CPU busy.
To avoid this situation, a proper storage subsystem and data layout planning is a mandatory prerequisite
before deploying the Oracle database. The most general rule to achieve optimal performance is evenly
distributing I/O load across all available elements of the storage subsystem(s).
This chapter is only focusing on the subset of the AIX I/O performance tunables that directly affect the
activity of the Oracle database.
2.3.1. AIX LVM - Volume Groups
A Volume Group is a logical storage structure consisting of the physical volumes that AIX treats as a
contiguous, addressable disk region. AIX Logical Volume Manager (LVM) is used for allocation
management of the basic storage elements (volume groups, physical volumes, logical volumes, file
systems etc).
There are 3 different Volume Group types supported starting with AIX 5.3: normal, big and scalable
VG’s. Following figure shows the configuration limits for different VG’s:
Figure 19 - Volume Groups Configuration Limits
Our recommendation would be to use the Scalable VG’s for allocation of the Oracle database.
Appendix - “Configuring IBM TotalStorage for Oracle OLTP Applications”, will provide more
information about this recommendation.
2.3.2. AIX LVM - Logical Volumes
A logical volume is a set of the logical partitions that AIX makes available as a single storage unit. The
LVM can stripe logical volumes across multiple disks to reduce disk contention. Effective use of the
striping features in the LVM allows you to spread I/O more evenly across disks, resulting in greater
overall I/O performance.
AIX LVM supports two striping techniques:
© 2011 International Business Machines, Inc.
Page 42
IBM Americas Advanced Technical Support
•
•
PP (Physical Partition) Spreading
LVM Striping
PP Spreading is initiated during the creation of the logical volume itself. The result is a logical volume
that is spread across multiple devices (hdisk, vpath, hdiskpower...) in the Volume Group. To create a LV
with PPs spread equally over a set of hdisks in a VG use the following AIX command:
# mklv –e x –y <lvname> <vgname> …
If LVM striping has to be implemented, our advice would be to use large grained stripes. Strip size should
be at least 1MB and not less than the value determined by the following formula:
DB_BLOCK_SIZE * DB_FILE_MULTIBLOCK_READ_COUNT (>= 1MB in size).
In many cases db_block_size * db_file_multiblock_read_count will be considerably less than 1MB, so the
strip size should be at least as large as is dictated by this formula. There may be rare exceptions where a
smaller strip size is warranted, e.g. where EXCEPTIONALLY HIGH single threaded sequential read/write
performance is needed.
AIX logical volume created using the default options will concatenate the underlying physical partitions
serially and adjacently across physical volumes.
Important assumption/prerequisite for this process is that the underlying storage LUN’s have been created
using typical RAID technologies (RAID-10 or RAID-5/RAID 6). Using both of these techniques (SW &
HW striping) will provide an optimal storage layout for any Oracle database.
Figure 20 - Optimal Storage Layout
© 2011 International Business Machines, Inc.
Page 43
IBM Americas Advanced Technical Support
There is at least one clear advantage of PP spreading over LVM striping. If the LV created using the PP
spreading technique has to be extended, any number of disk drives (hdisks) can be added. Once the VG is
extended, optimal PP re-distribution of the new disk layout is achieved using the ‘reorgvg’ command and
can be done online.
Extending the LV that was created using the LVM striping is a more complex process potentially
requiring application downtime. The number of drives that have to be added to the VG has to be equal to
the stripe width that the LV was originally defined and created with. The number of drives added is a
multiple of the existing number used for the LV.
If raw devices are being used, the AIX Logical Volumes should be created with the Zero Offset Feature.
By default, when regular or Big Volume Groups are used, the first 4k bytes of each AIX Logical Volume
are reserved for the Logical Volume Control Block (LVCB), with Oracle data beginning at the 2nd 4k
block in the Logical Volume. Beginning with Oracle9i R2, it is possible to go to a zero offset format
where Oracle data begins at byte zero (0) of the Logical Volume. When a db_block_size > 4k is used,
using the zero offset option can improve I/O subsystem performance, particularly where some form of
striping (e.g. AIX LVM or hardware based RAID-5 or RAID-10) is used.
Execute the following steps to take advantage of the zero offset feature:
• create a “scalable” or “big” Volume Group using the mkvg –S or –B flag,
• If a “big” Volume Group is being used, create one or more Logical Volumes in that Volume
group using the mklv –T O (not 0) flag. The “-T O” option removes LVCB offset so Oracle
can safely use this Logical Volume.
• Logical Volumes created in Scalable Volume Groups don’t have a LVCB offset and therefore
do not require the use of the “-T O” flag
2.3.3. AIX sequential read ahead
The VMM observes the pattern in which a process accesses a file, and if the file is read sequentially, reads
additional data before the application requests it and places the data in file system cache. When the
process accesses two successive pages of the file, the VMM assumes that the program will continue to
access the file sequentially, and schedules additional sequential reads of the file. These reads overlap the
program processing and make data available to Oracle sooner.
Two VMM thresholds, implemented as kernel parameters, determine the number of pages it reads ahead:
•
•
MINPGAHEAD (JFS) or j2_minPageReadAhead (JFS2), the number of pages read ahead
when the VMM first detects the sequential access pattern,
o default: 2,
o recommended starting value is: MAX (2, DB_BLOCK_SIZE/4096),
o example: ioo –p –o j2_minPageReadAhead=8
o restricted tunable in AIX 6.1 and 7.1
MAXPGAHEAD (JFS) or j2_maxPageReadAhead (JFS2) is the maximum number of pages
that VMM reads ahead in a sequential file,
o default: 8 (JFS) or 128 (JFS2),
o recommended value is the equal to (or multiple of) size of the largest Oracle I/O
request,
© 2011 International Business Machines, Inc.
Page 44
IBM Americas Advanced Technical Support
o max[256, (DB_BLOCK_SIZE/4096) *
DB_FILE_MULTIBLOCK_READ_COUNT]
o example: ioo –p –o j2_maxPageReadAhead=256
o restricted tunable in AIX 6.1 and 7.1
Set the MINPGAHEAD and MAXPGAHEAD (and equivalent jfs2) parameters to appropriate values for
your application (a power of two). Use the ioo command to change these values. You can use higher
values for the MAXPGAHEAD parameter in systems where the sequential performance of striped logical
volumes is of paramount importance. Be aware that too much read ahead has the disadvantage of using
system resources when the application only partly reads the file.
2.3.4. I/O Buffer Tuning
As I/Os traverse the I/O stack, AIX must keep track of outstanding I/Os and does so at several layers using
buffers (bufs). The different bufs are pinned memory buffers which are used to hold each individual I/O
(Figure 21). The LVM device driver uses pbufs, the file systems (JFS, JFS2, NFS or VxFS) use fsbufs,
and I/Os to page space use psbufs. The number of LVM pbufs, is controlled using the lvmo command and
pv_pbuf_count attribute. The number of buffers used for file systems vary depending on the type of file
system. The number of buffers for JFS is controlled by the numfsbuf attribute. JFS2 uses the
j2_nBufferPerPagerDevice and j2_dynamicBufferPreallocation attributes for tuning and NFS uses the
nfs_vX_pdts and nfs_vX_vm_bufs attributes for tuning (where X depends on the version of NFS).
To clarify the related tuning steps we will use the sample result of the following command:
# vmstat –v | grep blocked
0 pending disk I/Os blocked with no pbuf
o for pbufs, use “lvmo [-v VgName] –a” command to find which VGs have a non-zero
pervg_blocked_io_count value and then use lvmo to increase pv_pbuf_count which is a
dynamic change. max_vg_pbuf_count is the maximum number of pbufs that can be
allocated for the volume group. For this value change to take effect, the volume group
must be varied off and varied on again. If you increase the pbuf value too much, you
might see degradation in performance or unexpected system behavior.
0 paging space I/Os blocked with no psbuf
o for psbufs, preferably stop paging, else add more paging spaces.
8755 filesystem I/Os blocked with no fsbuf  JFS
o for fsbufs, increase numfsbufs using ioo, default is 196, recommended value is 1568,
and requires a remount to go into effect.
0 client filesystem I/Os blocked with no fsbuf  NFS/Veritas
o for client file system fsbufs, increase:
• nfso’s nfs_vX_pdts and nfs_vX_vm_bufs (where X depends of the NFS version
being used, 2, 3 or 4)
• The nfs_vX_vm_bufs is dynamic but can only be increased and nfs_vX_pdts
requires a remount
o restricted tunables in AIX 6.1 and 7.1
© 2011 International Business Machines, Inc.
Page 45
IBM Americas Advanced Technical Support
2365 external pager filesystem I/Os blocked with no fsbuf  JFS2
o if there is a shortage of file system buffstructs for JFS2:
• j2_dynamicBufferPreallocation tunable should be tuned (increased) first before
making any change to j2_nBufferPerPagerDevice parameter, dynamic
parameter, no file system remount is required.
• if j2_nBufferPerPagerDevice has to be increased, recommended value is 2048,
restricted tunable in 6.1 and 7.1
Figure 21 - AIX I/O Buffering
It is important for the reader to understand that the exact reported values don’t matter here. What is
important is how quickly values change over time, so you really need to look at consecutive “vmstat –v”
results captured over some period of time – not a single “vmstat –v” output.
© 2011 International Business Machines, Inc.
Page 46
IBM Americas Advanced Technical Support
2.3.5. Disk I/O Tuning
The storage subsystem is one of the most important infrastructure elements affecting overall Oracle
database response time and throughput. The tuning recommendations included in this chapter are
primarily focused on optimizing the I/O related buffers and queues at the operating system level. Queues
or buffers are used to keep track of I/O requests at each layer, and also act as a limiting factor on the
number of in-flight I/Os at that layer.
Below is an overview of the most important layers in the I/O stack below the AIX VMM layer.
Figure 22 - Disk I/O Layers
I/O requests initiated by the Oracle database (once they have exited the layers discussed in the previous
chapter) are submitted to the storage subsystem represented through different storage-related entities. Each
one of these elements contains its own queuing mechanism. Once the I/O request reaches the first layer
(usually a multipathing solution that is an optional element), it is submitted to the queue waiting for the
available resources required to service the I/O request. As long as the required resources are available to
satisfy the I/O request, the data will be returned to the initiator of the I/O request in a prompt manner. If
the resources are not available, the I/O requests are held in the appropriate queue until the resources
become available.
© 2011 International Business Machines, Inc.
Page 47
IBM Americas Advanced Technical Support
Some of the multi-path solutions (SDDPCM/MPIO etc…) are providing their own queuing mechanism;
others just pass the I/O requests to the next layer. The only way to obtain the proper description of the
queuing mechanism associated with the specific solution is to check the documentation provided by the
solution developer.
The next layer accepting the I/O request, based on the most optimal path defined by the multipath
solution, is the hdisk device driver. Even though the disk is connected to the adapter, the hdisk driver code
is utilized before the adapter code. Queue depth of the specific device (disk or adapter) defines how many
I/O requests can be in-flight at a certain point in time. This queue submission process (if the I/O is queued
and not submitted to the next layer right away) generates a delay in the response time presented to the
higher layers (AIX and Oracle). At the hdisk driver, two response times are recorded and used to check
the quality of the execution, queue wait time and service time (read I/O or write I/O related). Queue wait
time is the time the I/O request spent in the wait queue and the service time is how long it takes before the
I/O response is returned from the storage (either an acknowledgement for a write request or the data
returned as a result of the read request).
The size of the queues can be extended, but a general compromise is generated by this change. While the
response time is usually decreased in the specific entity, more I/Os are passed on to the next layer where a
bottleneck potentially could occur.
The device attribute that controls the queue depth of the hdisk device is queue_depth. To check the value
of this attribute, execute the following command: lsattr –El hdisk (Figure 23). If the VIO server is used for
storage provisioning, the hdisk queue_depth attribute value on the LPAR’s (VIO Client) should be set to
match the value of the mapped hdisk's queue_depth on the VIO Server.
Figure 23 - “lsattr –El hdisk2” sample output
The attribute defining the queue depth of the adapter is num_cmd_elems. To check the value of this
attribute, execute: lsattr –El fcsN (Figure 24).
© 2011 International Business Machines, Inc.
Page 48
IBM Americas Advanced Technical Support
Figure 24 - “lsattr –El fsc2” sample output
An additional storage adapter tuning attribute that can provide potential performance benefit is
max_xfer_size. It controls the maximum I/O size the adapter device driver will handle and also controls
the memory area used by the adapter for data transfers. If the default value is used
(max_xfer_size=0x100000) the memory area is 16 MB in size. The value is sometimes changed to
0x200000 which increases the memory region size to 128MB and potentially can improve the adapter’s
bandwidth. This applies to real, not NPIV adapters. Changing the value of this attribute, depending on the
hardware configuration, can make the system unbootable (if booted from SAN) or the disk drives attached
through this adapter can become unavailable.
Queue depth defaults are specified by the storage vendor and populated in the AIX ODM (Object Data
Manager) via file sets provided to support the storage.
To evaluate and/or verify the performance of the storage subsystem utilizing the device queues (adapters
and hdisks) the following commands can be utilized: iostat and filemon.
The iostat data collection process is usually executed during the peak load of the system (for the period of
1-2 hours executing snapshots every 15-30 s). It is usually executed in parallel with additional data
collection processes (AIX nmon, Oracle AWR etc…) to establish the performance data correlation across
all the layers. This correlation process can be used to find the potential performance anomaly. Sample
output from the “iostat –alDRT” command is included in the screenshot below.
The most common set of options submitted with the iostat command to collect this information are:
• -a: adapter throughput report,
• -D: extended disk drive report
• -R: Specifies that the reset of min* and max* values should happen at each interval
• -l: output in long listing mode
• -T: specifies the time stamp
The most important values to be examined in the iostat generated output are:
• Average read service time (read -> avgserv). This value depends on many different factors;
expected optimal average read service time should be lower than 10ms.
• Average write service time (write -> avgserv), with the basic assumption that the cache
implemented in the storage subsystem has enough empty space to support optimal I/O
execution, should average less than 3 ms. If the storage subsystem cache de-staging mechanism
is overloaded, these numbers will start growing substantially.
© 2011 International Business Machines, Inc.
Page 49
IBM Americas Advanced Technical Support
•
Average queue time (avgtime) and the rate per second at which I/O requests are submitted to a
full queue (servq full) should be equal to zero. If these values are constantly above 0, the queue
is overloaded and the queue_depth variable should be increased, unless I/O service times are
poor indicating a disk bottleneck (in which case increasing queue_depth just results in I/Os
waiting at the storage rather than in the queue)
Figure 25 - “iostat” sample output
Executing “fcstat adapter_name” will provide the status of the adapter queue as presented in this example
(Figure 26). If the values for “No Adapter Elements Count” or “No Command Elements Count” are nonzero, then increasing num_cmd_elems is warranted. If “No DMA Resource Count” is non-zero, then
changing max_xfer_size to 0x200000 is warranted (if not already at that value) for real FC adapters.
Figure 26 - “fcstat” sample output
© 2011 International Business Machines, Inc.
Page 50
IBM Americas Advanced Technical Support
Changing the value associated with any of the previously mentioned attributes (queue_depth and/or
num_cmd_elems) should be executed with the close supervision of the storage vendor representatives
and/or IBM Support Service.
The filemon command utilizes AIX trace functionality to monitor the file-system and I/O-system events
and reports the performance statistics for files, virtual memory segments, logical volumes and physical
volumes. This command is very useful in a situation when the Oracle database is believed to be diskbound and when there is a known performance issue regarding the I/O activity on the system. The filemon
command is capable of showing the load on different disks, logical volumes and files in great detail.
The filemon command monitors file and I/O activity at four different levels:
• Logical file system -It monitors logical I/O operations on logical files. I/O statistics are kept on a
per-file basis.
• Virtual memory system - It command monitors physical I/O operations (that is, paging) between
segments and their images on disk. I/O statistics are kept on a per-segment basis.
• Logical volumes - It monitors I/O operations on logical volumes. I/O statistics are kept on a perlogical-volume basis.
• Physical volumes - It monitors I/O operations on physical volumes. At this level, physical
resource utilizations are obtained. I/O statistics are kept on a per-physical-volume basis.
The data collection process is executed during the peak load of the system for a very short period of time
(rarely longer than 60s). The command itself generates performance/trace data. Before the data is written
to the output file, they are stored in the circular kernel trace buffer whose size is defined during the
execution of the command. If the load on the system is substantial, the records in the buffer may be
overwritten which makes the entire data collection process invalid (the error message “TRACEBUFFER
WRAPAROUND error” will be included in the output file). In that case, the size of the buffer should be
increased or the period of time for the data collection should be decreased. The process is repeated until a
valid data set is generated.
The filemon sample command most commonly used for the data collection process is:
filemon -u -v -o output.out -O all -T 100000000; sleep 60; trcstop.
The options included with the command can change with every major AIX release and should be validated
before they are used for the first time.
Options included in the filemon sample are:
• -u: reports on the files that are opened before the trace daemon is started
• -v: includes extra information in the report
• -o: defines the name of the output file (instead of stdout)
• -O levels: monitors specified levels of the file system (all 4 layers included in this sample)
• -T: size of the kernel trace buffer in bytes
• sleep 60: data collection process is active for 60s
• trcstop: command used to stop the tracing process
© 2011 International Business Machines, Inc.
Page 51
IBM Americas Advanced Technical Support
The snapshot below presents an extract of the output file generated using the filemon command:
Figure 27 - “filemon” sample output
The main focus of this investigation is the quality of the I/O activity surrounding the most active volumes
(logical and/or physical). The main elements describing the level of I/O activity are the average read and
write service times. The rules of thumb that could be used to define the validity of the associated values is
similar to the rules established during the use of the iostat command (average read service time should be
<= 10ms and average write service time should be in the range of 2-3ms).
© 2011 International Business Machines, Inc.
Page 52
IBM Americas Advanced Technical Support
There is a wide range of potential reasons for a suboptimal storage subsystem configuration, but the most
common are:
• initial build of the subsystem was not properly sized and configured
• data layout resulting in imbalances in the I/O rate to physical disks resulting in hot disks or
disk “hot spots”
• lack of capacity planning data that could be used to properly predict the growth of the
application and database load and its impact on the overall system performance.
• missing or improperly defined key performance indicators that could help easier and faster
recognition of the potential performance anomalies, etc…
The procedures explained in this chapter can be used to eliminate common disk I/O performance
anomalies. However, it will not assist the process of achieving optimal storage subsystem configuration if
any of the previously mentioned elements are missing.
© 2011 International Business Machines, Inc.
Page 53
IBM Americas Advanced Technical Support
2.3.6. Disk I/O Pacing
I/O pacing is intended to prevent programs that generate very large amounts of I/O from saturating the
system's I/O facilities and causing the response times of less-demanding programs to deteriorate. I/O
pacing enforces per-segment, or per-file, high and low-water marks on the sum of all pending I/Os. When
a process tries to write to a file that already has high-water mark pending writes, the process is put to sleep
until enough I/Os have completed to make the number of pending writes less than or equal to the lowwater mark. The logic of I/O-request handling does not change. The output from high-volume processes is
slowed down somewhat.
•
•
The maxpout parameter specifies the number of pages that can be scheduled in the I/O
state to a file before the threads are suspended.
The minpout parameter specifies the minimum number of scheduled pages at which the
threads are woken up from the suspended state.
The system-wide values for maxpout and minpout can be dynamically changed via SMIT or the
“chdev –l sys0” command. The system-wide settings may also be overridden at the individual filesystem
level by using the maxpoout and minpout mount options.
The best range for the maxpout and minpout parameters depends on the CPU speed and the I/O system -Generally, the newer and faster the server, the larger the minpout and maxpout settings should be. For
current generation hardware, the AIX 6.1 and 7.1 defaults (maxpout=8193, minpout=4096) are usually a
good starting point. In AIX 5.3, I/O pacing is not enabled by default, so we recommend setting
maxpout=8193 and minpout=4096 for AIX 5.3 as well. This can be done with the following command:
# chdev -l sys0 -a maxpout=8193 minpout=4096
When considering setting lower maxpout and minpout values, it would be advisable to open a
performance PMR with IBM Support to obtain expert tuning advice for your environment. Generally,
maxpout should not be less than the value of the j2_nPagesPerWriteBehindCluster parameter (which
defaults to 32) and usually much higher values are appropriate. Be aware that there are a number of
industry "best practices" recommendations in circulation that were developed several years ago, and never
revised to reflect current generation technology.
I/O pacing may also be implemented for NFS based filesystems using the nfs_iopace_pages parameter.
The AIX 6.1 and 7.1 defaults (1024 pages) is the starting point for AIX 5.3. This can be set with the
following command:
# nfso -o nfs_iopace_pages=1024
© 2011 International Business Machines, Inc.
Page 54
IBM Americas Advanced Technical Support
2.3.7. Asynchronous I/O
Oracle takes full advantage of Asynchronous I/O (AIO) provided by AIX, resulting in faster database
access. AIO allows Oracle to hand off I/O requests to AIX so Oracle can continue working (typically
submitting more I/Os) without having to wait synchronously for each I/O to complete; thus, improving
performance and throughput. Since AIX 5L, two AIO subsystems are available: legacy AIO and POSIX
AIO. All Oracle releases up to the current release 11g use legacy AIO subsystem.
AIX 5L and higher versions support Asynchronous I/O (AIO) for database files created both on file
system partitions and on raw devices, with fast paths available for raw devices and CIO file system I/Os.
Fast paths offer reduced context switching, less CPU overhead and don’t require an AIO kernel process to
handle the I/O. When using AIO without the fast path, the kernel server processes (kproc) control each
request from the time a request is taken off the queue until it completes. The number of kproc servers
determines the number of AIO requests that can be executed in the system concurrently, so it is important
to tune the number of kproc processes when using file systems to store Oracle data files. With AIX pre-6.1
versions, if the database is using AIO, the corresponding subsystem must be activated by setting available
in the autoconfig parameter. In AIX 6.1 and 7.1, AIO is always started. Changing this tunable will require
the reboot of the system to activate the change.
Figure 28 - AIX Asynchronous I/O
When using AIO on raw devices and disk drives in an Oracle ASM environment, I/O is being done via the
fastpath when the fastpath option is enabled (default in AIX 5.3, 6.1 and 7.1). This hands off the I/O to the
hdisk driver reducing CPU context switching. Starting with AIX 5.3, fastpath functionality for the CIO
accessed file systems can be enabled by running # aioo –o fsfastpath=1. This change is not persistent
across a system reboot. In AIX 6.1 and 7.1, the restricted tunable fsfastpath is set to 1 by default
(aio_fsfastpath).
Figure 29 - AIX Fastpath Asynchronous I/O
© 2011 International Business Machines, Inc.
Page 55
IBM Americas Advanced Technical Support
The default minimum number of async I/O servers (kproc) servers configured when async I/O is enabled
is defined with the minservers attribute. There are also a maximum number of async I/O servers that get
created, which is controlled by the maxservers attribute. If the number of async I/O requests is high,
increase the maxservers value to approximately the number of simultaneous I/Os there might be. In most
cases, it is better to leave the minservers parameter at the default value because the AIO kernel extension
will generate additional servers if needed. These parameters apply to files only; they do not apply to raw
devices or files opened in CIO mode that do not use kproc server processes.
In AIX 5.3, the default value for the minimum number of AIO servers is 1 (systemwide value), while the
default value for the maximum number of AIO servers is 10 (per CPU value). In AIX 6.1 and AIX 7.1
scope and the default values are changed: aio_minserver = 3, aio_maxserver = 30 and the scope is per
CPU. All parameters in AIX 6.1 and AIX 7.1 have leading aio_ in the parameter name.
In addition to changing the tunables minservers and maxservers for AIO, the maxreqs tunable should also
be monitored and changed, if necessary. The maxreqs tunable controls the number of concurrent requests
the AIO system allows and it is not applicable for (fs)fastpath based I/O operations.
To display the number of AIO servers running, enter the following command as the root user:
# pstat -a | egrep ' aioserver' | wc -l
Over time, the number of AIX servers shown with pstat will reflect the maximum number defined. This
is expected behavior.
AIX 5.3 added a new option for the iostat command (–A), which displays AIO statistics for the specified
interval and count. Starting with AIX 6.1, all AIO subsystems parameters became tunable via the ioo
command (aioo command is deprecated).
The Asynchronous I/O report has the following column headers:
•
•
•
•
•
avgc: Average global AIO request count per second for the specified interval,
avfc: Average fastpath request count per second for the specified interval,
maxgc: Maximum global AIO request count since the last time this value was fetched,
maxfc: Maximum fastpath request count since the last time this value was fetched,
maxreqs: Maximum AIO requests allowed.
Recommended starting values:
•
minservers:
o entry value for AIX 5.3 is defined per system, starting value should be 3, additional servers
will automatically start based on the requirements of the system load
o entry values for AIX 6.1 and 7.1 are defined per logical CPU, default value is 3
(aio_minservers)
© 2011 International Business Machines, Inc.
Page 56
IBM Americas Advanced Technical Support
•
maxservers:
o entry value for AIX 5.3 is defined per CPU, starting value should be 30. Database
environments with heavy I/O requirements may need 2-4x this value.
o entry value for AIX 6.1 and 7.1 is defined per CPU, starting value should be 30 for
filesystem environments (aio_maxservers)
•
maxreqs:
o a multiple of 4096 > 4 * number of disk * disk_queue_depth, typical setting is 65536 for
AIX 5.3 or AIX 6.1 and 131072 for AIX 7.1 (aio_maxreqs)
If the value of the maxreqs or maxservers parameter is set too low, you might see the following error
messages repeated in the Oracle alert log:
Warning: lio_listio returned EAGAIN
Performance degradation may be seen.
Use one of the following commands to set the minimum and maximum number of servers and the
maximum number of concurrent requests:
o in AIX 5.3,
• “smit aio” AIX management panel,
• chdev -P -l aio0 -a maxservers='m' -a minservers='n',
• chdev -P -l aio0 -a maxreqs=’q’,
o in AIX 6.1 and 7.1,
• ioo command is used to change all AIO related tunables.
• ioo -p -o aio_maxservers='m' -o aio_minservers='n',
• ioo -p -o aio_maxreqs=’q’,
Additionally, the following set of the Oracle parameters should be checked. Detailed explanation can be
found in “Chapter 3: Oracle Tuning”. Presented are the default values:
•
•
DISK_ASYNCH_IO = TRUE,
FILESYSTEMIO_OPTIONS = (ASYNCH | SETALL),
• DB_WRITER_PROCESS = usually left at the default value, means not set in init.ora or spfile.
© 2011 International Business Machines, Inc.
Page 57
IBM Americas Advanced Technical Support
2.3.8. JFS2 file system DIO/CIO mount options
o Direct I/O (DIO)
Certain classes of applications (and related DB activities) derive no benefit from the file buffer cache.
Databases normally manage data caching at the application level, so they do not need the file system to
implement this service for them. The use of a file buffer cache results in undesirable overhead in such
cases, since data is first moved from the disk to the file buffer cache and from there to the application
buffer. This “double-copying” of data results in additional CPU consumption. Also, the duplication of
application data within the file buffer cache increases the amount of memory used for the same data,
making less memory available for the application, resulting in additional system overhead due to memory
management. Simplified view of this activity is presented in Figure 30.
Figure 30 - AIX & Oracle Cached I/O Operations
For applications that wish to bypass the buffering of memory within the file system cache, Direct I/O is
provided as an option in JFS. When Direct I/O is used for a file, data is transferred directly from the disk
to the application buffer, without the use of the file buffer cache. While Direct I/O may benefit some
applications, its use disables any file system mechanism that depends on the use of file system cache, e.g.
sequential read-ahead. Applications that perform a significant amount of sequential read I/O may
experience degraded performance when Direct I/O is used. In some cases, it may be possible to
compensate for the loss of filesystem cache related mechanisms by increasing the size of Oracle DB
buffer cache and/or increasing the Oracle db_file_multiblock_read_count parameter. Simplified view of
this activity is presented in Figure 31.
© 2011 International Business Machines, Inc.
Page 58
IBM Americas Advanced Technical Support
Figure 31 - AIX & Oracle Direct I/O Operations
JFS2 supports Direct I/O as well as Concurrent I/O (discussed below) options. The Concurrent I/O model
is built on top of the Direct I/O model. For JFS2 based environments, Concurrent I/O should always be
used (instead of Direct I/O) for those situations where the bypass of file system cache is appropriate.
JFS Direct I/O should only be used against Oracle data (.dbf) files for the environments where the
DB_BLOCK_SIZE is 4k or greater. Use of JFS Direct I/O on any other files, including redo logs or
control files is likely to result in a severe performance penalty due to violation of Direct I/O alignment
and/or I/O transfer size restrictions.
o Concurrent I/O (CIO)
The inode lock imposes write serialization at the file level. JFS2 (like JFS) is a POSIX compliant file
system. As such, JFS2 (by default) employs I/O serialization mechanisms to ensure the integrity of data
being updated. An inode lock is used to ensure that there is at most one outstanding write I/O to a file at
any point in time, concurrent reads are not allowed because they may result in reading stale data. An
Oracle database implements its own I/O serialization mechanisms to ensure data integrity. Consequently,
it does not rely on the file system based (POSIX standard) I/O serialization mechanisms. In write intensive
environments inode serialization hinders performance by unnecessarily serializing non-competing data
accesses.
For applications such as Oracle Database that provide their own I/O serialization mechanisms, JFS2
(beginning with AIX 5.2.10) offers the Concurrent I/O (CIO) option. Under Concurrent I/O, multiple
threads may simultaneously perform reads and writes on a shared file. Applications that do not enforce
serialization for accesses to shared files (including operating system level utilities) should not use
© 2011 International Business Machines, Inc.
Page 59
IBM Americas Advanced Technical Support
Concurrent I/O, as this could result in data corruption due to competing accesses and /or severe
performance penalties.
Concurrent I/O should only be used for Oracle .dbf files (data & index, rbs or undo, system and temp)
online redo logs and/or control files. When used for online redo logs or control files, these files should
be isolated in their own JFS2 file system(s) that have been created with agblksize=512. File system(s)
containing .dbf files should be created with agblksize=2048 if DB_BLOCK_SIZE=2k, or
agblksize=4096 if DB_BLOCK_SIZE>= 4k or larger. Failure to implement these agblksize guidelines is
likely to result in a severe performance penalty. Do not, under any circumstances, use CIO mount
option for the file systems containing the Oracle binaries (!!!). Additionally, do not use DIO/CIO
options for file systems containing archive logs or any other files not already discussed.
For applications that wish to bypass the buffering of memory within the file system cache, Concurrent I/O
is the preferred option for JFS2. When Concurrent I/O is used for a file, data is transferred directly from
the disk to the application buffer, without the use of the file buffer cache. In addition, the POSIX based
I/O serialization mechanisms are disabled, providing better I/O concurrency for write intensive
applications. While Concurrent I/O may benefit some applications, its use disables any file system
mechanisms that depend on the use of file system cache, e.g. sequential read-ahead. Applications that
perform a significant amount of sequential read I/O may experience degraded performance when
Concurrent I/O is used.
Starting with Oracle 10g, the advanced I/O capabilities of JFS2 can be used by default from the Oracle
database server. If the filesystem type JFS2 is detected by the database server, all Oracle datafiles residing
on this filesystem are accessed using the CIO option. To use CIO with JFS2 for Oracle 10g / 11g together
with ASYNC I/O operation, set the initialization parameter FILESYSTEMIO_OPTIONS=SETALL in the
(s)pfile. Chapter 3 will provide some additional information about FILESYSTEMIO_OPTIONS
parameter.
If a datafile is opened in CIO mode by Oracle using O_CIO flag, all access to this datafile via the regular
filesystem cache is prohibited by the AIX OS. Tools like cp, dd, cpio, tar etc. do not open Oracle datafiles
with the CIO option. Therefore these tools will receive an error from the AIX operating system when
trying to access the Oracle datafiles if the database is open. Without corrective action, an attempt to
perform an online backup using these tools will fail.
To avoid the problem, with Oracle < 11.2.0.2, all JFS2 filesystems containing Oracle datafiles should be
mounted with the mount -o cio option. This enforces a bypass of the filesystem cache for all programs. If
Oracle redo log and control files are on JFS2 as well, the mount -o cio option must be used accordingly.
Note that the access to the Oracle datafiles by all other programs which do not support CIO directly will
probably be significantly slowed down. This is because caching, read ahead etc. is no longer possible.
AIX 6.1 and AIX 7.1 combined with Oracle >= 11.2.0.2 introduced a new open flag O_CIOR which is
same as O_CIO, but this allows subsequent open calls without CIO. The advantage of this enhancement is
that other applications like cp, dd, cpio can access database files in read only mode without having to open
them with CIO. Starting with Oracle 11.2.0.2 when AIX 6.1 or AIX 7.1 is detected, Oracle will use
O_CIOR option to open a file on JFS2. Therefore you should no longer mount the filesystems with mount
option –o cio.
Applications that use raw logical volumes or ASM for data storage don’t encounter inode lock contention
since they don’t access files.
© 2011 International Business Machines, Inc.
Page 60
IBM Americas Advanced Technical Support
2.3.9. Oracle ASM & AIX Considerations
Oracle Automatic Storage Manager (ASM) is Oracle’s recommended storage management solution that
provides an alternative to conventional volume managers, file systems and raw devices.
ASM requires a special type of Oracle instance to provide the interface between a traditional Oracle
instance and the storage elements presented to AIX. The Oracle ASM instance mounts disk groups to
make Oracle ASM files available to database instances. An Oracle ASM instance is built on the same
technology as an Oracle Database instance. The ASM software component is shipped with the Grid
Infrastructure software.
Figure 32 - Oracle ASM Architecture
Oracle ASM uses disk groups to store data files. A disk group consists of multiple disks and is the
fundamental object that Oracle ASM manages. Disk group components include disks, files, and allocation
units. For each ASM disk group, a level of redundancy is defined, which may be normal (mirrored), high
(3 mirrors), or external (no ASM mirroring). When a file is created within ASM, it is automatically striped
across all disks allocated to the disk groups according to the stripe specification in V$ASM_TEMPLATE.
The stripe width specified may be fine, which is a 128k stripe width, or coarse, indicating the stripe width
of the ASM disk group (default is 1M), which is called the allocation unit size, or AU_SIZE.
The raw device files given to ASM are organized into ASM disk groups. In a typical configuration, two
disk groups are created, one for data and one for a recovery area, but if LUNs are of different sizes or
performance characteristics, more disk groups should be configured, with only similar LUNs (similar in
terms of size, performance, and raid characteristics) comprising each disk group. For large database
environments, it may also be advisable to have more than one data disk group for manageability. Figure
33 provides a sample list of the ASM disk groups created for the standard SAP installation for a very large
database.
© 2011 International Business Machines, Inc.
Page 61
IBM Americas Advanced Technical Support
Figure 33 - SAP recommended ASM configuration for a very large databases
Oracle ASM disks are the storage devices that are provisioned to Oracle ASM disk groups. Most
commonly used storage objects that are mapped to ASM disks are AIX raw hdisks and AIX raw logical
volumes. Both of those elements could be backed by storage LUNs or actual disk drives. Most AIX
customers use raw hdisks while raw logical volumes are primarily used in a non-RAC environment. In
AIX environment, each of those storage elements is presented by a special file located in the /dev
directory:
•
•
Raw hdisks as /dev/rhdisknn or
Raw logical volume as /dev/ASMDataLVnn
To properly present those devices to the Oracle ASM instance (the “asmca” configuration tool during or
after the initial ASM build), they must be owned by the Oracle user (chown oracle.dba /dev/rhdisknn) and
the associated file permission must be 660 (chmod 660 /dev/rhdisknn).
Raw hdisks cannot belong to any AIX volume group and should not have a Physical Volume Identifier
(PVID) defined. One or more raw logical volumes presented to the ASM instance could be created on the
hdisks belonging to the AIX volume group.
A PVID is an AIX storage element identifier that is used to track AIX Physical Volumes (PVs). It is
written on the disk (LUN) and registered in the AIX ODM database. The PVID is defined before the disk
(LUN) joins the volume group, and is used to preserve hdisk numbering across reboots and storage
reconfigurations. To display the hdisk associated PVIDs, the lspv command is used.
© 2011 International Business Machines, Inc.
Page 62
IBM Americas Advanced Technical Support
Figure 34 - AIX PVIDs
If hdisks are not part of the AIX volume group, its PVIDs can be cleared using the chdev command:
# chdev –l hdiskn –a pv=yes
# chdev –l hdiskn –a pv=clear
PVIDs are physically stored in the first 4k block of the hdisk, which happens to be where Oracle stores the
ASM, OCR and/or Voting disk header. For ASM Managed Disks ASM preserves its own mapping
between LUNs and assignments in ASM disk, so hdisk numbering preservation isn’t important. Some
Oracle installation documentation recommends temporarily setting PVIDs during the install process (this
is not the preferred method). Assigning or clearing a PVID on an existing ASM managed disk will
overwrite the ASM header, making data unrecoverable without the use of KFED (See Metalink Note #
353761.1)
AIX 5.3 TL07 (and later) has a specific set of Oracle ASM related enhancements. Execution process of
the “mkvg” or “extendvg” commands will now check for presence of ASM header before writing PVID
information on hdisk. Command will fail and return an error message if ASM header signature is detected:
0516-1339 /usr/sbin/mkvg: Physical volume contains some 3rd party volume group.
0516-1397 /usr/sbin/mkvg: The physical volume hdisk3, will not be added to the volume group.
0516-862 /usr/sbin/mkvg: Unable to create volume group.
The force option (-f) will not work for an hdisk with an ASM header signature either.
If an hdisk formerly used by ASM need to be repurposed, the ASM header area can be cleared using the
AIX ‘dd’ command:
# dd if=/dev/zero/ of=/dev/rhdisk3 bs=4096 count=10
Using the chdev utility with pv=yes or pv=clear operations do not check for ASM signature before setting
or clearing PVID area.
AIX 6.1 TL06 and AIX 7.1 introduced a rendev command that can be used for permanent renaming of the
AIX hdisks (Figure 35).
© 2011 International Business Machines, Inc.
Page 63
IBM Americas Advanced Technical Support
Figure 35 - AIX “rendev” command
Additionally, the lkdev command can be used to locks device attributes from being changed with chdev
command (Figure 36).
Figure 36 - AIX “lkdev” command
© 2011 International Business Machines, Inc.
Page 64
IBM Americas Advanced Technical Support
2.4.
CPU Tuning
CPU tuning we define as an advanced tuning area. Most of the activities are primarily controlled using the
AIX schedo command. Our recommendation here would be to leave all the initial settings of the schedo
parameters at their default value. Change to any of these parameters should be applied only if specific
recommendation has been given by IBM.
For that reason, we will focus primarily on the new features of the Power Systems technologies:
‘Simultaneous Multi-Threading (SMT)’, ‘Logical Partitioning’ and ‘Micropartitioning’. Basic overview of
the PowerVM key terms is provided in Figure 37. Additional information can be found in the IBM
Redbook "IBM PowerVM Virtualization Introduction and Configuration",
(http://www.redbooks.ibm.com/redpieces/abstracts/sg247940.html).
Figure 37 - PowerVM key terms
© 2011 International Business Machines, Inc.
Page 65
IBM Americas Advanced Technical Support
2.4.1. Simultaneous Multi-Threading (SMT)
Simultaneous MultiThreading (SMT) is a hardware design enhancement in POWER5 and later that allows
two (or four) separate instruction streams (threads) to execute simultaneously on the processor. AIX 5.3 or
later is also required to utilize this feature.
For most multi-threaded commercial workloads, SMT (vs. no-SMT) provides a significant increase in the
total workload throughput that is achievable given the same number of physical processors in a partition.
The use of SMT is generally recommended for all Oracle workloads. Some applications such as cache or
memory intensive High Performance Computing workloads that are tuned to optimize the use of processor
resources may actually see a decrease in performance due to increased contention to cache and memory.
Single threaded applications (running at relatively low system wide processor utilization) may also see
some response time degradation. SMT may be disabled for these cases.
Each hardware thread is supported as a separate logical processor by AIX 5.3, 6.1 and 7.1, independent of
the type of partition. When SMT-2 is enabled, a partition with one dedicated Power5(+) or Power6(+)
processor would have 2 logical processors, a shared partition with 2 virtual processors would have 4
logical processors, etc. All active threads running on the 2 logical processors, representing the 2 SMT
threads for a single dedicated or virtual processor, are always scheduled together by the Hypervisor onto a
dedicated or virtual processor.
An enhancement in the POWER7 processor is the addition of the SMT-4 mode to enable four instruction
threads to execute simultaneously in each POWER7 processor core (additionally, AIX 6.1 or 7.1 is also
required). Thus, the instruction thread execution modes of the POWER7 processor are as follows:
• SMT-1: single instruction execution thread per core
• SMT-2: two instruction execution threads per core
• SMT-4: four instruction execution threads per core
AIX versions supporting Power 7 include a new feature that controls the SMT mode for each CPU-core
individually. It is called “Intelligent SMT threading”. It monitors the number of processes occupying the
active SMT threads and changes the SMT mode based on the load requirement. The goal of the feature is
to optimize the thread use to achieve either the highest per-thread performance or maximum throughput,
depending on the workload requirement.
The SMT policy is controlled by the operating system, thus it is partition specific and can be modified for
a given partition using the smtctl command. The smtctl command can enable SMT, disable SMT or
change the number of SMT thread. It provides privileged users and applications with a means to adjust
SMT for all processors in a partition either immediately or on a subsequent boot of the system.
© 2011 International Business Machines, Inc.
Page 66
IBM Americas Advanced Technical Support
2.4.2. Logical Partitioning
Logical partitioning in an IBM Power Systems is the division of a computer's processors, memory, and
hardware resources into multiple environments so that each environment can be operated independently
with its own operating system and applications. The first release of LPAR support was static in nature,
that is, the reassignment of a resource from one LPAR to another LPAR cannot be made while AIX is
actively running. As of 2002, IBM Power Systems support dynamic logical partitioning that provides the
ability to logically attach and detach a managed system's resources to and from a logical partition's
operating system without rebooting.
Oracle 9i introduced dynamic memory management that, when combined with the DLPAR features,
partially simplified management of the Oracle instances. Oracle memory utilization primarily consists of
memory allocated by the SGA and PGA. The initial size of the SGA memory region is defined by utilizing
the variable SGA_MAX_SIZE. Once the database is started (and SGA initialized), certain memory areas
can be re-adjusted using the standard Oracle system commands. The management of the PGA area is
completely automatic as long as the initial size is defined by utilizing the variable
PGA_AGGREGATE_TARGET. The size of this memory area can also be changed dynamically. More
details can be found in Chapter 3.
AIX based DLPAR functionality, combined with the new dynamic memory management feature of Oracle
9i, 10g and 11g, provides greater flexibility in the system management of memory allocation.
DLPAR supports dynamic change of the number of CPUs in the system. The Oracle variable
CPU_COUNT reflects the number of the CPUs which are enabled in the LPAR. The value of the
CPU_COUNT variable directly impacts several different Oracle parameters. The most important are:
parallel_max_servers, fast_start_parallel_rollback, db_block_lru_latches, log_buffer etc… Unfortunately,
Oracle 9i is not capable of dynamically updating the value of the CPU_COUNT if the number of CPU(s)
in the LPAR was changed during a dynamic reconfiguration. However, the AIX scheduler can take
advantage of the additional processors. In the case of processor removal, as long as the processor being
removed is not bound by any process, it can be removed by the DLPAR operation.
Oracle 10g and 11g will automatically update the value of the CPU_COUNT variable if the value of the
CPU(s) in the LPAR has been changed since the database was originally initialized. In Oracle 11gR2
value of the CPU_COUNT can be manually set (Oracle defines this as “Database Instance Caging”),
however, once set manually, CPU_COUNT will no longer dynamically adjust. Max CPU_COUNT value
is limited to 3x CPU_COUNT at instance startup.
2.4.3. Micro-Partitioning
Micro-Partitioning is a feature provided by PowerVM (previously known as Advanced Power
Virtualization). It provides the ability of sharing physical processors between logical partitions. The
amount of processor capacity that is allocated to a micro-partition (its Entitled Capacity) may range from
ten percent (10%) of a physical processor up to the entire capacity of the shared processor pool. In
POWER5-based System p servers, a single physical shared-processor pool is a set of physical processors
that are not dedicated to any logical partition. In POWER6 and POWER7 based servers Multiple SharedProcessor Pools (MSPPs) capability allows a system administrator to create a set of micro-partitions with
the purpose of controlling the processor capacity that can be consumed from the physical shared-processor
© 2011 International Business Machines, Inc.
Page 67
IBM Americas Advanced Technical Support
pool. The POWER Hypervisor, that manages this feature, is a firmware layer sitting between the hosted
operating systems and the server hardware. It schedules the virtual CPUs (also called Virtual Processors –
VP) on the physical processors. The shared processor pool can have from one physical processor up to the
total of the installed processor capacity of the system. Adding of the CPUs to the shared pool can only be
done in the increments of one full CPU.
Micro-partitions can be designated as:
• Capped
A capped micro-partition has a defined processor entitled capacity which it is guaranteed to receive. There
are no circumstances under which a capped micro-partition will receive any more processor resource than
its entitled capacity.
• Uncapped
An uncapped micro-partition has a defined processor entitled capacity which it is guaranteed to receive.
However, under some circumstances it can receive additional processor resources. Typically, if the
uncapped micro-partition is runnable (has real work to execute) and there are unused processor cycles
within the physical shared processor pool, then additional cycles can be allocated to the micro-partition on
a weighted basis.
Capped and uncapped micro-partitions can coexist and both receive the processor resources from the
physical shared-processor pool.
Shared processor pools are rapidly becoming the de facto standard for Oracle on AIX implementations.
Micro-partitioning advantages not only include higher effective hardware utilization and better price
performance -- it can also improve overall Oracle performance (and potentially RAC cluster stability)
because of the Hypervisor's ability to dynamically move excess CPU capacity to those LPARs where CPU
demand exceeds the CPU entitlement. When dedicated processor LPARs are used, Oracle performance
will suffer when workload demand exceeds the number of dedicated processors.
Following are some general guidelines we suggest for customers considering running Oracle DB in the
Micro-Partitioning environment:
•
•
SMT should be enabled.
Virtual CPUs for a partition should not exceed the number of physical processors in the shared
pool. “Maximum VP” number can be higher so the VP can be added dynamically as the
physical CPUs are added to the shared pool.
• Entitled capacity should be set to the minimum required to meet service level agreements,
based on the application workload sizing. Additionally, as the sum of all partitions entitlements
must be less than or equal to the size of the shared pool, over use of entitlement will reduce the
number of partitions that can be hosted.
• For capped partitions the number of Virtual CPUs should be set to the nearest integer >= the
capping limit. For example, if the capping limit is 3.4, the number of Virtual CPUs should be
4.
• For uncapped partitions, the number of Virtual CPUs should be set to the maximum anticipated
peak demand requirement for that partition. It may be appropriate to establish guidelines on the
maximum number of Virtual CPUs per CPU of entitled capacity (e.g. no more than 2 or 3
times). This guideline should be periodically reviewed and adjusted based on actual experience
as appropriate.
© 2011 International Business Machines, Inc.
Page 68
IBM Americas Advanced Technical Support
Where possible, in order to get the highest possible hardware utilization, partitions defined within a single
or multiple shared pool(s) should exhibit a mix of workload characteristics such as:
o Capacity requirements peak at different times (e.g. online vs. batch)
o Service level requirements vary (e.g. high priority production vs. low priority
development/test or response sensitive online vs. response insensitive batch)
• The shared processor pool should be sized so that the peak shared pool utilization does not
exceed 90%. Shared pool utilization should be monitored on an ongoing basis. Lparstat –h
shows in the app column the available physical processor resources in the shared pool. (Note:
the app metric is only available if Pool Utilization Authority is enabled in the partition’s
configuration for processor resources via the Hardware Management Console.)
• For batch only applications, a minimum of 2 concurrent batch threads should be run per Virtual
CPU (with SMT, there are 2/4 logical processors per virtual processor). Depending on the
application ratio of CPU to I/O wait time, more than 2 concurrent threads per Virtual CPU may
be required to utilize all the available CPU resources.
2.4.4. Virtual Processor Folding
Starting with AIX 5.3 ML3, AIX introduced virtual processor folding. It is a feature that enhances the
utilization of the shared processor pools by minimizing the use of idle VPs. This allows LPARs to be
configured with more VPs to take better advantage of the shared processor pool resources.
What this does is allowing idle virtual processors to sleep and to be awaken only when required to meet
the workload demand. The physical CPU resources freed by folded VP can then be redistributed on an as
needed basis to other virtual processors on client partitions in the shared processor pool. It also increases
the average VP dispatch cycle onto a physical CPU. When virtual processors are deactivated, they are not
dynamically removed from the partition as with DLPAR. The virtual processor is no longer a candidate to
run on or receive unbound work. The number of online logical processors and online virtual processors
that are visible to the user or applications does not change. This results in better cache utilization due to
better dispatch affinity to physical resources, reduced workload in the Hypervisor and allows Active
Energy Management features in POWER6 and POWER 7 systems to provide power savings. This feature
can be very useful for uncapped partitions because it allows the partition to be configured with a larger
number of VPs without significant performance issues.
The schedo vpm_xvcpus and vpm_fold_policy parameters can be used to modify or disable processor
folding behavior. Disabling processor folding can potentially degrade Oracle performance and should only
be done if it is warranted by specific workload conditions. As a general "best practice", we recommend
that these tuning parameters be left at their default values (vpm_xvcpus=0 and vpm_fold_policy=1). If the
environment exhibits sudden spikes in the CPU utilization it is recommended to set vpm_xvcpus to 1 or 2.
This will keep more VPs unfolded and those spikes will be handled right away without the need to unfold
a VP first. This happens frequently in OLTP environment.
© 2011 International Business Machines, Inc.
Page 69
IBM Americas Advanced Technical Support
Included is the AIX nmon performance data sample presenting the VP folding/unfolding process while
LPAR was under the load.
Figure 38 - AIX VP Folding
© 2011 International Business Machines, Inc.
Page 70
IBM Americas Advanced Technical Support
2.5.
•
Network Tuning
UDP Tuning
Oracle 9i, 10g and 11g Real Application Clusters on AIX uses User Datagram Protocol (UDP) for
interprocess communications. UDP kernel settings should be tuned to improve Oracle performance.
To do this, use the network options (no) command to change udp_sendspace and udp_recvspace
parameters.
o Set the value of the udp_sendspace parameter to [(DB_BLOCK_SIZE *
DB_FILE_MULTIBLOCK_READ_COUNT) + 4096], but not less than 65536.
o Set the value of the udp_recvspace parameter to be 10 * udp_sendpace
o If necessary, increase sb_max parameter (default value is 1048576), since sb_max must
be >= udp_recvspace
The value of the udp_recvspace parameter should be at least four to ten times the value of the
udp_sendspace parameter because UDP might not be able to send a packet to an application before
another packet arrives.
To determine the suitability of the udp_recvspace parameter settings, enter the following command:
o netstat -s | grep “socket buffer overflows”
If the number of overflows is not zero, increase the value of the udp_recvspace parameter.
•
TCP Tuning
There are several AIX tunable values that might impact TCP performance in the Oracle RDBMS
environment. Those are:
o tcp_recvspace - specifies how many bytes of data the receiving system can buffer in
the kernel on the receiving sockets queue. It is also used by the TCP protocol to set the
TCP window size, which TCP uses to limit how many bytes of data it will send to the
receiver to ensure that the receiver has enough space to buffer the data.
o tcp_sendspace - specifies how much data the sending application can buffer in the
kernel before the application is blocked on a send call.
o rfc1323 - enables the TCP window scaling option. The TCP window scaling option is a
TCP negotiated option, so it must be enabled on both endpoints of the TCP connection
to take effect.
o sb_max - sets an upper limit on the number of socket buffers queued to an individual
socket, which controls how much buffer space is consumed by buffers that are queued
to a sender’s socket or to a receiver’s socket.
© 2011 International Business Machines, Inc.
Page 71
IBM Americas Advanced Technical Support
To control the values of these tunables, the Interface-Specific Network Options (ISNO) should be used. It
allows IP network interfaces to be custom-tuned for best performance.
Values set for an individual interface take precedence over the system wide values set with the no
command. The feature is enabled (by the default in AIX 5.3, 6.1 and 7.1) or disabled for the whole system
with the no command use_isno option.
Following figure provides a predefined set of the values of the tunables for the most important network
adapters/interfaces. These values are already optimized and if they have to be changed, it should be done
only under the close supervision of IBM AIX Support.
Figure 39 - Suggested “no” tunable values
(1) It is suggested to use the default value of 1048576 for the sb_max tunable. The values shown in
the table are acceptable minimum values for the sb_max tunable.
© 2011 International Business Machines, Inc.
Page 72
IBM Americas Advanced Technical Support
3. Oracle Tuning
In this chapter, we will cover the most important items that will make the Oracle database run well on the
AIX platform. Besides the presentation of the relatively small sub-group of the Oracle internal parameters
I will not be covering details of the internal Oracle tuning activities (index & optimizer related activities,
SQL explaining, tracing and tuning etc).
If you are running an ISV packaged application, such as Oracle E-Business Suite or SAP R/3, the package
vendor may provide guidelines on basic Oracle setup and tuning. If so, please refer to the appropriate
vendor documentation first.
•
Basic Oracle parameters
There are a limited number of all the Oracle parameters that have a large impact on performance. Without
these being set correctly, Oracle cannot operate properly or give good performance. These need to be
checked before further fine tuning. It is worth tuning the database further only if all these top parameters
are properly defined.
The parameters that make the biggest difference (and should, therefore, be investigated in this order) are
listed in the following sections.
•
DB_BLOCK_SIZE
Specifies the default size of Oracle database blocks. This parameter cannot be changed after the database
has been created, so it is vital that the correct value is chosen at the beginning. Optimal
DB_BLOCK_SIZE values vary depending on the application. Typical values are 8KB for OLTP
workloads and 16KB to 32KB for DSS workloads. If you plan to use a 2KB DB_BLOCK_SIZE with JFS2
file systems, be sure to create the file system with agblksize=2048.
While most customers only use the default database block size, it is possible to use up to 5 different
database block sizes for different objects within the same database. Having multiple database block sizes
adds administrative complexity and (if poorly designed and implemented) can have adverse performance
consequences. Therefore, using multiple block sizes should only be done after careful planning and
performance evaluation.
Some possible block size considerations are as follows:
o Tables with a relatively small row size that are predominantly accessed 1 row at a time
may benefit from a smaller DB_BLOCK_SIZE, which requires a smaller I/O transfer
size to move a block between disk and memory, takes up less memory per block and
can potentially reduce block contention.
o Similarly, indexes (with small index entries) that are predominantly accessed via a
matching key may benefit from a smaller DB_BLOCK_SIZE.
o Tables with a large row size may benefit from a large DB_BLOCK_SIZE. A larger
DB_BLOCK_SIZE may allow the entire row to fit within a block and/or reduce the
amount of wasted space within the block.
© 2011 International Business Machines, Inc.
Page 73
IBM Americas Advanced Technical Support
o Tables or indexes that are accessed sequentially may benefit from a larger
DB_BLOCK_SIZE, because a larger block size results in a larger I/O transfer size and
allows data to be read more efficiently.
o Tables or indexes with a high locality of reference (the probability that once a particular
row/entry has been accessed, a nearby row/entry will subsequently be accessed) may
benefit from a larger DB_BLOCK_SIZE, since the larger the size of the block, the more
likely the nearby row/entry will be on the same block that was already read into
database cache."
•
DB_BLOCK_BUFFERS or DB_CACHE_SIZE
These parameters determine the size of the DB buffer cache used to cache database blocks of the default
size (DB_BLOCK_SIZE) in the SGA. If DB_BLOCK_BUFFERS is specified, the total size of the default
buffer cache is DB_BLOCK_BUFFERS x DB_BLOCK_SIZE. If DB_CACHE_SIZE is specified, it
indicates the total size of the default buffer cache.
If multiple different block sizes are used within a single database, the DB_nK_CACHE_SIZE parameter is
used to specify the size of additional DB cache areas used for blocks that are different from the default
DB_BLOCK_SIZE. For example, if DB_BLOCK_SIZE=8K and DB_16K_CACHE_SIZE=32M, this
indicates that a separate 32 Megabyte DB cache area will be created for use with 16K database blocks.
The primary purpose of the DB buffer cache area(s) is to cache frequently used data (or index) blocks in
memory in order to avoid or reduce physical I/Os to disk when satisfying client data requests. In general,
you want just enough DB buffer cache allocated to achieve the optimum buffer cache hit rate. Increasing
the size of the buffer cache beyond this point may actually degrade performance due to increased
overhead of managing the larger cache memory area. The "Buffer Pool Advisory" section of the Oracle
Statspack or Workload Repository (AWR) report provides statistics that may be useful in determining the
optimal DB buffer cache size.
•
DISK_ASYNCH_IO
AIX fully supports Asynchronous I/O for file systems (JFS, JFS2, GPFS and Veritas) as well as raw
devices. Many sites do not know this fact and fail to use this feature. This parameter should always be set
to TRUE (the default value). In addition, the AIX Asynchronous I/O feature must be enabled and properly
tuned (in AIX 6.1 and AIX 7.1 enabled by default).
© 2011 International Business Machines, Inc.
Page 74
IBM Americas Advanced Technical Support
•
FILESYSTEMIO_OPTIONS
FILESYSTEMIO_OPTIONS can be set to one of the following values:
o
o
o
o
ASYNCH:
DIRECTIO:
SETALL:
NONE:
use Asynchronous i/o with file system cache
use Synchronous i/o with Direct i/o
use Asynchronous i/o with Direct i/o
use Synchronous i/o with Direct i/o (JFS) or Concurrent i/o (JFS2)
To combine this parameter with the set of the new AIX mount options (DIO & CIO) you should follow
following set of the recommendations:
o to enable the file system caching for JFS/JFS2 (tends to benefit heavily sequential
workloads with low write content), use the default file system mount options and set
Oracle FILESYSTEMIO_OPTIONS=ASYNCH
o to disable file system caching for JFS/JFS2 (CIO and DIO both benefit heavy random
access workloads and CIO tends to benefit heavily update workloads), please follow the
attached set of recommendations:
Oracle 9i
filesystemio_options JFS (default mount)
JFS (dio mount)
JFS2 (default mount)
JFS2 (cio mount)
asynch
asynchronous, cached I/O
asynchronous, DIO
asynchronous, cached I/O
asynchronous, CIO
setall
asynchronous, cached I/O
asynchronous, DIO
asynchronous, cached I/O
asynchronous, CIO
directio(*)
synchronous, cached I/O
synchronous, DIO
synchronous, cached I/O
synchronous, CIO
none(*)
synchronous, cached I/O
synchronous, DIO
synchronous, cached I/O
synchronous, CIO
JFS (dio mount)
JFS2 (default mount)
JFS2 (cio mount)
Oracle 10g, 11g
filesystemio_options JFS (default mount)
asynch
asynchronous, cached I/O
asynchronous, DIO
asynchronous, cached I/O
asynchronous, CIO
setall
asynchronous, DIO
asynchronous, DIO
asynchronous, CIO
asynchronous, CIO
directio(*)
synchronous, DIO
synchronous, DIO
synchronous, CIO
synchronous, CIO
none(*)
synchronous, cached IO
synchronous, DIO
synchronous, cached I/O
synchronous, CIO
(*) - Since the DIRECTIO and NONE options disable the use of Asynchronous I/O, they should not be used.
© 2011 International Business Machines, Inc.
Page 75
IBM Americas Advanced Technical Support
•
DB_WRITER_PROCESS and DBWR_IO_SLAVES
These parameters specify how many database writer processes are used to update the database disks when
disk block buffers in database buffer cache are modified. Multiple database writers are often used to get
around the lack of Asynchronous I/O capabilities in some operating systems, although it still works with
operating systems that fully support Asynchronous I/O, such as AIX.
Normally, the default values for these parameters are acceptable and should only be overridden in order to
address very specific performance issues and/or at the recommendation of Oracle Support.
•
SHARED_POOL_SIZE
An appropriate value for the SHARED_POOL_SIZE parameter is very hard to determine before statistics
are gathered about the actual use of the shared pool. The good news is that starting with Oracle 9i, it is
dynamic, and the upper limit of shared pool size is controlled by the SGA_MAX_SIZE parameter. So, if
you set the SHARED_POOL_SIZE to an initial value and you determine later that this value is too low;
you can change it to a higher one, up to the limit of SGA_MAX_SIZE. Remember that the shared pool
includes the data dictionary cache (the tables about the tables and indexes), the library cache (the SQL
statements and execution plans), and also the session data if the shared server is used. Thus, it is not
difficult to run out of space. Its size can vary from a few MB to very large, such as 20 GB or more,
depending on the applications’ use of SQL statements. Many application vendors will give guidelines on
the minimum size. It depends mostly on the number of tables in the databases; the data dictionary will be
larger for a lot of tables and the number of the different SQL statements that are active or used regularly.
•
SGA_MAX_SIZE
Starting with Oracle 9i, the Oracle SGA size can be dynamically changed. It means the DBA just needs to
set the maximum amount of memory available to Oracle (SGA_MAX_SIZE) and the initial values of the
different pools: DB_CACHE_SIZE, SHARED_POOL_SIZE, LARGE_POOL_SIZE etc… The size of
these individual pools can then be increased or decreased dynamically using the ALTER SYSTEM
command, provided the total amount of memory used by the pools does not exceed SGA_MAX_SIZE. If
LOCK_SGA = TRUE, his parameter defines the amount of memory Oracle allocates at DB startup in “one
piece”! Also, SGA_TARGET is ignored for the purpose of memory allocation in this case.
•
SGA_TARGET
SGA_TARGET specifies the total size of all SGA components. If SGA_TARGET is specified, then the
following memory pools are automatically sized:
•
•
•
•
•
Buffer cache (DB_CACHE_SIZE)
Shared pool (SHARED_POOL_SIZE)
Large pool (LARGE_POOL_SIZE)
Java pool (JAVA_POOL_SIZE)
Streams pool (STREAMS_POOL_SIZE)
© 2011 International Business Machines, Inc.
Page 76
IBM Americas Advanced Technical Support
If these automatically tuned memory pools are set to non-zero values, then those values are used as
minimum levels by Automatic Shared Memory Management. You would set minimum values if an
application component needs a minimum amount of memory to function properly.
The following pools are manually sized components and are not affected by Automatic Shared Memory
Management:
•
•
•
Log buffer
Other buffer caches, such as KEEP, RECYCLE, and other block sizes
Fixed SGA and other internal allocations
The memory allocated to these pools is deducted from the total available for SGA_TARGET when
Automatic Shared Memory Management computes the values of the automatically tuned memory pools.
•
MEMORY_TARGET, MEMORY_MAX_TARGET (11g)
MEMORY_TARGET specifies the Oracle system-wide usable memory. The database tunes memory to the
MEMORY_TARGET value, reducing or enlarging the SGA and PGA as needed. In a text-based
initialization parameter file, if you omit MEMORY_MAX_TARGET and include a value for
MEMORY_TARGET, then the database automatically sets MEMORY_MAX_TARGET to the value of
MEMORY_TARGET. If you omit the line for MEMORY_TARGET and include a value for
MEMORY_MAX_TARGET, the MEMORY_TARGET parameter defaults to zero. After startup, you can then
dynamically change MEMORY_TARGET to a nonzero value, provided that it does not exceed the value of
MEMORY_MAX_TARGET. If MEMORY_TARGET parameter is used, memory cannot be pinned. It is
not recommended to use the MEMORY_TARGET parameter together with the SGA_MAX_SIZE and
SGA_TARGET.
•
PGA_AGGREGATE_TARGET
PGA_AGGREGATE_TARGET specifies the target aggregate PGA memory available to all server processes
attached to the instance. Setting PGA_AGGREGATE_TARGET to a nonzero value has the effect of
automatically setting the WORKAREA_SIZE_POLICY parameter to AUTO. This means that SQL working
areas used by memory-intensive SQL operators (such as sort, group-by, hash-join, bitmap merge, and
bitmap create) will be automatically sized. A nonzero value for this parameter is the default since, unless
you specify otherwise, Oracle sets it to 20% of the SGA or 10 MB, whichever is greater.
In OLTP environments, sorting is not common or does not involve large numbers of rows. In Batch and
DSS workloads, this is a major task in the system, and larger in-memory sort areas are needed. Unlike the
other parameters, this space is allocated in the PGA.
The “PGA Memory Advisory “ section of the Oracle Statspack or AWR report provides statistics that may
be useful in determining the optimal PGA Aggregate Target value.
© 2011 International Business Machines, Inc.
Page 77
IBM Americas Advanced Technical Support
Appendix: Related Publications
The publications listed in this section are considered particularly suitable for a more detailed discussion of
the topics covered in this white paper.
1. Publications
•
•
•
•
•
•
•
•
•
•
•
•
•
•
“Diagnosing Oracle Database Performance on AIX Using IBM NMON and Oracle Statspack
Reports”, Dale Martin (IBM)
 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP101720
“Configuring IBM TotalStorage for Oracle OLTP Applications”, Dale Martin (IBM)
 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100319
“Oracle 11g ASM on AIX with Oracle Real Applications Clusters (RAC)”, Rebecca Ballough
(IBM)
 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP102169
“Breaking the Oracle I/O Performance Bottleneck, Dale Martin (IBM)
 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS3885
‘Oracle Database and 1TB Segment Aliasing”, Ralf Schmidt-Dannert (IBM)
 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD105761
“Tuning SAP R/3 with Oracle on pSeries”, Mark Gordon (IBM)
 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100377
“AIX 5.3 Performance Management Guide”, IBM
“AIX 6.1 Performance Management Guide”, IBM
“AIX 7.1 Performance Management Guide”, IBM
“IBM AIX Version 6.1 Difference Guide”, IBM Redbooks
 http://w3.itso.ibm.com/abstracts/sg247559.html?Open
“IBM AIX Version 7.1 Difference Guide”, IBM Redbooks
 http://w3.itso.ibm.com/abstracts/sg247910.html?Open
“Improving DB performance with AIX Concurrent I/O”: Kashyap/Olszewski/Hendrickson,
IBM
"PowerVM Virtualization on IBM System p: Introduction and Configuration", IBM Redbooks
 http://www.redbooks.ibm.com/redpieces/abstracts/sg247940.html
"PowerVM Virtualization on IBM System p: Managing and Monitoring", IBM Redbooks
 http://www.redbooks.ibm.com/redpieces/abstracts/sg247590.html
© 2011 International Business Machines, Inc.
Page 78
IBM Americas Advanced Technical Support
2. Online Resources
•
•
•
•
•
IBM System p and AIX Information Center
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp (AIX 5.3)
http://publib.boulder.ibm.com/infocenter/systems/scope/aix/index.jsp (AIX 6.1)
http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp (AIX 7.1)
`My Oracle` (Metalink) Support web site:
https://support.oracle.com
Oracle 9i Documentation website:
http://www.oracle.com/technology/documentation/oracle9i.html
Oracle 10g R2 Documentation website:
http://www.oracle.com/pls/db102/homepage
Oracle 11g R2 Documentation website:
http://www.oracle.com/pls/db112/homepage
© 2011 International Business Machines, Inc.
Page 79
Download