Uploaded by Nguyễn Cương

AIX Tuning for Oracle - ver Apr2009

advertisement
IBM Advanced Technical Support - Americas
AIX Performance:
Configuration & Tuning for Oracle
Vijay Adik
vadik@us.ibm.com
ATS - Oracle Solutions Team
April 17, 2009
© 2009 IBM Corporation
IBM Advanced Technical Support - Americas
Legal information
The information in this presentation is provided by IBM on an "AS IS"
basis without any warranty, guarantee or assurance of any kind. IBM also
does not provide any warranty, guarantee or assurance that the
information in this paper is free from errors or omissions. Information is
believed to be accurate as of the date of publication. You should check
with the appropriate vendor to obtain current product information.
Any proposed use of claims in this presentation outside of the United
States must be reviewed by local IBM country counsel prior to such use.
IBM,^
,
, RS6000, System p, AIX, AIX 5L, GPFS, and Enterprise
Storage Server (ESS) are trademarks or registered trademarks of the
International Business Machines Corporation.
Oracle, Oracle9i and Oracle10g are trademarks or registered trademarks
of Oracle Corporation.
All other products or company names are used for identification
purposes only, and may be trademarks of their respective owners.
2
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Agenda
AIX Configuration Best Practices for Oracle
– Memory
– CPU
– I/O
– Network
– Miscellaneous
3
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
AIX Configuration Best Practices for Oracle
The suggestions presented here are considered to
be basic configuration “starting points” for
general Oracle workloads
Your workloads may vary
Ongoing performance monitoring and tuning is
recommended to ensure that the configuration is
optimal for the particular workload characteristics
4
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Performance Overview – Tuning Methodology
Iterative Tuning Process
•
Understand the external view of system performance
The external view of system performance is the
observable event that is causing someone to say
the system is performing poorly. Typically, (1)
end-user response time, (2) application (or task)
response time or (3) throughput. Should not use
system metrics to judge improvement.
•
Performance only improves when the predominant
bottleneck is fixed
Fixing a secondary bottleneck will not improve
performance and typically results in overloading an
already overloaded predominant bottleneck.
•
Monitor Performance after a change – Tuning is an
iterative process
Monitoring is required after making a change for
two reasons (1) Fixing the predominant bottleneck
typically uncovers another bottleneck, and (2) Not
all changes yield a positive results. If possible you
should have a “repeatable” test to so change can
be accurately evaluated.
Stress System (i.e., Tune at Peak workload)
Monitor Sub-Systems
Identify Predominant Bottleneck
Tune Bottleneck
Predominant Bottleneck
Repeat
CPU
Memory
Network
I/O
• End-User Response time is the elapsed time between when a user submits a request and receives a response.
• Application Response time is the elapsed required for one or more jobs to complete. Historically, these jobs have been called batch jobs.
• Throughput is the amount of work that can be accomplished per unit time. This metric is typically expressed in terms of transaction per minute.
5
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Performance Monitoring and Tuning Tools
CPU
Memory
I/O
Subsystem
Network
Processes &
Threads
vmstat, topas, vmstat, topas,
iostat, ps,
ps, lsps, ipcs
mpstat,
lparstat, sar,
time/timex,
emstat/alstat
vmstat, topas,
iostat,
lvmstat, lsps,
lsattr/lsdev,
lspv/lsvg/lslv
netstat, topas, ps, pstat,
atmstat,
topas,
entstat,
emstat/alstat
tokstat,
fddistat,
nfsstat, ifconfig
netpmon
svmon,
netpmon,
filemon
fileplace,
filemon
netpmon,
tcpdump
svmon, truss,
kdb, dbx,
gprof, kdb,
fuser, prof
Trace Level
Commands
tprof, curt,
splat, trace,
trcrpt
trace,trcrpt
trace, trcrpt
iptrace,
ipreport, trace,
trcrpt
truss, pprof
curt, splat,
trace, trcrpt
ioo, lvmo,
chdev,
migratepv,chl
v, reorgvg
no,
chdev,ifconfig
nfso,chdev
Tuning tools
schedo, fdpr, vmo,
bindprocessor rmss,fdpr,
, bindintcpu,
chps/mkps
nice/renice,
setpri
Status
Commands
Monitor
Commands
6
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Agenda
AIX Configuration Best Practices for Oracle
– Memory
– CPU
– I/O
– Network
– Miscellaneous
7
© 2009 IBM Corporation
April 17, 2009
Advanced Technical Support – System p
AIX Memory Management Overview
The role of Virtual Memory Manager (VMM) is to provide the capability for programs to address
more memory locations than are actually available in physical memory.
On AIX this is accomplished using segments that are partitioned into fixed sizes called “pages”.
– A segment is 256M
– default page size 4K
– POWER 4+ and POWER5 can define large pages, which are 16M
The 32-bit or 64-bit address translates into a 52-bit or 80-bit virtual address
– 32-bit system : 4-bit segment register that contains a 24-bit segment id, and 28-bit offset.
• 24-bit segment id + 28-bit offset = 52-bit VA
– 64-bit system: 32-bit segment register that contains a 52-bit segment id, and 28-bit offset.
•
52-bit segment id + 28-bit offset = 80-bit VA
The VMM maintains a list of free frames that can be used to retrieve pages that need to be
brought into memory.
– The VMM replenishes the free list by removing some of the current pages from real memory (i.e., steal
memory).
– The process of moving data between memory and disk is called “paging”.
The VMM uses a Page Replacement Algorithm (implemented in the lrud kernel threads) to select
pages that will be removed from memory.
8
© 2009 IBM Corporation
April 17, 2009
Advanced Technical Support – System p
Virtual Memory Space – 64 Bits
36-bits selects Segment Register
28-bits offset within Segment
64-bit Address
Each Segment Register contains
a 52-bit Segment ID
Segments IDs
0
Kernel Segment
Page Space Disk Map
Kernel Heap
Segment is divided into 4096 byte chunks called pages
Each Segment can have a maximum of
28-bit offset – to access a
specific location in the
segment
228 = 256M
65536 pages
.
.
.
256 Mbyte Segment
52-bit Segment Id + 28-bit offset = 80-bit Virtual Address
Virtual Memory
1 Trillion Terabytes or 1 Yotta byte
9
© 2009 IBM Corporation
April 17, 2009
Advanced Technical Support – System p
Memory Tuning Overview
vmo –p –o <parameter name>=<new value>
Memory:
-p flags updates /etc/tunables/nextboot
Virtual Memory
JFS
(General)
Enhanced JFS
Large Pages
(JFS2)
(Pinned Memory 1)
minfree
maxperm
maxclient
v_pinshm
maxfree
strict_maxperm
strict_maxclient
lgpg_regions
lru_file_repage
lgpg_size
lru_poll_interval
NAME
CUR
DEF
BOOT
MIN
MAX
UNIT
TYPE
-------------------------------------------------------------------------------lru_file_repage
1
1
1
0
1
boolean
D
lru_poll_interval
0
0
0
0
60000 milliseconds
D
maxclient%
80
80
80
1
100
% memory
D
maxfree
1088
1088
1088
8
200K
4KB pages
D
maxperm%
80
80
80
1
100
% memory
D
minfree
960
960
960
8
200K
4KB pages
D
strict_maxclient
1
1
1
0
1
boolean
D
strict_maxperm
0
0
0
0
1
boolean
D
minperm%
20
20
20
1
100
% memory
D
10
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Virtual Memory Manager (VMM) Tuning
The AIX “vmo” command provides for the display and/or
update of several parameters which influence the way AIX
manages physical memory
– The “-a” option displays current parameter settings
vmo –a
– The “-o” option is used to change parameter values
vmo –o minfree=1440
– The “-p” option is used to make changes persist across a reboot
vmo –p –o minfree=1440
On AIX 5.3, number of the default “vmo” settings are not optimized for
database workloads and should be modified for Oracle environments
11
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Kernel Parameter Tuning – AIX 6.1
AIX 6.1 configured by default to be ‘correct’ for most
workloads.
Many tunable are classified as ‘Restricted’:
– Only change if AIX Support says so
– Parameters will not be displayed unless the ‘-F’ option is used
for commands like vmo, no, ioo, etc.
When migrating from AIX 5.3 to 6.1, parameter override
settings in AIX 5.3 will be transferred to AIX 6.1 environment
12
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
General Memory Tuning
Two primary categories of memory pages: Computational and File System
AIX will always try to utilize all of the physical memory available (subject to
vmo parameter settings)
– What is not required to support current computational page demand will tend to be used for
filesystem cache
– Raw Devices and filesystems mounted (or individual files opened) in DIO/CIO mode do not
use filesystem cache
Memory Use 9/20/2007
Process%
FScache%
100
90
80
70
60
50
40
30
20
10
13
© 2009 IBM Corporation
April 17, 2009
10:14
10:10
10:06
10:02
09:58
09:54
09:50
09:46
09:42
09:38
09:34
09:30
09:26
09:22
09:18
09:14
09:10
09:06
09:02
08:58
08:54
08:50
08:46
08:42
08:38
08:34
08:30
08:26
08:22
08:18
08:14
08:10
08:06
08:02
0
IBM Advanced Technical Support - Americas
VMM Tuning
Definitions:
LRUD= VMM page stealing process (LRU Daemon) – 1 per Memory Pool
numperm, numclient = used fs buffer cache pages, seen in ‘vmstat –v’
minperm = target minimum number of pages for fs buffer cache
maxperm, maxclient = target maximum number of pages for fs buffer cache
Parameters:
MINPERM% = target min % real memory for fs buffer cache
MAXPERM%, MAXCLIENT% = target max % real memory for fs buffer cache
MINFREE = target minimum number of free memory pages
MAXFREE = target maximum number of free memory pages
When does LRUD start?
When total free pages (in a given memory pool) < MINFREE
When (maxclient pages - numclient) < MINFREE
When does LRUD stop?
14
When total free pages > MAXFREE
When (maxclient pages – numclient) > MAXFREE
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
VMM Tuning starting points (AIX 5.2ML4+ or later)
LRU_FILE_REPAGE=0
(default for 5.3 and 6.1)
tells LRUD to page out file pages (filesystem buffer cache) rather than
computational pages when numperm > minperm
LRU_POLL_INTERVAL=10
(default for 5.3 and 6.1)
indicates the time period (in milliseconds) after which LRUD pauses and
interrupts can be serviced. Default value of “0” means no preemption.
MINPERM%=3
(default for 6.1)
MAXPERM%, MAXCLIENT%=90*
(default for 6.1)
STRICT_MAXPERM=0*
(default for 5.3 and 6.1)
STRICT_MAXCLIENT=1
(default for 5.3 and 6.1)
* In AIX 5.2 environments with large physical memory, set MAXPERM%, MAXCLIENT% = (2400 / Phys
Memory (GB)) and STRICT_MAXPERM=1
15
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Virtual Memory Management (VMM) Thresholds
Start stealing pages when
free memory below minfree
80%
Stop stealing pages when
free memory above maxfree
60%
When numperm% >
maxperm%, steal only file
system pages
P h y s ic a l M e m o ry
100%
40%
When minperm% <
numperm% < maxperm%,
steal file system or
computation pages,
depending on repage rate
20%
0%
Time
numperm%
maxfree
18
comp%
minfree
© 2009 IBM Corporation
Free%
minperm%
maxperm%
When numperm% <
minperm%, steal both file
system and computational
pages
April 17, 2009
IBM Advanced Technical Support - Americas
VMM Page Stealing Thresholds
Minfree/maxfree values are per memory pool in AIX 5.3 and 6.1
– Total system minfree = minfree * # of memory pools
– Total system maxfree = maxfree * # of memory pools
Most workloads do not require minfree/maxfree changes in AIX 5.3 or 6.1
minfree
AIX 5.3/6.1: minfree = max(960,120 x # logical CPUs /#mem pools)
AIX 5.2:
minfree = 120 x # logical CPUs
Consider increasing if vmstat “fre” column frequently approaches zero or if “vmstat –s” shows significant “free
frame waits”
maxfree
AIX 5.3/6.1: maxfree = max(1088,minfree + (MAX(maxpgahead, j2_maxPageReadAhead) *
# logical CPUs)/ # mem pools)
AIX 5.2:
maxfree = minfree + (MAX(maxpgahead, j2_maxPageReadAhead) *
# logical CPUs
Example:
19
For a 6-way LPAR with SMT enabled, maxpgahead=8 and j2_maxPageReadAhead=8:
– minfree = 360 = 120 x 6 x 2 / 4
– maxfree = 1536 = 1440 + (max(8,8) x 6 x 2)
vmo –o minfree=1440 –o maxfree=1536 -p
© 2009 IBM Corporation
April 17, 2009
Advanced Technical Support – System p
Page Steal Method
Historically, AIX maintained a single LRU list which contains
both computational and filesystem pages.
– In environments with lots of computational pages that you want to
keep in memory, LRUD may have to spend a lot of time scanning the
LRU list to find an eligible filesystem page to steal
AIX 6.1 introduced the ability to maintain separate LRU lists
for computational vs. filesystem pages.
– Also backported to AIX 5.3
New page_steal_method parameter
– Enabled (1) by default in 6.1, disabled (0) by default in 5.3
– Requires a reboot to change
– Recommended for Oracle DB environments (both AIX 5.3 and 6.1)
21
© 2009 IBM Corporation
April 17, 2009
Advanced Technical Support – System p
Understanding Memory Pools
Memory cards are associated with every Multi
Chip Module (MCM), Dual Core Module (DCM) or
Quad Core Module (QCM) in the server
–
The Hypervisor assigns physical CPUs to a dedicated CPU
LPAR (or shared processor pool) from one or more MCMs,
DCMs or DCMs
–
For a given LPAR, there will normally be at least 1 memory pool
for each MCM, DCM or QCM that has contributed processors to
that LPAR or shared processor pool
By default, memory for a process is allocated
from memory associated with the processor that
caused the page fault.
Memory pool configuration is influenced by the
VMO parameter “memory_affinity”
–
Memory_affinity=1 means configure memory pools based on
physical hardware configuration (DEFAULT)
–
Memory_affinity=0 means configure roughly uniform memory
pools from any physical location
p590 / p595 MCM Architecture
Number can be seen with ‘vmstat –v |grep pools’
Size can only be seen using KDB
LRUD operates per memory pool
22
© 2009 IBM Corporation
April 17, 2009
Advanced Technical Support – System p
Memory Affinity…
Not generally a benefit unless processes are bound to a particular processor
It can exacerbate any page replacement algorithm issues (e.g. system paging or excessive LRUD
scanning activity) if memory pool sizes are unbalanced
If there are paging or LRUD related issues, try basic vmo parameter or Oracle SGA/PGA tuning first
If issues remain, use ‘kdb’ to check if memory pool sizes are unbalanced:
Pages in pool
KDB(1)> memp *
memp_frs+010000
memp_frs+010780
memp_frs+010280
memp_frs+010500
VMP
00
00
01
02
MEMP
000
003
001
002
NB_PAGES FRAMESETS
00B1F9F4 000 001
00001BBC 006 007
00221C80 002 003
00221C80 004 005
NUMFRB
00B073DE
00000000
0021C3CB
0021CDDE
Free pages
If the pool sizes are not balanced, consider disabling Memory Affinity:
# vmo –r –o memory_affinity=0
(requires a reboot)
– IY73792 required for 5300-01 and 5300-02
– Code changes in 5.3 TL5/TL6 solved most memory affinity issues
– Memory_affinity is also a “Restricted” tunable in AIX 6.1
23
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
AIX 5.3/6.1 Multiple Page Size Support
AIX 5.3 5300-04 introduces two new page sizes:
– 64K
– 16M (large pages) (available since Power 4+)
– 16G (huge pages)
•
•
•
User/Application must request preferred page size
– 64K page size is very promising, since they do not need to be
configured/reserved in advance or pinned
•
–
–
24
export LDR_CNTRL=DATAPSIZE=64K@TEXTPSIZE=64K@STATSPACK=64K
oracle* to use the 64K pagesize for stack, data & text
Will require Oracle to explicitly request the page size (10.2.0.4 & up)
If preferred size not available, the largest available smaller size will be used
•
Requires p5+ hardware
Requires p5 System Release 240, Service Level 202 microcode
16MB support requires Version 5 Release 2 of the Hardware Management Console
(HMC) machine code
Current Oracle versions will end up using 64KB pages even if SGA is not pinned
Refer: http://www03.ibm.com/systems/resources/systems_p_os_aix_whitepapers_multiple_
page.pdf
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Large Page Support (optional)
Pinning shared memory
AIX Parameters
•
vmo –p –o v_pinshm = 1
•
Leave maxpin% at the default of 80% unless the SGA exceeds 77% of real memory
– Vmo –p –o maxpin%=[(total mem-SGA size)*100/total mem] + 3
Oracle Parameters
•
LOCK_SGA = TRUE
Enabling Large Page Support
vmo –r –o lgpg_size = 16777216 –o lgpg_regions=(SGA size / 16 MB)
Allowing Oracle to use Large Pages
chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE oracle
Using Monitoring Tools
svmon –G
svmon –P
Oracle metalink note# 372157.1
Note: It is recommended not to pin SGA, as long as you had configured the VMM, SGA & PGA
properly.
25
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Determining SGA size
SGA Memory Summary for DB: test01 Instance: test01 Snaps: 1046 -1047
SGA regions
Size in Bytes
------------------------------
----------------
Database Buffers
16,928,210,944
Fixed Size
768,448
Redo Buffers
2,371,584
Variable Size
1,241,513,984
----------------
sum
18,172,864,960
lgpg_regions = 18,172,864,960 / 16,777,216 = 1084 (rounded up)
26
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
AIX Paging Space
Allocate Paging Space:
Configure Server/LPAR with enough physical memory to satisfy memory requirements
With AIX demand paging, paging space does not have to be large
Provides safety net to prevent system crashes when memory overcommitted.
Generally, keep within internal drive or high performing SAN storage
Monitor paging activity:
vmstat -s
sar -r
nmon
Resolve paging issues:
Reduce file system cache size (MAXPERM, MAXCLIENT)
Reduce Oracle SGA or PGA (9i or later) size
Add physical memory
Do not over commit real memory!
27
© 2009 IBM Corporation
April 17, 2009
Advanced Technical Support – System p
Tuning and Improving System Performance
Adjust the VMM Tuning Parameters
– Key parameters listed on word document
Implement VMM related Mount Options
– DIO / CIO
– Release behind or read and/or write
Reduce Application Memory Requirements
Memory Model
– %Computational < 70% - Large Memory Model – Goal is to adjust tuning
parameters to prevent paging
• Multiple Memory pools
• Page Space smaller than Memory
• Must Tune VMM key parameters
– %Computational > 70% - Small Memory Model – Goal is to make paging as efficient
as possible
•
•
•
•
•
Add multiple page spaces on different spindles
Make all pages space the same size to ensure round-robin scheduling
PS = 1.5 computational requirements
Turn off DEFPS
Memory Load Control
Add additional Memory
28
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Agenda
AIX Configuration Best Practices for Oracle
– Memory
– CPU
– I/O
– Network
– Miscellaneous
29
© 2009 IBM Corporation
April 17, 2009
Advanced Technical Support – System p
CPU Considerations
Oracle Parameters based on the # of CPUs
– DB_WRITER_PROCESSES
– Degree of Parallelism
•
•
•
•
user level
table level
query level
MAX_PARALLEL_SERVERS or
AUTOMATIC_PARALLEL_TUNING (CPU_COUNT *
PARALLEL_THREADS_PER_CPU)
– CPU_COUNT
– FAST_START_PARALLEL_ROLLBACK – should be using UNDO
instead
– CBO – execution plan may be affected; check explain plan
30
© 2009 IBM Corporation
April 17, 2009
Advanced Technical Support – System p
Lparstat command
# lparstat -i
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
32
Node Name
Partition Name
Partition Number
Type
Mode
Entitled Capacity
Partition Group-ID
Shared Pool ID
Online Virtual CPUs
Maximum Virtual CPUs
Minimum Virtual CPUs
Online Memory
Maximum Memory
Minimum Memory
Variable Capacity Weight
Minimum Capacity
Maximum Capacity
Capacity Increment
Maximum Physical CPUs in system
Active Physical CPUs in system
Active CPUs in Pool
Unallocated Capacity
Physical CPU Percentage
Unallocated Weight
© 2009 IBM Corporation
: erpcc8
::: Dedicated
: Capped
: 4.00
:::4
:4
:1
: 8192 MB
: 9216 MB
: 128 MB
:: 1.00
: 4.00
: 1.00
:4
:4
::: 100.00%
:April 17, 2009
Advanced Technical Support – System p
CPU Considerations
Use SMT with AIX 5.3/Power5 (or later) environments
Micro-partitioning Guidelines
– Virtual CPUs <= physical processors in shared pool
CAPPED
– Virtual CPUs should be the nearest integer >= capping limit
UNCAPPED
– Virtual CPUS should be set to the max peak demand requirement
– Entitlement >= Virtual CPUs / 3
DLPAR considerations
Oracle 9i
– Oracle CPU_COUNT does not recognize change in # cpus
– AIX scheduler can still use the added CPUs
Oracle 10g
– Oracle CPU_COUNT recognizes change in # cpus
Max CPU_COUNT limited to 3x CPU_COUNT at instance startup
33
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
CPU: Compatibility Matrix
SMT
DLPAR
MicroPartition
AIX 5.2
AIX 5.3
AIX 6.1
Oracle 9i
Oracle 10g
Oracle 11g
Note: Oracle RAC 10.2.0.3 on VIOS 1.3.1.1 & AIX 5.3 TL07 and higher are certified
35
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Agenda
AIX Configuration Best Practices for Oracle
– Memory
– CPU
– I/O
– Network
– Miscellaneous
36
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
The AIX IO stack
Application memory area caches data to
avoid IO
Application
Logical file system
Raw LVs
Raw disks
JFS JFS2
NFS
Other
VMM
NFS caches file attributes
NFS has a cached filesystem for NFS clients
JFS and JFS2 cache use extra system RAM
JFS uses persistent pages for cache
JFS2 uses client pages for cache
LVM (LVM device drivers)
Multi-path IO driver (optional)
Queues exist for both adapters and disks
Disk Device Drivers
Adapter device drivers use DMA for IO
Adapter Device Drivers
Disk subsystems have read and write cache
Disk subsystem (optional)
Disks have memory to store commands/data
Disk
Write cache
Read cache or memory area used for IO
IOs can be coalesced (good) or split up (bad) as they go thru the IO stack
37
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
AIX Filesystems Mount options
Journaled File System (JFS)
Better for lots of small file creates & deletes
– Buffer caching (default) provides Sequential Read-Ahead, cached writes, etc.
– Direct I/O (DIO) mount/open option no caching on reads
Enhanced JFS (JFS2)
Better for large files/filesystems
– Buffer caching (default) provides Sequential Read-Ahead, cached writes, etc.
– Direct I/O (DIO) mount/open option no caching on reads
– Concurrent I/O (CIO) mount/open option DIO, with write serialization
disabled
• Use for Oracle .dbf, control files and online redo logs only!!!
GPFS
Clustered filesystem – the IBM filesystem for RAC
– Non-cached, non-blocking I/Os (similar to JFS2 CIO) for all Oracle files
GPFS and JFS2 with CIO offer similar performance as Raw Devices
40
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
AIX Filesystems Mount options (Cont’d)
Direct IO (DIO) – introduced in AIX 4.3.
• Data is transfered directly from the disk to the
application buffer, bypassing the file buffer
cache hence avoiding double caching
(filesystem cache + Oracle SGA).
• Emulates a raw-device implementation.
To mount a filesystem in DIO
$ mount –o dio /data
Concurrent IO (CIO) – introduced with JFS2 in
AIX 5.2 ML1
• Implicit use of DIO.
• No Inode locking : Multiple threads can
perform reads and writes on the same file at
the same time.
• Performance achieved using CIO is
comparable to raw-devices.
Bench throughput over run duration – higher
tps indicates better performance.
To mount a filesystem in CIO:
$ mount –o cio /data
41
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
CIO/DIO implementation Advices
with Standard mount options
Oracle bin and shared lib.
Oracle Datafiles
Oracle Redolog
Oracle Archivelog
Oracle Control files
with optimized mount options
mount -o rw
mount -o rw
Cached by AIX
Cached by AIX
mount -o rw
Cached by Oracle
Cached by AIX
mount -o rw
Cached by Oracle
Cached by AIX
mount -o rw
mount -o cio *
(1)
Cached by Oracle
mount -o cio (jfs2 + agblksize=512)
Cached by Oracle
mount -o rw
mount -o rbrw
Use JFS2 write-behind …
but are not kept in AIX Cache.
mount -o rw
Cached by AIX
Cached by AIX
Cached by AIX
*(1) : to avoid demoted IO : jfs2 agblksize = Oracle DB block size / n
43
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
CIO Demotion and Filesystem Block Size
Data Base Files (DBF)
If db_block_size = 2048 set agblksize=2048
If db_block_size >= 4096 set agblksize=4096
Online redolog files & control files
Set agblksize=512 and use CIO or DIO
Mount Filesystems with “noatime” option
AIX/Linux records information about when files were created and last
modified as well as last accessed. This may lead to significant I/O
performance problems on often accessed frequently changing files
such as the contents of the /var/spool directory.
44
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
I/O Tuning (ioo)
READ-AHEAD (Only applicable to JFS/JFS2 with caching enabled)
MINPGAHEAD (JFS) or j2_minPageReadAhead (JFS2)
– Default: 2
– Starting value: MAX(2,DB_BLOCK_SIZE / 4096)
MAXPGAHEAD (JFS) or j2_maxPageReadAhead (JFS2)
– Default: 8 (JFS), 128 (JFS2)
– Set equal to (or multiple of) size of largest Oracle I/O request
• DB_BLOCK_SIZE * DB_FILE_MULTI_BLOCK_READ_COUNT
Number of buffer structures per filesystem:
NUMFSBUFS:
– Default: 196, Starting Value: 568
j2_nBufferPerPagerDevice (j2_dynamicBufferPreallocation replaces)
– Default: 512, Starting Value: 2048
Monitor with “vmstat –v”
45
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Data Layout for Optimal I/O Performance
Stripe and mirror everything (SAME) approach:
Goal is to balance I/O activity across all disks, loops, adapters, etc...
Avoid/Eliminate I/O hotspots
Manual file-by-file data placement is time consuming, resource intensive and iterative
Use RAID-5 or RAID-10 to create striped LUNs (hdisks)
Create AIX Volume Group(s) (VG) w/ LUNs from multiple
arrays, striping on the front end as well for maximum
distribution
Physical Partition Spreading (mklv –e x) –orLarge Grained LVM striping (>= 1MB stripe size)
http://www-1.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100319
46
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Data Layout cont’d…
Stripe using Logical Volume (LV) or Physical Partition (PP) striping
LV Striping
– Oracle recommends stripe width of a multiple of
• Db_block_size * db_file_multiblock_read_count
• Usually around 1 MB
– Valid LV Strip sizes:
• AIX 5.2: 4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k, 1 MB
• AIX 5.3: AIX 5.2 Stripe sizes + 2M, 4M, 16 MB, 32M, 64M, 128M
– Use AIX Logical Volume 0 offset (9i Release 2 or later)
• Use Scalable Volume Groups (VGs), or use “mklv –T O” with Big VGs
• Requires AIX APAR IY36656 and Oracle patch (bug 2620053)
PP Striping
– Use minimum Physical Partition (PP) size (mklv -t, -s parms)
• Spread AIX Logical Volume (LV) PPs across multiple hdisks in VG
(mklv –e x)
47
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Tuning and Improving System Performance
Adjust the key IOO Tuning Parameters
Adjust device specific tuning Parameters
Other I/O tuning Options
– DIO / CIO
– Release behind or read and/or write
– IO Pacing
– Write Behind
Improve the data layout
Add additional hardware resources
48
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Other I/O Stack Tuning Options (PBUFS)
When vmstat –v shows increasing “pending disk I/Os blocked with
no pbuf” values
lvmo:
max_vg_pbuf_count (lvmo) = maximum number of pbufs that may be
used for the VG.
pv_pbuf_count (lvmo) = the # of pbufs that are added when a PV is
added to the VG.
ioo:
pv_min_pbuf = The minimum # of pbufs per PV used by LVM
50
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Other I/O Stack Tuning Options (Device Level)
lsattr/chdev:
num_cmd_elems = maximum number of outstanding I/Os for an adapter.
queue_depth = the maximum # of outstanding I/Os for an hdisk.
Recommended/supported maximum is storage subsystem dependent.
max_xfer_size = the maximum allowable I/O transfer size (default is 0x100000 or 256k).
Maximum supported value is storage subsystem dependent. Increasing value (to at
least 0x200000) will also increase DMA size from 16 MB to 256 MB.
dyntrk = When set to yes (recommended), allows for immediate re-routing of I/O
requests to an alternative path when a device ID (N_PORT_ID) change has been
detected.
fc_err_recov = When set to “fast_fail” (recommended), if the driver receives an RSCN
notification from the switch, the driver will check to see if the device is still on the
fabric and will flush back outstanding I/Os if the device is no longer found.
52
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Asynchronous I/O for filesystem environments
AIX parameters
minservers = minimum # of AIO server processes (system wide)
– AIX 5.3 default = 1 (system wide), 6.1 default = 3 (per CPU)
maxservers = maximum # of AIO server processes
– AIX 5.3 default = 10 (per CPU), 6.1 default = 30 (per CPU)
maxreqs = maximum # of concurrent AIO requests
– AIX 5.3 default = 4096, 6.1 default = 65536
“enable” at system restart (Always enabled for 6.1)
Typical 5.3 settings: minservers=100, maxservers=200, maxreqs=65536
– Raw Devices or ASM environments use fastpath AIO
> above parameters do not apply
> lsattr -El aio0 and look for the value "fastpath", which should be enabled
– CIO uses fastpath AIO in AIX 6.1
– For CIO fastpath AIO in 5.3 TL5+, set fsfastpath=1
> not persistent across reboot
Oracle parameters
disk_asynch_io = TRUE
filesystemio_options = {ASYNCH | SETALL}
db_writer_processes (let default)
54
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
IO : Asynchronous IO (AIO)
• Allows multiple requests to be sent without to have to wait until the disk subsystem has
completed the physical IO.
• Utilization of asynchronous IO is strongly advised whatever the type of file-system and mount
option implemented (JFS, JFS2, CIO, DIO).
1
Application
2
aio
Q
3
aioservers
4
Disk
Posix vs Legacy
Since AIX5L V5.3, two types of AIO are now available : Legacy and Posix. For the moment, the
Oracle code is using the Legacy AIO servers.
55
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
IO : Asynchronous IO (AIO) fastpath
With fast_path, IO are queued directly from the application into the LVM layer without any
“aioservers kproc” operation.
Better performance compare to non-fast_path
No need to tune the min and max aioservers
No ioservers proc. => “ps –k | grep aio | wc –l” is not relevent, use “iostat –A” instead
1
3
AIX Kernel
Application
Disk
2
• Raw Devices / ASM :
check AIO configuration with : lsattr –El aio0
enable asynchronous IO fast_path. :
AIX 5L : chdev -a fastpath=enable -l aio0 (default since AIX 5.3)
AIX 6.1 : ioo –p –o aio_fastpath=1 (default setting)
• FS with CIO/DIO and AIX 5.3 TL5+ :
Activate fsfast_path (comparable to fast_path but for FS + CIO/DIO)
AIX 5L : adding the following line in /etc/inittab: aioo:2:once:aioo –o fsfast_path=1
AIX 6.1 : ioo –p –o aio_fsfastpath=1 (default setting)
56
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Asynchronous I/O for filesystem environments…
Monitor Oracle usage:
• Watch alert log and *.trc files in BDUMP directory for warning message:
“Warning “lio_listio returned EAGAIN”
If warning messages found, increase maxreqs and/or maxservers
Monitor from AIX:
• “pstat –a | grep aios”
• Use “-A” option for NMON
• iostat –Aq (new in AIX 5.3)
57
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
GPFS I/O Related Tunables
Refer Metalink note 302806.1
Async I/O:
Oracle parameter filesystemio_options is ignored
Set Oracle parameter disk_asynch_io=TRUE
Prefetchthreads= exactly what the name says
– Usually set prefetchthreads=64 (the default)
Worker1threads = GPFS asynch I/O
– Set worker1threads=550-prefetchthreads
Set aio maxservers=(worker1threads/#cpus) + 10
Other settings:
GPFS block size is configurable; most will use 512KB-1MB
Pagepool – GPFS fs buffer cache, not used for RAC but may be for
binaries. Default=64M
mmchconfig pagepool=100M
58
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
I/O Pacing
I/O Pacing parameters can be used to prevent large I/O
streams from monopolizing CPUs
–
System backups (mksysb)
–
DB backups (RMAN, Netbackup)
–
Software patch updates
When Oracle ClusterWare is used, use AIX 6.1 Defaults:
–
chgsys -l sys0 -a maxpout=8193 minpout=4096 (AIX 6.1 defaults)
–
nfso –o nfs_iopace_pages=1024 (AIX 6.1 defaults)
–
On the Oracle clusterware set : crsctl set css diagwait 13 –force
• This will delay the OPROCD reboot time to 10secs from 0.5secs during node
eviction/reboot, just enough to write the log/trace files for future diagnosis. Metalink
note# 559365.1
59
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
ASM configurations
AIX parameters
– Async I/O needs to be enabled, but default values may be used
ASM instance parameters
– ASM_POWER_LIMIT=1
Makes ASM rebalancing a low-priority operation. May be changed dynamically. It is
common to set this value to 0, then increase to a higher value during maintenance
windows
– PROCESSES=25+ 15n, where n=# of instances using ASM
DB instance parameters
–
–
–
–
–
60
disk_asynch_io=TRUE
filesystemio_options=ASYNCH
Increase Processes by 16
Increase Large_Pool by 600k
Increase Shared_Pool by [(1M per 100GB of usable space) + 2M]
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Agenda
AIX Configuration Best Practices for Oracle
– Memory
– CPU
– I/O
– Network
– Miscellaneous
61
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Network Options (no) Parameters
–
–
–
–
Set sb_max >= 1 MB (1048576)
Set tcp_sendspace >= 262144
Set tcp_recvspace >= 262144
Set rfc1323=1
If isno=1, check to see if settings have been overridden at the network
interface level:
$ no -a | grep isno use_isno=1
use_isno=1
$ lsattr -E -l en0 -H
attribute
value
description
rfc1323
N/A
tcp_nodelay
N/A
tcp_sendspace N/A
tcp_recvspace N/A
tcp_mssdflt
N/A
62
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Additional Network (no) Parameters for RAC:
Set udp_sendspace = db_block_size *
db_file_multiblock_read_count
(not less than 65536)
Set udp_recvspace = 10 * udp_sendspace
– Must be < sb_max
Increase if buffer overflows occur
Ipqmaxlen=512 for GPFS environments
Use Jumbo Frames if supported at the switch layer
Examples:
no -a |grep udp_sendspace
no –o -p udp_sendspace=65536
netstat -s |grep "socket buffer overflows"
63
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Agenda
AIX Configuration Best Practices for Oracle
– Memory
– I/O
– Network
– Miscellaneous
64
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Miscellaneous parameters
User Limits (smit chuser)
–
–
–
–
–
Soft FILE size = -1 (Unlimited)
Soft CPU time = -1 (Unlimited)
Soft DATA segment = -1 (Unlimited)
Soft STACK size -1 (Unlimited)
/etc/security/limits
Maximum number of PROCESSES allowed per user (smit chgsys)
– maxuproc >= 2048
Environment variables:
– AIXTHREAD_SCOPE=S
– LDR_CNTRL=DATAPSIZE=64K@TEXTPSIZE=64K@STATSPACK=64K oracle*
65
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Q&A
71
© 2009 IBM Corporation
April 17, 2009
IBM Advanced Technical Support - Americas
Trademarks
The following are trademarks of the International Business Machines Corporation in the United States and/or other countries. For a complete list of IBM Trademarks, see www.ibm.com/legal/copytrade.shtml: AS/400,
DBE, e-business logo, ESCO, eServer, FICON, IBM, IBM Logo, iSeries, MVS, OS/390, pSeries, RS/6000, S/30, VM/ESA, VSE/ESA, Websphere, xSeries, z/OS, zSeries, z/VM
The following are trademarks or registered trademarks of other companies
Lotus, Notes, and Domino are trademarks or registered trademarks of Lotus Development Corporation
Java and all Java-related trademarks and logos are trademarks of Sun Microsystems, Inc., in the United States and other countries
LINUX is a registered trademark of Linux Torvalds
UNIX is a registered trademark of The Open Group in the United States and other countries.
Microsoft, Windows and Windows NT are registered trademarks of Microsoft Corporation.
SET and Secure Electronic Transaction are trademarks owned by SET Secure Electronic Transaction LLC.
Intel is a registered trademark of Intel Corporation
* All other products may be trademarks or registered trademarks of their respective companies.
NOTES:
Performance is in Internal Throughput Rate (ITR) ratio based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user will experience will
vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be
given that an individual user will achieve throughput improvements equivalent to the performance ratios stated here.
IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.
All customer examples cited or described in this presentation are presented as illustrations of the manner in which some customers have used IBM products and the results they may have achieved. Actual
environmental costs and performance characteristics will vary depending on individual customer configurations and conditions.
This publication was produced in the United States. IBM may not offer the products, services or features discussed in this document in other countries, and the information may be subject to change without notice.
Consult your local IBM business contact for information on the product or services available in your area.
All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.
Information about non-IBM products is obtained from the manufacturers of those products or their published announcements. IBM has not tested those products and cannot confirm the performance, compatibility, or
any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
Prices subject to change without notice. Contact your IBM representative or Business Partner for the most current pricing in your geography.
References in this document to IBM products or services do not imply that IBM intends to make them available in every country.
Any proposed use of claims in this presentation outside of the United States must be reviewed by local IBM country counsel prior to such use.
The information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may
make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
72
© 2009 IBM Corporation
April 17, 2009
Download