RACPerformance-DOUG

advertisement

DOUG – Technology Day

Tuning Seminar by Nitin Vengurlekar

Nitin Vengurlekar

• 18 Years with Oracle

6 years with Oracle Support

9 years with RAC Product Management

3 years a “ Private Database Cloud ” Evangelist

• Worked with numerous customers on consolidation, rationalization planning.

• Taking these key customers to reference-ability

• Developed white papers and Best Practices for Application/Database

High Availability and Consolidation

• Follow me on Twitter: dbcloudshifu

2

RAC FUNDAMENTALS AND INFRASTRUCTURE

SUPPORT

Content Contributors

Thanks to all the past Content Contributors

• Michael Zoll

• Markus Michalewicz

• Barb Lundhild

• Saar Moaz

• John McHugh

• My “former past” Nitin Vengurlekar (circa 2009)

Objectives

• Not gonna give you scripts or queries

– You can find that on the InterWeb

• Gonna cover basics of buffer/block management in RAC

– So you know what is happening when happens

• Review key metrics/waits and its dependencies

– So you know [starting point] causality

• Check out next session on RAC Buffer Cache Internals for more deep dive

Agenda

Understanding RAC Cache Fusion for Practical RAC Performance Analysis

• RAC Fundamentals and Infrastructure

• Common Problems and Symptoms

• Application and Database Design

• Diagnostics and Problem Determination

• Summary: Practical Performance Analysis

Node1

Service

VIP1

Listener

DB Instance 1

RAC Cluster 11gR2 Architecture

/…/ public network

VIP2

Service

Node 2

Service

Listener Listener

VIPn

DB Instance 2 DB Instance n

ASM ASM ASM

Oracle Clusterware Oracle Clusterware Oracle Clusterware

Operating System Operating System Operating System shared storage

Managed by ASM

Redo / Archive logs all instances

Database / Control files

OCR and Voting Disks

Node n

Under the Covers

Cluster Private High Speed Network

Instance 1

SGA

Node 1

LMON LMD0 DIAG

Global Resoruce Directory

Dictionary

Cache

Library

Cache

Log buffer

Buffer Cache

LCK0

LMS0

LGWR

SMON

DBW0

PMON

LMHB

Instance 2

SGA

LMON LMD0 DIAG

Global Resoruce Directory

Log buffer

Dictionary

Cache

Library

Cache

Buffer Cache

LCK0

LMS0

LGWR

SMON

DBW0

PMON

LMHB

Instance n

SGA

Node 2

LMON LMD0 DIAG

Global Resoruce Directory

Log buffer

Dictionary

Cache

Library

Cache

Buffer Cache

LCK0

LMS0

LGWR

SMON

DBW0

PMON

LMHB

Node n

Redo Log Files Redo Log Files Redo Log Files

Data Files and Control Files

Two Keys Components – Cache Fusion

• Global Cache Service (GCS)

• Global Enqueue Services (GES)

• Global Resource Directory (GRD)

Global Enqueue Service (GES)

• GES maintains synchronization dictionary cache, library cache, transaction locks, and DDL locks.

– Easier just to say “GES manages enqueues other than data blocks” 

– LCK and LMD are processes that manage GES

• Maintains local and global enqueue

– V$ENQUEUE_STATISTICS displays enqueues with the highest impact.

– GV$LOCK – global view of local locks

– GV$GES_ENQUEUES – global view of global locks that are blocking or being blocked

– TX, TM, SQ, TA, US are typical enqueues

Global Enqueue Service (GES) - Example

• A process is trying to acquire a HW enqueue

• Sends a BAST (Blocking Asynchronous Trap) message to LCK process

• LCK constructs the message

– Message includes lock pointer, resource pointer, and resource name

• If resource is not available, then the LCK process sends a message to the lock holder for a lock downgrade.

– Can be seen as ‘DFS lock handle’ waits

Global Cache Services (GCS)

• Guarantees cache coherency

– Ensures that instances acquire a resource cluster-wide before modifying or reading a database block

• Minimizes access time to data which is not in local cache and would otherwise be read from disk or rolled back

– Synchronize global cache access – PCM ;-)

• Implements direct memory access over interconnect

• Uses an efficient and scalable messaging protocol

– skgxp

Global Resource Directory (GRD)

• GRD records information about current status of the data blocks, resources and enqueues.

• GRD is managed and maintained by GES and GCS.

• Each running instance stores a portion of the directory.

• LMON recovers the GRD during instance recovery

8k on-disk block

Global Cache Resource

Relationship

8k buffer header

(x$bh)

200 bytes

Lock

Element

(LE) – x$le

DLM Lock

(x$kjbl)

DLM Resource

(x$kjbr)

Three Players in this Chess[RAC]-Match

• Requestor

– Session [from an Instance] who is making the request for the buffer/block

• Master (kjblmaster)

– Instance that has that buffer mastered

– Maintains grant and convert queues

– Buffer [ranges] are mastered by different instances. Provides even distribution of mastered locks

– Block re-mastering can change for various reasons (gms changes, DRM, manually using event)

• Holder (kjblowner)

– Instance that has the buffer cached

Buffer Blocks Basics

A word on Current and CR blocks

• Oracle includes block multi-versioning architecture

– RAC extends that to multi-nodal multi-versioning

• Block can be either a current data block or a consistent read (CR) versions of a block.

– Current block contains changes for all committed and uncommitted transactions. All DML gets on a block are made in this mode

– Consistent read (CR) version of a block represents a consistent snapshot of the data at a previous point in time. Select read requests made in this mode

• Oracle applies undo segments to current blocks to produce to appropriate CR versions of a block.  not RAC specific !

• Both the current and consistent read blocks are managed by the GCS.

Cache Fusion – 1,2,and 3 ways

1-way block transfer No block transfer at all. Requestor, Holder, and Master is the local instance. Close to single instance performance.

• 2-way block transfer Node A performs hash/directory lookup, and finds out another node is the master of this block; eg, node B. Node A send a message to B for that block. Node B found no other nodes hold that block and messages back node A

• 3-way block transfer Node A requests for block, via lookup, determines Node B is master, but finds that node C is currently holding that block. NodeB sends a message to node C, and then node C send the block to node A. That’s 3-way.

GCS Coordination

Example 1

Assume data block C has been read and and dirtied by InstanceA.

Only one copy of the block exists clusterwide and represented by its SCN

1. InstanceB attempting to modify the block submits a request to LMS.

2. LMS transmits the request to the holder, InstanceA

3. InstanceA receives the message, flushes the redo associated with dirtied buffer and sends the block to the

InstanceB

4. InstanceA retains the dirty buffer for recovery purposes. This dirty image of the block is called a past image (PI) of the block. A PI block cannot be modified further.

5. On receipt of the block, the , InstanceB informs GCS that it holds the block.

Note: The data block is not written to disk before the resource is granted , InstanceB

GCS Coordination Example 2

Write to Disk Coordination

In this scenario, assume that InstanceA is holding a past image buffer, requests that Oracle writes the buffer to disk:

1. InstanceA sends a write request to the GCS.

2. The GCS forwards the request to the InstanceB, the holder of the current version of the block.

3. InstanceB receives the write request and writes the block to disk.

4. InstanceB records the completion of the write operation with the GCS.

5. After receipt of the notification, the GCS orders all past image holders to discard their past images. These past images are no longer needed for recovery.

Note: In this case, only one I/O is performed to write the most current version of the block to disk.

GCS Coordination Example 2

Write to Disk Coordination

This scenario illustrates what happens when an instance invokes a checkpoint or cache clean buffers due to free buffer requests.

Because multiple versions of the same data block with different changes can exist in the caches of instances in the cluster, a write protocol managed by the GCS ensures that only the most current version of the data is written to disk.

Disk block writes are only required for cache replacement. A past image (PI) of a block is kept in memory before the block is sent if it is a dirty block. In the event of failure, Oracle reconstructs the current version of the block by reading the PI blocks.

Key Layers that Affect RAC Performance

• Local disk block or buffer access is ‘same-ol-same-ol’

• A remote cache access driven by round-trip time

• Latency variation (and CPU cost ) correlates with

– Block transfer (“wire time”)

– Block Contention

– Block Access Cost – block preparation

– Delayed log flushes

– CPU saturation/LMS scheduling

– IO latency

• Transfer Path Length

Packets Socket CPU

Packet Headers

Network Path sys user

Queues

Time

• Wire latency is very small

– ~ 50% of fixed overhead is in kernel

– Protocol ( e.g. UDP, RDS ) dependent

• IPC queue lengths are variable

– Depends on incoming rate and service time

• Context switch and scheduling delay

(CPU queue ) are variable

– Depends on process concurrency & CPU load

• Hence: time in queues can vary under load

Performance of immediate message transfers depends practically on minimizing queue and context switch time

Wire

NIC

Driver and IP stack

Context switch

IPC

Global Cache (RDBMS)

Block transfer Time

Interconnect and IPC processing – “Wire-Time”

Message:~200 bytes

LMS

Receive

Process block

Send

Initiate send and wait

Receive

Block: e.g. 8K

8192 bytes/(1 Gb/sec)

Total access time: e.g. ~360 microseconds (UDP over 10GBE)

Network propagation delay ( “ wire time ” ) is a minor factor for roundtrip time

( approx.: 6% , vs. 52% in OS and network stack )

Block Access Cost

Cost determined by

• Block server process load

• Message Propagation Delay

• Operating system scheduling

• IPC CPU

• Block Access Cost = message propagation delay + IPC CPU + Operating System

Scheduling + Block Server Load

Block Contention

• The contention-oriented waits occur when a session is attempting to read/ modify a globally cached buffer, but could not be immediately shipped

• The following are possibilities:

– Buffer was pinned by a session on another node

– A change to the buffer had not been flushed to disk

– Too many other waiters on the grantors list, caused by frequent concurrent read and write accesses to the same data.

• Block Contention Wait Events

– gc current block busy

– gc cr block busy

– gc buffer busy acquire

– gc buffer busy release

Block Access Cost

Prepare, Build to ship

• Two key factors of Cache Fusion Latency

• CR block request time = build time + flush time + send time

• Current block request time = pin time + flush time + send time

• Always refer send times too from other instances

Infrastructure: Cache Fusion Latency

• Average Prepare Latency =

– Blocks Served Time/Blocks Served

• Blocks Server Time =

– gc cr block build time +

– gc cr block flush time +

– gc current block pin time +

– gc current block flush time +

– gc current send time +

– gc cr send time +

• Blocks Served =

– gc cr blocks served + gc current blocks served

Identifying Issues

No contention Global Cache Access – how it looks in AWR

Accurate average: 100

µsecs

Factors Affecting Performance of

Immediate Global Cache Access

– Machine Load

Process concurrency for CPU

Scheduling

CPU utilization

– Interconnect Bandwidth

Total bandwidth utilization for the database(s)

LMS processes

Real time

• CPU busy

No application tuning required

• Latency in Cluster has small impact

• Average Performance is good

Identifying Issues

Global Cache Access with application contention – how it looks in AWR

Block on the way from another instance

Block on the way from another instance

Factors Affecting Performance with Application Contention on Data

Log File IO latency

– LGWR responsiveness

• Schema tuning may be required

– If the application response time or throughput do not meet objectives

Index

Contention

Transfer delayed by log flush on other node(s)

Impact of

Application

Contention

Identifying Issues

SQL and Schema Optimization: Identifying SQL incurring highest Cluster Wait Time

Indexes with High Contention, 1 accounting for 84%

Identifying Issues

Cause and Effect are distributed – How to read the Global Impact

Global cache wait events:

35% significant

 higher than expected

A racdb1_1

B racdb1_3

ISHAN

Block pinged out; sessions waiting for its return

NISHA

Variance and Outliers indicate

That IO to the log file disk group affects performance

In the cluster

Local sessions waiting for transfer

Transfer delayed by log flush on other node(s)

26.1 / 73.1

Cluster Cache Efficiency -

AWR

Cluster Cache Efficiency - AWR

Cluster Cache Efficiency - AWR

COMMON PROBLEMS AND SYMPTOMS

Misconfigured or Faulty Interconnect Can Cause:

Dropped packets/fragments

Buffer overflows

Packet reassembly failures or timeouts

Ethernet Flow control kicks in

TX/RX errors

“lost blocks” at the RDBMS level, responsible for 64% of escalations

“ Lost Blocks ” : NIC Receive Errors

ifconfig –a: eth0 Link encap:Ethernet HWaddr 00:0B:DB:4B:A2:04 inet addr:130.35.25.110 Bcast:130.35.27.255 Mask:255.255.252.0

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:21721236 errors:135 dropped:0 overruns:0 frame:95

TX packets:273120 errors:0 dropped: 1105 overruns:0 carrier:0

“ Lost Blocks ” : IP Packet Reassembly Failures

netstat –s

Ip:

84884742 total packets received

1201 fragments dropped after timeout

3384 packet reassembles failed

Finding a Problem with the Interconnect or IPC

Top 5 Timed Events Avg %Total

~~~~~~~~~~~~~~~~~~ wait Call

Event Waits Time(s)(ms) Time Wait Class

---------------------------------------------------------------------------------------------------log file sync 286,038 49,872 174 41.7 Commit gc buffer busy gc cr block busy

177,315 29,021 164 24.3 Cluster

110,348 5,703 52 4.8 Cluster gc cr block lost cr request retry

4,272 4,953 1159 4.1 Cluster

6,316 4,668 739 3.9 Other

Should never be here

CPU Saturation or Memory Depletion

Top 5 Timed Events Avg %Total

~~~~~~~~~~~~~~~~~~ wait Call

Event Waits Time(s)(ms) Time Wait Class

------------------------- ------- ---- ----- ---------db file sequential 1,312,840 21,590 16 21.8 User I/O read gc current block congested

275,004 21,054 77 21.3 Cluster gc cr grant congested 177,044 13,495 76 13.6 Cluster gc current block 1,192,113 9,931 8 10.0 Cluster

2-way gc cr block congested 85,975 8,917 104 9.0 Cluster

Congested

: LMS could not dequeue messages fast enough

Cause : Long run queueus and paging on the cluster nodes

Impact of IO capacity issues or bad

SQL execution on RAC

• Log flush IO delays can cause “ busy ” buffers

• “ Bad ” queries on one node can saturate the link

• IO is issued from ALL nodes to shared storage ( beware of one-node

“ myopia ” )

Cluster-wide impact of IO or query plan issues responsible for 23% of escalations

Summary

Look for:

• High impact of “ lost blocks ” , e.g.

gc cr block lost

• IO capacity saturation , e.g. gc cr block busy

• Overload and memory depletion, e.g

gc current block congested

All events with these tags are potential issue, if their % of db time is significant.

Compare with the lowest measured latency

( target , c.f. SESSION HISTORY reports or SESSION HISTOGRAM view )

INFRASTRUCTURE BEST PRACTICES AND

CONFIGURATION

Infrastructure:

Interconnect Bandwidth

• Interconnect must be private (can be VLAN)

– Should not roll up to distribution switch (keep in layer 2 domain)

• Network Cards

– Use Fast interconnect 10GbE (with jumbo frames) or Infiniband

– Multiple NICs generally not required for performance and scalability

• Bandwidth requirements depend on

– CPU power per cluster node

– Application-driven data access frequency

– Number of nodes and size of the working set

– Data distribution between PQ slave

Infrastructure: IPC configuration

Settings:

– Socket receive buffers ( 256 KB – 4MB )

– Negotiated top bit rate and full duplex mode

– NIC ring buffers

– Ethernet flow control settings

– CPU(s) receiving network interrupts

Verify your setup:

– CVU does checking

– Load testing eliminates potential for problems

Infrastructure:

IO capacity

• Disk storage is shared by all nodes, i.e the aggregate IO rate is important

• Log file IO latency can be important for block transfers

– Log file sync that exceed 500 ms are logged

• Parallel Execution across cluster nodes requires a well-scalable IO subsystem

– Disk configuration needs to be responsive and scalable

– Test with Calibrate I/O or Orion or SLOB2

APPLICATION AND DATABASE DESIGN

<Insert Picture Here>

Application Considerations

How to avoid Resource Contention in applications racdb1_3

Oracle RAC

Oracle GI comet racdb1_4

Oracle RAC

Oracle GI vixen

Connection Pool • Scheduling delays on high context switch rates on busy systems may increase the variation in the cluster traffic times

• Latch and mutex contention can cause priority inversion issues for critical background procs.

• More processes imply higher memory utilization and higher risk of paging

• Control the number of concurrent processes

– Use connection pooling

– Avoid connection storms

(pool and process limits )

• Ensure that load is well-balanced over nodes

Services

• Application Workloads can be defined as Services

– Workload Management

– Do not ever ever use instance names

• Individually managed and controlled

• On instance failure, automatic re-assignment

• Service Performance is individually tracked

• Finer Grained Control with Resource Manager

• Integrated with Other Tools – i.e. Scheduler, Streams

• Managed by Oracle Clusterware

• Several services created and managed by the database

Scalability Pitfalls

• Serializing contention on a small set of data/index blocks

– monotonically increasing key

– frequent updates of small cached tables

– segment without ASSM or Free List Group (FLG)

• Full table scans

• Frequent hard parsing

• Concurrent DDL ( e.g. truncate/drop )

Index Block Contention: Optimal Design

• Monotonically increasing sequence numbers

• Large sequence number caches select sequence_owner, sequence_name, increment_by, cache_size, order_flag, last_number from dba_sequences where (sequence_owner not in (&OLIST1) order by sequence_owner, cache_size, last_number;

• Hash or range partitioning

– Local indexes

Data Block Contention

Optimal Design

• Small tables with high row density and frequent updates and reads can become “ globally hot ” with serialization e.g.

– Queue tables

– session/job status tables

– last trade lookup tables

• Higher PCTFREE for table reduces # of rows per block

Summary

Look for:

• Indexes with right-growing characteristics

– Eliminate indexes which are not needed

• Frequent updated and reads of “ small ” tables

– “ small ” =fits into a single buffer cache

• SQL which scans large amount of data

– Bad execution plan

– More efficient when parallelized

<Insert Picture Here>

SUMMARY: PRACTICAL PERFORMANCE ANALYSIS

Global Cache Event Semantics

All Global Cache Events will follow the following format:

GC …

• CR, current

– Buffer requests and received for read or write

• block, grant

– Received block or grant to read from disk

• 2-way, 3-way

Immediate response to remote request after N-hops

• busy

– Block or grant was held up because of contention

• congested

– Block or grant was delayed because LMS was busy or could not get the CPU

What to Look For

Look for:

• Gc [current/cr] [2/3]-way -Monitor if average ms > 1ms or close to Disk I/O latency. Look at reducing latency

• Gc [current/cr] grant 2-way – Permission to read from disk. Monitor disk I/O

• Gc [current/cr][block/grant] congested – Long Block access cost. Review CPU/memory utilization

• Gc [current/cr] block busy – Review block contention

• Gc [current/cr][failure/retry] – Review private interconnect, network errors or hardware problems

Cluster Cache Coherency - EM

Cluster Database - EM

General RAC Principles

• Performance Monitoring & Diagnosis tools for RAC

•  AWR captures data from all active instances of RAC

•  ADDM presents data in cluster wide perspective

•  ASH reports statistics for all active sessions of all active instances

•  Enterprise manager is RAC aware

• No fundamentally different design and coding practices for RAC

• Badly tuned SQL and schema will not run better

• Serializing contention makes applications less scalable

• Standard SQL and schema tuning solves > 80% of performance problems

More detailed information is available at

viscosityna.com

or by talking to a real person at

469.444.1380

65

Download