Oracle Cache Fusion – In Operation } Agenda } Cache Fusion – What is it? – Cache Coherency Vs. Cache Fusion – Key Components and terminology } Cache Fusion in operation – Lock Mastering & Resource Affinity – Type of Contentions – Cache Fusion – I – Cache Fusion – II – Examples } Instance Crash Recovery in RAC – Key Components in a Instance crash – I Pass recovery – II Pass recovery }Cache Fusion – What is it? } What is it? Oracle introduced the framework of sharing data using private interconnects between the nodes, which was used only for messaging purposes in previous versions. This protocol is Cache Fusion. Data blocks are shipped throughout the network similar to messages, reducing the most expensive component of data transfer, disk I/O, to data sharing. According to the manual: Process that implement Cache Fusion. It maintains the block mode for blocks in the global role. It is responsible for block transfers between instances. The Global Cache Service employs various background processes such as the Global Cache Service Processes (LMSn) and Global Enqueue Service Daemon (LMD). A diskless cache coherency mechanism in Oracle Real Application Clusters that provides copies of blocks directly from a holding instance's memory cache to a requesting instance's memory cache. } Cache Coherency } According to Manual – The synchronization of data in multiple caches so that reading a memory location through any cache will return the most recent data written to that location through any other cache. Sometimes called cache consistency. } Can We say its something to maintain the resource (block) status, If so, the following two together provides the same for us. – GCS (Global Cache Services) – GES (Global Enqueue Services) In the name of Global Resource Directory } Now both together ……… } The GCS manages all types of data blocks. Cache coherency is maintained through the GCS by requiring that instances acquire a resource (lock or enqueue on a block) cluster-wide before modifying or reading a database block. The GCS is used to synchronize global cache access, allowing only one instance to modify a block at any single point in time. The GCS, through the RAC wide Global Services Directory, ensures that the status of data blocks cached in any mode in the cluster is globally visible and maintained. } Oracle’s RAC has multi-versioning architecture. This multi-versioning architecture distinguishes between current data blocks and one or more consistent read (CR) versions of a block. A current block contains changes for all committed and yet-to-be-committed transactions. A consistent read (CR) version of a block represents a consistent snapshot of the data at a previous point in time. A data block can reside in many buffer caches under the auspices of shared resources. } In Oracle9i RAC, applying rollback segment information to current blocks produces consistent read versions of a block. Both the current and consistent read blocks are managed by the GCS. } To transfer data blocks among database caches, buffers are shipped by means of the high speed IPC interconnect. Disk writes are only required for cache replacement. A past image (PI) of a block is kept in memory before the block is sent if it is a dirty (modified) block. In the event of failure, Oracle reconstructs the current version of the block by reading the PI blocks. } Background Process and their roles } LMSx – Lock Monitor Services (GCS) } Primarily responsible for shipping the blocks across buffers } Provides/creates a CR image whenever there is cross instance call for a dirtyblcok } LMS must also check constantly with the LMD background process (or our GES process) to get the lock requests placed by the LMD process. } Parameter: GCS_SERVER_PROCESS upto 36 as of 10.2, Min. cpu_count/2 } LMON – Lock Monitor Process (GES) } LMON Processes manages the global locks & resources. } Reconfiguration of locks & resources when an instance joins or leaves the cluster are handled by LMON ( During reconfiguration LMON generate the trace files) } LMON also provides cluster group services. } LMD – Lock Manager Daemon } LMD process performs global lock deadlock detection local and remote . (GES) } Also monitors for lock conversion timeouts. } Basically maintains the lock queues, traverse through the GES structures } LCK – Lock Process } Manages instance resource requests & cross instance calls for shared resources. } During instance recovery,it builds a list of invalid lock elements and validates lock elements. } DIAG – Diagnostic Daemon – Oracle 10g - this one new background processes ( New enhanced diagnosability framework). Regularly monitors the health of the instance. Also checks instance hangs & deadlocks. } History of Cache Fusion Oracle Release Feature Description Prior to 8.1.5 OPS OPS used disk-based pings 8.1.5 Cache Fusion I or Consistent Read Server Consistent read version of the block is transferred over the interconnect 9i Cache Fusion II (write/write cache fusion) Current version of the block is transferred over the interconnect 10g R1 Oracle Cluster Ready Services (CRS) CRS eliminates the need for third-party clusterware, though it can be used 10g R2 Oracle CRS for High Availability CRS provides high availability for nonOracle applications } Key Components in Cache Fusion Ping The transfer of a data block from one instance’s buffer cache to another instance’s buffer cache is known as a ping. Whenever an instance needs a block, it sends a request to the lock master to obtain a lock in the desired mode. If another lock resides on the same block, the master will ask the current holder to downgrade/release the current lock., this process is known as a blocking asynchronous trap (BAST). When an instance receives a BAST it downgrades the lock as soon as possible. However, before downgrading the lock, it might have to write the corresponding block to disk. This operation sequence is known as disk ping or a hard ping. CR Fabrication When ever there is Consistent read request from any other instance, the holding instance (LMS) has to create a Consistent read image by applying the undo information to the Current Block. Since CR fabrication is I/O expensive which requires a undo into the buffer and apply the undo image etc. Past Image (PI) Blocks PI blocks are copies of blocks in the local buffer cache. Whenever an instance has to send a block it has recently modified to another instance, it preserves a copy of that block, marking it as PI. An instance is obliged to keep Pls until that block is written to the disk by the current owner of the block. Pls are discarded after the latest version of the block is written to disk. When a block is written to disk and is known to have global role, indicating the presence of Pls in other instances’ buffer caches, Global Cache Services (GCS) informs the instance holding the Pls to discard the Pls. With Cache Fusion, a block is written to disk to satisfy checkpoint requests and so on, not to transfer the block from one instance to another via disk. Lock Mastering The memory structure where GCS keeps information about a data block (and other sharable resources) usage is known as the lock resource. The responsibility of tracking locks is distributed among all the instances and the required memory also comes from the participating instances’ System Global Area (SGA). Due to this distributed ownership of the resources, a master node exists for each lock resource. The master node maintains complete information about current users and requestors for the lock resource. The master node also contains information about the Pls of the block. } Resource Affinity and Dynamic remastering } Each block is mastered in any one of the instance at any given point of time } Resource Master can be changed based on frequency of the block that is requested by other instances – For a period of 10 Mins if an instance request 50 times for a particular resource the requested instance become the master. This is called resource affinity - Block Mastering } In Oracle 9.2 – documentation describes dynamic remastering – not implemented in code } In Oracle 10.1 – – – – work at data file level very high threshold so difficult to test does occur on some customer sites may cause LMON process to crash in 10.1.0.4 } } bug 3659289 - patch available fixed in 10.1.0.5/10.2.0.1 } In Oracle 10.2 – works at object level – thresholds are relatively low. – Object re mastering is recorded in V$GCSPFMASTER_INFO } Cache Fusion- Possible Types of Contention } Contention of a resource occurs when two or more instances want the same resource. If a resource such as a data block is being used by an instance and is needed by another instance at the same time, a contention occurs. There are three types of contention for data blocks: } Read/Read contention Read/read contention is never a problem because of the shared disk system. A block read by one instance can be read by other instances without the intervention of GCS. } Write/Read contention Write/read contention was addressed in Oracle 8i by the consistent read server. The holding instance constructs the CR block and ships the requesting instance using interconnects. } Write/Write contention Write/write contention is addressed by the Cache Fusion technology. Since Oracle 9i, cluster interconnect is used in some cases to ship data blocks among the instances that need to modify the same data block simultaneously. } Prior to Cache Fusion (before 8.1.5) Write/read contention before Cache Fusion } Cache Fusion – I aka Consistent Read Server Write/Read contention - CR Block Transfer in Cache Fusion Oracle Introduced a background process called BSP (Block Server process) makes the CR fabrication at the holder’s cache and ships the CR version of the block across the interconnect } Still need to address Write/Write Contention Write / Write Contention before Cache Fusion – II (before 9i) } So now – Cache Fusion – II or Write/Write Cache Fusion Cache Fusion current block transfer (from 9i r2 ) } Buffer States In Cache Fusion Mode/Role Local Global Null: N NL NG Shared: S SL SG Exclusive: X XL XG SL When an instance has a resource in SL form, it can serve a copy of the block to other instances and it can read the block from disk. Since the block is not modified, there is no need to write to disk. XL When an instance has a resource in XL form, it has sole ownership and interest in that resource. It also has the exclusive right to modify the block. All changes to the blocks are in its local buffer cache, and it can write the block to disk. If another instance wants the block, it will contact the instance via GCS. NL A NL form is used to protect consistent read blocks. If a block is held in SL mode and another instance wants it in X mode, the current instance will send the block to the requesting instance and downgrade its role to NL. SG In SG form, a block is present in one or more instances. An instance can read the block from disk and serve it to other instances. XG In XG form, a block can have one or more Pls, indicating multiple copies of the block in several instances’ buffer caches. The instance with the XG role has the latest copy of the block and is the most likely candidate to write the block to disk. GCS can ask the instance with the XG role to write the block to disk or to serve it to another instance. NG After discarding Pls when instructed by GCS, the block is kept in the buffer cache with NG role. This serves only as the CR copy of the block. } Example 1: Reading a Block from Disk } Example 2: Reading a Block from the Cache } Example 3: Getting a (Cached) Clean Block for Update } Example 4: Getting a (Cached) Modified Block for Update and Commit } Example 5: Commit the Previously Modified Block and Select the Data } Example 6: Write the Dirty Buffers to Disk Due to Checkpoint } Example 7: Master Instance Crash } Example 7: What Alert log says abt reconfiguration……. } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } List of nodes: 012 Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished * domain 0 valid = 0 according to instance 0 Wed Jun 21 23:22:22 2006 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Wed Jun 21 23:22:22 2006 LMS 0: 0 GCS shadows cancelled, 0 closed Wed Jun 21 23:22:22 2006 LMS 2: 0 GCS shadows cancelled, 0 closed Wed Jun 21 23:22:22 2006 LMS 3: 0 GCS shadows cancelled, 0 closed Wed Jun 21 23:22:22 2006 LMS 1: 0 GCS shadows cancelled, 0 closed Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Wed Jun 21 23:22:22 2006 LMS 0: 2189 GCS shadows traversed, 332 replayed Wed Jun 21 23:22:22 2006 LMS 2: 2027 GCS shadows traversed, 364 replayed Wed Jun 21 23:22:22 2006 LMS 3: 2098 GCS shadows traversed, 364 replayed Wed Jun 21 23:22:22 2006 LMS 1: 2189 GCS shadows traversed, 343 replayed Wed Jun 21 23:22:22 2006 Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete } Crash Recovery – Key Components Redo Threads and Streams Redo Records and Change Vectors Checkpoints Thread Checkpoint or Local Checkpoint Database Checkpoint or Global Checkpoint Incremental Checkpoint Bounded Recovery Block Written Record (BWR) Past Image (PI) Checkpoints and PI I Pass Recovery II Pass Recovery Merge Threads } Cache Fusion - Crash Instance Recovery The steps for GRD reconfiguration are as follows: Instance death is detected by the cluster manager. Requests for PCM locks are frozen. Enqueues are reconfigured and made available. DLM recovery. GCS (PCM lock) is remastered. Pending writes and notifications are processed. The steps for I Pass recovery are as follows: The instance recovery (IR) lock is acquired by SMON. The recovery set is prepared and built. Memory space is allocated in the SMON Program Global Area (PGA). SMON acquires locks on buffers that need recovery. II Pass recovery steps are as follows: II Pass is initiated. The database is partially available. Blocks are made available as they are recovered. The IR lock is released by SMON. Recovery is complete. The system is available. } Example 8: Select the Rows from Instance A } Just for a clear understanding…… } Its time to play …… } Cross Instance Consistent Read Instance 1 Instance 2 Session 15 LMS0 SELECT runs,wickets FROM score WHERE team = 'ENG'; Build read consistent version of block 42 Session 27 UPDATE score SET runs = runs + 6 4 2 WHERE team = 'ENG'; segment 5 slot 18: state: 10 wrap#: 4E7 dba: 00800777 Undo Header ITL1 ITL1 ITL1 seq: 530 irb 12 xid: 0005.018.4E7 xid: 0005.018.4E7 xid: 0005.018.4E7 xid: 0005.018.4E7 uba: uba: -800777.530.12 800777.530.13 800777.530.12 800777.530.13 800777.530.14 uba: uba: -800777.530.12 800777.530.13 800777.530.12 800777.530.13 800777.530.14 uba: 800777.530.14 800777.530.12 800777.530.13 slot 0 slot 0 slot 0 col1: ENG col1: ENG col1: ENG col2: 340 350 344 352 col2: 340 350 344 352 340 col2: 352 344 350 col3: 1 col3: 1 col3: 1 12 uba: 5.1 slot 1 slot 1 col1: AUS col1: AUS col1: AUS col2: 99 col2: 99 col2: 99 col3: 10 col3: 10 col3: 10 DataData Block Block 42 (copy) 42 DataData Block Block 42 (copy) 42 Data Block 42 col3: 340 13 uba 800777.530.12 5.1 slot 1 block 42 slot 0 block 42 slot 0 col3: 344 14 uba 800777.530.13 5.1 block 42 slot 0 col3: 350 Undo Block 800777 } Commited Block – Block on Disk Session15 LMS0 Session27 22:9 22:10 ENG 199 ENG 205 ENG 205 199 200 204 AUS 99 AUS 99 ENG 204 Block 42 Undo Block SELECT runs FROM score WHERE team = 'ENG'; 199 ENG 205 AUS 99 Instance 1 ENG 200 Instance 2 UPDATE score SET runs = 200 WHERE team = 'ENG'; UPDATE score SET runs = 204 WHERE team = 'ENG'; UPDATE score SET runs = 205 WHERE team = 'ENG'; COMMIT; Committed Block – Block on Buffer Cache } Session15 LMS0 Session27 22:9 22:10 ENG 199 ENG 205 ENG 205 200 204 199 AUS 99 AUS 99 ENG 204 Block 42 Undo Block SELECT runs FROM score WHERE team = 'ENG'; ENG 199 AUS 99 Instance 1 STOP ENG 200 Instance 2 UPDATE score SET runs = 200 WHERE team = 'ENG'; UPDATE score SET runs = 204 WHERE team = 'ENG'; UPDATE score SET runs = 205 WHERE team = 'ENG'; COMMIT; } Uncommitted Block – Block in Buffer cache Session15 LMS0 Session27 22:10 ENG 199 ENG 199 ENG 199 205 204 200 AUS 99 SELECT runs FROM score WHERE team = 'ENG'; ENG 205 199 200 204 AUS 99 AUS 99 ENG 204 Block 42 Copy Block 42 Undo Block ENG 199 AUS 99 Instance 1 ENG 200 Instance 2 UPDATE score SET runs = 200 WHERE team = 'ENG'; UPDATE score SET runs = 204 WHERE team = 'ENG'; UPDATE score SET runs = 205 WHERE team = 'ENG'; } Uncommitted Block – On Disk Session15 LMS0 Session27 ENG 199 ENG 205 199 200 204 AUS 99 SELECT runs FROM score WHERE team = 'ENG'; 22:10 ENG 199 ENG 200 ENG 205 199 200 204 ENG 204 AUS 99 ENG 204 Block 42 Undo Block ENG 200 UPDATE score SET runs = 200 WHERE team = 'ENG'; UPDATE score SET runs = 204 WHERE team = 'ENG'; UPDATE score SET runs = 205 WHERE team = 'ENG'; ENG 205 199 200 204 SEE SLIDE NOTES FOR ADDITIONAL INFORMATION AUS 99 Instance 1 Instance 2 Q &A } References:} Oracle 10g Real Application Clusters handbook – K Gopalkrishnan } Julian Dyke – RAC Presentation } Oracle 10g RAC Administrators Guide