A Closer Look inside Oracle ASM UKOUG Conference 2007 Luca Canali, CERN IT CERN - IT Department CH-1211 LCGGenève 23 Switzerland www.cern.ch/it Outline • Oracle ASM for DBAs – Introduction and motivations • ASM is not a black box – Investigation of ASM internals – Focus on practical methods and troubleshooting • ASM and VLDB – Metadata, rebalancing and performance • Lessons learned from CERN’s production DB services CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 2 ASM • Oracle Automatic Storage Management – Provides the functionality of a volume manager and filesystem for Oracle (DB) files • Works with RAC – Oracle 10g feature aimed at simplifying storage management • Together with Oracle Managed Files and the Flash Recovery Area – An implementation of S.A.M.E. methodology • Goal of increasing performance and reducing cost CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 3 ASM for a Clustered Architecture • Oracle architecture of redundant low-cost components Servers SAN Storage CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 4 ASM Disk Groups • Example: HW = 4 disk arrays with 8 disks each • An ASM diskgroup is created using all available disks – – – – The end result is similar to a file system on RAID 1+0 ASM allows to mirror across storage arrays Oracle RDBMS processes directly access the storage RAW disk access ASM Diskgroup Mirroring Striping CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Failgroup1 Striping Failgroup2 Inside Oracle ASM, UKOUG Dec 2007 - 5 Files, Extents, and Failure Groups Files and extent pointers Failgroups and ASM mirroring CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 6 ASM Is not a Black Box • ASM is implemented as an Oracle instance – – – – – Familiar operations for the DBA Configured with SQL commands Info in V$ views Logs in udump and bdump Some ‘secret’ details hidden in X$TABLES and ‘underscore’ parameters CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 7 Selected V$ Views and X$ Tables CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it View Name X$ Table Description V$ASM_DISKGROUP X$KFGRP performs disk discovery and lists diskgroups V$ASM_DISK X$KFDSK, X$KFKID performs disk discovery, lists disks and their usage metrics V$ASM_FILE X$KFFIL lists ASM files, including metadata V$ASM_ALIAS X$KFALS lists ASM aliases, files and directories V$ASM_TEMPLATE X$KFTMTA ASM templates and their properties V$ASM_CLIENT X$KFNCL lists DB instances connected to ASM V$ASM_OPERATION X$KFGMG lists current rebalancing operations N.A. X$KFKLIB available libraries, includes asmlib N.A. X$KFDPARTNER lists disk-to-partner relationships N.A. X$KFFXP extent map table for all ASM files N.A. X$KFDAT allocation table for all ASM disks Inside Oracle ASM, UKOUG Dec 2007 - 8 ASM Parameters • Notable ASM instance parameters: *.asm_diskgroups='TEST1_DATADG1','TEST1_ RECODG1' *.asm_diskstring='/dev/mpath/itstor*p*' *.asm_power_limit=5 *.shared_pool_size=70M *.db_cache_size=50M *.large_pool_size=50M *.processes=100 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 9 More ASM Parameters • Underscore parameters – Several undocumented parameters – Typically don’t need tuning – Exception: _asm_ausize and _asm_stripesize • May need tuning for VLDB in 10g • New in 11g, diskgroup attributes – V$ASM_ATTRIBUTE, most notable • disk_repair_time • au_size – X$KFENV shows ‘underscore’ attributes CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 10 ASM Storage Internals • ASM Disks are divided in Allocation Units (AU) – Default size 1 MB (_asm_ausize) – Tunable diskgroup attribute in 11g • ASM files are built as a series of extents – Extents are mapped to AUs using a file extent map – When using ‘normal redundancy’, 2 mirrored extents are allocated, each on a different failgroup – RDBMS read operations access only the primary extent of a mirrored couple (unless there is an IO error) – In 10g the ASM extent size = AU size CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 11 ASM Metadata Walkthrough • Three examples follow of how to read data directly from ASM. • Motivations: – Build confidence in the technology, i.e. ‘get a feeling’ of how ASM works – It may turn out useful one day to troubleshoot a production issue. CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 12 Example 1: Direct File Access 1/2 Goal: Reading ASM files with OS tools, using metadata information from X$ tables 1. Example: find the 2 mirrored extents of the RDBMS spfile 2. sys@+ASM1> select GROUP_KFFXP Group#, DISK_KFFXP Disk#, AU_KFFXP AU#, XNUM_KFFXP Extent# from X$KFFXP where number_kffxp=(select file_number from v$asm_alias where name='spfiletest1.ora'); GROUP# DISK# AU# EXTENT# ---------- ---------- ---------- ---------1 16 17528 0 1 4 14838 0 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 13 Example 1: Direct File Access 2/2 3. Find the disk path sys@+ASM1> select disk_number,path from v$asm_disk where GROUP_NUMBER=1 and disk_number in (16,4); DISK_NUMBER PATH ----------- -----------------------------------4 /dev/mpath/itstor417_1p1 16 /dev/mpath/itstor419_6p1 4. Read data from disk using ‘dd’ dd if=/dev/mpath/itstor419_6p1 bs=1024k count=1 skip=17528 |strings CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 14 X$KFFXP Column Name Description NUMBER_KFFXP ASM file number. Join with v$asm_file and v$asm_alias COMPOUND_KFFXP File identifier. Join with compound_index in v$asm_file INCARN_KFFXP File incarnation id. Join with incarnation in v$asm_file XNUM_KFFXP ASM file extent number (mirrored extent pairs have the same extent value) PXN_KFFXP Progressive file extent number GROUP_KFFXP ASM disk group number. Join with v$asm_disk and v$asm_diskgroup DISK_KFFXP ASM disk number. Join with v$asm_disk AU_KFFXP Relative position of the allocation unit from the beginning of the disk. LXN_KFFXP 0->primary extent,1->mirror extent, 2->2nd mirror copy (high redundancy and metadata) CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 15 Example 2: A Different Way A different metadata table to reach the same goal of reading ASM files directly from OS: sys@+ASM1> select GROUP_KFDAT Group# ,NUMBER_KFDAT Disk#, AUNUM_KFDAT AU# from X$KFDAT where fnum_kfdat=(select file_number from v$asm_alias where name='spfiletest1.ora'); GROUP# DISK# AU# ---------- ---------- ---------1 4 14838 1 16 17528 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 16 X$KFDAT Column Name (subset) Description GROUP_KFDAT Diskgroup number, join with v$asm_diskgroup NUMBER_KFDAT Disk number, join with v$asm_disk COMPOUND_KFDAT Disk compund_index, join with v$asm_disk AUNUM_KFDAT Disk allocation unit (relative position from the beginning of the disk), join with x$kffxp.au_kffxp V_KFDAT Flag: V=this Allocation Unit is used; F=AU is free FNUM_KFDAT File number, join with v$asm_file XNUM_KFDAT Progressive file extent number join with x$kffxp.pxn_kffxp CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 17 Example 3: Yet Another Way Using the internal package dbms_diskgroup declare fileType varchar2(50); fileName varchar2(50); fileSz number; blkSz number; hdl number; plkSz number; data_buf raw(4096); begin fileName := '+TEST1_DATADG1/TEST1/spfiletest1.ora'; dbms_diskgroup.getfileattr(fileName,fileType,fileSz, blkSz); dbms_diskgroup.open(fileName,'r',fileType,blkSz, hdl,plkSz, fileSz); dbms_diskgroup.read(hdl,1,blkSz,data_buf); dbms_output.put_line(data_buf); end; / CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 18 DBMS_DISKGROUP • Can be used to read/write ASM files directly – It’s an Oracle internal package – Does not require a RDBMS instance – 11g’s asmcmd cp command uses dbms_diskgroup Procedure Name Parameters dbms_diskgroup.open (:fileName, :openMode, :fileType, :blkSz, :hdl,:plkSz, :fileSz) dbms_diskgroup.read (:hdl, :offset, :blkSz, :data_buf) dbms_diskgroup.createfile (:fileName, :fileType, :blkSz, :fileSz, :hdl, :plkSz, :fileGenName) dbms_diskgroup.close (:hdl) dbms_diskgroup.commitfile (:handle) dbms_diskgroup.resizefile (:handle,:fsz) dbms_diskgroup.remap (:gnum, :fnum, :virt_extent_num) dbms_diskgroup.getfileattr (:fileName, :fileType, :fileSz, :blkSz) CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 19 File Transfer Between OS and ASM • The supported tools (10g) – – – – – RMAN DBMS_FILE_TRANSFER FTP (XDB) WebDAV (XDB) They all require a RDBMS instance • In 11g, all the above plus asmcmd – cp command – Works directly with the ASM instance CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 20 Strace and ASM 1/3 Goal: understand strace output when using ASM storage • Example: read64(15,"#33\0@\"..., 8192, 473128960)=8192 • • This is a read operation of 8KB from FD 15 at offset 473128960 What is the segment name, type, file# and block# ? CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 21 Strace and ASM 2/3 1. From /proc/<pid>/fd I find that FD=15 is /dev/mpath/itstor420_1p1 2. This is disk 20 of D.G.=1 (from v$asm_disk) 3. From x$kffxp I find the ASM file# and extent#: • Note: offset 473128960 = 451 MB + 27 *8KB sys@+ASM1>select number_kffxp, xnum_kffxp from x$kffxp where group_kffxp=1 and disk_kffxp=20 and au_kffxp=451; NUMBER_KFFXP XNUM_KFFXP ------------ ---------268 17 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 22 Strace and ASM 3/3 4. From v$asm_alias I find the file alias for file 268: USERS.268.612033477 5. From v$datafile view I find the RDBMS file#: 9 6. From dba extents finally find the owner and segment name relative to the original IO operation: sys@TEST1>select owner,segment_name,segment_type from dba_extents where FILE_ID=9 and 27+17*1024*1024 between block_id and block_id+blocks; OWNER SEGMENT_NAME SEGMENT_TYPE ----- ------------ -----------SCOTT EMP TABLE CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 23 Investigation of Fine Striping B7 … … A6 … … B6 … … A5 … … B5 … … A4 … … B4 … … A3 … … B3 … … A2 … … A1 B2 … … … … … … A0 B1 … … B0 … … … … AU = 1MB • An application: finding the layout of fine-striped files – Explored using strace of an oracle session executing ‘alter system dump logfile ..’ – Result: round robin distribution over 8 x 1MB extents A7 Fine striping size = 128KB (1MB/8) CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 24 Metadata Files • ASM diskgroups contain ‘hidden files’ – Not listed in V$ASM_FILE (file# <256) – Details are available in X$KFFIL – In addition the first 2 AUs of each disk are marked as file#=0 in X$KFDAT – Example (10g): GROUP# FILE# FILESIZE_AFTER_MIRR RAW_FILE_SIZE ---------- ---------- ------------------- ------------1 1 2097152 6291456 1 2 1048576 3145728 1 3 264241152 795869184 1 4 1392640 6291456 1 5 1048576 3145728 1 6 1048576 3145728 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 25 ASM Metadata 1/2 • File#0, AU=0: disk header (disk name, etc), Allocation Table (AT) and Free Space Table (FST) • File#0, AU=1: Partner Status Table (PST) • File#1: File Directory (files and their extent pointers) • File#2: Disk Directory • File#3: Active Change Directory (ACD) – The ACD is analogous to a redo log, where changes to the metadata are logged. – Size=42MB * number of instances Source: Oracle Automatic Storage Management, Oracle Press Nov 2007, N. Vengurlekar, M. Vallath, R.Long CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 26 ASM Metadata 2/2 • File#4: Continuing Operation Directory (COD). – The COD is analogous to an undo tablespace. It maintains the state of active ASM operations such as disk or datafile drop/add. The COD log record is either committed or rolled back based on the success of the operation. • File#5: Template directory • File#6: Alias directory • 11g, File#9: Attribute Directory • 11g, File#12: Staleness registry, created when needed to track offline disks CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 27 ASM Rebalancing • Rebalancing is performed (and mandatory) after space management operations – Goal: balanced space allocation across disks – Not based on performance or utilization – ASM spreads every file across all disks in a diskgroup • ASM instances are in charge of rebalancing – Extent pointers changes are communicated to the RDBMS • RDBMS’ ASMB process keeps an open connection to ASM • This can be observed by running strace against ASMB – In RAC, extra messages are passed between the cluster ASM instances • LMD0 of the ASM instances are very active during rebalance CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 28 ASM Rebalancing and VLDB • Performance of Rebalancing is important for VLDB • An ASM instance can use parallel slaves – RBAL coordinates the rebalancing operations – ARBx processes pick up ‘chunks’ of work. By default they log their activity in udump • Does it scale? – In 10g serialization wait events can limit scalability – Even at maximum speed rebalancing is not always I/O bound CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 29 ASM Rebalancing Performance • Tracing ASM rebalancing operations – 10046 trace of the +arbx processes • Oradebug setospid … • oradebug event 10046 trace name context forever, level 12 • Process log files (in bdump) with orasrp (tkprof will not work) • Main wait events from my tests with RAC (6 nodes) – DFS lock handle • Waiting for CI level 5 (cross instance lock) – – – – – – Buffer busy wait ‘unaccounted for’ enq: AD - allocate AU enq: AD - deallocate AU log write(even) log write(odd) CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 30 ASM Single Instance Rebalancing • Single instance rebalance – Faster in RAC if you can rebalance with only 1 node up (I have observed: 20% to 100% speed improvement) – Buffer busy wait can be the main event • It seems to depend on the number of files in the diskgroup. • Diskgroups with a small number of (large) files have more contention (+arbx processes operate concurrently on the same file) • Only seen in tests with 10g – 11g has improvements regarding rebalancing contention CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 31 Rebalancing, an Example ASM Rebalancing Performance (RAC) Rate, MB/min 7000 6000 Oracle 11g 5000 Oracle 10g 4000 3000 2000 1000 0 0 2 4 6 8 10 Diskgroup Rebalance Parallelism CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Data: D.Wojcik, CERN IT Inside Oracle ASM, UKOUG Dec 2007 - 32 12 Rebalancing Workload • When ASM mirroring is used (e.g. with normal redundancy) – Rebalancing operations can move more data than expected • Example: – 5 TB (allocated): ~100 disks, 200 GB each – A disk is replaced (diskgroup rebalance) • The total IO workload is 1.6 TB (8x the disk size!) • How to see this: query v$asm_operation, the column EST_WORK keeps growing during rebalance • The issue: excessive repartnering CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 33 ASM Disk Partners • ASM diskgroup with normal redundancy – Two copies of each extents are written to different ‘failgroups’ • Two ASM disks are partners: – When they have at least one extent set in common (they are the 2 sides of a mirror for some data) • Each ASM disk has a limited number of partners – Typically 10 disk partners: X$KFDPARTNER – Helps to reduce the risk associated with 2 simultaneous disk failures CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 34 Free and Usable Space • When ‘ASM mirroring’ is used not all the free space should be occupied • V$ASM_DISKGROUP.USABLE_FILE_MB: – Amount of free space that can be safely utilized taking mirroring into account, and yet be able to restore redundancy after a disk failure – it’s calculated for the case of the worst scenario, anyway it is a best practice not to have it go negative (it can) – This can be a problem when deploying a small number of large LUNs and/or failgroups CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 35 Fast Mirror Resync • ASM 10g with normal redundancy does not allow to offline part of the storage – A transient error in a storage array can cause several hours of rebalancing to drop and add disks – It is a limiting factor for scheduled maintenances – 11g has new feature ‘fast mirror resync’ • Redundant storage can be put offline for maintenance • Changes are accumulated in the staleness registry (file#12) • Changes are applied when the storage is back online CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 36 Read Performance, Random I/O IOPS measured with SQL (synthetic test) Small Random I/O (8KB block, 32GB probe table) 64 SATA HDs (4 arrays, 1 instance) 10000 9000 I/O small read per sec (IOPS) 8000 7000 8675 IOPS 6000 5000 ~130 IOPS per disk 4000 3000 Destroking, only the external part of the disks is used 2000 1000 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 0 35 0 33 0 31 0 29 0 27 0 25 0 23 0 21 0 19 0 17 0 15 0 13 0 11 90 70 50 30 10 2 0 Workload, number of oracle sessions Inside Oracle ASM, UKOUG Dec 2007 - 37 Read Performance, Sequential I/O Sequential I/O Throughput, 80GB probe table 64 SATA HDs (4 arrays and 4 RAC nodes) 800 700 600 MB/sec 500 400 Limited by HBAs -> 4 x 2 Gb 300 200 (measured with parallel query) 100 0 1 2 4 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 Workload, number of parallel query slaves CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 38 Implementation Details • Multipathing – Linux Device Mapper (2.6 kernel) • Block devices – RHEL4 and 10gR2 allow to skip raw devices mapping – External half of the disk for data disk groups • JBOD config – No HW RAID – ASM used to mirror across disk arrays • HW: – Storage arrays (Infortrend): FC controller, SATA disks – FC (Qlogic): 4Gb switch and HBAs (2Gb in older HW) – Servers are 2x CPUs, 4GB RAM, 10.2.0.3 on RHEL4, RAC of 4 to 8 nodes CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 39 Conclusions • CERN deploys RAC and ASM on Linux on commodity HW – 2.5 years of production, 110 Oracle 10g RAC nodes and 300TB of raw disk space (Dec 2007) • ASM metadata – Most critical part, especially rebalancing – Knowledge of some ASM internals helps troubleshooting • ASM on VLDB – Know and work around pitfalls in 10g – 11g has important manageability and performance improvements CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 40 Q&A Q&A • Links: – http://cern.ch/phydb – http://twiki.cern.ch/twiki/bin/view/PSSGroup/ASM_Internals – http://www.cern.ch/canali CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 41