RAC on Linux Best Practices (Coniguring Linux for RAC) This paper will help you to configure Linux for a RAC environment and will cover the OS enhancements. RAC on Linux Overview Oracle provides, with its Unbreakable campaign for Linux (Red Hat AS and United Linux), a solid solution for the Linux platform. To maximize performance and availability in Linux environment some steps are required within Oracle and at the Operating System level. This paper describes those features. · Using Asynchronous I/O Asynchronous I/O can be used on RAWIO, EXT2, EXT3, NFS, REISERFS filesystem.Presently Async I/O is not available for the Oracle Clustered File System(OCFS), because the Linux kernel does not expose the API needed. We hope the API is exposed in the RedHat 3.0 release, so OCFS can support asynchronous I/O. To enable asynchronous I/O for Oracle9iR2 on Red Hat Linux Advanced Server 2.1 and United Linux 1.0 you must re-link Oracle using the Async I/O library, libaio, as follows: cd to $ORACLE_HOME/rdbms/lib make -f ins_rdbms.mk async_on make -f ins_rdbms.mk ioracle Two parameters in the init.ora need to be changed, add following lines to the appropriate init.ora file: Parameter settings in init.ora for raw devices: disk_asynch_io=true (default value is true) Parameter settings in init.ora file or spfile.ora for filesystem files: Make sure that all Oracle datafiles reside on filesystems that support asynchronous I/O. (For example, ext2, ext3) disk_asynch_io=true (default value is true) filesystemio_options=asynch To get better I/O throughput: For DSS workloads, the /proc/sys/fs/aio-max-size has to be increased from the default 131072 bytes to >=1M. For OLTP workloads, the default size of 131072 would suffice, as these workloads perform very small writes. You can set this value by executing the following command as root user: echo >/proc/sys/fs/aio-max-size 1048576 For United Linux the aio-max-size cannot be larger than 524288 because of an OS limit, and if this limit exceeds the LGWR process (Oracle Redolog writer) will crash. · Configuring Linux for large Buffer Cache Oracle9i can allocate and use more than 4 GB of memory for the database buffer cache. In order to use the extended buffer cache support on Linux, create an in-memory file system on the /dev/shm mount point equal in size or larger than the amount of memory that you intend to use for the database buffer cache. mount -t shm shmfs -o size=8g /dev/shm To enable the extended buffer cache feature, set the foloowing parameter in the init.ora. USE_INDIRECT_DATA_BUFFERS = true Dynamic Cache Parameters The following dynamic cache parameters can not be used while the extended cache feature is enabled. DB_CACHE_SIZE DB_2K_CACHE_SIZE DB_4K_CACHE_SIZE DB_8K_CACHE_SIZE DB_16K_CACHE_SIZE DB_32K_CACHE_SIZE If the extended cache feature is enabled, you have to use the DB_BLOCK_BUFFERS parameter to specify the database cache size. Limitations The following limitations apply to the extended buffer cache feature on Linux: · You cannot change the size of the buffer cache while the instance is running. · You cannot create or use tablespaces with non-standard block sizes. · Increasing Address Space The current shipped version of Oracle can use about 1.7 GB of address space for its SGA. To increase this size, Oracle needs to be re-linked with a lower SGA base, and Linux needs to have the mapped base lowered for processes running Oracle. The solution exists on RH 2.1 and UnitedLinux 1.0 First, the SGA base address must be lowered by relinking Oracle as follows: Shutdown all instances of Oracle cd $ORACLE_HOME/lib cp -a libserver9.a libserver9.a.org (to make a backup copy) cd $ORACLE_HOME/rdbms/lib genksms -s 0x15000000 >ksms.s (lower SGA base to 0x15000000) make -f ins_rdbms.mk ksms.o (compile in new SGA base address) make -f ins_rdbms.mk ioracle (relink) After relinking the Oracle executable, it can only run on Linux when the mapped_base is lowered. The shmmax kernel parameter has to be set to 3GB sysctl �w kernel.shmmax=3000000000 (See Metalink Note: 200266.1 for details and a sample program.) · Using UDP Efficient interprocess communication is important for RAC since cache fusion transfers buffers between instances use this to transfer messages and data among the instances. In Oracle Version 9.2.0.1 and onwards, we use UDP as the default protocol on Linux and hence we recommend tuning the UDP send and receive space. We strongly recommend adjusting the send and receive buffer size to 256K. Tune the default and maximum window sizes with sysctl: sysctl -w net.core.rmem_max=262144 sysctl -w net.core.wmem_max=262144 sysctl -w net.core.rmem_default=262144 sysctl -w net.core.wmem_default=262144 To check the values read the following files: /proc/sys/net/core/rmem_default /proc/sys/net/core/rmem_max /proc/sys/net/core/wmem_default /proc/sys/net/core/wmem_max - default receive window - maximum receive window - default send window - maximum send window Appendix A contains a small C program which can be used to determine the current send and receiver buffer size. · Logical Volumn Manager (LVM) Suse�s United Linux 1.0 offers a LVM that can be used to format and configure the disk layout, to use RAW or OCFS. For a complete documentation how the LVM works and how to use the LVM on Suse please see the Suse web page. ftp://ftp.suse.com/pub/suse/i386/supplementary/commercial/Oracle/docs/lvm_whitepaper.pdf There is still a problem with the LVM because this is not cluster aware All volumn create, volumn changes need to be done on all nodes.LVM is not available on RH 3.0. · Hangcheck-timer and Oracle Cluster Manager Detaching watchdogd from the Cluster Manager (Bug 2495915) The �watchdogd� deamon impacted system availability as it initiated system reboots under heavy workloads. The watchdogd implementation has been removed from Oracle 9.2.0.2 and above for this reason. In the place of watchdogd, the 9.2.0.2 and above versions of the oracm for Linux now includes the use of a Linux kernel module called hangcheck-timer. This module is not required for oracm operation, but its use is highly recommended. This module monitors the Linux kernel for long operating system hangs that could affect the reliability of a RAC node and cause a corruption of a RAC database. When such a hang occurs, this module reboots the node. This approach offers the following three advantages over the watchdogd approach,: - node resets are triggered from within the Linux kernel making them much less effected by system load, - oracm on a RAC node can easily be stopped and reconfigured because its operation is completely independent of the kernel module, - The features provided by the hangcheck-timer module closely resemble features available in the implementation of the Cluster Manager for RAC on the Windows platform, on which the Cluster Manager on Linux was based. Configuration Parameter Changes for the oracm on Linux The deprecation and removal of the watchdogd in the 9.2.0.2 version (and above) of the Oracle Cluster Manager for Linux and the addition of the hangcheck-timer kernel module requires several parameter changes in the configuration file. File ��$ORACLE_HOME/oracm/admin/cmcfg.ora� is used to configure the Oracle Cluster Manager. 1. The removal of the watchdogd means that the following parameters included in the cmcfg.ora file used by the Oracle Cluster Manager are no longer valid: WatchdogTimerMargin WatchdogSafetyMargin These parameters should be removed from cmcfg.ora on all nodes in the cluster. 2. Hangcheck timer introduces the following configuration parameter used in cmcfg.ora to allow the oracm to know the name of the hangcheck-timer kernel module so it can determine if it is correctly loaded: KernelModuleNam(see Appendix B) If the module in KernelModuleName is either not loaded but correctly specified or incorrectly specified, the oracm will produce a series of error messages in the syslog system log (/var/log/messages). However, it will not prevent the oracm process from running. The module must be loaded prior to oracm startup. It is strongly recommended that this parameter be set correctly on all nodes in the cluster. 3. Finally, it changes the use of the following configuration parameter from optional to mandatory: CMDiskFile(see Appendix B) This is done to ensure that a CM quorum partition is used and allows the oracm to more reliably handle certain kinds of hardware and software errors that affect cluster participation. 4. The inclusion of the hangcheck-timer kernel module also introduces two new configuration parameters to be used when the module is loaded: hangcheck_tick - the hangcheck_tick is an interval indicating how often the hangcheck-timer checks on the health of the system. hangcheck_margin - certain kernel activities may randomly introduce delays in the operation of the hangcheck-timer. hangcheck_margin provides a margin of error to prevent unnecessary system resets due to these delays. Taken together, these two parameters indicate how long a RAC node must hang before the hangcheck-timer module will reset the system. A node reset will occur when the following is true: (system hang time) > (hangcheck_tick + hangcheck_margin) Example of loading the hangcheck-timer. Put this into the rc.local script on RH or the /etc/init.d/oracle script on UL # load hangcheck-timer module for ORACM 9.2.0.2 (or higher) /sbin/insmod /lib/modules/2.4.19-4GB/kernel/drivers/char/hangchecktimer.o hangcheck_tick=30 hangcheck_margin=180 Recommended Configuration Defaults Oracle recommends that the hangcheck-timer module be loaded and the oracm be started with the following parameter values (in addition to recommendations made elsewhere in the Oracle RAC documentation): Parameter Service Value hangcheck_tick hangcheck-timer 30 seconds hangcheck_margin hangcheck-timer 180 seconds KernelModuleName oracm hangcheck-timer MissCount oracm > hangcheck_tick + hangcheck_margin (> 210 seconds) · Linux Monitoring and Configuration Tools Overall tools sar, vmstat CPU /proc/cpuinfo, mpstat, top Memory /proc/meminfo, /proc/slabinfo, free Disk I/O iostat Network /proc/net/dev, netstat, mii-tool Kernel Version and Release cat /proc/version Types of I/O Cards lspci -vv Kernel Modules Loaded lsmod, cat /proc/modules List all PCI devices (HW) lspci �v Startup changes &nb/etc/sysctl.conf, /etc/rc.local Kernel messages &nb/var/log/messages, /var/log/dmesg OS error codes &nbs/usr/src/linux/include/asm/errno.h OS calls &nbs/usr/sbin/strace-p Appendix A: ========= cmcfg.ora ========= HeartBeat=15000 ClusterName=Oracle Cluster Manager, version 9i KernelModuleName=hangcheck-timer PollInterval=1000 MissCount=250 PrivateNodeNames=mars-int venus-int PublicNodeNames=mars venus ServicePort=9998 CmDiskFile=/dev/quorum HostName=venus-int ocmargs.ora ========== # Sample configuration file $ORACLE_HOME/oracm/admin/ocmargs.ora oracm norestart