Linux-HA Release 2 Hands-On Tutorial Alan Robertson Project Leader – Linux-HA project Dave Blaschke IBM Linux Technology Center alanr@unix.sh -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Part I: Linux-HA Release 2 Overview What is High-Availability (HA) Clustering? What can HA do for me? What is the Linux-HA project? Linux-HA applications Linux-HA customers Linux-HA release 1 capabilities Linux-HA release 2 capabilities Comparative Architectures Release 2 Details Thoughts about cluster security -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 What Is HA Clustering? Putting together a group of computers which trust each other to provide a service even when system components fail When one machine goes down, others take over its work This involves IP address takeover, service takeover, etc. New work comes to the “takeover” machine Not primarily designed for high-performance -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 What Can HA Clustering Do For You? It cannot achieve 100% availability – nothing can. HA Clustering designed to recover from single faults It can make your outages very short From about a second to a few minutes It is like a Magician's (Illusionist's) trick: When it goes well, the hand is faster than the eye When it goes not-so-well, it can be reasonably visible A good HA clustering system adds a “9” to your base availability 99->99.9, 99.9->99.99, 99.99->99.999, etc. Complexity is the enemy of reliability! -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Lies, Damn Lies, and Statistics Counting nines – downtime allowed per year 99.9999% 99.999% 99.99% 99.9% 99% 30 sec 5 min 52 min 9 hr 3.5 day -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 The Desire for HA systems Who wants low-availability systems? Why are so few systems HighAvailability? -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Why isn't everything HA? Cost Complexity -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 How is HA Clustering Different from Disaster Recovery? HA: Failover is cheap Failover times measured in seconds Reliable inter-node communication DR: Failover is expensive Failover times often measured in hours Unreliable inter-node communication assumed -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 How Does HA work? Manage redundancy to improve service availability Like a cluster-wide-super-init Even complex services are now “respawn” on node (computer) death on “impairment” of nodes on loss of connectivity for services that aren't working (not necessarily stopped) managing very complex dependency relationships -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Single Points of Failure (SPOFs) A single point of failure is a component whose failure will cause near-immediate failure of an entire system or service Good HA design adds redundancy eliminates of single points of failure -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Non-Obvious SPOFs Replication links are rarely single points of failure The system may fail when another failure happens Some disk controllers have SPOFs inside them which aren't obvious without schematics Redundant links buried in the same wire run have a common SPOF Non-Obvious SPOFs can require deep expertise to spot -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 The “Three R's” of High-Availability Redundancy Redundancy Redundancy If this sounds redundant, that's probably appropriate... Most SPOFs are eliminated by redundancy HA Clustering is a good way of providing and managing redundancy -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Redundant Communications Intra-cluster communication is critical to HA system operation Most HA clustering systems provide mechanisms for redundant internal communication for heartbeats, etc. External communications is usually essential to provision of service External communication redundancy is usually accomplished through routing tricks Having an expert in BGP or OSPF routing is a help -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Fencing Guarantees resource integrity in the case of certain difficult cases (split-brain) Four Common Methods: FiberChannel Switch lockouts SCSI Reserve/Release (difficult to make reliable) Self-Fencing (like IBM ServeRAID) STONITH – Shoot The Other Node In The Head Linux-HA has native support for the last two -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Redundant Data Access Replicated Copies of data are kept updated on more than one computer in the cluster Shared Typically Fiber Channel Disk (SAN) Sometimes shared SCSI Back-end Storage (“Somebody Else's Problem”) NFS, SMB Back-end database -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Data Sharing – Replication Some applications provide their own replication DNS, DHCP, LDAP, DB2, etc. Linux has excellent disk replication methods available DRBD is my favorite DRBD-based HA clusters are shockingly cheap Some environments can live with less “precise” replication methods – rsync, etc. Generally does not support parallel access Fencing usually required EXTREMELY cost effective -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Data Sharing – ServeRAID IBM ServeRAID disk is self-fencing This helps integrity in failover environments This makes cluster filesystems, etc. impossible No Oracle RAC, no GPFS, etc. ServeRAID failover requires a script to perform volume handover Linux-HA provides such a script in open source Linux-HA is ServerProven with ServeRAID -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Data Sharing – Shared Disk The most classic data sharing mechanism – commonly fiber channel Allows for failover mode Allows for true parallel access Oracle RAC, Cluster filesystems, etc. Fencing always required with Shared Disk In the hand-on portion, we will non-fiber-channel shared virtual disks -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Data Sharing – Back-End Network Attached Storage can act as a data sharing method Existing Back End databases can also act as a data sharing mechanism Both make reliable and redundant data sharing Somebody Else's Problem (SEP). If they did a good job, you can benefit from them. Beware SPOFs in your local network -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 The Linux-HA Project Linux-HA is the oldest high-availability project for Linux, with the largest associated community The core piece of Linux-HA is called “heartbeat” (though it does much more than heartbeat) Linux-HA has been in production since 1999, and is currently in use on about ten thousand sites Linux-HA also runs on FreeBSD and Solaris, and is being ported to OpenBSD and others Linux-HA is shipped with every major Linux distribution except one. -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Linux-HA Release 1 Applications Database Servers (DB2, Oracle, MySQL, others) Load Balancers Web Servers Custom Applications Firewalls Retail Point of Sale Solutions Authentication File Servers Proxy Servers Medical Imaging Almost any type server application you can think of – except SAP -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Linux-HA customers MAN Nutzfahrzeuge AG – truck manufacturing division of Man AG Karstadt, Circuit City use Linux-HA and databases each in several hundred stores FedEx – Truck Location Tracking BBC – Internet infrastructure Citysavings Bank in Munich (infrastructure) Bavarian Radio Station (Munich) coverage of 2002 Olympics in Salt Lake City The Weather Channel (weather.com) Sony (manufacturing) Emageon – medical imaging services Incredimail bases their mail service on Linux-HA on IBM hardware University of Toledo (US) – 20k student Computer Aided Instruction system ISO New England manages power grid using 25 Linux-HA clusters -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Linux-HA Release 1 capabilities Supports 2-node clusters Can use serial, UDP bcast, mcast, ucast comm. Fails over on node failure Fails over on loss of IP connectivity Capability for failing over on loss of SAN connectivity Limited command line administrative tools to fail over, query current status, etc. Active/Active or Active/Passive Simple resource group dependency model Requires external tool for resource (service) monitoring SNMP monitoring -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Linux-HA Release 2 capabilities Built-in resource monitoring Support for the OCF resource standard Much Larger clusters supported (>= 8 nodes) Sophisticated dependency model with rich constraint support (resources, groups, incarnations, master/slave) (needed for SAP) XML-based resource configuration Coming in 2.0.1: Configuration and monitoring GUI Support for GFS cluster filesystem Multi-state (master/slave) resource support Initially - no IP, SAN monitoring -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Linux-HA Release 1 Architecture -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Linux-HA Release 2 Architecture (add TE and PE) -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Resource Objects in Release 2 Release 2 supports “resource objects” which can be any of the following: Primitive Resources OCF, heartbeat-style, or LSB resource agent scripts Resource Clones – need “n” resource objects somewhere Resource Groups – a group of primitive resources with implied co-location and linear ordering constraints Multi-state resources (master/slave) Designed to model master/slave (replication) resources (DRBD, et al) -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Basic Dependencies in Release 2 Ordering Dependencies start before (normally implies stop after) start after (normally implies stop before) Mandatory Co-location Dependencies must be co-located with cannot be co-located with -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Resource Location Constraints Mandatory Constraints: Resource Objects can be constrained to run on any selected subset of nodes. Default depends on setting of symmetric_cluster. Preferential Constraints: Resource Objects can also be preferentially constrained to run on specified nodes by providing weightings for arbitrary logical conditions The resource object is run on the node which has the highest weight (score) -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Resource Clones Resource Clones allow one to have a resource which runs multiple (“n”) times on the cluster This is useful for managing load balancing clusters where you want “n” of them to be slave servers Cluster filesystems Cluster Alias IP addresses -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Resource Groups Resource Groups provide a shorthand for creating ordering and co-location dependencies Each resource object in the group is declared to have linear start-after ordering relationships Each resource object in the group is declared to have co-location dependencies on each other This is an easy way of converting release 1 resource groups to release 2 -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Multi-State (master/slave) Resources Normal resources can be in one of two stable states: running stopped Multi-state resources can have more than two stable states. For example: running-as-master running-as-slave stopped This is ideal for modeling replication resources like DRBD -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Advanced Constraints Nodes can have arbitrary attributes associated with them in name=value form Attributes have types: int, string, version Constraint expressions can use these attributes as well as node names, etc in largely arbitrary ways Operators: =, !=, <, >, <=, >= defined(attrname), undefined(attrname), colocated(resource id), not colocated(resource id) -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Advanced Constraints (cont'd) Each constraint is associated with particular resource, and is evaluated in the context of a particular node. A given constraint has a boolean predicate associated with it according to the expressions before, and is associated with a weight, and condition. If the predicate is true, then the condition is used to compute the weight associated with locating the given resource on the given node. All conditions are given weights, positive or negative. Additionally there are special values for modeling must-have conditions +INFINITY -INFINITY -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 DRBD – RAID1 over the LAN DRBD is a block-level replication technology Every time a block is written on the master side, it is copied over the LAN and written on the slave side Typically, a dedicated replication link is used It is extremely cost-effective – common with xSeries Worst-case around 10% throughput loss Recent versions have very fast “full” resync -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Security Considerations Cluster: A computer whose backplane is the Internet If this isn't scary, you don't understand... You may think you have a secure cluster network You're probably mistaken now You will be in the future -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Secure Networks are Difficult Because... Security is not often well-understood by admins Security is well-understood by “black hats” Network security is easy to breach accidentally Users bypass it Hardware installers don't fully understand it Most security breaches come from “trusted” staff Staff turnover is often a big issue Virus/Worm/P2P technologies will create new holes especially for Windows machines -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Security Advice Good HA software should be designed to assume insecure networks Not all HA software assumes insecure networks Good HA installation architects use dedicated (secure?) networks for intra-cluster HA communication Crossover cables are reasonably secure – all else is suspect ;-) -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Part II – Hands-On Cluster Configuration Overview of hardware Overview of Logical cluster Setup Overview of Services to be Configured Detailed Configuration: ha.cf authkeys Cluster Information Base (CIB) -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Overview of Cluster Hardware Each cluster will consist of: Two logical computers (LPARs) Each logical computer consists of: One logical CPU One logical boot/root disk (sda) One logical disk sharable by the other LPAR (sd[bc]) One logical disk shared from the other LPAR (sd[bc]) One logical ethernet connection (eth0) One adminstrative IP address + a service address The observant will note the lack of significant redundancy in this test environment -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Logical Cluster Setup – Part I Communications: Heartbeat has unicast, broadcast, and multicast communications modes. Broadcast communications will NOT work in this environment – all clusters are on the same subnet. Multicast communications are simplest for this environment Unicast communications are probably the best choice for this environment. -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Resources To Be Configured For this cluster we will configure the following resources (services): IP address using IPaddr resource agent Filesystem mounts using the resource Filesystem Apache web server using the ha_apache resource agent SAMBA file sharing using the smb and nmb LSB-style init scripts that come with SLES 9 NFS file sharing using the nfslock and nfsserver LSB-style init scripts that come with SLES9 -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Shared Disk Access - Naming Earlier it was noted that the shared logical disk numbering was noted as sd[bc] When configuring resources, it is necessary for a given resource to come up consistently on all nodes in a cluster To avoid problems, one can mount the shared filesystems by label rather than by /dev/sdb or /dev/sdc This is accompished using the using “-Llabel” in the Filesystem resource. We do not do this in our sample files :-). -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 /etc/ha.d/ha.cf Configuration File use_logd true node clclusternumberlp1 clclusternumberlp2 keepalive 2 deadtime 30 initdead 120 crm on # next line is for testing respawn hacluster /usr/lib/heartbeat/cibmon -d ucast eth0 192.168.224.1xx ucast eth0 192.168.224.1xx+1 See also: http://linux- ha.org/ha.cf -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 /etc/ha.d/authkeys This file provides authentication information auth 1 1 sha1 putYourUniqueSecretKeyHer e File MUST be mode 0600 or 0400 See: http://linuxha.org/authkeys Exercise for those who find time: linux-- Linux-HA Release 2 Hands-On ha.org/GeneratingAuthkeysAutoma LWCE – SF - August, 2005 Cluster Information Base (CIB) Intro The CIB is an XML file containing: Configuration Information Cluster Node information Resource Information Resource Constraints Status Information Which nodes are up / down Which resources are running where We only provide configuration information -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 An Empty CIB <cib> <configuration> <crm_config/> <nodes/> <resources/> <constraints/> </configuration> <status/> </cib> -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 The crm_config CIB section <crm_config> <nvpair id="require_quorum" name="require_quorum" value="true"/> <nvpair id="symmetric_cluster" name="symmetric_cluster" value="true"/> <nvpair -- Linux-HA Release 2 Hands-On id="suppress_cib_writes" LWCE – SF - August, 2005 The nodes CIB section We let it get the nodes information from the membership layer This makes things much easier on us :-) -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 The resources section of the CIB The resources section is one of the most important sections. It consists of a set of individual resource records Each resource record represents a single resource <resources> <primitive/> <primitive/> ... </resources> -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Classes of Resource Agents in R2 OCF – Open Cluster Framework - http://opencf.org/ take parameters as name/value pairs through the environment Can be monitored well by R2 Heartbeat – R1-style heartbeat resources Take parameters as command line arguments Can be monitored by status action LSB – Standard LSB Init scripts Take no parameters Can be monitored by status action Stonith – Node Reset Capability Very similar to OCF resources -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 IPaddr resource Agent Class: OCF Parameters: ip – IP address to bring up nic – NIC to bring address up on (optional) netmask – netmask for ip in CIDR form (optional) broadcast – broadcast address (optional) -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Filesystem resource Agent Class: OCF Parameters: device – “devicename” to mount directory – where to mount the filesystem fstype – type of filesystem to mount options – mount options (optional) This is essentially an /etc/fstab entry – expressed as a resource -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 ClusterMon resource Agent Class: OCF Parameters: htmlfile – name of apache configuration file (required) update – how often to update the HTML file (required) user – who to run crm_mon as extra_options – Extra options to pass to crm_mon (optional) Update must be in seconds htmlfile must be located in the Apache docroot Suggested value for extra_options: “-n -r” -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 ha_apache resource Agent Class: OCF Parameters: configfile – name of apache configuration file (required) port – the port the server is running on (optional) statusurl – URL to use in monitor operation (optional) Values for optional parameters are deduced from reading the configuration file. Configfile and html directories must go on shared media -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 smb and nmb resources Class: LSB (i. e., normal init script) They take no parameters Must be started after the IP address resource is started Must be started after the filesystem they are exporting is started Their configuration files should go on the shared media -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 nfslock and nfsserver Resources Class: LSB (i. e., normal init script) Neither takes any parameters NFS config and lock info must be on shared media NFS filesystem data must be on shared media Inodes of mount devices and all files must match (!) Must be started before IP address is acquired Making the inodes of disks match can sometimes be a bit tricky (LVM and DRBD help here) -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 ibmhmc STONITH Resource Class: stonith Parameters: ip – IP address of the HMC controlling the node in question This resource talks to the “management console” for the OpenPower machine we're using -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 An OCF primitive object <primitive id=ebIP class=ocf type=IPaddr provider=heartbeat> <instance_attributes> <attributes> <nvpair name=ip value=192.168.224./> </attributes> </instance_attributes> -- Linux-HA Release 2 Hands-On </primitive> LWCE – SF - August, 2005 A STONITH primitive object <primitive id=st class=stonith type=ibmhmc provider=heartbeat> <instance_attributes> <attributes> <nvpair name=ip value=192.168.224.99/ > </attributes> -- Linux-HA Release 2 Hands-On </instance_attributes> LWCE – SF - August, 2005 An LSB primitive object (i. e., an init script) <primitive id=samba-smb class=lsb type=smb> <instance_attributes> <attributes/> </instance_attributes> </primitive> -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Resource Groups Resources can be put together in groups a lot like R1 resource groups Groups are simple to manage, but less powerful than individual resources with constraints <group id=webserver> <primitive/> <primitive/> </group> -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Resource “clone” Units If you want a resource to run in several places, then you can “clone” the resource <clone id=yI> <instance_attributes> <attributes/> </instance_attributes> <primitive> <operations/> <instance_attributes/> </primitive/> </clone>-- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 CIB constraints <constraints> <rsc_location> <rule/> <rule/> </rsc_location> </constraints> -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 rsc_location information We prefer to run on host cl01lp2 <rsc_location id=run_webserver group=webserver> <rule id=rule_webserver score=100> <expression attribute=#uname operation=eq value=cl01lp2/> -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 </rule> Our Configuration: Two resource groups Webserver Filesystem IPaddr apache Fileserver Filesystem NFS IPaddr smb (Samba) / nmb (Samba) -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 STONITH “clone” resource(s) <clone id="fencing"> <instance_attributes> <attributes> <nvpair id="1" name="clone_max" value="2"/> <-- necessary? --> <nvpair id="2" name="clone_node_max" value="1"/> </attributes> </instance_attributes> <primitive id="fencing_op" class="stonith" type="ibmhmc"> <operations> <op id="1" name="monitor" interval="s" timeout="20s" prereq="nothing"/> <op id="2" name="start" timeout="20s" prereq="nothing"/> </operations> <instance_attributes> <attributes> <-- o we really need the hostlist? --> <nvpair id="1" name="hostlist" value="cl01lp1 cl01lp2"/> </attributes> </instance_attributes> LWCE – SF - August, 2005 </primitive> -- Linux-HA Release 2 Hands-On Detailed Handouts The complete configuration is too detailed to present further in slides. Remaining details are provided in smaller fonts in the supplied paper handout -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 References http://linux-ha.org/ http://linux-ha.org/download/ http://linux-ha.org/SuccessStories http://linux-ha.org/Certifications http://linux-ha.org/NewHeartbeatDesign www.linux-mag.com/2003-11/availability_01.html -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005 Legal Statements IBM is a trademark of International Business Machines Corporation. Linux is a registered trademark of Linus Torvalds. Other company, product, and service names may be trademarks or service marks of others. This work represents the views of the author and does not necessarily reflect the views of the IBM Corporation. -- Linux-HA Release 2 Hands-On LWCE – SF - August, 2005