Making Domino Clusters a Key Part of Your Enterprise Disaster Recovery Architecture Andy Pedisich Technotics © 2011 Wellesley Information Services. All rights reserved. In This Session … • • Clustering is one of my favorite Domino features Introduced with Release 4.5 in December of 1996 We have seen an increased adoption in clustering Domino servers over the past two years Much of that was for disaster recovery Some enterprises have been using Domino clustering for disaster recovery for years This session shares as much information as possible about the topic based on what I’ve seen, heard about, and discovered creating Domino-based disaster recovery solutions And most importantly, the session speaks to the all important issues about managing failovers 1 What We’ll Cover … • • • • • • • • Convincing management that DR clusters rock Exploring the choices for a clustered DR architecture Mastering cluster replication Setting up a private LAN for cluster traffic Managing cluster failover and load balancing Understanding the role of iNotes in your DR solution Reviewing the 7 important rules for configuring DR clusters Wrap-up 2 Setting a Baseline for Concepts • • • First, lets make sure that everyone knows when you see the letters DR, we’re talking about disaster recovery Disaster recovery is part of a larger concept referred to as business continuity That’s what we plan to do to keep our businesses running while disruptive events occur These events could be extremely local, like a power outage in a building Or they could be regional, like an earthquake or tornado, or a disaster caused by humans doing their thing Disaster recovery has its focus on the technology that supports business operations 3 Why Use Domino for Disaster Recovery • • • Domino clustering has been accepted by many organizations as an important part of their DR infrastructure Although that’s not a total endorsement as corporations do wacky things Clustering works with just about all things Lotus Notes/Domino such as Email, calendaring, Traveler, BlackBerry, and roaming user Clustering is included as part of your enterprise server licenses Clusters clearly should be exploited in every enterprise And they should have a solid role in any DR solution for messaging and collaboration 4 Important Facts to Help You Sell Clustering for DR • • Here are some facts to back you up when you start making plans to use Domino clusters for DR Clustering is a shrink wrapped solution It’s not new. As a matter of fact, it’s been burned in for years. It works really well It’s automatic It’s easy to set up and maintain It’s easy to test The biggest drawback for most companies using clusters is the increase in storage requirements since their data size is double Domino Attachment and Object Services (DAOS) can reduce that size by 30% to 40% 5 What We’ll Cover … • • • • • • • • Convincing management that DR clusters rock Exploring the choices for a clustered DR architecture Mastering cluster replication Setting up a private LAN for cluster traffic Managing cluster failover and load balancing Understanding the role of iNotes in your DR solution Reviewing the 7 important rules for configuring DR clusters Wrap-up 6 Basic Clustering Requirements • • All servers in the cluster must: Run on Domino Enterprise or Domino Utility server Be connected using high-speed local area network (LAN) or a high-speed wide area network (WAN) Use TCP/IP and be on the same Notes named network Be in the same Domino domain Share a common Domino directory Have plenty of CPU power and memory It’s safe to say that clustered servers need more power and more disk resources than unclustered servers A server can be a member of only one cluster at a time 7 Technical Considerations for Disaster Recovery Clusters • • A few guidelines regarding configuring clusters, especially for clusters used for disaster recovery Servers in a cluster should not use the same power source Nor should they use the same network subnet or routers They should not use the same disk storage array They should not be in the same building Your clustered DR solution should ideally be in different cities Never in the same room 8 Keep Your Distance • • • • • Good DR cluster designs should take into account both local and regional problems Consider a financial company who had clustered servers in two separate buildings across the street from each other in Manhattan This firm now has primary servers in offices in New York City And failover servers are thousands of miles away Another firm has primary servers in Chicago With a failover server in the UK A college has primary and failover servers separated by 200 miles Another company we know is just starting out with DR and has servers 25 miles from each other A good start, but they really want more distance 9 Servers Accessed Over a Wide Area Network • • • If servers are as far away as they are supposed to be, there might be some latency in the synchronization of the data If this is the case, users might not find the failover server is up to date with their primary server during a failover Everyone must be aware that this is a possibility Expectations must be set Or management needs to provide budgets for better networking Work out all of these details in advance with management Get them written down and approved so there are no surprises 10 Most Common DR Cluster Configuration • The most common DR cluster configuration is active/passive Servers on site are active Servers at the failover site are passive, waiting for failover Sometimes domains use this failover server as a place to do backups or journaling The number of servers in the cluster usually vary There could be 1 active and 1 passive Or 2 or 3 active and 1 passive 11 The Active/Passive Clustered Servers • • Active/Passive servers Cluster has two servers: one active, the other not generally used or used for backing up Very common disaster recovery setup Each server holds replicas of all the files from the other server During failover all users flip to the other server 12 The 3 Server Cluster for DR • • Three or more servers in the cluster There are two replicas for each cluster supported application or mail file Each primary server holds the mail files of the users assigned to that server Replicas of mail file from both primary servers are on the failover 13 3 Server Cluster with One Primary Server Down • If a primary server goes down, users from that server go to the failover server Easy to understand, and you save yourself a server You’ll still need twice the total disk space of Mail1 and Mail2 What happens when both Mail1 and Mail2 are unavailable 14 Both Primary Servers Are Down • • If both primary servers are down, the last server in the cluster has to support everyone Remember that generally speaking only about 60% to 70% of users assigned to a server are on there concurrently Still that has to be a pretty strong server with fast disks Some sites have remote servers as primary and failover happens at the home office data center 15 What We’ll Cover … • • • • • • • • Convincing management that DR clusters rock Exploring the choices for a clustered DR architecture Mastering cluster replication Setting up a private LAN for cluster traffic Managing cluster failover and load balancing Understanding the role of iNotes in your DR solution Reviewing the 7 important rules for configuring DR clusters Wrap-up 16 Understanding Cluster Replication • • Cluster replication is event driven It doesn’t run on a schedule The cluster replicator detects a change in a database and immediately pushes the change to other replicas in the cluster If a server is down or there is significant network latency, the cluster replicator stores changes in memory so it can push them out when it can If a change to the same application happens before a previous change has been sent, the CLREPL gathers them and sends them all together 17 Streaming Cluster Replication • • R8 introduced Streaming Cluster Replication (SCR) This newer functionality reduces replicator overhead Provides reduction in cluster replicator latency As changes occur to databases, they are captured and immediately queued to other replicas in the same cluster This makes cluster replication more efficient 18 SCR Only Works with R8 • • If one server in the cluster is R8 another is not, Domino will attempt SCR first When that doesn’t work, it will fall back on standard cluster replication If you’d like to turn off SCR entirely to ensure compatibility, use this parameter DEBUG_SCR_DISABLED=1 This must be used on all cluster mates 19 Only One Cluster Replicator by Default • • When a cluster is created, each server has only a single cluster replicator instance If there have been a significant number of changes to many applications, a single cluster replicator can fall behind Databases synchronization won’t be up to date If a server fails when database synch has fallen behind, users will think their mail file or app is “missing data” They won’t understand why all the meetings they made this morning are not there They think their information is gone forever! Users need their cluster insurance! 20 Condition Is Completely Manageable • • • • Adding a cluster replicator will help fix this problem You can load cluster replicators manually using the following console command Load CLREPL Note that a manually loaded cluster replicator will not be there if the server is restarted after manually loading a cluster replicator Add cluster replicators permanently to a server Use this parameter in the NOTES.INI CLUSTER_REPLICATORS=# I always use at least two cluster replicators 21 When to Add Cluster Replicators • • • But how do you tell if there’s a potential problem? Do you let it fail and then wait for the phone to ring? No! You look at the cluster stats and get the data you need to make an intelligent decision Adding too many will have a negative effect on server performance Here are some important statistics to watch 22 Key Stats for Vital Information About Cluster Replication Statistic What It Tells You Acceptable values Replica.Cluster. SecondsOnQueue Total seconds that last DB replicated spent on work queue < 15 sec – light load < 30 sec – heavy Replica.Cluster. SecondsOnQueue.Avg Average seconds a DB spent on Use for trending work queue Replica.Cluster. SecondsOnQueue.Max Maximum seconds a DB spent on work queue Use for trending Replica.Cluster. WorkQueueDepth Current number of databases awaiting cluster replication Usually zero Replica.Cluster. WorkQueueDepth.Avg Average work queue depth since the server started Use for trending Replica.Cluster. WorkQueueDepth.Max Maximum work queue depth since the server started Use for trending 23 What to Do About Stats Over the Limit • • Acceptable Replica.Cluster.SecondsOnQueue Queue is checked every 15 seconds, so under light load, should be less than 15 Under heavy load, if the number is larger than 30, another cluster replicator should be added If the above statistic is low and Replica.Cluster.WorkQueueDepth is constantly higher than 10 … Perhaps your network bandwidth is too low Consider setting up a private LAN for cluster replication traffic 24 Stats That Have Meaning but Have Gone Missing • There aren’t any views in Lotus version of Statrep that let you see these important statistics Matter of fact, the Cluster view is pretty worthless 25 The Documents Have More Information • The cluster documents have much better information You can actually use the data in the docs But they still lack key stats, though they’re in each doc 26 Stats That Have Meaning but Have Gone Missing • But there is a view like that in the Technotics R8.5 Statrep.NTF It shows the key stats you need To help track and adjust your clusters It is included on the CD for this conference 27 My Column Additions to Statrep Column Title Formula Formatting Min on Q Replica.Cluster.SecondsOnQueue / 60 Fixed (One Decimal Place) Min/Q Av Replica.Cluster.SecondsOnQueue.Avg / 60 Fixed (One Decimal Place) Min/Q Mx Replica.Cluster.SecondsOnQueue.Max / 60 Fixed (One Decimal Place) WkrDpth Replica.Cluster.WorkQueueDepth General WD Av Replica.Cluster.WorkQueueDepth.Avg General WD Mx Replica.Cluster.WorkQueueDepth.Max General 28 Use a Scheduled Connection Document Also • Back up your clustered replication with a scheduled connection document between servers Have it replicate at least once per hour You’ll always be assured to have your servers in sync even if one has been down for a few days And it replicates deletion stubs too! 29 What We’ll Cover … • • • • • • • • Convincing management that DR clusters rock Exploring the choices for a clustered DR architecture Mastering cluster replication Setting up a private LAN for cluster traffic Managing cluster failover and load balancing Understanding the role of iNotes in your DR solution Reviewing the 7 important rules for configuring DR clusters Wrap-up 30 Busy Clusters Might Require a Private LAN • • A private LAN separates the network traffic the cluster creates for replication and server probes And will probably leave more room on your primary LAN Start by installing an additional network interface card (NIC) for each server in the cluster Connect the NICs through a private hub or switch 31 Setting Up the Private LAN • • Assign second IP address to additional NIC Assign host names to the addresses in the local HOSTS file on each server Using DNS is a best practice 10.200.100.1 mail1_clu.domlab.com 10.200.100.2 mail2_clu.domlab.com Test by pinging the new hosts from each server 32 Modify Server Document • For each server in the cluster, edit the server document to enable the new port 33 Set Parameters so Servers Use Private LAN • Make your clusters use the private LAN for cluster traffic by establishing the ports in the server NOTES.INI with these parameters CLUSTER=TCP,0,15,0 PORTS=TCPIP,CLUSTER CLUSTER_TCPIPADDRESS=0,10.200.100.2:1352 You will use the address of your NIC card 34 Parameters to Make the Cluster Use the Port • • Use the following parameter to ensure Domino uses the port for cluster traffic SERVER_CLUSTER_DEFAULT_PORT=CLUSTER Use this parameter just in case the CLUSTER port you’ve configured isn’t available SERVER_CLUSTER_AUXILIARY_PORTS=* This allows clustering to use any port if the one you’ve defined isn’t available 35 Keep Users Off the Private LAN • • To keep users from grabbing on to the private LAN port, take the following steps Create a group called ClusterServers Add the servers in the cluster to this group Add the following parameter to the NOTES.INI of both servers It will keep users from connecting through the CLUSTER port Allow_Access_Cluster=ClusterServers 36 What We’ll Cover … • • • • • • • • Convincing management that DR clusters rock Exploring the choices for a clustered DR architecture Mastering cluster replication Setting up a private LAN for cluster traffic Managing cluster failover and load balancing Understanding the role of iNotes in your DR solution Reviewing the 7 important rules for configuring DR clusters Wrap-up 37 Respect the Users • • Clustering provides outstanding service levels for users But the process of failing over is sometimes hard on users Failover is actually the most difficult moment for users And sometimes errors in network configuration might prevent successful failover For example, the common name of the server should be listed as an alias in DNS to ensure users can easily open their application on the servers If the server is not in DNS, the clients won’t know how to get to the failover servers 38 Best Practice for Cluster Management • • Best Practice: Don’t take a clustered server down during working hours unless it is absolutely necessary A non-planned server outage, such as a crash or power failure, is a legitimate reason to fail over Resist the urge to take a server down because you know it’s clustered You could probably do it, but the risk of a hard failover will probably cause unwanted help desk calls 39 Easiest Cluster Configuration to Manage • • The Active/Passive model of clustering is by far the easiest to manage Use parameters in the NOTES.INI file on the servers in the cluster that allow users on the primary one server But don’t allow them on the failover server 40 Check for Server Availability • The parameters we use are thresholds that check the cluster statistic Server.AvailabilityIndex (SAI) This statistic shows how busy the server is 100 means it’s not busy at all 0 means it’s crazy busy 41 Adjusting the Threshold • • Setting the parameter Server_availablity_threshold controls whether users can access the server 50 means if the SAI is above 50, then failover users to another server A setting like this can be used for load balancing 100 means the SAI must be 100, which means the server must be 100% available This translates into “nobody is allowed on the server” 0 means that load balancing and checking the SAI is turned off These thresholds can come in handy 42 Setting Up Active/Passive Servers in a Cluster • • • • Let’s look at the following scenario Mail1 is the active primary; Mail2 is the passive failover To allow users to access their primary server, use this parameter in the NOTES.INI of Mail1 Server_availablity_threshold = 0 Use this console command: Set config server_availability_threshold=0 Prevent users from accessing failover server Mail2; use this parameter Server_availablity_threshold = 100 Use this console command: Set config server_availability_threshold=100 Administrators will still be able to access this server 43 Mail1 Is Crashing • • • If Mail1 crashes, the Notes client will disregard our setting of 100 on Mail2 and users will be permitted To help stabilize the system use this parameter on Mail2 Server_availability_threshold = 0 Let all users aboard While Mail1 is down, enter this parameter into the NOTES.INI to prevent users from connecting: Server_restricted=2 2 will keep the setting after a server restart Setting it to 1 also keeps them off, but the setting will be disabled with a 0 after restarting the server 44 Recovering After a Crash • • • • When Mail1 is brought back up after the crash, no one will be permitted to access it except administrators That’s because of the Server_restricted=2 setting Leave it that way until the end of the day The ugliest part about failing over is the client part Clients are working just fine on Mail2 By the way, iMap and POP3 users still have access to Mail1 At the day’s end, switch the Server_availability_threshold back to 100 on Mail2 and 0 on Mail1 Issue this console command on Mail2 Drop all 45 Taking a Clustered Server Down Voluntarily • • • If you must take clustered Mail1 server down, set Server_restricted=2, then drop all at the console Remember that POP3 and iMap users won’t like you very much Set Mail2 to Server_availability_threshold to 0 Don’t forget to set the Server_Restricted back to 0 when the crisis has passed I know someone who forgot to do this a couple of times This person could access the server and work because he was in an administrator’s group However nobody else could get on the server and he made all the users very angry 46 Triggering Failover • • You can set the maximum number of concurrent NRPC users allowed to connect to a server Server_MaxUsers NOTES.INI variable Set variable to a number determined in planning stage Set variable using console command Or use NOTES.INI tab in server configuration document Set config Server_MaxUsers = desired maximum number of active concurrent users Additional users will fail over to the other members of the cluster 47 Load Balancing • • If you’d like to load balance your servers, determine what your comfort range is for how busy your servers are and set the Server_availablity_threshold accordingly Perhaps start with a value of 60 Users should fail over when the SAI goes below 60 Pay close attention to the SAI in the STATREP.NSF, which is listed under the Av Inx column Some hardware can produce inaccurate SAI reading and cause users to failover when it’s not necessary 48 SAI Is Unreliable in Some Cases • Note how in this case, the Server Availability Index never seems to get much above 50 consistently Users would be failing over constantly And if both servers had the issue, users would be bouncing back and forth between the clustered servers 49 The Expansion Factor • Cluster members determine their workload based on the expansion factor This is calculated based on response times for recent requests Server compares recent response time to minimum response time that the server has completed Example: Server currently averages 12ms for DBOpen requests; minimum time was 4ms Expansion factor = 3 (current time/fastest time) This is averaged over different types of transactions Fastest time is stored in memory and in LOADMON.NCF LOADMON.NCF is read each time server starts 50 The Expansion Factor (cont.) • • But sometimes Domino has a difficult time calculating the expansion factor The result is that the Server_AvailabilityIndex is not a reliable measure of how busy the server is This can happen with extremely high performing servers If you see a very low Server_AvailabilityIndex at a time you know servers are supposed to be idle and you are trying to load balance, there is something you can do to correct it And Domino can help! 51 Changing Expansion Factor Calculation • • Use this parameter to change how the Expansion Factor is calculated SERVER_TRANSINFO_RANGE=n To determine the optimal value for this variable: After the server has experienced heavy usage use this console command: Show AI This means, show the availability index calculation It has nothing to do with that 2001 Steven Spielberg movie about the robot that looks like a child and tries to become a real boy 52 An Easy Way to Find the Parameter Value • Show AI is a console command that has been around since Domino Release 6 It runs some computations on the server And suggests a SERVER_TRANSINFO_RANGE for you 53 Events to Monitor When Using Clusters • Use event monitoring to look for certain problems that could ruin your clustering by preventing replication Look for the following phrases “cannot replicate database” “folder is larger than supported” They both mean that users are going to hate you in the event of a failover Because their databases will not be in synch 54 Disaster Recovery Is a Special Case • • Many enterprises configure their clusters for manual failover They don’t want the user to fail over unless they permit it To keep users off of the failover servers, use the following parameters in the NOTES.INI of the server Server_restricted=2 This will keep off all users until you set it to a zero Server_availability_threshold=100 The server looks busy and keeps everyone on the primary Set it to zero to allow users on the server Keep in mind, if the primary servers are down, the users won’t have a server to work on unless you manually reset these parameters 55 What We’ll Cover … • • • • • • • • Convincing management that DR clusters rock Exploring the choices for a clustered DR architecture Mastering cluster replication Setting up a private LAN for cluster traffic Managing cluster failover and load balancing Understanding the role of iNotes in your DR solution Reviewing the 7 important rules for configuring DR clusters Wrap-up 56 Things That Do Not Cluster with Domino Clustering • • • Only traffic over NRPC port 1352 will fail over Other TCP/IP services will not fail over, including: IMAP POP3 HTTP This means iNotes users will not fail over But that doesn’t mean you can’t make it work In the aftermath of Katrina, some companies found that browser-based email was a savior 57 High Availability Is Not Disaster Recovery • • • IBM/Lotus has several recommendations for achieving high availability with iNotes For example, Internet Cluster Manager This can redirect the initial URL for an application to one of several back-end mail servers However, if the mail server becomes unavailable after a session is started, there is a way to recover and switch to another server that has a replica of the mail file IBM/Lotus also recommends an HTTP load balancer to help with high availability, and with DR to a certain extent The issue is, what happens when the load balancer is unavailable? 58 iNotes Users Accessing a Hub Server • • The IWAREDIR.NSF is an application that helps direct browser requests to a user’s mail server Any user on either server Mail1 or Mail2 connects to either iNotes server that is deployed using the IWAREDIR.NSF The IWAREDIR.NSF looks up their name in the Domino directory and “redirects” them to their mail server But it is not designed to take them to a failover server However, it is possible to have a second version of this file with code changes that will take a user to their failover mail server 59 Open the IWAREDIR.NSF Using the Designer Client • Open the AutoLogin form using the Notes Designer Client At the very bottom, you’ll find a field called $$HTMLHead This contains the code that discovers the user’s mail server and mail file name 60 Section of Code That Can Be Modified • In this field is code that can be modified to point the user to the failover server rather than to their primary mail server There’s a link at the end of this presentation to some extreme programming you can do with this field 61 Code to Point Users to Failover Server • If the address book says the user’s mail is on Mail1, but you want to take them to server Mail1F, the failover box, the code can be changed to something like this This is an @if statement that essentially says, if the directory states that the MailServer for the user is Mail1, take them to Mail1F, the failover server This failover version of the IWAREDIR.NSF that could be called IWAREDIRF.NSF 62 During Normal Operation • During a normal day with no need for failover, the IWAREDIR.NSF would be specified as the home URL for the server Users would be taken to their home servers and their mail files automatically 63 During a Failover Situation • When required, this configuration can be changed to use the failover version, the IWAREDIRF.NSF After this change is made, HTTP must be restarted 64 Restarting the HTTP Task • • Although the HTTP task can be restarted, the most reliable method is to shut down HTTP and load it again From the Domino console use these commands Tell HTTP quit Load HTTP 65 iNotes Is There When You Need It • When there are no fat Notes clients, iNotes can have extreme value and maintain communications in your organization It’s worth planning it out so that when users call the help desk during an emergency they can point the users to any DR server that is configured to use the IWAREDIRF.NSF That’s the failover version of your redirector 66 What We’ll Cover … • • • • • • • • Convincing management that DR clusters rock Exploring the choices for a clustered DR architecture Mastering cluster replication Setting up a private LAN for cluster traffic Managing cluster failover and load balancing Understanding the role of iNotes in your DR solution Reviewing the 7 important rules for configuring DR clusters Wrap-up 67 7 Rules for Clustering Domino Servers for Disaster Recovery 1. 2. 3. 4. 5. 6. 7. Get agreement from all parties before you start configuring Be sure everyone agrees on what failover means and what kind of disaster you are preparing for, such as local or regional outages Keep primary and failover servers as far apart as possible Actually test the failover scenarios so that there is no doubt that the configuration works Consider manual failover to prevent users from accessing servers over a wide area network or slow connections Review cluster statistics regularly to ensure there are enough cluster replicators Review the CLUDBDIR.NSF to make sure there is a failover replica for each database on the primary server 68 What We’ll Cover … • • • • • • • • Convincing management that DR clusters rock Exploring the choices for a clustered DR architecture Mastering cluster replication Setting up a private LAN for cluster traffic Managing cluster failover and load balancing Understanding the role of iNotes in your DR solution Reviewing the 7 important rules for configuring DR clusters Wrap-up 69 Resources • • • Achieving high availability with IBM Lotus iNotes www10.lotus.com/ldd/dominowiki.nsf/dx/Achieving_high_availability _with_IBM_Lotus_iNotes ServersLookup Form for IWAReder to return User replicas for Load Balancers and Reverse Proxies to use www10.lotus.com/ldd/dominowiki.nsf/dx/ServersLookup_Form_for_I WAReder_to_return_User_replicas_for_Load_Balancers_and_R everse_Proxies_to_use Understanding IBM Lotus Domino server clustering www.ibm.com/developerworks/lotus/documentation/d-lsdominoclustering/index.html 70 Resources (cont.) • How to test cluster failover on one Domino database (application) www304.ibm.com/support/docview.wss?rs=899&uid=swg21280021 71 7 Key Points to Take Home • • • • Disaster recovery is part of the bigger concept known as business continuity The biggest drawback when using DR clusters is disk space consumption, which can be decreased by implementing Domino Attachment and Object Service A three-server cluster, with two primary servers and one failover they share, is an excellent way to conserve server resources and licensing costs Make sure your clustered servers are configured with scheduled connection documents because deletion stubs don’t replicate via cluster replication 72 7 Key Points to Take Home (cont.) • • • Use the Allow_Access_Portname parameter to prevent users from accidently using the private LAN between clustered servers Failing over is the hardest part for users iNotes is an excellent way to keep users connected during a disaster 73 Your Turn! How to contact me: Andy Pedisich Andyp@technotics.com HTTP://www.andypedisich.com 74