Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology Abstract Oracle database and application administrators face many challenges to managing the application and storage resources necessary for Oracle operations. This white paper outlines how EMC® RecoverPoint provides costeffective local and remote replication of Oracle Fusion Middleware and Oracle Database Server as part of a disaster recovery solution. August 2009 Copyright © 2009 EMC Corporation. All rights reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com All other trademarks used herein are the property of their respective owners. Part Number h6450 Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 2 Table of Contents Executive summary ............................................................................................ 5 Background .................................................................................................................................. 5 Introduction ......................................................................................................... 5 Audience ...................................................................................................................................... 6 RecoverPoint....................................................................................................... 6 Advantages of RecoverPoint ....................................................................................................... 6 Local and remote recovery .......................................................................................................... 6 Replication modes ....................................................................................................................... 7 Oracle database protection with RecoverPoint............................................................................ 7 Federated environments and consistency groups ....................................................................... 8 Use case objectives.......................................................................................... 10 Oracle/EMC environment block diagram ........................................................ 10 RecoverPoint validation architecture.............................................................. 12 CLARiiON array-side preparations.................................................................. 14 System configuration and storage .................................................................. 15 Hardware configuration.............................................................................................................. 15 Software configuration ............................................................................................................... 16 Oracle Fusion Middleware, web tier, and Database volume layout........................................... 17 RecoverPoint CRR consistency groups..................................................................................... 18 RecoverPoint group sets and parallel bookmarks ..................................................................... 18 OCFS2 configuration ........................................................................................ 19 High availability for persistent stores ......................................................................................... 19 Use high-availability storage for state data................................................................................ 19 Network configuration...................................................................................... 20 Planning for disasters and planned downtime............................................... 20 Initiate the replication process ................................................................................................... 20 RecoverPoint software installation and configuration ............................................................ 21 Starting replication.................................................................................................................. 21 Switchover procedures .............................................................................................................. 22 Switchback procedures.............................................................................................................. 24 Failover procedures ................................................................................................................... 26 Failback procedures................................................................................................................... 27 General recommendations............................................................................... 28 Setting snapshots or manual bookmarks based on requirements............................................. 28 Periodic DR testing .................................................................................................................... 28 Event notification........................................................................................................................ 29 General I/O and sizing ............................................................................................................... 29 Conclusion ........................................................................................................ 29 References and resources ............................................................................... 30 Oracle......................................................................................................................................... 30 Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 3 EMC ........................................................................................................................................... 30 Appendix ........................................................................................................... 31 Oracle DR terminology............................................................................................................... 31 RecoverPoint terminology.......................................................................................................... 32 RecoverPoint write splitters ....................................................................................................... 33 RecoverPoint image access modes .......................................................................................... 34 Virtual access (instant) ........................................................................................................... 34 Virtual access (instant) with Roll image in background.......................................................... 34 Logged access (physical)....................................................................................................... 35 Disable Image Access............................................................................................................ 35 Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 4 Executive summary Oracle’s family of middleware products is comprehensive, standards-based application infrastructure software – from the leading Java application server to SOA and Enterprise 2.0 portals. Pre-integration with Oracle Applications, Database, and Enterprise Manager speeds implementation and lowers costs. EMC® RecoverPoint is an important advancement in the area of data replication. RecoverPoint provides local and remote replication for heterogeneous servers and storage and enables multiple applications to have replication consistency with fine-grained control over local and remote recovery. This white paper describes an application topology supporting core services provided by Oracle Fusion Middleware and Oracle Database Server. The data replication and disaster recovery products represented can address your specific application topology as it expands beyond the core represented here. This disaster recovery validation test incorporates the use of Oracle Application Server SOA Suite and Oracle Database Server 10gR2. Oracle SOA Suite is a comprehensive, hot-pluggable software suite for the building, deployment, and management of a service-oriented architecture. This includes service-oriented application development, service-oriented applications and IT systems integration, and service-oriented management of business processes. It plugs in to a heterogeneous IT infrastructure and enables enterprises to adopt SOA incrementally. These applications rely on hundreds of gigabytes or even terabytes of data and they share one common factor. They need a well-designed recovery plan in case of disaster. Oracle and EMC deliver key solutions that address the protection of mission-critical applications. Enterprise deployments need protection from unforeseen disasters and natural calamities. One protection solution involves setting up a disaster recovery (DR) site at a geographically different location from the production site. The DR site may have equal or fewer services and resources compared to the production site. Application data, metadata, configuration data, and security data are replicated to the DR site on a periodic basis. The DR site is normally in a passive mode; it is started when the production site is not available. This deployment model is sometimes referred to as an active/passive model. This model is normally adopted when the two sites are connected over a WAN and network latency does not allow clustering across the two sites. Background This white paper describes the setup and testing environment deployed to validate EMC RecoverPoint 3.1 with both Oracle Fusion Middleware and an Oracle database. This test configuration validates that the entire environment can be restored at a secondary site (DR site) in the event of a major production site failure. This test validates third-party replication using EMC RecoverPoint and Oracle Database Server as a joint solution that protects both database and non-database artifacts. The white paper Disaster Recovery of Oracle Fusion Middleware with EMC RecoverPoint outlines an alternative solution approach where Oracle Data Guard is used to protect Oracle database files. Introduction This white paper is a follow-on paper to Disaster Recovery of Oracle Fusion Middleware with EMC RecoverPoint. It addresses and explains the benefits of using EMC RecoverPoint local and remote replication to provide operational and disaster recovery for the Oracle Fusion Middleware and database environments. RecoverPoint provides application-consistent recovery points that can be utilized in response to a number of possible scenarios, enhancing the native availability of Oracle environments. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 5 Audience The intended audience for this paper includes both storage administrators as well as database administrators seeking to understand best practices for failover and recovery of both Oracle Database Server and Oracle Fusion Middleware environments. The reader will establish an understanding of the importance of using an integrated solution to ensure effective recovery of all necessary data files to restart not only Oracle databases, but the applications they support. Detailed best practices are provided based upon EMC and Oracle joint testing, helping provide IT organizations with guidelines to develop a complete disaster recovery solution specific to their business. RecoverPoint RecoverPoint is proven technology for high-availability Oracle environments with both local and remote protection across SAN storage with complete protection against many possible disaster scenarios. This type of environment provides resiliency against failures within the data center infrastructure. It can help improve recovery from a regional disaster, all with the added benefit of immediate and instantaneous application recovery. Oracle products are inherently highly available and provide enterprise-class reliability without compromising security, performance, or scalability. To enhance the built-in availability features for Oracle, consider the following requirements for a data protection solution: • Protection from infrastructure failure (such as a storage array or SAN switch) • Protection from local or regional disaster • Protection from data corruption Many companies are deploying continuous data protection (CDP) as a way to meet their recovery time objectives (RTO) and recovery point objectives (RPO). A true CDP implementation ensures that all changes to an application’s data are tracked and retained consistently. In effect, CDP creates an electronic journal of application data snapshots, one for every instant in time that a modification occurs. Advantages of RecoverPoint RecoverPoint CDP preserves a record of the write transactions that take place within its environment, providing crash or application consistent recovery points. For local replication, RecoverPoint captures every write with I/O splitter technology (see the “RecoverPoint write splitters” section on page 33) and preserves them in a local journal; for remote replication, transactions are grouped based on user-specified policies, with significant write changes preserved in a journal at the DR site. This preservation of writes ensures that if data is lost or corrupted, such as from a server failure, virus, Trojan horse, software errors, or end-user errors, it is always possible to recover a clean copy of the affected data. Another advantage for RecoverPoint is that this data recovery can be performed at either the local or remote locations. These recovery points can be immediately accessed and mounted back to production environments in seconds — much less time than is the case with disk-based snapshots, tape backups, or archives. Local and remote recovery EMC RecoverPoint is a comprehensive data-protection solution providing concurrent local and remote (CLR) data protection. This integrates both CRR (remote) and CDP (local) replication, allowing users to recover applications to any point in time. The integration of local (CDP) and remote (CRR) replication protects data against catastrophic events that can bring entire data centers to a standstill. RecoverPoint delivers superior data protection by allowing both local and remote replication with no application performance degradation. As a result, organizations can deploy geographically dispersed data centers for maximum protection from local or regional failure or disaster. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 6 RecoverPoint CDP, CRR and CLR use the same write-splitting methods, journaling technologies, and appliance platform. They both make use of consistency groups, which let the user define sets of volumes for which the inter-write order must be retained during replication and recovery. This ensures that data at any point in time will be fully self-consistent. With RecoverPoint CDP, instead of compressing the data and sending over an IP network to a remote volume, it writes the data to a local journal and then to a local volume. As there is no IP network involved, and hence no latency concern, RecoverPoint CDP can synchronously track every write in the local journal and distribute the write to the target volume—all without impacting application-server performance. RecoverPoint CRR transfers significant writes, based on an application’s RPO/RTO, to a remote site where they are saved in a history journal. Once the appliance receives the write, it will bundle this write up with others into a package. Redundant blocks are eliminated from the package, and the remaining writes are sequenced and stored with their corresponding timestamp and bookmark information. The package is then compressed, and an MD-5 checksum is generated for the package. The package is then scheduled for delivery across the IP network to the remote appliance. Once received the remote appliance verifies the checksum to ensure the package was not corrupted in the transmission. The data is then uncompressed and written to the journal volume. Once the data has been written to the journal volume, it is distributed to the remote volumes, ensuring that write-order sequence is preserved. RecoverPoint CRR enables the user to perform backups at the remote site, eliminating the need to take the production file systems offline. With RecoverPoint data recovery can be performed locally and/or remotely by rewinding the target volumes back to a selected point in time by using earlier versions of data saved in the journal. RecoverPoint CRR provided protection for Oracle Fusion Middleware data. It validated that an entire environment could be restored at a secondary or disaster recovery site, in the event of a major production site failure. Replication modes When transferring data from the local site to the remote site, the RecoverPoint system automatically switches between replication modes on the fly to ensure that it is always using the method that best fits the current load conditions and the replication policy. The RecoverPoint system automatically switches between the following modes, according to load conditions: • Continuous synchronous mode • Continuous asynchronous mode • Snapshot mode RecoverPoint automatically uses the replication mode that is most effective for the current conditions, including the application load, throughput capacity, and replication policy. Regardless of the replication mode, RecoverPoint is unique in its ability to guarantee a consistent replica at the target side under all circumstances, and in its ability to retain write order fidelity in multi-host heterogeneous SAN environments. For more information regarding replication modes, please refer to the EMC RecoverPoint Release 3.1 Administrator’s Guide. Oracle database protection with RecoverPoint EMC RecoverPoint supports both single instance and Oracle Real Application Clusters (RAC) for local and remote replication of the Oracle RAC SAN-attached volumes. The Oracle LUNs are grouped into a single consistency group, and replication sets are created to map the production LUNs to the remote and/or local copy LUNs. RecoverPoint then processes the consistency group based on the type of recovery required. The following are among the options Oracle provides for protecting its databases, which can be integrated with RecoverPoint replication. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 7 • Application-consistent recovery from a shutdown (also known as “cold” backup) The user creates a consistency group that represents the Oracle application. The consistency group contains all volumes for the application, including data files, online redo log files, configuration files, and optionally control files. This method produces a copy from which you can restore the database with 100 percent reliability. Because normal operations must be halted while the “cold” backup is being created, this method is not appropriate for systems that must operate on a 24 x 7 basis. In addition, any changes protected by RecoverPoint that are made before or after the “cold” backup will not be available as an application-consistent recovery point, but they can be recovered as a crashconsistent recovery point (see below). When the database is shut down, the user will create a RecoverPoint bookmark for the specific consistency group, identifying the image as a “cold” backup image. This bookmark can be used to identify a point-in-time recovery image that represents a fully restorable and restartable Oracle database image. • Crash-consistent recovery during operations This process enables the creation of crash-consistent images without requiring system shutdown. This process is performed by default by RecoverPoint for all applications as part of the RecoverPoint writesplitting operations. As a write is issued to the production volumes, RecoverPoint splitters intercept it and send a copy of the write to the RecoverPoint appliance for further processing. These captured writes represent the on-disk consistent data, which is the same data that remains on external storage even when an application crashes. When Oracle is restarted from a server crash, an instance must first open the database and then execute recovery operations. The instance automatically uses the redo log to recover the committed data in the database buffers that was lost when the instance failed. Oracle also undoes any transactions that were in progress on the failed instance when it crashed, then clears any locks held by the crashed instance after recovery is complete. When Oracle is restarted from a RecoverPoint crash-consistent image, it will perform the same recovery procedures. • Application-consistent recovery during operation (also known as “hot” or “fuzzy” backup) This process enables the creation of application-consistent images without requiring system shutdown. It is required that all data files belonging to the relevant tablespaces, archive log files, and control files are flushed from the server’s in-memory buffers to disk. To ensure that Oracle can recover from these images, Oracle must write additional information to the log file; information not required when crashconsistent images are sufficient. This feature in RecoverPoint requires that the user script several commands both to the Oracle Server and to the RecoverPoint appliance. The procedure entails placing the appropriate tablespace or database into backup mode (for example, ALTER TABLESPACE BEGIN BACKUP or ALTER DATABASE BEGIN BACKUP). When in “hot” backup mode, when a database block is modified, the entire block is written to the online redo logs. In normal operations, only the changed bytes are written. Also the data file headers are not updated with the SCN when a checkpoint is performed. Once Oracle backup mode is set, the script creates a RecoverPoint bookmark for the specific consistency group to identify the image as an application-consistent image. Archived redo logs can be used against this image. Federated environments and consistency groups Federated environments are related applications that span multiple servers and storage arrays at the same site. Each application has its own RPO and RTO policies that govern the protection and recoverability of the application’s data. In RecoverPoint terminology, each application becomes a RecoverPoint consistency group with its own policy, journal, and replication set. For a successful recovery of the federated environment the customer must ensure that the individual consistency groups have a common recovery point across all of the applications. To get a common recovery point, the user creates a “group set” that defines the RPO across all of the RecoverPoint consistency groups that make up the federated environment. This is done though the use of a common bookmark in each journal for the consistency groups that are part of the “group set.” Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 8 When the user recovers each of the consistency groups that make up the federated environment and selects the common bookmark, the user will ensure that all of the application’s data is recovered to the exact same point in time. This enables the data to be used for such advanced features as building a testing and development environment, creating a federated backup image, data mining, and so forth. Use of federated consistency groups provides the following: • Allows application recovery to be tiered by service level Multiple volumes per group Mixed recovery point objectives within the same infrastructure • Provides independent replication controls Recover by group, either locally or remotely Start/stop by group • Enables grouping of optimization Importance Resource usage Recovery point and recovery time objectives • Each tier can have different service level agreements Consistency groups per tier Operational recovery of tier • Enforces consistency across tiers Federated environments Recover to a known point for all applications Disaster recovery for tier or application Spans operating systems, applications, storage, and servers • Enables advanced functions Full environment cloning Application upgrade testing Data mining Consistent production rebuild The federated environment topology consists of three separate tiers: the web tier, the Oracle Fusion Middleware SOA Suite application tier, and the database tier. For successful recovery, consistency must be maintained between the middleware application tier and the database tier. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 9 Use case objectives Objective Details Validate the test system architecture deploying key Oracle and EMC technologies. The archtecture consists of SAN-based storage with multiple Linux servers deployed in a production / DR site mode. Utilize EMC RecoverPoint to protect all tiers, that is,web, Oracle Fusion Middleware, and database. Describe how RecoverPoint is set up and administered to protect the Oracle WebLogic application server, configuration information, and the WebLogic persistent stores (transaction logs), the SOA Suite binaries and configuration, and the Oracle database. Ensure a very small to zero RTO and a very small RPO. Detail how RecoverPoint CLR is used to create simultaneous local/remote replicas and bookmarks. Provide no single point of failure. Accomplish with clustered WebLogic Application servers, EMC storage as well as highly available replication appliance (RPA) configurations. Support the replication of WebLogic configuration and transaction logs (TLogs) Achieve through EMC RecoverPoint CLR for local and remote replication. Set up procedures to demonstrate planned switchover and switchback and unplanned failover and failback. Show these procedures using EMC RecoverPoint. Demonstrate ease of use in WebLogic and Oracle database environment failover and failback procedures. Document these procedures with easy-to-follow failover and failback instructions. Run the Oracle validation test suite, verifying failover data integrity. Execute test program and procedures that validate RecoverPoint as a DR solution. This paper describes a test environment that is representative of DR protection of a production environment. The actual validation tests are not addressed here. Oracle/EMC environment block diagram Figure 1 depicts the major components and relationships between the servers, and applications on the primary/production and disaster recovery sites. The configuration consists of a single web host and two application servers with Oracle Fusion Middleware. They are clustered using Oracle WebLogic Server clustering. EMC RecoverPoint protects all objects including the Oracle Home binaries, Oracle Fusion Middleware, and the single instance database. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 10 Clients Primary Site Standby Site Load Balancer Web Host APACHE SERVER App Host # 1 RecoverPoint Manager Load Balancer Web Host APACHE SERVER Web Host APACHE SERVER App Host # 2 App Host # 1 RecoverPoint Manager Admin Server Cluster RecoverPoint Appliance BPEL+ ESB + OWSM Web Host APACHE SERVER App Host # 2 Admin Server Cluster BPEL+ ESB + OWSM RecoverPoint Appliance BPEL+ ESB + OWSM BPEL+ ESB + OWSM RecoverPoint (RP) CRR Fusion Middleware & Logs Database Database Data Guard Replication Figure 1. Topology diagram of an SOA Suite Application and RecoverPoint use case environment Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 11 RecoverPoint validation architecture The RecoverPoint architecture for the middleware disaster recovery consists of four RecoverPoint appliances (RPAs) attached to both primary and secondary storage through Fibre Channel with two paths to either appliance. The connections are managed through a Brocade DCX and 7600 switch. They have an IP connection on a private subnet through a gigE dLink switch to simulate a WAN link; this is in addition to the management IP connection that runs through a Cisco Catalyst switch. Both storage arrays, the EMC CLARiiON® CX4-480 and CX4-960, have the integrated RecoverPoint write-splitting technology. Each storage array has a storage group created specifically to mask the LUNs to the RecoverPoint appliances as well as the middleware hosts. Figure 2. RecoverPoint use case architecture All the hosts are zoned through the Brocade switches to the CX-480 and CX-960, using the switch explorer GUI. The RecoverPoint appliances were also zoned over to the storage in the same manner. For each LUN masked to the RecoverPoint appliances, a replication set is created linking the source LUN to a target LUN. These replication sets are all placed within the middleware consistency group. The CX splitter is integrated with each storage processor (SP) of the CLARiiON array and this will send one copy of the write to the storage array and the other to the RecoverPoint appliance. The following types of storage volumes are required for RecoverPoint configuration: • Repository volume: This volume holds the configuration and marking information during replication. At least one repository volume is required per site and is accessible from all the RPAs at the site. • Journal volume: This volume is used to store all the modifications. The application-specific bookmarks and timestamp details are written to the journal volume. The size of the journal depends on the write activity to the protected LUN(s) and the RPO required for the data. Best practices information on the sizing and configuration of journal volumes is available in the EMC RecoverPoint Installation Guide. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 12 • Replication set: The association created between the source volume and the local and/or remote target volumes is called the replication set. A consistency group contains one or more replication sets. • Consistency group: The logical grouping of replication sets identified for replication is called a consistency group. Consistency groups ensure that the updates to the associated volumes are always consistent with write-order preserved and that they can be used to restore the database at any point of time. The following shows the Middleware consistency group and its assigned replication sets from the RecoverPoint Management Application GUI. The names of the volumes in the Middleware and DR columns are derived from the “friendly” names of the LUNs returned by the CLARiiON array during the SCSI discovery operations. The number in parenthesis is the CX LUN ID number. Figure 3. Consistency group and replication set definitions Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 13 The following RecoverPoint Management Application GUI screenshot shows the Middleware consistency group storage volumes replicated by RecoverPoint (defined by replication sets). Figure 4. Storage volumes in replication sets CLARiiON array-side preparations The integrated RecoverPoint write-splitting technology is enabled on each storage processor (SP) of the CLARiiON array. The integrated write-splitter will intercept every write to a protected LUN and will send a copy of the write to the RecoverPoint appliance and send the original to the protected LUN. This setup has a CX4-480 as primary storage and a CX4–960 at the DR site. RecoverPoint 3.1 is installed and configured to utilize the integrated RecoverPoint write-splitting technology in the CX4 arrays. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 14 System configuration and storage Hardware configuration Table 1. Server types used Dell PowerEdge 2950 Dell PowerEdge 900 2 Quad-Core CPUs, 4 Quad-Core CPUs, 2.50 GHz, 16 GB RAM, 2.93 GHz, 32 GB RAM, Brocade 8 GB HBAs Brocade 8 GB HBAs Table 2. Fibre Channel switch types Primary DR Brocade DCX Brocade 7600 Table 3. Storage types used Primary DR CLARiiON CX4-480 FLARE 28 CLARiiON CX4-960 FLARE 28 Table 4. Server functions and types Server function Server type Production Site – Apache Web Server Dell 2950 Production Site – Oracle WebLogic Server Cluster 1 Dell 2950 Production Site – Oracle WebLogic Server Cluster 2 Dell 2950 Production Site – Oracle Database Server Dell 900 DR Site – Apache Web Server Dell 2950 DR Site – Oracle WebLogic Server Cluster 1 Dell 2950 DR Site – Oracle WebLogic Server Cluster 2 Dell 2950 DR Site – Oracle Database Server Dell 900 Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 15 Software configuration Table 5. Software versions and configuration Software Version Configuration OEL Enterprise Linux Server (Carthage) Release 5.2 x86_64 EMC PowerPath® 5.1 SP 2 EMC RecoverPoint CLR 3.1 Oracle Database 10g Release 2 10.2.0.1 Linux X86_64 Oracle Database Server Patch Set 10.2.0.4 Patch Set 6810189 Linux X86_64 Oracle Enterprise Manager 10g Grid Control 10.2.0.4 Oracle Enterprise Manager 10g Grid Control Agent 10.2.0.4 Linux x86_64 Oracle WebLogic Server 9.2 MP3 Linux x86 Oracle Application Server SOA Suite 10g Release 3 10.1.3.1 Linux x86 Oracle Application Server Patch Set 10.1.3.4 Patch Set 7272722 Oracle SOA Suite 10.1.3.4 for WebLogic Server 9.2 MP3 Patch Patch 7490612 Oracle Patch – Resolve Prereq Installation Issue 10.1.3.1 Apache HTTP Server 2.2.3 Oracle JDeveloper 10g Release 3 Studio Version 10.1.3 Apache ANT 1.7.1 Patch 633905 Linux x86_64 Windows XP – Service Pack 2 Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 16 Oracle Fusion Middleware, web tier, and Database volume layout Table 6. Volume layout Volume name Tier Size Mounted on nodes Mount point Notes/Comments VolWeb Web Tier 20G webhost /u01/app/oracle Volume for Apache Install VolAdmin App Tier 20G apphost1 /u01/app/oracle/wls/soaDomain/admin Volume for Admin server Instances VolWLS1 App Tier 20G apphost1 /u01/app/oracle/wls/soaDomain/mng1 Volume for Managed Server Instance VolWLS2 App Tier 20G apphost2 /u01/app/oracle/wls/soaDomain/mng2 Volume for Managed Server Instance VolData App Tier 20G apphost1, apphost2 /u01/app/oracle/data Volume for TLogs and JMS Data VolOrcl1 App Tier 20G apphost1 /u01/app/oracle/product Volume for binaries, both Oracle and WebLogic VolOrcl1 App Tier 20G apphost2 /u01/app/oracle/product Volume for binaries, both Oracle and WebLogic Oradata_1 DB Tier 100G dbhost /u01/app/oracle Volume for Oracle binaries and Flash Recovery Area Oradata_2 DB Tier 100G dbhost /u01/oradata/orclsoa/cg1_dbf_undo Database Files and Undo Oradata_3 DB Tier 50G dbhost /u01/oradata/orclsoa/cg2_redo Database Online Redo Logs Oradata_4 DB Tier 100G dbhost /u01/oradata/orclsoa/cg3_arch_ctl Database Archive Logs and Controlfiles Journal RP 50G RP appliance Raw Volume - Fusion Middleware Consistency Group Production RecoverPoint metadata Journal RP 50G RP appliance Raw Volume - Fusion Middleware Consistency Group CRR Replica RecoverPoint metadata Journal RP 50G RP appliance Raw Volume - Database Consistency Group, cg1_dbf_undo Production RecoverPoint metadata Journal RP 50G RP appliance Raw Volume - Database Consistency Group, cg1_dbf_undo CRR Replica RecoverPoint metadata Journal RP 50G RP Raw Volume - Database Consistency RecoverPoint Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 17 appliance Group, cg2_redo Production metadata Journal RP 50G RP appliance Raw Volume - Database Consistency Group, cg2_redo CRR Replica RecoverPoint metadata Journal RP 50G RP appliance Raw Volume - Database Consistency Group, cg3_arch_ctl Production RecoverPoint metadata Journal RP 50G RP appliance Raw Volume - Database Consistency Group, cg3_arch_ctl CRR Replica RecoverPoint metadata Please note the VolData volume is mounted simultaneously on both application hosts. This volume contains the WebLogic persistent file-based store. File-based stores must be configured on shared storage. The OCFS2 clustered file system is configured for the shared storage. All server root volumes are installed on local disk. Although not used in this validation, RecoverPoint does support the replication of root volumes installed in the SAN (boot from SAN). RecoverPoint CRR consistency groups For this use case three additional consistency groups for the database tier were created. The Oracle Fusion Middleware tier consists of one consistency group as previously configured. The white paper Disaster Recovery of Oracle Fusion Middleware with EMC RecoverPoint has further detail. The Oracle Fusion Middleware has two types of data that are being replicated. The first is the SOA Suite, WebLogic binaries, and configuration files. The second is the persistent file-based store used for the Java Message Services (JMS) and Transaction Logs (TLogs). The persistent file-based store is used to recover transactions. The RecoverPoint consistency groups used for the database tier will be provisioned for 1) data files and undo logs, 2) online redo logs, and 3) archived redo plus control files. This will provide for the various strategies of disaster recovery according to the business needs of the organization, that is, applicationconsistent recovery, crash-consistent recovery, and crash-and-application consistent recovery. The online redo logs change frequently. In a RecoverPoint configuration, placing the online redo logs in a separate, lower-priority consistency group and giving priority to the data and archive redo log replication sets (crash-consistent strategy) may improve replication performance and slightly decrease time between snapshots (point-in-time recovery objective). The actual impact for all disaster recovery strategies depends on the bandwidth and the percentage of writes being sent over the link to the replica site. Please refer to the Replicating Oracle with EMC RecoverPoint Technical Notes. Each consistency group has settings and policies to include configuration group name, preferred RPA, and reservation support; and policies, such as compression, bandwidth limits, and maximum lag, which govern the replication process. In the configuration described above, the Lag time will need to be defined. This is the maximum offset between writing data to production storage and writing it to the RPA or journal at the replication site. The EMC RecoverPoint Release 3.1 Administrator’s Guide has further detail on the settings and policies. RecoverPoint group sets and parallel bookmarks RecoverPoint group sets are used throughout a federated environment. Federated environments are related applications that may span multiple servers and storage arrays. Each application has its own RPO and RTO policies that govern the protection and recoverability of the application’s data. After creating each consistency group with its appropriate settings and policies, to ensure that the individual consistency groups have a common consistency point across all of the applications, create a group set. The group set allows you to automatically bookmark a set of consistency groups so that the bookmark represents the same recovery point of each consistency group in the group set. This allows you to define Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 18 consistent recovery points for consistency groups that are distributed across different RPAs. The automatic periodic bookmark consists of the name you specified for the group set and an automatically incremented number. Numbers start at zero, and are incremented up to 65535, then begin again at 0. The same bookmark name is used across all consistency groups. To apply automatic bookmarks, the sources must be at the same site (replicating in the same direction) and transfer must be enabled for each consistency group included in the group set. The Group Set Details dialog box in the RecoverPoint Manager Application allows you to create, edit, or remove group sets. When creating the group set, enter a name for the automatic bookmarks. Select the consistency groups to be in the group set, and specify the bookmarking frequency. This will enable the parallel bookmarks. It is recommended that the interval between automatic bookmarks not be less than 30 seconds. If you prefer to take a bookmark at a specific time other than the automated times, or choose to manually take the parallel bookmarks, this can also be done through the RecoverPoint Manager Application. In the Navigation pane, select consistency groups. In the component pane, select all the consistency groups to bookmark simultaneously. All selected consistency groups must be enabled and transfer must be active. Click the parallel bookmarks icon in the upper right corner of the component pane. When prompted, enter the name of the bookmark. Please do not use the name “latest” as it is a reserved word in RecoverPoint. The group set can be perceived as a single entity, but each consistency group functions as a separate unit within the group set. The user recovers each of the consistency groups that make up the federated environment and selects the common bookmark. This will ensure all of the application’s data is recovered to the exact same point in time. This can be executed through the RecoverPoint Manager Application or using the command line interface (CLI). Maintaining the autonomy of the consistency group in a group set provides the flexibility for each application to maintain its own policies and settings to govern the protection and recoverability of the application’s data. It enables the data to be used for such advanced features as building a testing and development environment, creating a federated backup image, and data mining. OCFS2 configuration High availability for persistent stores The WebLogic application servers are usually clustered for high availability. For the local site high availability of the SOA Suite Topology, a persistent file-based store is used for the Java Message Services (JMS) and Transaction Logs (TLogs). This file store needs to reside on shared disk that is accessible by all members of the cluster, that is, by apphost1 and apphost2. The persistent file-based store can be migrated along with its parent server as part of the whole server migration feature that provides both automatic and manual migration. There are two methods for configuring a synchronous write policy: the Cache-Flush and the Direct-Write policies. The Cache-Flush policy improves performance, but the downside is possibly losing sent messages or generating duplicate messages in the event of an operating system crash or hardware failure. This is due to the fact that transactions are complete as soon as the writes are cached in memory, instead of waiting for acknowledgement that the writes are written to disk. Use high-availability storage for state data The server migration process moves or “migrates” services. Some state information associated with the work in process at the time of failure is persisted to storage. To ensure high availability, it is critical that such state information remains available to the server instance and the services it hosts after migration. It should be stored in a shared storage system that is accessible to any potential machine to which a failed migratable server might be migrated. For highest reliability, use a shared storage array solution (like EMC CLARiiON) that is itself highly available and a SAN designed for high availability. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 19 If independent local file systems resided on a shared LUN, there would be no means of cache synchronization, and the file systems would eventually corrupt each other. Therefore, the shared storage solution in a SAN as deployed in this white paper uses the Oracle Cluster File System (OCFS2) as the hostbased clustered or shared file system technology. OCFS2 is a symmetric shared disk cluster file system that allows each node to read and write both metadata and data directly to the SAN. A shared storage solution in a SAN as described in this paper uses a host-based clustered or shared filesystem technology. Oracle Cluster File System (OCFS2) is used here, but any shared filesystem technology can be used such as Red Hat's GFS, Veritas VxCFS, or IBM GCFS. Network configuration The EMC RecoverPoint system has been designed as a secure platform for CDP (local), CRR (remote), and CLR (local and remote replication). EMC has invested in ensuring security for all aspects of its RecoverPoint system, including the operating system, networking, and RecoverPoint software. Security settings are divided into the following categories: • RecoverPoint appliance (RPA) operating system and networking • Access control settings to limit access by end users or by external product components • Log settings related to the logging of events • Communication security settings related to security for the product network communications • Data security settings available to ensure protection of the data handled by the product Secure serviceability is available to ensure control of service operations performed on the products by EMC or its service partners. Other security considerations list known issues, including false positive findings that appear when scanning the product for vulnerabilities. For further discussion and implementation detail, please refer to EMC RecoverPoint Release 3.1 and Service Pack Releases Security Configuration Guide. Planning for disasters and planned downtime Initiate the replication process RecoverPoint is an intelligent data protection and recovery solution that runs on out-of-band appliances attached to the SAN and the IP network. RecoverPoint uses your enterprise’s existing network and storage systems. To prepare for installation, you should be familiar with how the RecoverPoint system integrates with your existing systems. In preparation for RecoverPoint installation: 1. The RPAs connect to the hosts and storage subsystems using a Fibre Channel SAN. Before installing RecoverPoint, these subsystems should be in place. 2. RPAs are linked to the WAN interface using an Ethernet/IP connection (eth0). 3. Before you install RPAs and define consistency groups, ensure that sufficient volumes are available on the SAN-attached storage at each site for use by RecoverPoint. As part of the Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 20 installation process, you must create a repository volume and journal volumes at both the primary and secondary sites. The journal is maintained on one or more SAN-attached storage volumes for each copy of a consistency group; that is, the production copy, together with a local copy or a remote copy, or both. To configure a consistency group, you must designate one or more volumes on the production storage to be replicated, together with corresponding volumes on the copy (or copies). RecoverPoint operation requires proper storage mapping (LUN masking) configuration. 4. The RecoverPoint system uses the Network Time Protocol (NTP) to synchronize the clocks across all of the machines that are attached to a given installation. It is highly recommended that you configure an external NTP server, which runs on Linux machines, for use by the RecoverPoint system prior to RecoverPoint installation. 5. Zoning on the Fibre Channel switch must be determined. It allows for communication between the host, storage, and appliances. Zoning instructions vary according to the type of HBA ports installed in your RPAs. Without the proper zoning, replication cannot be performed properly. 6. The RPAs should be installed independently at the primary and secondary sites. RecoverPoint software installation and configuration RecoverPoint is installed with a GUI wizard that helps you install new RecoverPoint clusters from a single management point. At the end of the wizard, you will be able to start building a replication configuration. Before you begin using the RecoverPoint Installer, ensure you have completed the following operations: 1. Identified the type of installation (one site or two sites). 2. Confirmed that all RPAs are available at the site(s). 3. Confirmed that at least two ports are available on each of two fabrics for each RPA. 4. Completed the integration with installed networks and storage. 5. Completed the Customer Site Planning Sheet. 6. Unpacked and physically installed the RPAs. 7. Assigned Management IP addresses for the RPAs (done when logging in as the user “boxmgmt” with the password “boxmgmt”). The IP addresses are used by the RecoverPoint Installer to complete the installation. For detailed information, please refer to the EMC RecoverPoint Release 3.1 Installation Guide. After completing system installation for the first time and verifying that all clocks are synchronized, verify that you can access the management interface using both the RecoverPoint Management Application GUI and the CLI. Starting replication It is assumed that host-based, fabric-based, and array-based splitters have been installed as needed. If not, please refer to the EMC RecoverPoint Release 3.1 Installation Guide. Before you begin, add splitters to the RecoverPoint system. Create the consistency groups using the Add New Group Wizard. Define the settings and policies for the consistency group. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 21 Configure the copies. Enter the values for the General Settings, Protection Settings and Advanced Settings for the production, local, and remote replicas. If you are using the Create New Consistency Group Wizard, the Production Copy Settings dialog box appears. In the Replication Set and Journal Volumes Configuration dialog box, click the Add New Replication Set button to define a new replication set. A replication set consists of a source (normally production) volume and a corresponding volume for each replica. Each consistency group contains as many replication sets as there are source volumes in the configuration group. Add the volumes to the splitters that have been installed. When the volumes are added, they will be automatically attached to all of the splitters that have access to that volume. Now that the configuration groups and their journals are defined and configured, splitters have been added, and volumes attached to splitters, enable the groups. Enabling the groups will initiate the transfer/replication/sync of the production volumes with the replica/disaster recovery site volumes. When a consistency group is initialized for the first time, the RecoverPoint system must complete full synchronization of all designated volumes. The volumes at the local and remote site can be initialized while host applications are running. Alternatively, the current production middleware installation can be backed up, and manually transferred to the remote site. This is possible, because the RecoverPoint system can efficiently determine which blocks are different between the production and replica copies. It sends only the data for those blocks to the replica storage, as the initialization snapshot. For detailed procedures and further explanation, please refer to the EMC RecoverPoint Release 3.1 Administrator’s Guide. Switchover procedures Switchovers are planned operations for the purpose of testing and validation that result in smooth transition of services and applications from one site to another. A trigger may be a maintenance window or to comply with regulatory requirements to validate disaster recovery functionality. Switchovers can provide a mechanism to establish, test, and prove SLA, RPO, and RTO requirements. During a switchover, the current production site becomes the disaster recovery site, and the disaster recovery site becomes the current production site. The initial state of the production site is presumed to be up and functioning. The procedures begin with the shutdown of the production site. The procedures to execute the switchover are as follows. 1. The decision is made to initiate switchover to the disaster recovery site, and all participants in this process are advised. 2. Shut down the Oracle Fusion Middleware, that is, SOA Suite and WebLogic. a) Shut down the WebLogic Managed Servers. Log in to the WebLogic Console. Choose SOADomain > Control. Select the managed servers, and click the Shutdown tab. Choose the option Force Shutdown Now. b) Shut down the WebLogic processes on the application servers. Log in to the WebLogic Administration Server application host. Using the command line, manually shut down the Node Manager process first and then shut down the Administration Server process. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 22 Log in to the other nodes in the cluster and manually shut down the Node Manager process. In this configuration, the WebLogic Administration Server is only configured on one server. 3. Log in to the webhost server(s). Shut down the Apache httpd processes. 4. Unmount the file systems on all of the application and webhost servers. 5. Log in to the database server. a) Cleanly shut down the database and Listener. If the database is managed by Oracle Enterprise Manager 10g Grid Control, then log in to Oracle Grid Control and shut down the database and Listener. If not, use any variation of the standard Oracle commands, that is, “shutdown immediate” and “lsnrctl stop”. b) Unmount the database file systems. 6. Perform any network, DNS updates, or modifications to /etc/hosts or DNS if necessary at the disaster recovery site. 7. Log in to the RecoverPoint Management Application. a) Confirm current image selection on the production site: All applications are now shut down. Take a parallel bookmark of the production site. The parallel bookmark is needed to identify the point across all of the consistency groups that represents the point in time the shutdown has completed. It causes any data held at the production site to be flushed to the remote site. Therefore if a recovery is needed of the production site prior to switchover, the parallel bookmark is guaranteed to have the latest data prior to the switchover. For detailed procedures to create the parallel bookmark, please refer to the EMC RecoverPoint 3.1 Administrator’s Guide, in the section, “Applying bookmarks to multiple groups simultaneously.” A journal entry reflecting the parallel bookmark will be entered into each consistency group’s journal. In this configuration, there are five consistency groups, and therefore respectively, five journal entries. To view the bookmark for each consistency group, in the Navigation pane, click on the consistency group name. Then choose the production site and click the Journal tab. When you see the journal entry for the bookmark, you can be assured that all the data has flowed through the pipe and has been replicated to the DR site. b) On the DR site for each consistency group in the group set: 8. Enable access to the replicated images and applications at the DR site. Click on the first icon under the name of the DR site. Choose Enable Image Access. Choose Select an image from the list. From the list that appears, select the named parallel bookmark. Choose the type of access mode. The production site will be recovered from the DR site. Therefore, select Virtual access with instantaneous access to the image, and Roll image in the background to enable the recovery process. Confirm the current image selection Log in to the database server. a) Mount the database file systems. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 23 b) Start the database and Listener. If the database is managed by Oracle Enterprise Manager 10g Grid Control, then log in to Oracle Grid Control and start the database and Listener. If not, use any variation of the standard Oracle commands, that is, “startup” and “lsnrctl start”. 9. Mount the file systems on all the application and webhost servers. 10. Log in to the webhost server(s). Start the Apache httpd processes. 11. Start the Oracle Fusion Middleware, that is, SOA Suite and WebLogic. a) Start the WebLogic processes on application servers. Log in to the WebLogic Administration Server application host. Using the command line, manually start the Administration Server process first and then start the Node Manager process. Log in to the other nodes in the cluster and manually start the Node Manager process. In this configuration, the WebLogic Administration Server is only configured on one server. b) Start the WebLogic Managed Servers. Log in to the WebLogic Console. Choose SOADomain > Control. Select the managed servers, and click the Start tab. 12. The site is ready to perform work. Applications are started, and database has been switched over. Switchback procedures Testing and validation of the disaster recovery plan have been completed. SLA, RPO, and RTO requirements have been verified, and compliance with regulatory requirements has been met and proven. Planned maintenance is done. The next tasks are to prove the procedures to switch back from the disaster recovery site to the primary site and in this test scenario, to reserve all changes and propagate to the primary site. During a switchback, the disaster recovery site, now acting as the primary site, returns to its initial function as the DR site. The initial state of the DR site is up and functioning. The production site is down. We begin by shutting down all processes on the DR site in preparation to recover the production site from the DR site images with RecoverPoint. The procedures to execute the switchback are as follows. 1. The decision is made to initiate switchback to the production site, and all participants in this process are advised. 2. Shut down the Oracle Fusion Middleware, that is, SOA Suite and WebLogic. a) Shut down the WebLogic Managed Servers. Log in to the WebLogic Console. Choose SOA Domain > Control. Select the managed servers, and click the Shutdown tab. Choose the option to Force Shutdown Now. b) Shut down the WebLogic processes on the application servers. Log in to the WebLogic Administration Server application host. Using the command line, manually shut down the Node Manager process first and then shut down the Administration Server process. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 24 Log in to the other nodes in the cluster and manually shut down the Node Manager process. In this configuration, the WebLogic Administration Server is only configured on one server. 3. Log in to the webhost server(s). Shut down the Apache httpd processes. 4. Unmount the file systems on all of the application and webhost servers. 5. Log in to the database server. a) Cleanly shut down the database and Listener. If the database is managed by Oracle Enterprise Manager 10g Grid Control, then log in to Oracle Grid Control and shut down the database and Listener. If not, use any variation of the standard Oracle commands, that is, “shutdown immediate” and “lsnrctl stop”. b) Unmount the database file systems. 6. Perform any network, DNS updates, or modifications to /etc/hosts or DNS if necessary at the disaster recovery site. 7. Log in to the RecoverPoint Management Application. a) On the DR site: For each consistency group in the group set, to switch back to the production site, choose the option Recover production from the DR site. RecoverPoint will automatically take a snapshot of the latest image. The journal entry for the bookmark will be referred to as the “Pre-replication Image”. This is the image restored on the production site. When the Recover production option is chosen, there will be several questions asking whether you are sure you want to perform the production restore, and do you want to continue. Respond Yes to these questions. During the initial phases of transfer, the connection will be paused for reconfiguration. This can be seen in the Component pane of the RecoverPoint Management Application. When the value Transfer changes from Paused to Active, you can now move to the production site to complete the switchback recovery. b) On the production site: Replication will continue from the DR site to the production site, until all data is transferred. The amount of time to complete transfer is relative to the amount of changes and bandwidth. In the Component pane for the selected consistency group the image status of the production site will be “Distributing Pre-replication image” and the role will be “Production (being restored)”. When the Pre-Replication image is recorded on the production site, all data has been replicated, and syncing completed. To verify the status of the transfer, click on the entry for the production site under each consistency group. Choose the Journal tab. The bookmark entry will indicate the replication status. When it states “Synchronization completed (Primary)”, replication is complete and you can continue with the next step. For each consistency group, enable image access. Choose logged access (physical) when recovering the site on a permanent basis. Virtual images are temporary, as their name indicates. After confirming the image, choose Resume production for each consistency group on the production site. You will be asked if you choose to continue with this action. It will be noted that there will be a pause while reconfiguration occurs. The response should be Yes. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 25 8. When “Resume production” is completed at the production site, note the direction of the replication flow will change and revert back, originating from the production site to the DR site for each consistency group. The volumes are now visible on the production site and are mountable as read/write. The RecoverPoint splitter is enabled, and all new writes will be sent to the appliance. Although not required, best practice would recommend a parallel bookmark be created for the group set. This should be implemented before restarting the applications on the production site. This will provide an audit trail and a bookmark of all applications at a consistent point in time, in case there is an unexpected situation and you must again switch over immediately to the DR site. Log in to the database server. a) Mount the database file systems. b) Start the database and Listener. If the database is managed by Oracle Enterprise Manager 10g Grid Control, then log in to Oracle Grid Control and start the database and Listener. If not, use any variation of the standard Oracle commands, that is, “startup” and “lsnrctl start”. 9. Mount the file systems on all the application and webhost servers. 10. Log in to the webhost server(s). Start the Apache httpd processes. 11. Start the Oracle Fusion Middleware, that is, SOA Suite and WebLogic. a) Start the WebLogic processes on the application servers. Log in to the WebLogic Administration Server application host. Using the command line, manually start the Administration Server process first and then start the Node Manager process. Log in to the other nodes in the cluster and manually start the Node Manager process. In this configuration, the WebLogic Administration Server is only configured on one server. b) Start the WebLogic Managed Servers. Log in to the WebLogic Console. Choose SOA Domain > Control. Select the managed servers, and click the Start tab. 12. The production site is recovered. Data is being replicated to the DR site. Failover procedures The extent of the disaster and anticipated length of time at the disaster recovery site will directly affect which types of failover procedures are required. If the failover was due to a catastrophe, such as fire, flooding, or earthquake, then migration to the disaster recovery site is the likely scenario. In this case, the personality of the sites changes. The disaster recovery site becomes the permanent production site until the previous production site is rebuilt or repaired. With this type of failover, when the production site becomes available, a resynchronization of all data and applications will be executed. For the RecoverPoint procedures to execute a migration please refer to the EMC RecoverPoint 3.1 Administrator’s Guide. In the following scenario, the process addresses failover due to temporary loss of site, or unplanned downtime, and the system crashes. The procedures are very similar to those detailed in the previous “Switchover procedures” and “Switchback procedures” sections. The major difference is we are beginning with no access to the production site, and no ability to gracefully shut down all applications and cleanly switch over the database. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 26 The procedures to execute the failover are as follows: 1. Log in to RecoverPoint Management Application. Enable access to the replicated images, as per steps 7b in the “Switchover procedures” section that starts on page 22. The wizard will ask which image to choose. For this exercise, we choose the named parallel bookmark image taken prior to switchover from the list of bookmarks. 2. Start the database and Listener. If the database is managed by Oracle Enterprise Manager 10g Grid Control, then log in to Oracle Grid Control and start the database and Listener. If not, use any variation of the standard Oracle commands, that is,“startup” and “lsnrctl start”. 3. Continue with steps 9 through 11 of the “Switchover procedures” section, that is, mount and start the Apache HTTP application. 4. Perform any network, DNS updates, or modifications to /etc/hosts or DNS if necessary at the disaster recovery site. 5. The site is ready for work. Failback procedures The systems on the production site are now available and a decision is made to return to the production site. The applications and database at the disaster recovery site must be shut down, failed back, or re-instantiated and restarted. The procedures to execute the failback are as follows. 1. Shut down the WebLogic Managed Servers, WebLogic processes, and Apache httpd services, and unmount the file systems, as per steps 2 through 4 of the “Switchback procedures” section of this paper starting on page 24. 2. Cleanly shut down the database and Listener. If the database is managed by Oracle Enterprise Manager 10g Grid Control, then log in to Oracle Grid Control and shut down the database and Listener. If not, use any variation of the standard Oracle commands, that is, “shutdown immediate” and “lsnrctl stop”. Unmount the database file systems. 3. In this implementation, the failover was temporary. Therefore, to fail back, log in to the RecoverPoint Management Application and repeat steps 7a and 7b of the “Switchback procedures” section. 4. There is no direct relationship between having to re-instantiate or recover the database and RecoverPoint. It could be possible due to the bandwidth and the size of the database that it would be better to recover from backup, and then sync only the changes with RecoverPoint. To implement this procedure, please refer to the EMC RecoverPoint Release 3.1 Administrator’s Guide. If reinstantiation is not required, then mount the database file systems. Start the database and Listener. If the database is managed by Oracle Enterprise Manager 10g Grid Control, then log in to Oracle Grid Control and start the database and Listener. If not, use any variation of the standard Oracle commands, that is, “startup” and “lsnrctl start” 5. After completion of these steps, the original production site will be available to resume work. To start the applications repeat steps 9 through 11 of the “Switchover procedures”. 6. Perform any network, DNS updates, or modifications to /etc/hosts or DNS if necessary at the production site. The production site is recovered. Data is being replicated to the DR site. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 27 General recommendations Setting snapshots or manual bookmarks based on requirements The RPO policy for the Oracle Fusion Middleware binaries has been set based on size, number of writes, and time, with a configured lag time of 12 hours. The lag time of 12 hours was decided upon based on the knowledge that binaries have few expected modifications. The way to think about your policy is that you are guaranteed a recovery point within 12 hours. The binaries are scheduled to be patched. We would like to ensure the patched binaries are replicated prior to the configured lag time after the application of the patch. The concern is if an event occurs that causes failover to the DR site, then the correct version of the code will be available. To ensure the configuration changes are replicated, a bookmark is created. A bookmark is a named snapshot (image) that uniquely identifies the image at a point in time. The bookmark creates a transactional consistent snapshot that is transferred to the replica. When the bookmark is created it forces all data held at the production side in the RPA’s buffers to be flushed to the remote site. At the remote site a corresponding bookmark is recorded when replication is completed. In this situation, it is best practice to create a bookmark prior to patching the binaries. If a recovery of the binaries prior to patching is needed, the “pre-patch” bookmark will guarantee the latest data prior to the patch. After patching the binaries, create a “post-patch” bookmark, so a recovery using this bookmark is guaranteed to have the latest data image after the patch. By creating both a “pre-patch” and “post-patch” bookmark, it becomes simple to choose either bookmark, depending on the point-in-time recovery required. To create the bookmark, in the Navigation pane of the RecoverPoint Management Application, select the consistency group. Verify on the Status tab that it is active. Click the Bookmark button and enter a descriptive name. Images’ snapshots will continue independent of the manually initiated bookmark per the RPO set policy. Periodic DR testing To ensure governance, regulatory requirements, and failover to the disaster recovery site will be successful in times of catastrophe, it is a good practice to verify the replicas can be used to restore data, recover from a disaster, or seamlessly take over. In most cases, while testing a replica, applications can continue to run on the production servers, and replication can continue. The writes will be stored in the replica journal until testing is completed. Upon completion of testing, write access to the replica is disabled, which results in any writes made during testing rolled back by RecoverPoint. However, any concurrent writes from the production applications will be automatically distributed from the journal to the replica. This entire process can be completed without application downtime and without loss of data at the replica. In the RecoverPoint Management Application at the DR site: 1. From the Image Access menu, select Enable Image Access. If you are only testing the image, and do not expect a high rate of modification/change to the data, selecting Virtual Image Access without Roll Image in Background is appropriate. If you expect to do more testing/forensics, over an extended period of time or need maximum performance while testing, select Logged Image Access (physical). Virtual access is instantaneous while physical access will require more time before the image is available for use. For further discussion of the differences in the image access modes, please refer to the “Appendix” section of this paper. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 28 2. At the host, mount the replica volume you wish to access. If the volume is in a volume group managed by a logical volume manager, import the volume group. 3. If desired, run “fsck” (“chkdsk” on Windows) on the replica volumes. This is optional. 4. Access the volumes and test as desired. 5. When testing is completed, unmount the replica volumes from the host. If using logical disk management, deport the volume groups. Then select Disable Image Access at the replica. The writes to the replica will automatically be undone. Event notification Complete event logging is provided for all RecoverPoint management operations and status changes. Events are stored within the RecoverPoint System Log, which is accessible from the management application. Event logging includes auditing information such as commands, command errors, events, and the System Log. Additionally, RecoverPoint also supports the following types of event notification: e-mail, SNMP, Syslogs, System Reports, and System Alerts. General I/O and sizing The transfer rate of a single RecoverPoint appliance (RPA) is approximately 60 MB per second per RPA. Iostat can be used to measure the data change rate. This data is relevant in determining the number of RPAs needed to meet RTO and Service Level Agreements. It is useful in determining the size of the journals and retention time, bandwidth, and degree of compression. Group sets should be configured for consistency groups that are dependent on one another or that must work together as a single unit, that is, in a federated environment. A group set provides the capability to apply parallel bookmarks at a user-defined frequency. In this implementation, the data is being replicated from both the web server and application hosts. On the application hosts, the rate of change varies for different volumes. The WebLogic persistent store must be replicated at a higher rate than the SOA and WebLogic binaries and configuration data need to be transferred. To implement group sets, please refer to the EMC RecoverPoint 3.1 Administrator’s Guide. Conclusion Enterprise Oracle deployments need protection from unforeseen disasters and natural calamities. Oracle provides Data Guard as a technology to remotely replicate Oracle database files synchronously or asynchronously to allow for recovery of the Oracle database. However, protecting the database alone is not enough to protect the business itself. Full application level recovery is required to bring the business back to a production-ready state. To that end, Oracle works with partners to validate enterprise replication technologies to protect Fusion Middleware environments on which Oracle enterprise business applications run. This white paper summarizes the results of a joint EMC-Oracle engineering effort to utilize EMC RecoverPoint for local and remote replication of Oracle Fusion Middleware and Oracle Database Server information. It highlights key concepts and setup and administration examples of how to deliver application-aware recovery to specific points in time and provide continuous data protection , while also incorporating additional features such as bandwidth compression to reduce overall TCO. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 29 References and resources Oracle OCFS2 A Cluster File System for Linux, User’s Guide for Release 1.4 http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.4/ocfs2-1_4-usersguide.pdf Oracle Cluster File System 2 (OCFS2) User’s Guide OCFS2 – Frequently Asked Questions http://oss.oracle.com/projects/ocfs2/documentation/v1.2/ Oracle WebLogic Server 9.2 documentation on the Oracle website http://e-generation.beasys.com/wls/docs92/admin.html SOA Architect Center page on the Oracle website http://www.oracle.com/technology/tech/soa/index.html Oracle Data Guard Concepts and Administration 10g Release 2 (10.2) http://download.oracle.com/docs/cd/B19306_01/server.102/b14239/toc.htm EMC The following white papers can be found EMC.com. For a selection of other EMC RecoverPoint white papers, go to our resource library. • Introduction to EMC RecoverPoint 3.1: New Features and Functions http://www.emc.com/collateral/software/white-papers/h2781-emc-recoverpoint-3-new-features.pdf • Using EMC RecoverPoint Concurrent Local and Remote for Operational Disaster Recovery http://www.emc.com/collateral/software/white-papers/h4175-recoverpoint-concurrent-local-remote-operdisaster-recovery-wp.pdf • EMC RecoverPoint Family Overview http://www.emc.com/collateral/software/white-papers/h2346-recoverpoint-ov.pdf The following are available on Powerlink, EMC’s password-protected website for customers and partners: • Enhancing Oracle Database Recovery with EMC RecoverPoint—Applied Technology\ • Disaster Recovery of Oracle Fusion Middleware with EMC RecoverPoint • Replicating Oracle with EMC RecoverPoint Technical Notes • EMC RecoverPoint Release 3.1 Administrator’s Guide • EMC RecoverPoint Release 3.1 and Service Pack Releases Security Configuration Guide • EMC RecoverPoint Release 3.1 Installation Guide Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 30 Appendix Oracle DR terminology This appendix defines the following Oracle disaster recovery terminology: • Application Server host name: This paper differentiates between the terms Application Server host name and network host name. The Application Server host name is the host name that Oracle Application Server uses for the host when Oracle Application Server is configured on the host. During installation, the installer automatically retrieves the Application Server host name from the current host and stores it in the Oracle Application Server configuration metadata on disk. A host can have only one Application Server host name. See also network host name later in this section. • Asymmetric topology: A disaster recovery configuration that is different across tiers on the production site and standby site. In an asymmetric topology, the standby site can use less hardware (for example, the production site could include four hosts with four Application Server instances while the standby site includes two hosts with four Application Server instances. Or, in a different asymmetric topology, the standby site can use fewer Application Server instances. For example, the production site could include four Application Server instances while the standby site includes two Application Server instances). Another asymmetric topology might include a different configuration for a database (for example, using a Real Application Clusters database at the production site and a single instance database at the standby site). • Disaster recovery: The ability to safeguard against natural or unplanned outages at a production site by having a recovery strategy for applications and data to a geographically separate standby site. • Network host name: A host name assigned to an IP address that is resolved through DNS resolution. The network host name is the host name by which a particular host is known within the host's network. A host can have the same network host name and Application Server host name. A host can have only one Application Server host name, but it can have multiple network host names. See also Application Server host name earlier in this section. • Production site setup: The process of creating the production site. To create the production site using the procedure described in this manual, you must plan and create Application Server host names and network host names, create mount points and links on the hosts to the Oracle home directories on the shared storage where the Oracle Application Server instances will be installed, install the binaries and instances, and deploy the applications. • Site failover: The process of making the current standby site the new production site after the production site becomes unexpectedly unavailable (for example, due to a disaster at the production site). This paper also uses the term "failover" to refer to a site failover. • Site switchover: The process of reversing the roles of the production site and standby site. Switchovers are planned operations done for periodic validation or to perform planned maintenance on the current production site. During a switchover, the current standby site becomes the new production site, and the current production site becomes the new standby site. This paper also uses the term "switchover" to refer to a site switchover. • Site synchronization: The process of applying changes made to the production site at the standby site. For example, when a new application is deployed at the production site, you should perform synchronization so that the same application will be deployed at the standby site, also. • Standby site setup: The process of creating the standby site. To create the standby site using the procedure described in this paper, you must plan and create Application Server host names and network host names, perform a switchover operation (which replicates the Oracle home directories and installations from the production site shared storage to the standby site shared storage), and create mount points and links to the Oracle home directories on the standby shared storage. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 31 • Symmetric topology: An Oracle Application Server Disaster Recovery configuration that is completely identical across tiers on the production site and standby site. In a symmetric topology, the production site and standby site have the identical number of hosts, load balancers, instances, and applications. The same ports are used for both sites. The systems are configured identically and the applications access the same data. This paper describes how to set up a symmetric Oracle Application Server disaster recovery topology for an enterprise configuration. • Topology: The production site and standby site hardware and software components that comprise an Oracle Application Server disaster recovery solution. RecoverPoint terminology • Bookmarks: A bookmark is a named snapshot. The bookmark uniquely identifies an image. Bookmarks can be set and named manually; they can also be created automatically by the system either at regular intervals or in response to a system event. Bookmarked images are listed by name. • Consistency group: A consistency group is a logical grouping of replication volumes that must be consistent across one another. The need for consistency across these volumes could be due to the volumes being used by the same application or needing to have the data on the volumes at the same point in time when recovered due to data dependencies. A consistency group is also used to determine replication direction and policies on a set of replication volumes. Each consistency group is an independent entity and can have different replication direction and policies than other consistency groups. This allows for synchronous and asynchronous replication as well as bi-directional replication to exist in the same environment. Consistency groups are a technology that groups together various objects, either on a single system or across systems, so that when they are moved or copied, they’re seen as a group. Remember, you either get all of the data or none of it—you don't want to get part of the data. • Continuous data protection (CDP): Local replication across heterogeneous environments. At the local site, the CDP engine captures every I/O into the local CDP journal with I/O bookmarking to capture application events. CDP provides instantaneous or on-demand any-point-in-time recovery regardless of the array type. • Continuous local and remote data protection (CLR): Simultaneous block-level local replication and asynchronous block level remote replication for LUNs with one copy residing locally in the same SAN, and the second copy residing remotely in a different SAN. Locally, every write is journaled. With remote replication significant groups of writes are journaled (bandwidth efficiencies). With CLR, users are enabled to independently recover from local or remote sites. Recovery of one copy locally or remotely can occur without affecting the other copy. This ability to fail over to a local copy of data without impacting a remote site extends RecoverPoint disaster recovery to encompass local as well as regional events. • Continuous remote replication (CRR): Remote replication, with remote site recovery. It provides heterogeneous replication with policy-based bandwidth reduction and efficiencies in asynchronous or synchronous replication environments. CRR implements bi-directional, heterogeneous, block-level replication across any distance using asynchronous, synchronous, and snapshot modes. • Journal: Provides time-stamped recovery points with application-consistent bookmarks. It also correlates system-wide events (port failure, system error, and so on) with potential corruption events, which is very useful when performing root-cause analysis. These application and system bookmarks are automatic, but users can also enter their own bookmarks into the system. The RecoverPoint journal, an important component of the protection process, provides the following capabilities: Tracks all data changes to every protected LUN. It saves each write so that it represents an anypoint-in-time image of a protected LUN. Utilizes bookmarks for application-aware recovery. Maintains all of the application, user, and environmental bookmarks associated with specific pointin-time images. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 32 Repository for live data updates. Maintains a reserved space, called the target-side processing space, which is used to store changes to an image that has been recovered. Provisioned from existing SAN LUNs. Can be configured from any SAN-accessible LUN, or from a collection of “concatenated” LUNs. The size of the journal can be dynamically increased by adding a new LUN without discarding the existing history contents. Dynamically compressed, which saves storage. The data is stored in a compressed format that can be used to roll a protected LUN back to any point in time. • RecoverPoint appliance (RPA): Based on a standard Dell 1μ server running a customized Linux kernel. The appliance has four 4 Gb/s Fibre Channel ports that are used to attach into a single- or dualnode (A/B) fabric. Each RPA also has two Ethernet ports—one is used as the management-control network and one is used to communicate to a remote RecoverPoint appliance cluster. It is designed for high availability with redundant power and cooling. The RecoverPoint application software provides core functionality and management for the system. Appliances are deployed in a two- to eight-node cluster configuration that allows active-active failover between the nodes. The RecoverPoint software is designed to avoid the split-brain issues that can arise with traditional clustering technologies. All RecoverPoint appliances are in constant communication and use a shared private SAN volume to maintain metadata state. If a RecoverPoint appliance fails, one of the other RecoverPoint appliances will take over without interrupting any in-progress CDP or CRR operations. • Replication volumes: Volumes with data to be replicated. If the source and target replication volumes differ in size, then the source must be the smaller of the two volumes. This is typical in heterogeneous storage environments as well as some environments where different versions of storage management software are used. Any excess size will not be replicated and will be hidden from the host servers. • Replication set: The association created between the source volume and the local and/or remote target volumes is called the replication set. A consistency group contains one or more replication sets. • Repository volume: This volume holds the configuration and marking information during the replication. At least one repository volume is required per site and is accessible from all RPAs at the site. • Snapshot: A snapshot is the difference between one consistent image of stored data and the next. Snapshots are taken seconds apart. The application writes to storage; at the same time, the splitter provides a second copy of the writes to the RecoverPoint appliance. In asynchronous replication, the appliance gathers several writes into a single snapshot. The exact time for closing the snapshot is determined dynamically depending on replication policies and the journal of the consistency group. In synchronous replication, each write is a snapshot. When the snapshot is distributed to a replica, it is stored in the journal volume, so that is it possible to revert to previous images by using the stored snapshots. RecoverPoint write splitters The function of the splitter is to mirror writes from the application server to LUNs being protected by RecoverPoint. When a write is requested from the application server it is split and sent to the RecoverPoint appliance in one of three ways. • The first method utilizes a host splitter/driver. This host splitter is a lightweight driver that resides in the I/O stack, below any file system and volume manager, and just above any multipath driver (such as EMC PowerPath). The splitter looks at the destination for the write packet. If the write is to a LUN that RecoverPoint protects, the splitter will send a copy of the write packet to the RecoverPoint appliance. It does this by rewriting the target address inside the packet to redirect it to the RecoverPoint appliance’s pseudo-LUN, and reissuing the write down the stack. • The second method is through an intelligent fabric switch, either the Brocade Connectrix® AP-7600B or Connectrix ED-48000B with the SAS APIs, or one of the Cisco Connectrix MDS-9000 series Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 33 switches with the SANTap API. The switch intercepts all writes to LUNs being protected by RecoverPoint, and sends a copy of that write to the RecoverPoint appliance. • The third method is through a CLARiiON array-based splitter, which is supported on CLARiiON CX3 arrays with FLARE 26 patch code, and CLARiiON CX4 arrays with FLARE 28 patch code. The array intercepts all writes to the LUNs being protected by RecoverPoint, and sends a copy to the RecoverPoint appliance. In all cases, the original write travels though its normal path to the production LUN. When the copy of the write is received by the RecoverPoint appliance, it is acknowledged back (ACK). This ACK is received by the splitter (the host, CLARiiON or fabric splitter), and held until the ACK is received back from the production LUN. With both ACKs received, the ACK is sent back to the host, and I/O continues normally. Once the appliance has acknowledged the write, it will move the data into the local journal volume, along with a timestamp and any application-, event-, or user-generated bookmarks for the write. When the data is safely in the journal, it is then distributed to the target replica volumes, with care taken to ensure that write order is preserved during this distribution. RecoverPoint image access modes To test and verify the replica at the disaster recovery site is a reliable and consistent image of the production site, it is necessary to access the image, as referenced in the procedures in this paper. Image access is required to restore production from the disaster recovery site, and to roll back to a previous state of the data. It is also required to temporarily operate systems from a replicated copy while maintenance work is carried out on the production site and to fail over to the replica. When image access is enabled, host applications at the copy site can access the replica. The following is a discussion of several of the available image access strategies that were pertinent in this implementation, that is, Logged, Virtual, Virtual with Roll, and Disable Image access. Virtual access (instant) In Virtual access, the system creates the image selected in a separate virtual LUN within the RecoverPoint appliance. Performance is constrained by the RecoverPoint appliance; however access to the point-in-time image is nearly instantaneous. The image can be used in the same way as logged access (physical), but again, all data changes are temporary and are stored in a special place on the local journal. Generally, this type of image access is chosen because the user is not sure which image, or point in time is needed. The user must access several images to conduct forensics and determine which replica is required. As stated above, the image is accessed through the RecoverPoint appliance; virtual access is not recommended for heavy workloads or production work. You will not be able to recover the production site from a virtual image. By definition, the image is temporary. Generally when work is completed, the choice is made to disable image access. If it is determined the image should be maintained, then access must be changed to Logged access using Roll To Image. This can be done on the fly using a pull-down menu in the GUI or through the RecoverPoint command line interface. When you disable image access, the virtual LUN and all writes to it are discarded. Virtual access (instant) with Roll image in background In Virtual access with Roll image in background, the system first creates the image in a virtual volume managed by the RecoverPoint appliance. This provides very fast access to the image, the same as in Virtual access. Simultaneously in background, the system rolls to the physical image. Once the system has completed this action, the virtual volume is discarded, and the physical volume takes its place. At this point, the system continues to function as if you had chosen Logged image access initially. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 34 The virtual volume and the physical volume have the same SCSI ID. The virtual LUNs owned by the RecoverPoint appliance look like physical target LUNs but were dynamically created by the RPA. If you execute a SCSI inquiry against the virtual LUNs, the data returned will be the same as if the inquiry was invoked against the physical LUNs. Zoning and all other characteristics will appear to have the same configuration. The virtual LUNs can be mounted to the disaster recovery server and are seen by the operating system as physical LUNs. The switch from virtual to physical will be transparent to the servers and applications. The user will not see any difference in access. Once this occurs, changes are read from the physical volume instead of being performed by the RecoverPoint appliance. If you disable image access, the writes to the volume while image access was enabled will be rolled back (undone). Then distribution to storage will continue from the accessed image forward. This type of access is recommended when the decision is made to move from the current production site to the disaster recovery site, that is, catastrophe, and immediate access to the image is required. If the intention is to roll back in time, but you need immediate access to images to determine which image is valid, this is also the best option. This option is also viable for a heavy workload. Lastly, as mentioned, production cannot be recovered from a virtual only image. This type of access or logged access is necessary. Logged access (physical) In Logged access, the system rolls backward (or forward) to the snapshot (point in time) you select to access. There will be a delay while the successive snapshots are applied to the replica image to create the image you selected. The length of delay depends on how far the selected snapshot is from the snapshot currently being distributed to storage. Once the access is enabled, hosts will have direct access to the replica volumes, and the RPA will not have access; that is, distribution of snapshots from the journal to storage will be paused. When you disable image access, the writes to the volume while image access was enabled will be rolled back (undone). Then distribution to storage will continue from the accessed snapshot forward. Logged access is the preferred image access for production. When recovering production from the disaster recovery site this image should be enabled. Disable Image Access Choosing to disable image access means all changes to the replica will be discarded or thrown away. It does not matter what type of access was initiated, that is, logged or another type, or whether the image chosen was the latest or an image back in time. When the splitter is disabled the LUN will be masked off. The operating system will still see the LUNs as mounted. But in fact, if you try to access data or rescan the disks using a SCSI command, the servers will report errors. Disabling the image is like yanking out the LUNs from underneath the application. Applications usually issue errors when data is cached or information is pending to be flushed from the operating system. The cleanest way to ensure there will be no errors on the disaster recovery site is to first shut down the applications. Then unmount the file systems, and disable the image. Disabling image access restores the storage state to No access. Changes to the replica recorded in the image access log are automatically undone, so that the replica is restored to the state it was in before it was accessed. Using Disable Image Access effectively says the work done at the disaster recovery site is no longer needed. Some reasons may be that the point-in-time image chosen was not the correct image, or the information sought was obtained and propagated by another means. Disaster Recovery of Oracle Fusion Middleware and Oracle Database Server with EMC RecoverPoint Applied Technology 35