Uploaded by ram kuruva

HA and DR with the SAP HANA Platform -Part II

advertisement
High Availability and Disaster Recovery with
the SAP HANA Platform
PART - II
HA/DR – Support
System Replication
Overview
•
•
•
•
•
•
System replication is a solution for both high availability & disaster recovery
Compatible with all SAP Hana h/w partner solutions
The secondary system can be located near the primary system in the setup to
serve as a rapid failover solution for the planned downtime
Alternatively a secondary system can be installed in a remote site for disaster
recovery purposes
System replication replicates data and & persists data/logs , and finally loads
data to memory
As an alternative configuration, system replication without data preload, the
secondary system does not preload the data and hence consumes very little
memory. This allows host of the secondary system to serve dual purposes
https://www.linkedin.com/in/priti-prasanna-33698050/
System Replication - options:
Synchronous in memory (default):
•
•
•
This is the default replication mode
The primary system commits the transaction after it receives a reply, confirming
that the log was received by the secondary system, but before it has been
persisted.
The transaction delay in the primary system is shorter in this case because it only
includes the data transmission time
Synchronous with full sync :
•
•
In this option the log write is successful when the log buffer has been written to
the log file of the primary and the secondary instance
In addition when the secondary system is disconnected (e.g. n/w failure) the
primary system suspends the transaction processing until the connection to the
secondary system is reestablished. No data loss occurs in this scenario.
https://www.linkedin.com/in/priti-prasanna-33698050/
Synchronous:
•
•
In this option the primary system does not commit a transaction until it receives a
confirmation that the log has been persisted into the secondary memory
This mode guarantees immediate consistency between both the primary and the
secondary systems . However the transaction is delayed by the time it takes to
transmit the data and persist it in the secondary system
Asynchronous
•
•
•
In this mode the primary system sends the redo log buffers to the secondary
system asynchronously
It does not wait for the secondary system
The primary system commits a transaction when it has been written to the log file
of the primary system and sent to the secondary system through network
System Replication – Operation mode - options:
Delta data shipping:
•
•
In this operation mode the secondary system persists the data but doesnot
immediately replay the received log
In order to avoid ever growing list of logs, incremental data snapshots are
transmitted asynchronously from time to time from the primary to the secondary
system
https://www.linkedin.com/in/priti-prasanna-33698050/
Logreplay (available from SPS11)
•
•
In this operation mode, log replay, the received log entries are replayed
immediately in the secondary system, which reduces the takeover time and less
traffic on the n/w between primary and secondary site, because there is no delta
data shipping required in this operation mode.
If the secondary system has to takeover only part of the log needs to be replayed
that represents the changes that were made after the most recent data snapshot.
System Replication - Configuration options:
Minimal setup in one data center for fast takeovers :
https://www.linkedin.com/in/priti-prasanna-33698050/
Advantages with this configuration option are as follows:
•
•
•
Memory is continuously loaded into the secondary site
The switchover form primary to secondary is faster when compared to mirroring
(e.g. 1-2 minutes)
During the takeover process to the secondary site only roll forward the latest
savepoint is required
Delta log shipping:
Connect to primary system:
https://www.linkedin.com/in/priti-prasanna-33698050/
https://www.linkedin.com/in/priti-prasanna-33698050/
https://www.linkedin.com/in/priti-prasanna-33698050/
https://www.linkedin.com/in/priti-prasanna-33698050/
Connect to secondary system:
https://www.linkedin.com/in/priti-prasanna-33698050/
https://www.linkedin.com/in/priti-prasanna-33698050/
https://www.linkedin.com/in/priti-prasanna-33698050/
https://www.linkedin.com/in/priti-prasanna-33698050/
Cluster across data centers with DB controller transfer:
•
Hana cluster set up between primary and secondary data centers with db
controller transfer.
Advantages:
•
•
•
•
Memory is continuously loaded on secondary site as a preparation for possible
takeover & occupies resources
The switchover is faster than with storage replication/mirroring which is approx 25 minutes
Also during the takeover process to the secondary site , only roll forward with
latest synchronization point is necessary
There has very short performance ramp only minutes not hours without
preparation
Disadvantages:
•
•
No QA/DEV systems can be operated on the secondary site
H/w like memory & CPU is actively used on secondary site for the
standby/shadow processes
https://www.linkedin.com/in/priti-prasanna-33698050/
Cluster across data center with QA & DEV on second site
Advantages:
•
•
•
•
QA & DEV systems operated on the secondary site where synchronous and
asynchronous solutions are available
The impact of synchronous solution on primary is about 0%-10% in contrast to
about 25% with storage replication based on the measurements done with the
SAP business systems.
The transfer process from primary to secondary is optimized & less transfer
amount is necessary when compared to storage replication
During the takeover to secondary site only roll forward the latest data
synchronization point is required
Disadvantages:
•
•
•
•
When the tables and column data cannot be continuously loaded into memory on
the secondary site
H/W like memory and CPU is actively used for QA/DEV applications to run and
partly for the standby/shadow processes
About 10-20GB is necessary for the shadow operation of the production bridge
based on measurements tested with SAP systems
The takeover time is similar to the storage mirroring , e.g. 10-15 minutes at best
https://www.linkedin.com/in/priti-prasanna-33698050/
•
Some hardware partners even restart the secondary site with storage replication
& add another 10-15 minutes to that time for getting the operating system up and
running in this case
SAP HANA Multitenant Database Containers:
https://www.linkedin.com/in/priti-prasanna-33698050/
Continuous log Replay Replication
(operation mode as of SPS 11)
Continuous Log Replay (pure log-based transfer)
•
•
•
•
•
New operation mode “logreplay” capability released as a part of SP11
Redo log is processed immediately on the secondary instance
No more delta transfer necessary to the secondary instance
It reduces takeover time and n/w traffic significantly
Automatic initial data transfer and multi tier system replication is still supported
Continuous log replication is the foundation for active/active operations (planned
beyond SP11)
Steps for Configuration of continuous log replication
Using wizard tool available in hana studio
The steps involved are as follows:
•
•
•
•
Create an initial data backup (native or storage snapshot)
Enable primary system for system replication
Register the secondary system, by selecting the operation mode as “logreplay”
Start the secondary system
https://www.linkedin.com/in/priti-prasanna-33698050/
Using command line tool – hdbnsutil for configuring log
replay replication
https://www.linkedin.com/in/priti-prasanna-33698050/
Continuous log replication – new parameter
logshipping_max_retention_size (soft limit) , <>0 , =0
•
•
•
•
•
•
•
•
•
•
•
•
Default value of the logshipping_max_retention_size parameter is 1TB
With operation mode logreplay , log segments can be marked as retained so that
they can sync a secondary system after a disconnect
The secondary site only uses log of the online log area of the primary SAP HANA
system for synching
The log must be retained for a longer time period than before to be able to sync
the secondary site
If synching via delta log shipping doesnot work , for e.g. because the log has
been reused or a full data shipping becomes necessary to avoid this situation the
concept of log retention has been introduced.
The parameter logshipping_max_retention_size can be used to specify how the
SAP HANA system behaves when many log segments of the type retained free
are created
If the logshipping_max_retention_size has been set to value other than 0, where
no secondary is connected ,
o log systems are not reused even if they are truncated and backed up until
the max size limit has been reached or the system runs into a log full
situation
o If the max size limit is reached or in log full situation segments that are
only kept for synching , the secondary site will be reused
This setting prevents the system from hanging on the primary site due to too
many log segments that are held for synching the secondary site
With this setting the primary is kept running with the drawback that the secondary
cannot sync anymore
If the logshipping_max_retention_size is configured to 0 then
o the log segments required for secondary synching are not reused
o And logfull results in a system standstill on primary site due until log
writing can continue
This setting allows to assign a higher priority to being able to sync the secondary
over a standstill on primary
The reason for the full log has been resolved on the primary or the secondary
site . The transaction processing can continue further.
https://www.linkedin.com/in/priti-prasanna-33698050/
Continuous log replication- demo
Connect to primary system:
https://www.linkedin.com/in/priti-prasanna-33698050/
https://www.linkedin.com/in/priti-prasanna-33698050/
https://www.linkedin.com/in/priti-prasanna-33698050/
Click Next
https://www.linkedin.com/in/priti-prasanna-33698050/
https://www.linkedin.com/in/priti-prasanna-33698050/
Connect to secondary system:
https://www.linkedin.com/in/priti-prasanna-33698050/
https://www.linkedin.com/in/priti-prasanna-33698050/
https://www.linkedin.com/in/priti-prasanna-33698050/
https://www.linkedin.com/in/priti-prasanna-33698050/
Reference:
https://www.linkedin.com/in/priti-prasanna-33698050/
Differences between 2 operation modes:
Zero Downtime Procedure
https://www.linkedin.com/in/priti-prasanna-33698050/
Zero downtime – Planned
•
In a planned situation, for example, software upgrades,
o once the connectivity is suspended between the primary and secondary
sites, the cluster manager tool initiates the takeover of moving the virtual
IP addresses to the secondary host.
o Once the takeover is completed, the primary instance can be updated,
reconfigured with the software upgrades, and finally initiate the
resynchronization between new secondary and the primary instance.
Zero Downtime – Unplanned
•
•
•
The steps involved between the primary and the secondary sites in a unplanned
outage situation are:
The initial situation: SAP NetWeaver is connecting to SAP HANA via the
Database Shared Library (DBSL) where a virtual IP address is used to access
the database host and the database instance on that host is alternately running.
Alternatively, the Domain Names Service (DNS) can offer also virtual hostnames.
SAP HANA system replication is working and the secondary is in a synchronous
or asynchronous state 00:10:28 with the primary SAP HANA instance.
Once an incident happens, the takeover is executed.
https://www.linkedin.com/in/priti-prasanna-33698050/
•
•
•
•
•
A cluster manager is checking on operational state of the setup and takes action
if a failure has happened.
In case of a failure, the cluster manager would isolate the box-- drag the virtual IP
addresses away or even send a STONITH command-- to prevent any further
usage of the primary host. The cluster manager also initiates the takeover, waits
for the secondary to prompt the full operational state, and finally moves the
virtual IP addresses to the secondary host. With this move of the virtual IP
addresses, there is a living system again behind this interface and SAP
NetWeaver sessions with work processes can be revived on the new database
instance.
Finally, follow up and re-initialize the SAP HANA system replication in the
reverse direction where every committed transaction and related changes are
available again on the takeover system.
Afterward, the recreation of the HA or DR has to happen with the rebuild of the
hardware, maybe reinstall a blank installation or revive the SAP HANA and
reconfigure it to be a secondary system replication host.
Finally, initiate the resynchronization between the new secondary and the
primary instance.
https://www.linkedin.com/in/priti-prasanna-33698050/
Host Auto Failover/Scale out , Storage Replication
Scale-out with host failover
Scale-out clusters addresses two problems.
o The first problem is scale to a setup larger than one host.
o The second problem is to offer an easy high availability option by having
one or more hosts as spare or as standby.
Host Auto Failover
•
•
•
In SAP HANA platform, the cluster is managed by the name service inside the
platform, where it regularly checks the cluster members if they are still active or
not.
In case of failures, the system initiates a fully automated takeover to the standby
instance.
HANA platform scale-out systems uses SAN storage with Fibre Channel
adapters via the Storage Connector API which ensures the possibility of
remounting the necessary file systems to standby hosts for persistency
https://www.linkedin.com/in/priti-prasanna-33698050/
Storage for persistent data:
Host Auto Failover
•
•
•
•
Host auto-failover is a local fault recovery solution as part of SAP HANA platform
It is used as an alternative measure to the system replication solution.
In this solution, one or more hosts are added to the SAP HANA system and
configured to work in a standby mode.
One important point to be noted here is that as long as the hosts are in standby
mode, the databases on these hosts do not contain any data and do not accept
requests or queries.
https://www.linkedin.com/in/priti-prasanna-33698050/
Host Auto Failover what happens after recovery during host
auto-failover
•
•
In this case, any active host automatically takes over its operations and it's
placed in order to access all the database volumes.
This is accomplished by a shared networked storage server using a distributed
file system or with a vendor-specific solution that uses an SAP HANA
programmatic interface like Storage Connector API where it dynamically
detaches and attach networked storage upon failover. For example, using block
storage via Fiber Channel.
Host Auto Failover – how to recover connections from SAP
HANA clients
•
In this scenario, the connections that are configured to reach the original host
and need to be "diverted" to the standby host after host auto-failover.There are a
couple of approaches available to recover the connections.
o The first approach is the HTTP load balancer (HLB) where network-based
approach using the IP addresses or DNS names.
o The second approach is the SQL/MDX clients. The client connection
code—in this case it's an ODBC or JDBC connection— uses a roundrobin approach to reconnect and ensure that these clients can reach the
SAP HANA database even after failover.
https://www.linkedin.com/in/priti-prasanna-33698050/
•
In order to support the HTTP Web clients, which use the SAP HANA XS
application services, it is recommended to install an external HTTP load balancer
such as SAP Web Dispatcher or a similar product from another vendor. The
HTTP load balancers are configured to monitor the Web servers on all the hosts
at both the primary and secondary sites.
Host Auto Failover
– how SAP HANA platform leverages host auto-failover and
the methods to provide the host auto-failover support.
Heartbeat
•
The first method is heartbeats.
"A heartbeat is a periodic signal generated by software or hardware to indicate
normal operation or to synchronize other parts of a system."
The following types of heartbeats are used with SAP HANA platform to check if
another host is active as a master before starting the current host as master or
performing a failover.
https://www.linkedin.com/in/priti-prasanna-33698050/
o The first heartbeat type is TCP communication-based heartbeats
▪ with a ping from name server to name server with SAP HANA
internal communication protocol.
▪ or a ping from name server to hdbdaemon with SAP HANA internal
communication protocol.
o Storage based heartbeats is the second option where the current master
nameserver periodically updates heartbeat files storage on a different
storage level.
Fencing
•
•
The next method that HANA platform leverages for host auto-failover is fencing
where inbound and outbound (I/O) fencing ensures that the other side no longer
accesses the data or the log storage.
SAP HANA Storage Connector API allows usage of different types of storage
and network architectures to ensure proper inbound/outbound fencing like
o SAN storage, where SAP HANA Fiber Channel storage connector uses
SCSI-3 persistent reservations.
o The second one is NFSv3 is used without file locking but with a storage
connector provided by certified storage vendors where it implements a
STONITH call in order to reboot a failed host.
o Another one is NFSv4 or cluster-like file systems like GPFS using file
locks.
Scenario where heartbeats cannot detect if another host is alive
•
•
•
For example in a split-brain scenario where no communication is possible
between the hosts due to network errors.
In this split brain scenario, the master name server is the only entity to make a
failover decision and the HTTP load balancer redirects the HTTP clients to the
correct server upon HANA instance failure.
The HTTP clients are configured to use the IP address of the load balancer itself
and remain unaware of any HANA failover activity.
https://www.linkedin.com/in/priti-prasanna-33698050/
Storage Replication
The third replication option that SAP HANA platform supports is storage replication
which provides continuous replication of all persisted data and prevents any potential
loss of data from the time of the last backup to the time of the failure.
•
•
Storage replication delivers backup of the volumes or file system to a remote
networked storage system where the persisted transaction log has been
replicated remotely via synchronous storage replication.
Synchronous storage replication can be used only when the distance between
the primary and the secondary backup site is no more than 100 kilometers.
An important note in this replication option is that the SAP HANA disk area is
controlled by the storage technology vendor.
Storage replication- Cluster across data center
•
•
•
In this we have cluster setup across different data centers between the primary
and secondary.
The storage mirroring is offered at the storage system level together with the
appliance as a special offering by SAP hardware partners.
When a synchronous mirroring is activated, performance impact is to be
expected on data changing operations which depend on a lot of external factors
like distance, connection between data centers, etc.
https://www.linkedin.com/in/priti-prasanna-33698050/
•
•
•
The synchronous writing of the log with the concluding COMMITs is the crucial
part here.
In the case of an emergency, the primary data center is not available anymore
and a process for the takeover must be initiated.
o There is a manual process available for the takeover apart from the
automated process, which has to be implemented here.
o The takeover process would end the mirroring officially, will mount the
disks to the already installed HANA software instance, and start up the
secondary database side of the cluster.
o If the hostnames and instance names on both sides of the cluster are
identical, no further steps are necessary here.
In the use of storage replication use case scenario where QA and dev systems
are running on the secondary cluster hardware,
o the takeover would stop dev and QA instances and mount the production
disks to the hosts.
o It would require an additional set of disks for the dev and QA instances to
operate here. To offer this additional option, active-active operation
capability is being planned with the hardware partners in near future.
https://www.linkedin.com/in/priti-prasanna-33698050/
Refer:
SAP HANA System Replication - Deltadatalog shipping demo (SAP HANA Academy):
https://www.youtube.com/watch?v=58vrPJBcZSQ
SAP HANA System Replication - Log based replication demo (SAP HANA Academy):
https://www.youtube.com/watch?v=RYocesxR6Q8
https://www.linkedin.com/in/priti-prasanna-33698050/
Download