Uploaded by anhnh77

VMware vSphere 5.x, 6.x and 7.x support with NetApp MetroCluster (2031038)

advertisement
VMware vSphere 5.x, 6.x and 7.x support
with NetApp MetroCluster (2031038)
Last
Updated: 7/17/2020Categories: Informational 9Language:
subscribe
Purpose
This article provides information about deploying a vSphere Metro Storage Cluster (vMSC) across two
datacenters or sites using the NetApp MetroCluster Solution with vSphere 5.5, 6.x and 7.0 for ESXi 5.5, 6.x
and 7.0. This article applies for FC, FCoE, iSCSI, and NFS implementations of MetroCluster FC and
MetroCluster IP on ONTAP 9.0 to 9.7.
Resolution
What is vMSC?
A VMware vSphere Metro Storage Cluster configuration is a VMware vSphere certified solution that
combines synchronous replication with array-based clustering. These solutions are implemented with the goal
of reaping the same benefits that high-availability clusters provide to a local site, but in a geographically
dispersed model with two data centers in different locations. At its core, a VMware vMSC infrastructure is a
stretched cluster. The architecture is built on the idea of extending what is defined as local in terms of network
and storage. This enables these subsystems to span geographies, presenting a single and common base
infrastructure set of resources to the vSphere cluster at both sites. In essence, it stretches network and storage
between sites. All supported storage devices are listed on the VMware Storage Compatibility Guide.
What is a NetApp MetroCluster™?
NetApp MetroCluster™ software is a solution that combines array-based clustering with synchronous
replication to deliver continuous availability and zero data loss at the lowest cost. Administration of the arraybased cluster is simpler because the dependencies and complexity normally associated with host-based
clustering are eliminated. MetroCluster immediately duplicates all your mission-critical data on a transactionby-transaction basis, providing uninterrupted access to your applications and data. And unlike standard data
replication solutions, MetroCluster works seamlessly with your host environment to provide continuous data
availability while eliminating the need to create and maintain complicated failover scripts.
NetApp MetroCluster™ enhances the built-in high availability and nondisruptive operations of NetApp
hardware and ONTAP storage software, providing an additional layer of protection for the entire storage and
host environment. Whether your environment is composed of standalone servers, high-availability server
clusters, or virtualized servers, MetroCluster seamlessly maintains application availability in the face of a total
storage outage. Such an outage can result from loss of power, cooling, or network connectivity; a storage array
shutdown; or operational error. This is because MetroCluster is an array- based active-active clustered
solution, eliminating the need for complex failover scripts, server reboots, or application restarts. In fact, most
MetroCluster customers report that their users experience no application interruption when a cluster recovery
takes place.
What is ONTAP Mediator?
ONTAP Mediator works with MetroCluster IP in ONTAP 9.7 and later. Mediator extends the ability for
ONTAP to detect and handle failures. Mediator runs on a RHEL or CentOS physical or virtual machine. The
MetroCluster nodes use mediator to determine site status in the event the sites loose connectivity with each
other or if there is a site failure.
With the use of the ONTAP mediator, MetroCluster IP is able to perform Automatic Unplanned Switchover
(AUSO). For more information see the MetroCluster IP Installation Guide in the ONTAP documentation.
What is MetroCluster TieBreaker?
NetApp MetroCluster Tiebreaker software checks the reachability of the nodes in a MetroCluster configuration
and the cluster to determine whether a site failure has occurred. The Tiebreaker software also triggers an alert
under certain conditions. The Tiebreaker software runs on a RHEL physical or virtualized servers, and it’s best
deployed at a third location for independent visibility to all MetroCluster nodes.
The Tiebreaker software monitors each controller in the MetroCluster configuration by establishing redundant
connections through multiple paths to a node management LIF and to the cluster management LIF, both hosted
on the IP network.
The Tiebreaker software monitors the following components in the MetroCluster configuration:



Nodes through local node interfaces
Cluster through the cluster-designated interfaces
Surviving cluster to evaluate whether it has connectivity to the disaster site (FCVI, storage, and intercluster
peering)
When there is a loss of connection between the Tiebreaker software and all the nodes in the cluster and to the
cluster itself, the cluster will be declared as "not reachable" by the Tiebreaker software. It takes around three to
five seconds to detect a connection failure. If a cluster is unreachable from the Tiebreaker software, the
surviving cluster (the cluster that is still reachable) must indicate that all links to the partner cluster are severed
before the Tiebreaker software triggers an alert.
Automatic Unplanned Switchover in MetroCluster
In MetroCluster configuration, the system can perform an automatic unplanned switchover (AUSO) in the
event of a site failure to provide nondisruptive operations.
In a four-node and eight-node MetroCluster configuration, an AUSO can be triggered if all nodes at a site are
failed because of the following reasons:



Node power down
Node power loss
Node panic
Because there is no local HA failover available in a two-node MetroCluster configuration, the system performs
an AUSO to provide continued operation after a controller failure. This functionality is similar to the HA
takeover capability in an HA pair. In a two-node MetroCluster configuration, an AUSO can be triggered in the
following scenarios:



Node power down
Node power loss
Node panic

Node reboot
In the event of an AUSO, disk ownership for the impaired node's pool0 and pool1 disks is changed to the
disaster recovery (DR) partner. This ownership change prevents the aggregates from going into a degraded
state after the switchover.
After the automatic switchover, you must manually proceed through the healing and switchback operations to
return the controller to normal operation.
Configuration Requirements
MetroCluster is a combined hardware and software solution, with deployment flexibility including 2-nodes to
8-nodes options. Specific hardware is required to create the shared storage fabric and inter-site links. Consult
the NetApp Interoperability Matrix (IMT) and the NetApp Hardware Universe for supported hardware. On the
software side, MetroCluster is completely integrated into Data ONTAP - no tools or licenses are required to
leverage MetroCluster. After the MetroCluster relationships are established, data and configuration is
automatically continuously replicated between the sites, manual effort is not required to establish replication of
newly provisioned workloads. This not only simplifies the administrative effort required, it also eliminates the
possibility of forgetting to replicate critical workloads.
These requirements must be satisfied to support this configuration:












The maximum distance between two sites is 300 km for MetroCluster FC and 700km for MetroCluster IP.
The maximum round trip latency for Ethernet Networks between two sites must be less than 10 ms.
The storage network must be a minimum of 4Gbps throughput between the two sites for ISL connectivity,
refer to NetApp’s IMT tool for specific controller and minimum ISL requirements.
ESXi hosts in the vMSC configuration should be configured with at least two different IP networks, one for
storage and the other for management and virtual machine traffic. The Storage network handles NFS and
iSCSI traffic between ESXi hosts and NetApp Controllers. The second network (VM Network) supports
virtual machine traffic as well as management functions for the ESXi hosts. End users can choose to
configure additional networks for other functionality such as vMotion/Fault Tolerance. VMware
recommends this as a best practice, but it is not a strict requirement for a vMSC configuration.
FC Switches are used for vMSC configurations where datastores are accessed through FC protocol, and
ESX management traffic is on an IP network. End users can choose to configure additional networks for
other functionality such as vMotion/Fault Tolerance. This is recommended as a best practice but is not a
strict requirement for a vMSC configuration.
For NFS/iSCSI configurations, a minimum of two uplinks for the controllers must be used. An interface
group (ifgroup) should be created using the two uplinks in multimode configurations.
The VMware datastores and NFS volumes configured for the ESX servers are provisioned on mirrored
aggregates.
vCenter Server must be able to connect to ESX servers on both the sites.
The maximum number of Hosts in an HA cluster must not exceed 32 hosts.
There is no specific license for MetroCluster functionality, including SyncMirror. It is included in the basic
Data ONTAP license. Protocols and other features such as SnapMirror require licenses if used in the
cluster. Licenses must be symmetrical across both sites. For example, SMB, if used, must be licensed in
both clusters. Switchover does not work unless both sites have the same licenses.
All nodes should be licensed for the same node-locked features.
All 4 nodes in a DR group must be the same AFF, FAS (for example, four AFF400 or four FAS9000). It is
not supported to mix FAS and FlexArray controllers (even of the same model number) in the same
MetroCluster DR group.
Notes:





A MetroCluster TieBreaker or ONTAP Mediator should be deployed in a third location, and must be able
to access the storage controllers in Site A and Site B to initiate an HA takeover event in case of an entire
site failure.
For more information on NetApp MetroCluster Design and Implementation, see the ONTAP 9
Documentation Center MetroCluster configuration section .
For information on deploying vSphere 6 on NetApp MetroCluster, see vSphere 6 on NetApp MetroCluster
8.3 .
For information about NetApp in a vSphere environment, see NetApp Storage Best Practices for VMware
vSphere.
For more information on MetroCluster see the following technical reports:
TR-4705: NetApp MetroCluster Solution Architecture and Design
TR-4375 NetApp MetroCluster FC TR-4689: MetroCluster IP Solution Architecture and Design
Note: The preceding links were correct as of August 3, 2017. If you find a link is broken, provide a feedback
and a VMware employee will update the link.
Solution Overview
MetroCluster configurations protect data by using two physically separated, mirrored clusters, separated by a
distance of up to 700 km. Each cluster synchronously mirrors the root and data aggregates. When a disaster
occurs at one site, an administrator can trigger a failover and begins serving the mirrored data from the
surviving site. In Data ONTAP 9.2, MetroCluster consists of a 2-node HA pair at each site, allowing the
majority of planned and unplanned events to be handled by a simple failover and giveback within the local
cluster. Only in the event of a disaster (or for testing purposes), is a full switchover to the other site required.
Switchover, and the corresponding switchback operations transfers the entire clustered workload between the
sites.
This figure shows the basic MetroCluster configuration. There are two data centers, A and B, separated at a
distance up to 300km with ISLs running over dedicated fibre-links. Each site has a cluster in place, consisting
of 2-nodes in an HA pair. In this example, cluster A at site A consists of nodes A1 and A2, and cluster B at site
B consists of nodes B1 and B2. The two clusters and sites are connected through two separate networks which
provide the replication transport. The Cluster Peering network is an IP network used to replicate cluster
configuration information between the sites. The shared storage fabric is an FC connection and is used for
storage and NVRAM synchronous replication between the two clusters. All storage is visible to all controllers
through the shared storage fabric.
Note: This illustration is a simplified representation and does not indicate the redundant front-end components,
such as Ethernet and fibre channel switches.
The vMSC configuration used in this certification program was configured with Uniform Host Access mode.
In this configuration, the ESX hosts from a single site are configured to access storage from both sites.
In cases where RDMs are configured for virtual machines residing on NFS volumes, a separate LUN must be
configured to hold the RDM mapping files. Ensure you present this LUN to all the ESX hosts.
vMSC test scenarios
This table outlines vMSC test scenarios:
Scenario
NetApp Controllers Behavior
VMware HA Behavior
Controller path failover occurs. All LUNs
and volumes remain connected.
Controller
single path
failure
For FC datastores, path failover is triggered
from the host and the next available path to
the same controller will be active.
No impact
All ESXi iSCSI/NFS sessions remain active
in multimode configurations of two or more
network interfaces.
ESXi single
storage path
failure
No impact on LUN and volume availability.
ESXi storage path fails over to the
alternative path. All sessions remain active.
No impact
Site 1 or Site
2 single
storage node
failure
Since there is an HA pair at each site, a
failure of one node transparently and
automatically triggers failover to the other
node.
No impact
MCTB VM
failure
No impact on LUN and volume availability.
All sessions remain active.
No impact
MCTB VM
single Link
failure
No impact. Controllers continue to function
normally.
No impact
Complete
Site 1
failure,
including
ESXi and
controller
In the case of a site-wide issue, the
MetroCluster switchover operation allows
immediate resumption of service by moving
storage and client access from the Site 1
cluster to Site 2. The Site 2 partner nodes
begin serving data from the mirrored plexes
and the sync destination Storage Virtual
Machine (SVM).
Virtual machines on failed
Site 1 ESXi nodes fail. HA
restarts failed virtual
machines on ESXi hosts on
Site 2.
Complete
Site 2
failure,
including
ESXi and
controller
In the case of a site-wide issue, the
MetroCluster switchover operation allows
immediate resumption of service by moving
storage and client access from the Site 2
cluster to Site 1. The Site 1 partner nodes
begin serving data from the mirrored plexes
and the sync destination Storage Virtual
Machine (SVM).
Virtual machines on failed
Site 2 ESXi nodes fail. HA
restarts failed virtual
machines on ESXi hosts on
Site 1.
No impact. Controllers continue to function
normally.
Virtual machines on failed
ESXi node fail. HA restarts
failed virtual machines on
surviving ESXi hosts.
Multiple
ESXi host
management
network
failure
No impact. Controllers continue to function
normally.
A new Primary will be
selected within the network
partition.
Virtual machines will remain
running. No need to restart
virtual machines.
Site 1 and
Site 2
simultaneous
failure
(shutdown)
and
restoration
Controllers boot up and resync. All LUNs
and volumes become available. All iSCSI
sessions and FC paths to ESXi hosts are reestablished and virtual machines restarted
successfully. As a best practice, NetApp
controllers should be powered on first and
allow the LUNs/volumes to become
available before powering on the ESXi
hosts.
No impact
No impact to controllers. LUNs and
volumes remain available.
If the HA host isolation
response is set to Leave
Powered On, virtual
machines at each site
continue to run as storage
heartbeat is still active.
Single ESXi
failure
(shutdown)
ESXi
Management
network all
ISL links
failure
Partitioned Hosts on site that
do not have a Fault Domain
Manager elect a new Primary.
All Storage
ISL Links
failure
No Impact to controllers. LUNs and
volumes remain available.
When the ISL links are back online, the
aggregates resync.
No impact
System
Manager Management
Server
failure
No impact. Controllers continue to function
normally.
NetApp controllers can be managed using
Command Line.
No impact
vCenter
Server
failure
No impact. Controllers continue to function
normally.
No impact on HA. However,
the DRS rules cannot be
applied.
Related Information
vSphere Storage DRS or vSphere Storage I/O Control support in a vSphere Metro Storage Cluster environment
NetApp MetroCluster での VMware vSphere 5.x および 6.0 のサポート
NetApp MetroCluster 支持 VMware vSphere 5.x 和 6.x
Download