Uploaded by anoop shar

VNX Unified Storage Solutions Design - Student Guide

VNX Unified Storage Solutions
Design - Student Guide
Education Services
October 2012
Welcome to VNX Unified Storage Solutions Design.
Copyright © 1996, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012 EMC Corporation. All Rights
Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to
change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES
OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.
EMC2, EMC, Data Domain, RSA, EMC Centera, EMC ControlCenter, EMC LifeLine, EMC OnCourse, EMC Proven, EMC Snap, EMC
SourceOne, EMC Storage Administrator, Acartus, Access Logix, AdvantEdge, AlphaStor, ApplicationXtender, ArchiveXtender, Atmos,
Authentica, Authentic Problems, Automated Resource Manager, AutoStart, AutoSwap, AVALONidm, Avamar, Captiva, Catalog
Solution, C-Clip, Celerra, Celerra Replicator, Centera, CenterStage, CentraStar, ClaimPack, ClaimsEditor, CLARiiON, ClientPak,
Codebook Correlation Technology, Common Information Model, Configuration Intelligence, Configuresoft, Connectrix, CopyCross,
CopyPoint, Dantz, DatabaseXtender, Direct Matrix Architecture, DiskXtender, DiskXtender 2000, Document Sciences, Documentum,
elnput, E-Lab, EmailXaminer, EmailXtender, Enginuity, eRoom, Event Explorer, FarPoint, FirstPass, FLARE, FormWare, Geosynchrony,
Global File Virtualization, Graphic Visualization, Greenplum, HighRoad, HomeBase, InfoMover, Infoscape, Infra, InputAccel,
InputAccel Express, Invista, Ionix, ISIS, Max Retriever, MediaStor, MirrorView, Navisphere, NetWorker, nLayers, OnAlert, OpenScale,
PixTools, Powerlink, PowerPath, PowerSnap, QuickScan, Rainfinity, RepliCare, RepliStor, ResourcePak, Retrospect, RSA, the RSA
logo, SafeLine, SAN Advisor, SAN Copy, SAN Manager, Smarts, SnapImage, SnapSure, SnapView, SRDF, StorageScope, SupportMate,
SymmAPI, SymmEnabler, Symmetrix, Symmetrix DMX, Symmetrix VMAX, TimeFinder, UltraFlex, UltraPoint, UltraScale, Unisphere,
VMAX, Vblock, Viewlets, Virtual Matrix, Virtual Matrix Architecture, Virtual Provisioning, VisualSAN, VisualSRM, Voyence, VPLEX,
VSAM-Assist, WebXtender, xPression, xPresso, YottaYotta, the EMC logo, and where information lives, are registered trademarks or
trademarks of EMC Corporation in the United States and other countries.
All other trademarks used herein are the property of their respective owners.
© Copyright 2012 EMC Corporation. All rights reserved. Published in the USA.
Revision Date: October 2012
Revision Number: MR-7CP-VNXUNISDTA 5.32/7.1 v1.5
Copyright © 2012 EMC Corporation. All rights reserved
Course Introduction
1
This slide gives an overview of the course, the audience at whom it is aimed, and pre-requisite
courses.
Copyright © 2012 EMC Corporation. All rights reserved
Course Introduction
2
Upon completion of this course, you should be able to gather relevant information, analyze it, and
use the result of that analysis to design a VNX solution.
Copyright © 2012 EMC Corporation. All rights reserved
Course Introduction
3
This is the agenda for Day 1 of the class.
Copyright © 2012 EMC Corporation. All rights reserved
Course Introduction
4
This is the agenda for Day 2 and 3 of the class.
Copyright © 2012 EMC Corporation. All rights reserved
Course Introduction
5
This is the agenda for Day 4 and Day 5 of the class.
Copyright © 2012 EMC Corporation. All rights reserved
Course Introduction
6
This module focuses on gathering the proper data in preparation for designing a VNX solution.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
1
This lesson covers the theory and methodology of documenting the current hardware environment
and requirements prior to the design phase.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
2
In order to model and validate that a storage solution meets the determined business
requirements, supporting capacity, connectivity and performance data from the environment must
be collected and modeled using certified toolsets. The modeling of this data provides the necessary
criteria to validate a solution. It is not possible to grant credibility for a solution to adequately meet
the customer’s business requirements without first gathering the appropriate data. This data is
used in a modeling process designed to validate the solution. The type of data and collection
methods vary according to the different solutions.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
3
As we look at designing VNX storage solutions, it is important to understand the customer’s
environment: knowing the applications used, how the data is accessed, what and how much data
needs to be replicated, across which logical volumes and which physical spindles, and the
replication requirements are all critical to designing VNX storage solutions.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
4
Once we understand which devices make up the environment, we need to look at how things are
utilized across the servers, storage subsystems, SAN and WAN. We need to understand application,
storage, and network performance. If replication is used, current bandwidth and latency are key
considerations for future designs.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
5
Gathering performance during peak workloads is critical to ensure proper sizing of the
environment. To focus on some subset of the hour, day, or week, we look to the customer to
identify the peak production times and then confirm with host- or storage- production data
statistics. Properly determining collection intervals is also critical. Choosing smaller periods without
affecting the production environment is preferred over larger intervals.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
6
Shared logical volumes may unknowingly contend for the same logical volumes and degrade each
others’ performance. Performance analysis should be performed at both the host and storage level.
Storage performance may not show host side issues and host side performance may not identify
storage hot spots. Determination of current hot spots will help in preventing hot spots in the VNX
environment.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
7
All proposed or recommended configurations and solutions must come from actual customer data
points. If we don’t have the data, we can’t make a calculated recommendation.
If there are changes or if the analysis is outdated we need to revisit the configuration to make sure
the proposed configuration still meets the customer’s objectives.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
8
Most IT environments are dynamic. Changes in the environment can occur frequently and unless
they are taken into account, the final solution may be based on stale data. Care must be taken to
work with all parties to ensure that any changes are accounted for in the final solution.
Implementing properly timed freezes will help ensure successful implementations/migrations.
An example of this would be a migration schedule for August 28. Let’s say we gather the data on
August 1 and design the future environment with this data. If on August 20 there is an addition to
the current environment which is not accounted for in the future environment design, then we will
be unsuccessful when we try to migrate on August 28.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
9
Verifying the scope of the project will ensure that all the proper data is collected and no time will
be wasted on resources outside of the scope. Be sure to cover the servers, network, storage,
applications, backup and recovery, replication, and any other in scope resources.
It is also important to identify high profile assets. Planning the solution for high profile data needs
special care. The applications can be very sensitive to performance and the business may be
significantly affected if these applications perform poorly.
After data collection a thorough analysis needs to be performed. Current and final location of data,
network architecture, backup and recovery, and replication requirements all need to be considered.
After analysis of the current environment, conversation with the Customer is critical to ensure that
unnecessary data is not kept and that growth will be accommodated. The final design can then be
created and verified with the solution. Implementation and validation need to be performed before
completion of the project.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
10
It is very important to determine the final list of servers as early as possible in the project. Scope
creep can greatly affect the timeliness and efficiency of a project. Determining all servers in scope
of the project is should be one of the first tasks in designing the solution. Gathering emcgrab is the
next step. Using data gathering tools will be discussed in the next module. It is critical to account
for all current storage and storage requirements and ensure no needed data is left behind in the
case of migrations.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
11
Gathering the current storage configuration is also one of the first steps in any design. Tools vary
depending on the current storage platform. Some of these tools will be discussed in the next
module. Once the data is gathered, correlate this information with the emcgrab information. This
will help ensure that no storage is unaccounted for. Discussions with the Customer will also
determine whether additional storage needs to be allocated for growth.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
12
VNX storage access is done via Block and/or File protocols. A clear distinction should be made of
the storage used for the different protocols. Proper design will require special performance
considerations for File vs. Block data as discussed later in this course.
Gathering File storage data means including the underlying storage, the connectivity between the
NAS servers/Data Movers and the backend storage, as well as the network shares and NFS exports
information. For environments with Active Directory, LDAP domains, and NIS, these environmental
details must be documented and incorporated in the final design.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
13
Connectivity between hosts, storage, and remote sites is critical to a proper design. Ethernet
networks are often used for the management of the environment, iSCSI, CIFS, and NFS traffic. Fibre
Channel (FC) networks will carry the Fibre Channel data. Converged networks will account for both
Ethernet and FC data. Documenting converged networks still requires the same due diligence in
gathering both the Ethernet and FC configurations. VLAN, VSANs, and any special networking
should be noted and incorporated into the final design. Proper design of remote replication
solutions often requires a network assessment to determine bandwidth, latency, and packet loss. If
this is required for the environment it is good to schedule this early in the project.
Creating a diagram of the current configuration will help ensure a good understanding of the
environment and is a good first step to designing a network solution.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
14
Backups are an often overlooked part of VNX designs. Determining requirements and discussing
any current issues is critical to a successful design. Double check whether backup configuration is in
scope of the project as this can be contentious.
Archiving may also be part of the scope. Setting up proper connectivity and implementing desired
retention policies require ensuring all requirements have been gathered and documented.
Designing replication can be simple or complicated depending on the requirements. All volumes
needing replication must be accounted for. Consistency among volumes need to be documented
and designed for. And replication methodologies must be chosen to properly meet the RTOs and
RPOs.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
15
E-Lab Advisor is a simple method of processing the data gathering configuration. The outputs can
be used for planning designs, migration plans, and remediation.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
16
E-Lab Advisor is a web based tool which combines the functionality of the HEAT, SWAT, Celerra
Health Check, SYMAPI Log Analyzer, and SANsummary tools. HTML or PDF formatted reports can
be produced via the upload of data collections (EMCGrab for UNIX, EMCREPORTS for Windows,
supportshow for Brocade, show tech-support details for Cisco, product information or Connectrix
data collection for McData). These reports can be produced in a few seconds (excluding upload
time) saving an average of two hours per report compared to the previous methods.
E-Lab Advisor is a complete rewrite of a number of decommissioned tools. Existing tool
functionality was retained and new functionality was added.
E-Lab Advisor allows users to:
•
•
•
•
•
•
Automate high level health checks for host and fibre switch environments
Compare actual host configuration against EMC e-Lab Support Matrix (ESM)
Compare actual host configuration against EMC Simple Support Matrix (ESSM)
Send XML file to CCA5 for Windows, Solaris and HP-UX environments
Generate SANsummary XLS reports to document customer environments
Generate SANsummary XLS reports for host and switch import into GSSD tools
E-Lab Advisor provides dashboard functionality which allows the user to have a high level view of
the severity of issues before opening a report. SANsummary reporting has been integrated into the
E-Lab Advisor web application, removing the need to maintain locally installed software. By using
SANsummary's Excel formatted spreadsheets, users can also import hosts and fibre channel switch
data into EMC Migration Planner (EMP) and Networked Storage Designer (NSD-U), thereby
reducing manual data entry. (Continued)
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
17
Major benefits for EMC technical personnel include:
• Standardized reports for customer deliverables
• Automation for E-Lab Navigator supportability checks
• Simplification via a single integrated utility
Major benefits for customers include:
• Improved Web Interface merges multiple existing tools/URLs into a single interface
• Standardized formatting for all reports allows cut and paste from multiple reports without
having to reformat, saving remediation time
• Single upload page simplifies the end-user experience
https://elabadvisor.emc.com
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
18
Any necessary remediation of the environment should be communicated to the customer as early
as possible. Remediation, such as switch code upgrades, can be difficult to schedule and can
lengthen a project timeline if not quickly addressed. The output of eLab Advisor should help in
producing the remediation document to give to the customer.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
19
This lesson covers interviewing key personnel.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
20
Covering all bases requires discussion with all vested parties. Anyone who will use, manage,
support, and base business decisions on the VNX storage environment should be interviewed to
ensure that all aspects of storage design have been accounted for.
People and companies do not generally seek out technology; they seek out solutions to
their needs or pain points. Technology for the sake of technology is not a great marketing
strategy. When customers buy from us, they are really buying outcomes, feelings, results,
and solutions.
In many cases, personnel at the CxO level will view objectives from a business point of view, while
administrators are likely to more technical in nature. It is important to bear in mind that rivalries
often exist between various disciplines, and that employees may regard certain procedures as
being part of their exclusive responsibility. They may work actively to preserve what they perceive
as their personal domain, and may be hostile to, or even try to sabotage, attempts to move
responsibilities to other areas.
Recent EMC software, e.g. the plugins for VMware, allow server or virtualization administrators to
provision their own storage – previously the sole preserve of storage administrators.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
21
IT often involves both intangible and complex sales. Solutions are often invisible—they run in the
background, the hardware is seldom seen, and the majority of personnel may not even know they
are there.
Storage is a vital part of all solutions, so most customers are, to some extent, in the high
technology and storage business.
People from vastly different roles are involved in the purchase of high tech solutions. For instance,
if you were selling storage management software to a customer, the following personnel could be
involved in the decision making and evaluation process:
• CEO, CIO, CTO roles
• Finance/Accounting
• The training department that will train users
• IT personnel who will have help integrate and support the software
Each of the roles needs to have the solution explained differently:
• CxO roles want to know the big picture, the bottom-line and ROI. They will not want
technical details; they should not be asked questions related to those details.
• Finance and accounting will be concerned about capital costs, warrantees, process
management, risk and on-going fees.
• The training department will want to know about implementation, training tools and
materials, and how it will impact their ability to educate the users.
• The IT department will want to know specifics on scalability, ease of use, and many other
technical issues.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
22
Most people are comfortable communicating with one or two of these types of stakeholders. Good
architects and salespeople know how to address the needs of each of these groups, and
communicate the benefits in a way that is unique and pertinent to each stakeholder type.
There are several other important attributes as well:
Knowing the customer market
Understanding the customer market and niche is important. This includes the customer’s unique
circumstances, competitive environment, and business processes. If this knowledge is at hand, it is
possible to easily identify the core business challenges our solution can address.
Knowing the specific customer
Each customer will have unique business challenges and processes that need the support of
technology in a special way.
Factors that affect the solution needed will depend on their stage of business growth, existing
business processes, corporate goals, immediate and long term pain points, as well as management
and operations philosophy.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
23
Be aware of the solution strengths and limitations
In order to help the customer and become a true resource and problem solver, we must
understand the weaknesses and limitations of the solution in addition to its strengths and
capabilities.
Customers may have unreasonable or ill-defined expectations — by understanding our limitations
and communicating them effectively we can dispel any misconceptions. Small misunderstandings
at the beginning of a project may lead to major dissatisfaction later. On the other hand, by being
aware of the benefits of the solution we maximize revenues and customer satisfaction.
Aim to solve the customer problem
This is an integral part of selling technology. New technologies often stem from problems for which
there is no solution.
Become a trusted advisor
Being seen as a trusted advisor can be a real advantage. To be seen as a consultant, one needs
familiarity with needs analysis, must be a subject matter expert, have a high level of rapport with
customers, and look for ideal solutions — not the boilerplate solution to client problems.
Most large high-tech deals require a number of people to help close the sale. Technical and support
staff may need to interact with key staff in the customer company. Once the sale is closed, there
will be a need to continue to monitor these interactions to ensure fulfillment of commitments
made to the customer.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
24
Costs difficult to justify: Some ideas may work out to be too expensive. The only solution may be
to add a touch of realism to make the solution affordable
Implementation: How will it work in practice? How will it interface with existing business
processes? Is it new and untested?
Benefits difficult to measure: New ideas often demand new skills and new thinking. That can make
costing them difficult. Training and education costs may also have to be considered.
Risky solutions: New ideas often depend on new and possibly untested technologies. This will
lengthen the implementation time, which delays the benefits and increases the costs. The
customer may also want performance and operational guarantees which could add cost to the
solution.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
25
This lesson covers documenting expectations.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
26
Use server and storage data along with the information learned from interviewing key personnel to
document the capacity expectations.
Capacity should be documented on a per LUN basis. Final LUN size should be the current capacity
plus any growth the Customer wants to build into the design. Growth can also be accomplished by
adding brand new LUNs from the new storage. The application, environment, and migration
methods will all play a part in determining whether growth will be accomplished by growing
current LUNs or adding new LUNs to the server(s).
Expandability is an important design criteria and should be discussed with all key personnel. The
ability to grow and shrink data usage will significantly affect Customer satisfaction. Network is also
an important area to consider for growth. New servers will require switch ports and bandwidth.
Documenting current head room in the environment and working with the Customer to determine
future growth will help determine the final design.
Determine performance requirements on a server to server basis. If performance needs to be
increased, clearly document this so that the final solution will meet the expected goals. Often the
performance requirement is for the final performance to be equal or better than the current
performance. As such, ensure good baselines of current performance. By comparing the current
performance to the final performance, there will be a clear indication of whether the performance
expectations were met. If the Customer has baselining tools, these are good methods of gathering
application level performance.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
27
Availability expectations will determine the storage, network, and server design. Ensure that all
parties have indicated their availability needs on a storage, server, network, and application level.
Security is also an important part of VNX design. Clearly documenting the level of security required
will allow proper design of administration and support access to the environment.
Management expectations are also critical for Customer satisfaction. Determining who will manage
the devices and from where and what access they will need enables the proper set up of the
management environment.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
28
This lesson covers how user data will be accessed.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
29
When designing migrations, interviews with key personnel should include discussions of data
access methods. Often the data access method will be maintained after the migration. If a change
in access method is required, care must be taken to ensure proper connectivity and performance.
An example of changing access methods would be a migration from a CX3 to a VNX array where the
servers originally used fibre channel to access the CX3 and will use FCoE to access the VNX. For
new implementations data access methods must be determined by discussions with the Customer.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
30
This lesson covers determining how data is protected.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
31
For environments with data archiving, primary and secondary storage must be properly designed
and configured for smooth operations. Gathering current archiving information plus any desired
changes will start the archiving design portion. Connectivity between primary and secondary
storage must account for redundancy and bandwidth requirements. Migrating current archived
data may or may not require recalling the archived data before the migration. Check the
documentation of the specific platforms to determine migration paths and any special migration
considerations. Shown here is an example architecture for EMC CTA archiving to a EMC Centera or
Atmos system.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
32
Although the VNX does not provide backup software, it is important to consider the backup and
restore environment when designing a VNX solution. Physical spindles may need to be dedicated to
backup servers or archival locations. Software such as SnapView may need to be considered to
offload the workload to dedicated backup to dedicated backup servers. Replication is also a means
of protecting information by providing remote site copies of the data. The VNX series offers all the
replication methodologies needed to keep data available and secure.
VNX SnapView for block and VNX SnapSure for file for local protection; VNX SAN Copy, offering
both push and pull, incremental and full, local and remote copies from and to VNX, CX, and NS
Series arrays; VNX MirrorView, both Synchronous and Asynchronous from and to VNX, CX, and NS
Series arrays; VNX Replicator for file system level replications from and to VNX and NS Series arrays.
VNX Replicator is capable of 1-to-1, 1-to-many, many-to-1, and cascading replications.
Replicated copies managed by RecoverPoint/SE or other standard EMC replication utilities are not
only vital for DR and data protection solutions, they can also be very useful for application testing,
offloading backups, reporting, and performing software upgrade tests. Replication Manager fully
integrates with RecoverPoint/SE and these replication utilities to enable automated, application
consistent copies. Combine this with the visibility provided by Data Protection Advisor for
Replication to monitor and alert on recovery gaps for remediation. One can also prove compliance
to protection policies with integrated reporting.
Note: Data Protection Advisor does not support VNX Replicator or SnapSure.
Gathering current backup, archiving, and replication methodologies along with discussions with key
personnel will provide the basis for the final data protection design. Details of designing data
protection will be covered later in this course.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
33
The VNX series offers all the replication methodologies needed to keep data available and secure.
Integrated with Unisphere, RecoverPoint/SE offers several data protection strategies.
RecoverPoint/SE Continuous data protection (CDP) enables local protection for block
environments.
RecoverPoint/SE Concurrent local and remote (CLR) replication enables concurrent local and
remote replication for block environments.
RecoverPoint/SE Continuous remote replication (CRR) enables block protection as well as a filelevel site DR solution. This enables failover and failback of both block and file data from one VNX to
another. Data can be synchronously or asynchronously replicated. During file failover, one or more
stand-by Data Movers at the DR site come online and take over for the primary location. After the
primary site has been brought online, failback would allow the primary site to resume operations
as normal. Configuration can be active/passive as shown here or active/active where each array is
the DR site for the other. Deduplication and Compression allow for efficient network utilization,
reducing WAN bandwidth by up to 90%, enabling very large amounts of data to be protected
without requiring large WAN bandwidth. RecoverPoint/SE is also integrated with both vCenter and
Site Recovery Manager (SRM). SRM integration is block only.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
34
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
35
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
36
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
37
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
38
This module covered gathering data to prepare for the analysis phase of a VNX design.
Copyright © 2012 EMC Corporation. All rights reserved
Module 1: Data Gathering
39
This module focuses on tools used to gather data and validate an environment.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
1
This lesson covers VNX utilities and host utilities used to gather data.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
2
These are the VNX utilities.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
3
The Unisphere GUI displays real-time information about a number of object types. This information
cannot be saved to disk, and is displayed in text form on the dialogs where it appears. These statistics
are only displayed if Statistics Logging is enabled at the storage system level, and are accessible
through the Properties page of the object (or the Status page for Incremental SAN Copy).
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
4
Most of these statistics have names that are self-explanatory. Some additional notes follow:
Cache Hit Ratios are calculated values that divide the number of cache hits by the number of reads or
writes. They are a measure of how well the cache subsystem is performing.
The Number of Unused Blocks Prefetched is a measure of how efficient prefetches are. Lower
numbers are better.
The Write Cache Flush Ratio is calculated by dividing writes that caused a flush by the total number of
writes – essentially a look at whether or not this LUN is causing forced flushes.
Utilization is a calculated number that determines how busy an object is. This parameter may not
always give an accurate picture; if it is incorrect, it is likely to be too high rather than too low. Note
that a LUN will be regarded as busy if 1 or more of its disks are busy with a task related to that LUN.
Stripe Crossings and Disk Crossings (Analyzer only) indicate that an I/O has spanned 2 or more disks,
or spanned 2 or more stripes. This may be an indication that data is misaligned. See the previous slide
that discussed disk and stripe crossings. Note that the concept of stripe crossings is meaningless for a
Pool LUN, and the option is not displayed.
The Number of Trespass parameter counts the number of times a LUN has been trespassed.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
5
These new LUN parameters show I/O through the owning and non-owning SP for a LUN, and the
number of times that I/O has been rerouted. These parameters are displayed on the LUN Statistics tab
only if Failover Mode 4 – the mode that supports ALUA - is selected.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
6
These statistics are similar to the equivalent statistics for ordinary LUNs. Note that, as with ordinary
LUNs, ALUA statistics will be displayed if the host supports failover mode 4.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
7
Most of these statistics have names which are self-explanatory.
The Number of Write Cache Pages is the number of pages allocated to this SP (initially half of the total
number of write cache pages).
Dirty Pages is a measure of how many write cache pages contain modified data which has not yet
been flushed to disk.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
8
These statistics are self-explanatory. Note that, as is the case with some other counters, utilization
may be incorrect if the disk is being used by both SPs. VNX OE for Block, and its predecessor FLARE,
have used different methods of calculating utilization when objects are used by both SPs; generally,
the object will be as busy, or busier, as reported by the SP with the highest Utilization number.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
9
These statistics are identical to those available from Unisphere Analyzer.
One can determine at a glance what impact the Session will have on the Source LUN by looking at the
reads from the Source LUN. These show additional reads performed as a result of the session running
on the Source LUN. The Writes Larger than Reserved LUN Pool Entry Size measures the number of
writes that exceed 64 kB in size (the size of a chunk).
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
10
Though this information may be regarded as not being true performance information, it does give one
an idea of how much data is being copied, and at what rate the copy is proceeding.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
11
A number of Navisphere Secure CLI commands return the current values of running counters.
These raw values may be massaged into a usable state. Some of those are discussed in following
slides.
Note that, unless the security file has been created, all commands will contain the username,
password and scope.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
12
Note the ‘Prct Cache Pages Owned’ field – SPA owns 51% of the write cache pages, so has been
marginally busier than SPB. Page assignment, which is checked every 10 minutes, has been
modified to give the busiest SP more of the available cache.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
13
The counters shown here, such as Total Reads and Total Writes, are cumulative counters. They can be
reset with a CLI command to enable a clean start.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
14
The getdisk command displays disk information and status, as well as several cumulative counters
related to reads, writes, and errors. The Private line shows the block number at which the LUN begins:
LUN 500, for example, starts at block 69704 – the preceding blocks (~ 34 MB) are used internally by
VNX OE for Block.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
15
The getlun command displays a wealth of information about a LUN or LUNs. Included are the
cumulative counters for reads and writes, broken down by I/O size; these are the same statistics
displayed by Analyzer’s I/O Size Distribution views.
Note that data is not broken down by optimal/non-optimal path.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
16
Navisphere Secure CLI returns MirrorView and SAN Copy data similar to that returned by Unisphere.
This information falls into the category of status information rather than performance information.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
17
This lesson covers VNX utilities and host utilities used to gather data.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
18
The basics of Unisphere Analyzer should already be familiar to you. In the next few slides we’ll be
looking at the performance parameters monitored by Analyzer, and defining several of them.
Some counters include information broken down by optimal and non-optimal paths. Though the main
object types for performance analysis are the disk, LUN and SP, note that other objects may also be
selected. The RAID Group, host and Storage Group objects allow a subset of the VNX LUNs to be
selected and displayed. Viewing a RAID Group will show cumulative values for the disks and LUNs in
that RAID Group. Counters for SP front-end ports allow more precise determination of where I/O is
going on the SP; these counters include a queue full counter, designed to aid troubleshooting in very
large environments with very busy hosts.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
19
These are the performance parameters that Analyzer will display for a Traditional LUN. Items marked
with an asterisk (*) are only displayed when the Advanced checkbox is checked. Items marked +
display the parameter, and the parameter for the optimal as well as non-optimal paths. The
optimal/non-optimal information is only displayed if the Advanced checkbox is checked.
Some of these parameters have names that are self-explanatory. Those that do not, and those that
have functions that are not obvious, are discussed below.
• Used Prefetches – if any part of the prefetched data is used, it is counted as a used prefetch.
High numbers are better
• Read Cache Hits/s – read requests satisfied from either read cache or write cache, broken
down into Reads from Write Cache/s and Reads from Read Cache/s. The former gives an
indication of how many writes are being re-read; the latter gives an idea of prefetching
efficiency
• Write Cache Hits/s – writes that do not require physical disk access. These may be ‘new’ writes
that did not cause forced flushes, or writes made to data still in cache (rehits)
• Write Cache Rehits/s – writes made to data still in write cache, and which has not yet been
flushed to disk. This gives an idea of the locality of reference of writes
• Average Busy Queue Length – counts the length of the queue, but only when the component is
already busy. This helps to indicate how bursty the I/O is
• Service Time (ms) – time taken to service a request. This does not include the time spent
waiting in the queue
Note that there are separate categories for SP cache (RAM-based) and FAST Cache (Flash drive
based).
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
20
These are the performance parameters that Analyzer will display for a Pool LUN. Items marked with an
asterisk (*) are only displayed when the Advanced checkbox is checked. Items marked + display the
parameter, and the parameter for the optimal as well as non-optimal paths. The optimal/non-optimal
information is only displayed if the Advanced checkbox is checked.
Some of these parameters have names that are self-explanatory. Those that do not, and those that
have functions that are not obvious, are discussed below.
• Average Busy Queue Length – counts the length of the queue, but only when the component is
already busy. This helps to indicate how bursty the I/O is
• Service Time (ms) – time taken to service a request. This does not include the time spent
waiting in the queue
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
21
These are the performance parameters that Analyzer will display for a metaLUN. Items marked with
an asterisk (*) are only displayed when the Advanced checkbox is checked. Items marked + display the
parameter, and the parameter for the optimal as well as non-optimal paths. The optimal/non-optimal
information is only displayed if the Advanced checkbox is checked.
Most of these parameters are identical to those used for standard LUNs. There are 2 parameters not
used for Traditional LUNs – LUN Read Crossings/s and LUN Write Crossings/s. These are similar in
concept to the Disk Crossings/s used for standard LUNs, except that in these cases, an I/O has
accessed more than one LUN. Bear in mind that, as previously mentioned, a metaLUN is a large RAID 0
LUN composed of other LUNs, which will often be RAID 5, RAID 6, or RAID 1/0.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
22
These are the performance parameters that Analyzer will display for an SP. Items marked with an
asterisk (*) are only displayed when the Advanced checkbox is checked.
Some of these parameters have names that are self-explanatory. Those that do not, and those that
have functions that are not obvious, are discussed below.
• SP Cache Dirty Pages (%) – the percentage of write cache pages owned by this SP that were
modified since last being read from disk or written to disk
• SP Cache Flush Ratio – the ratio of flush operations to write operations, or, put differently, the
ratio of back-end write operations to front-end write operations. Low numbers mean better
performance
• SP Cache High Water Flush On – the number of times in a sample period that dirty pages
reached the High Watermark. This is a measure of front-end activity
• SP Cache Idle Flush On – the number of times in the last sample period that idle flushing has
been used to flush data to LUNs. This is an indication that one or more LUNs are idle
• SP Cache Low Water Flush Off – the number of times that watermark processing was turned
off because the number of dirty pages reached the Low Watermark. This number should be
close to the value for High Water Flush On. If it is not, it is an indication that a high level of
front-end I/O activity is preventing the cache from being flushed
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
23
These are the performance parameters that Analyzer will display for an SP port. Items marked with an
asterisk (*) are only displayed when the Advanced checkbox is checked.
Some of these parameters have names that are self-explanatory. Those that do not, and those that
have functions that are not obvious, are discussed below.
• Queue Full Count – the number of Queue Full events that occurred for a particular front-end
port during a polling interval. A queue full response is sent to a host when the port receives
more I/O requests than it can handle.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
24
These are the performance parameters that Analyzer will display for a disk. Because a RAID Group is
simply a collection of disks, the same parameters are displayed for a RAID Group . Items marked with
an asterisk (*) are only displayed when the Advanced checkbox is checked.
Some of these parameters have names that are self-explanatory. Those that do not, and those that
have functions that are not obvious, are discussed below.
Average Seek Distance (GB) – a measure of randomness of the I/O pattern to the disk. Low numbers
mean that data is more sequential. Note that this value should be compared to the physical disk size
to be meaningful. Note also that this parameter is disk-specific, not LUN-specific, so for Storage Pools
with multiple LUNs it is not possible to accurately determine the randomness of a single LUN. For a
Pool, because of the way data is distributed, this data is meaningless at the LUN level.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
25
These are the performance parameters that Analyzer will display for a Pool. Items marked with an
asterisk (*) are only displayed when the Advanced checkbox is checked.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
26
These are the performance parameters that Analyzer will display for a SnapView Session. Items
marked with an asterisk (*) are only displayed when the Advanced checkbox is checked.
Note that Analyzer refers to the Snapshot Cache, where Unisphere and the CLI refer to the Reserved
LUN Pool.
Some of these parameters have names that are self-explanatory. Those that do not, and those that
have functions that are not obvious, are discussed below.
• Writes Larger Than Cache Chunk Size – strictly, writes to the Source LUN that resulted in 2 or
more Reserved LUN chunks being used.
• Chunks Used in Snapshot Copy Session – the number of Reserved LUN chunks used by this
Session
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
27
These are the performance parameters that Analyzer will display for a MirrorView/Asynchronous
mirror. Items marked with an asterisk (*) are only displayed when the Advanced checkbox is checked.
Some of these parameters have names that are self-explanatory. Those that do not, and those that
have functions that are not obvious, are discussed below.
• Total Bandwidth (MB/s) and Total Throughput (I/O/sec) – these refer to bandwidth and
throughput of the updates made to the secondary image from the primary. They are therefore
a measurement of MirrorView traffic across the link between the storage systems
• Average Transfer Size (KB) – average size of update I/Os sent from primary image to secondary
image
• Time Lag (min) – a measure of how far the secondary image is, in time, behind the primary
• Data Lag (MB) – a measure of how much data on the primary image is different to that on the
secondary image
• Cycle Count - the number of updates that completed during the polling interval
• Average Cycle Time (min) - the average duration of all updates that finished within the polling
interval
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
28
This summarizes information with which you should already be familiar. Note that Analyzer menu
options also appear when a host or Storage Group is right-clicked; these options may be used to filter
the LUNs being displayed.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
29
The Analyzer CLI commands allow scripted control over the Analyzer polling interval, and the starting
and stopping of logging. These Secure CLI commands require that a security file be created, or each
command will require username, password and scope.
The standalone archiveretrieve command allows scripted retrieval of log data from a storage system.
Once the data is stored on the host as a NAR file, the archive dump utility will produce a report,
formatted as per user choice, from the raw data, and write it to a file. The archivemerge command
may be used to merge 2 NAR files into a single one for viewing; the original files are unaltered.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
30
The –status command displays the Analyzer logging status – started or stopped.
The archive –list command lists all archive files, and ignores all other switches. The –path switch
allows a folder to be specified; by default archives are saved in the current folder. The –delete switch
allows deletion of archive files.
The archive –new command starts a new archive (if more than 10 samples have been collected), or
returns the name of the newest archive file. The –statusnew switch returns the status of the newly
created archive.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
31
The -archiveretrieve command allows scripted retrieval of log data from a storage system. Once the
data is stored on the host as a NAR file, the -archivedump command can be used to produce a report,
formatted as per user choice, from the raw data, and write it to a file. The archivemerge command
may be used to merge 2 NAR files into a single one for viewing; the original files are unaltered.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
32
This lesson covers VNX utilities and host utilities used to gather data.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
33
Unisphere Service Manager (USM) is a GUI-based tool used to perform services tasks on the VNX
Series systems. USM is a java-based application that runs on a Windows PC and communicates
over the Internet to Powerlink and over a local network to the VNX system.
The USM Software Upgrade tool can download File and Block software packages from Powerlink.
Also, the USM Software Download wizards guide the administrator, step-by-step, through the
process of upgrading the VNX software upgrade, including File, Block, or Unified.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
34
The Unisphere Service Manager (USM) can be launched from the Service Tasks section under the
System tab of Unisphere. If USM is not installed on the management station that you are using
Install Anywhere can automatically prompt you to do so; however you must have a valid Powerlink
account. USM is also available on Powerlink or a VNX Installation Toolbox CD. Once the installation
is completed, USM will open, or you can click the Unisphere Service Manager icon on the desktop
of the management station.
Note: If started from the “Launch USM” option within Unisphere, USM is automatically logged
using the same authentication and privileges as in Unisphere.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
35
In the USM Home screen, the top menu will show four options: (1) home tab which is the screen
you currently are; (2) Advisories tab which will provide any advisories for all system models; (3)
Download tab which provides a direct link to PowerLink to download necessary software file; (4)
Support tab, which is the same as in Unisphere, makes your service and maintenance experience
simple.
The login options will allow you to login to either a VNX series platform, CLARiiON storage system,
or a Celerra system running at least DART 5.6. The Reports section allows you easier capability to
view the already-downloaded repository files. When requested, you can also use it to submit a
system configuration report to EMC Support personnel. You can also use the Reports section to
generate new system configuration file for your system.
Without logging into USM, let’s see what else you can do.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
36
The screen shown here is the Downloads tab without logging into any systems. You are provided
with four selections: (1) Download VNX Software Updates, (2) Download CLARiiON Software
Updates, (3) Download Celerra Update Files, and (4) Download Disk Firmware Packages. Each one
of the selection allows you to download software or files updates for a specified system, and the
disk firmware packages allows you to download selected disk firmware packages.
For these selections to work, you would need Powerlink access, therefore you may need to
authenticate on the management station.
Now that you have seen the capabilities of USM without logging in, let’s go through the process to
log in to see what else is available.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
37
Besides providing the IP address, username and password, there are three different scopes
available when logging into USM: global, local, LDAP.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
38
Once you are successfully authenticated, you will automatically see the System’s view of USM. You
have five options on top of the page: System, Hardware, Software, Diagnostics, and Support tabs.
Under System you get instant access to: install or replace hardware component using the Hardware
selection, install or update block/file codes using the Software selection, verify the storage system
and gather diagnostic data using the Diagnostic selection.
On the right-side panel, you can verify the system’s information, logout of the system, and submit
or view system configuration through the system Reporting tool. User authentication is not needed
to view the existing configuration file using the System Reporting Tool. On the lower left-side of the
screen, you will find an Advisory icon with a Certificate icon. The certificate icon will have a counter
to display how many certificates are currently active for that specific system through USM. On the
lower right-side of USM, you should be able to see the username that is currently logged onto
USM, who is sysadmin on our example on the picture.
Overall, USM is a GUI-based tool that can be used to maintain the VNX system.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
39
USM can be used to maintain multiple different platform across the EMC portfolio. Here we see a
table of all the current wizards available from within USM and the platforms that are supported.
Now that you have seen USM as a whole, let’s see some of the wizards and tools available.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
40
The first utility you will learn of is the System Reporting Tool. It allows you to view system
configuration and submit the configuration to EMC with ease, and also view the Repository. Once
you open USM, the System Reporting Tool is located on the lower right-hand corner under a section
titled Reports. By selecting View System Configuration, the System Reporting Tool will pop up.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
41
As the System Reporting Window opens up, you can see that you can customize it. The System
Configuration Wizard allows you two options:
Generate system configuration: you need to be authenticated in order access this option. The
wizard will go through and collect system configuration from the VNX system, and put it all in one
file of one of two format choices.
Existing system configuration: There is no need to authenticate because the tool will use an
existing (xml or zip) file to output the system configuration file in the specified format of your
choice.
You can choose to add additional content to the report by selecting the checkbox next to
Configuration Analysis to get the report analyzed.
This wizard also allows you two choices to output the file: HTML and/or Microsoft Excel. Although
Microsoft Excel requires that you have the Excel installed on the management station that you are
using, you can open the file once the tool completes the process.
Once the tool has compiled the configuration report in the specified format, it places the file in the
repository and provides a button to view it.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
42
The system Reports’ output is an easy to read and follow document that provides an overview of
the VNX system’s configuration. Within the selected area, there is the system’s serial number, the
name of the configuration, and the version of the software currently installed on the system. You
also get:
-
Selected general information about the system under the General tab
Selected hardware-based information under the Hardware tab
Detailed information about the available software and enablers installed on the system
under the Software tab
An overview of the user-created storage group under the Storage tab
Selected information of all the hosts, initiator, virtual machines connected to the system
under the SAN tab
Each one of the tabs has multiple sub-selections.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
43
Now, you will learn of the Hardware section tab within Unisphere Service Manager. It includes the
tools to increase connectivity and storage, and to replace a faulted disk.
You are provided with two options from under the Hardware tab of USM. As the name reads, the
Hardware tab takes care of selected VNX hardware components. These two options are Hardware
Installation, and Hardware Replacement.
As of this release, the Hardware Installation allows you to: <Click-1> (1)increase the total storage
capacity by installing additional Disk Array Enclosure to the system and (2) add additional
connectivity by installating additional I/O Module and/or SFPs. The Hardware Replacement would
allow you to replace faulted disk.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
44
One of the capabilities that the VNX series platform has is the ability to grow in storage capacity
and connectivity as long as the maximum threshold has not been met for each. USM facilitates this
through its many wizards. Two of them are presented under the Hardware Installation options,
they are:
- Install Disk Array Enclosure: USM helps determine whether or not you can expand the
capacity of the system by adding more DAEs. If approved, it also advises on bus and
enclosure location for the DAE.
- Install I/O Module and/or SFPs: USM helps determine whether or not you can install I/O
Modules and/or SFPs into the system. Although it does not advise on which ports to install
the I/O Modules and/or SFPs, it is best practice to verify with a configuration guide or the
Pocket Reference before designating a slot for a specified I/O Module.
For more information about I/O Module assignment, please check with the VNX Procedure
Generator.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
45
While the hardware installation allows you to add more storage capacity, the hardware
replacement allows you the capability to replace a faulted disk. First off, USM verifies that a faulted
disk does exist, before executing any replacement command to the system. The Replace Faulted
Disk, once selected, opens the Disk Replacement Wizard (DRU). The Disk Replacement Wizard is a
self-explanatory program, and it provides step-by-step procedure for identifying whether a faulted
disk can be replaced or not.
For more information on how to replace a faulted disk, check the VNX Series Platform Maintenance
and Troubleshooting course in the Education Services database.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
46
Now you will learn about the Software tab with Unisphere Service Manager. It includes the wizards
used to download system upgrade files and packages, disk firmware upgrades, and other tools to
service a VNX system and/or other legacy systems.
As a service tool, USM also has the capability to update the VNX system. Under the Software tab,
there are three options: System Software, Disk Firmware, and Downloads. The Disk Firmware opens
Online Disk Firmware Wizard. If an update is available and you do not have it in the repository, the
wizard will allow you to do the update via the Install Update option. Note: This option requires that
the management station is connected to the internet. If an update is available and you do have it in
the repository, you can select Install from Local Repository.
Whether you select to update the disk from the repository or from online, the disks will be updated
online with little to minimal impact to an active environment.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
47
The second option under the Software tab is Download. With the Download button, and a valid
Powerlink account, USM can download packages for the VNX system. Upon selecting the Download
button, you have the option to do one of two things: (1) Download VNX Software Updates or (2)
Download Disk Firmware Packages.
The Download Software Updates lets you download all the software necessary for the VNX system
to the local repository. The list of software available includes Unisphere Client, Unisphere
Initialization Tool, Navisphere Secure CLI.
The Download Disk Firmware Packages lets you download disk firmware packages for later use. It
does not let you update disk firmware; you must use the Disk Firmware option from under the
Software tab.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
48
From time to time, every system needs to be updated. The VNX system is no different, and USM
plays a major part in the process. As the only GUI-based tool to service the VNX system, USM’s
System Software is the only tool to update all types of the VNX system. Besides updating the VNX
system, it can also be used to install additional enablers on the system using the Install Software
option.
Alternatively, there are separate command methods for updating the VNX for block or file system
software; however USM is the recommended tool for updating all models and types of the VNX
Series family.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
49
Once you select Install Software, this window will pop up. You need to verify that you are logged
onto the system you intend to. Then you need to specify what type of VNX system you are
updating: VNX for file, VNX for block or Install VNX OE (both). If you select to install both, the
system will update VNX for file first, then VNX for block. If you select VNX for block, you will need
to check with the EMC Support Matrix for compatibility first.
Although you are given the option to select either VNX for file or VNX for block, you should never
update VNX for block before updating VNX for file. This could impact the environment greatly
including the Control Station losing access to the Storage Processors. The best approach is always
to update VNX for file first or check with the VNX Procedure Generator before proceeding with the
update.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
50
Now you will learn about the Diagnostic tab within Unisphere Service Manager. It include the steps
to collect diagnostics data including SPCollects and log collection.
USM’s Diagnostic tab helps you keep an eye on the system’s status by checking the functionality of
every Field Replaceable Unit (FRU) and Customer Replaceable Unit (CRU), and logs. The Verify
Storage System button also checks backend functionality to determine whether the system is not
functioning normally. Once the verification is completed, most faults can be taken care of right on
the screen.
The Verify Storage System button helps you determine the system’s operation status with result on
the screen, but if you want to gather information about the system including SPCollects and log
collection from the Control Station you need the Capture Diagnostic Data button. This button
would open the Diagnostic Data Capture wizard. The wizard’s purpose is to initiate a process to
capture the SPCollect process on each SP and log collection process from the Control Station. Once
they are captured, the wizard transfers them to the local repository or a location of your choice.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
51
This lesson covers VNX utilities and host utilities used to gather data.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
52
The Control Station CLI commands allow management, configuration, and monitoring of the VNX
system.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
53
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
54
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
55
These are the categories of available tools in the VNX space.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
56
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
57
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
58
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
59
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
60
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
61
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
62
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
63
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
64
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
65
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
66
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
67
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
68
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
69
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
70
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
71
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
72
This slide shows the workflow in a typical VNX sales and deployment cycle. Specific tools are listed
in the phases where they are used.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
73
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
74
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
75
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
76
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
77
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
78
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
79
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
80
This lesson covers VNX utilities and host utilities used to gather data.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
81
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
82
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
83
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
84
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
85
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
86
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
87
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
88
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
89
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
90
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
91
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
92
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
93
The following slide series will step through the setup of perfmon, and the display of various
performance parameters.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
94
This shows the menu structure for Perfmon. Most of the management will take place via the tool
buttons.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
95
The first group of tool buttons are displayed here. From left to right they are:
•
•
•
•
•
•
•
•
•
•
•
New Counter Set – remove all counters, and start over
Clear Display – clear current chart, and restart the display
View Current Activity – view real-time information
View Log Data – get performance information from a log file
View Graph – view data in chart form (usually the most useful)
View Histogram – view data in histogram form
View Report – view data as a text report
The next group of buttons are:
Add – add a counter to the current set
Delete – remove a counter from the current set
Highlight – highlight the line or bar for a specific counter
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
96
The final group of buttons are:
•
•
•
•
•
•
Copy Properties – copy counter data to the Clipboard
Paste Counter List – paste data from the Clipboard into another instance of Perfmon
Properties – view properties of this instance of Perfmon
Freeze Display – stop updates to the display
Update Data – restart updates to the display
Help – show Perfmon help
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
97
Here is a perfmon graph view, with several counters chosen for each of 2 physical disks. Note that one
of the counters, Disk Bytes/sec on PhysicalDisk 19, has been highlighted by clicking the Highlight tool
button.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
98
Select the system to gather information from (normally the local host), the objects to be monitored,
and the counters to use for those objects.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
99
The new log is displayed in the counter logs container.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
100
The esxtop utility allows monitoring of ESX Server performance. The result of the monitoring may be
sent to a file, but is often used in real-time, interactive mode, as discussed here.
Other modes are: Batch mode, which allows output to be captured to a file, and Replay mode, which
allows replaying (viewing) of previously captured performance information.
Disk parameters are likely to be of more interest than the others, particularly CPU and NIC
parameters. Memory use can have an effect on disk performance. Shortage of memory will cause the
swap area on disk to be used.
The options that allow manipulation of displayed fields are useful; there are often too many
parameters to display on an 80-column display, and some of the parameters displayed on the screen
by default are less important than those in hidden columns.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
101
The CPU display page shows CPU usage for various components of the ESX Server environment,
including the VMs. By pressing ‘e’ to expand the display, more detail is shown for components of
interest.
The VM IDs have been highlighted in the display; these IDs are useful when interpreting the disk
display shown next.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
102
This slide shows the unexpanded display for the storage, or disk, statistics. Note that the display
shows HBAs, and does not differentiate between internal and external HBAs. In this slide, vmhba0 is
the internal HBA, while vmhba1 and vmhba2 are FC HBAs attached to a CLARiiON.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
103
The display has been expanded by means of the c, t and l options to show more detail for controller,
target and LUN. LUN 0 (the HLU is 0) is a shared VMFS LUN, used by all 3 VMs. The VM ID shows how
each VM is using the LUN. Other LUNs have not been expanded.
This view allows a finer granularity than that used by the CLARiiON; events are allocated to the VM
that caused them. A comparison of this data and the data generated by Navisphere Analyzer will help
clarify where potential performance issues exist.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
104
While memory affects the overall performance of the ESX Server, it has very little direct effect on the
performance of the storage subsystem. A high incidence of swapping may cause a path or SP port to
be overloaded if the swap file is located on SAN storage. The 3 VMs displayed here are running disk
tests on CLARiiON LUNs; though the amount of I/O generated is reasonably high, the memory use of
the VMs is low.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
105
This display shows the network information for the ESX Server. It is broken down by NIC (vmnic0 is the
only active NIC here), by virtual NIC (vswif0) and by individual VM.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
106
The performance monitoring that can be performed from the VI Client displays a reasonable amount
of detail about all aspects of ESX Server and VM performance. The disk monitoring includes
bandwidth, throughput, bus resets and aborted commands.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
107
The Performance Chart displayed by VI Client allows display of CPU, Disk, Memory, Network and
System performance parameters. Charts may be displayed as line graphs, stacked graphs or stacked
graphs per VM as shown in following slides. A number of different performance counters may be
displayed; for the disk subsystem they include bandwidth and throughput counters.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
108
The information presented here is similar to that seen in Navisphere Analyzer. Note, though, that
some of the values shown are for a polling interval (20s default), and are not displayed as ‘per second’
values.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
109
This example shows that stacked chart option. Only one performance parameter may be displayed,
though multiple objects can be chosen, as shown here.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
110
This view displays parameters as they relate to the individual VMs, rather than being system-wide.
Only one performance parameter may be displayed on a chart.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
111
This lesson covers tools used to validate the VNX environment.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
112
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
113
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
114
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
115
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
116
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
117
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
118
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
119
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
120
This example of an Release Note (RN) is for FAST Cache on VNX systems running VNX OE for Block
5.31. Note the sections in the list of topics – these are common to many RNs. Useful information
includes a product description, known problems, fixed problems, and troubleshooting.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
121
An example product description from a Release Note.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
122
Questions 1 to 3
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
123
Questions 1 to 3
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
124
Questions 4 and 5
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
125
Questions 4 and 5
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
126
This module covered tools used to gather data, and tools used to validate the environment.
Copyright © 2012 EMC Corporation. All rights reserved
Module 2: Tools
127
This module focuses on analyzing collected data.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
1
This lesson covers analyzing data collected from a customer environment.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
2
Capacity analysis is typically simpler than performance analysis. If the VNX system is replacing
another storage system, or replacing local storage on production servers, then the storage
requirement is already known. For a new installation, the sizes of databases, email inboxes, etc can
usually be estimated fairly well. The level of protection must be chosen carefully, and this will affect
the number of disks for a given capacity. If growth is expected, Virtual Provisioning should be
considered. Expansion of LUNs is direct and simple with Pool LUNs; metaLUNs may be used if more
predictable performance is desired, but may be too restrictive since only traditional LUNs may be
used for metaLUNs.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
3
A Unisphere performance analysis is easy if the Survey is used as the starting point; LUNs with
potential issues are flagged, and are easy to identify.
More information may be obtained by looking at SP, LUN and disk parameters in the Performance
Detail view.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
4
The Survey view identifies LUNs with Utilization, Response Time or Queue Length which exceed the
predetermined thresholds. Note that Bandwidth and Throughput are not useful here.
Any LUNs with one or more red borders around a pane need to be investigated. Expanding the LUN
in the Performance Detail view will show the disks on which the LUN was created, as well as the SP
that owns the LUN.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
5
We need to know if cache is enabled; if it isn’t, the performance we expect will be much lower than
if it was. In some cases, especially if cache has been enabled or disabled during the capture
interval, the checkbox may be empty and grayed out. In that case, further investigation may be
required to determine whether or not the caches are enabled. As an example, if we see write cache
hits for a LUN, the cache is enabled. Note that the absence of write cache hits does not, by itself,
indicate that write cache is disabled.
SP utilization consistently higher than 70% may indicate a performance problem, and will almost
certainly contra-indicate the addition of VNX replication software.
We also need to look at the dirty pages to get an idea of write cache performance. Ideally, dirty
pages should remain within the watermarks, with few excursions above the High Watermark, and
few or none to the 100% mark.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
6
Trespassed LUNs are an indication that there may be a hardware problem of some sort. Other LUNs
on the same RAID Group should be checked to see if they are trespassed as well.
Other LUN performance attributes should also be checked – the forced flush rate, read and write
rate and I/O size, and the measure of burstiness of the I/O. Based on what we know about the
access pattern, we can set expectations for the performance of the LUN involved.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
7
We look at the disks in much the same way as we look at the LUNs – utilization is important, and
usually much more so than LUN utilization. We also need to look at the throughput and bandwidth
of the individual disks, and compare with our rule of thumb values. Disk service time and queue
length will determine disk response time, while average seek distance may give us an idea of how
random the I/O is. Comparing disk and LUN I/O sizes will show if coalescing is taking place – a sign
that at least some I/O is sequential. In the case of RAID 1/0, we need to compare the I/O rate of
the primary and secondary disks in a pair; they will not always show identical workloads, but will
often be fairly similar.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
8
If our predictions and reality don’t match, there are several possible reasons. Those need to be
investigated more fully.
This first slide mentions some reasons why random I/O workloads may not perform as expected.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
9
On this slide we look at a few reasons why sequential workloads may not perform as expected. We
also need to verify whether or not the workload is balanced across all available resources.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
10
This lesson covers analyzing data collected from a customer environment.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
11
This slide introduces a scenario used to illustrate the analysis process.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
12
The Performance Survey is the view that opens by default when an archive is opened. We can
check for red flags to see which LUNs are potential problems. LUNs 600 and 601 are both flagged.
Each shows Utilization over 70% in the latter part of the capture, and each shows response times
consistently over 20 ms during the latter portion of the capture. Because the LUNs look the same,
we’ll concentrate on only one of them – LUN 600.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
13
We note that LUNs 600 and 601 are on the same RAID Group, and that both are owned by SPB. As
noted previously, the performance of the LUNs is identical.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
14
SPB Utilization shows a change at around the 18:00 mark, but is consistently below 50% for the
entire period under observation. The watermarks are set to 49%/70%, and dirty pages fall inside
these limits at all times.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
15
Read cache and write cache are enabled. Read cache size is probably adequate for this
environment – further investigation will show whether or not it should be increased. Write cache
size is 2048 MiB, and could be increased for this system. Page size is at the default value.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
16
LUN Properties, when viewed from Analyzer, may not always accurately reflect the state of the LUN
at any given time. In the example shown, the LUN has read and write cache enabled, but Analyzer
does not reflect this. You make need to look at read and write cache hit ratios to determine if cache
is enabled.
The LUN is not trespassed, and all prefetch settings are at their default values. No problems with
the LUNs can be identified here.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
17
LUN bandwidth is low. This could be because the LUN is not heavily utilized, or because I/O sizes
are small. We’ll need to look at throughput, and I/O sizes, to check.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
18
Throughput appears to be moderate for a 6-disk R6 LUN (though, of course, other LUNs are also
present on the RAID Group). We’ll check for forced flushes, and also check prefetch behavior.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
19
If write cache is enabled for the LUN, then the lack of forced flushes indicates that caching is
working well for this LUN, and that the cache subsystem has adequate headroom.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
20
The lack of read cache hits may indicate that reads are very random, that prefetching is not being
triggered, or that read cache is turned off.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
21
Though this slide shows only a subset of the available read and write sizes, all were checked. Only
the 512 B size shows any activity at all. These small I/Os will not bypass write cache, and if they are
random, as seems to be the case, will not cause disk coalescing.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
22
Average seek distance for the initial period under observation (up to 18:00 or so) is fairly low, at 5
GiB or so (disks are 300 GB FC). After that, the average seek distance changes to around 23 GiB, still
moderately low.
Disk utilization varies between 50 and 65 percent, and is fairly consistent for the entire interval.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
23
Disk and LUN write sizes are identical until the 18:00 mark, after which disk write sizes increase
fourfold. The I/O access pattern to LUNs 600 and 601 did not change over the entire period of
observation, so some other factor is causing the increased disk write sizes. Additional LUNs use this
set of disks, and their access patterns will need to be investigated.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
24
Response times are reasonable; service time, not shown, is consistent.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
25
The disks are running at over 50% utilization. They will allow additional load, but not a doubling of
the workload.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
26
Here are suggestions to improve the performance of the system.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
27
The Survey configuration should use reasonable values; the Utilization and Queue Length values
are the recommended ones. The Response Time setting is more dependant on the specific
environment – 30 ms may be reasonable in environments that require low latency.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
28
This example shows the queue length over a period of 10 polling intervals. Based on the curve
shown, what are the values for average queue length, average busy queue length, and utilization?
Average queue length = ( 6 + 0 + 0 + 4 + 4 + 4 + 0 + 0 + 0 + 0 ) / 10 = 1.8
Average busy queue length = ( 6 + 4 + 4 + 4 ) / 4 = 4.5
Utilization = ( 1 + 0 + 0 + 1 + 1 + 1 + 0 + 0 + 0 + 0 ) / 10 = 0.4 = 40%
Note what the calculated values tell you – the Queue Length is a reasonable measure of how busy
this object is. The Average Busy Queue Length is calculated only from polling intervals where the
queue value is 1 or more – empty points are ignored. The ABQL is therefore a measure of the
burstiness of the I/O. Utilization is calculated by giving any polling interval where there is at least
one I/O in the queue a value of 1, and intervals where the queue is empty a value of 0. Note that,
as in the slide, where I/O is very bursty, the utilization may be displayed as higher than the actual
‘busy’ time for the object.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
29
This lesson covers analyzing data collected from a customer environment.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
30
Next we’ll look at performance analysis on file-based storage.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
31
In order to demonstrate how the server_stats tool can be used when troubleshooting a
performance issue, we will look at a case study regarding a certain business process. This business
process has been having throughput issues and the administrator would like to increase throughput
by at least 20%.
Most performance issues can be summarized by four questions:
1. What am I getting?
2. Is that what I should be getting?
3. If not, why not?
4. What, if anything, can I do about it?
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
32
The first step in any performance analysis is to find out what type of workload your environment or
application is generating. We need to find out the operation rate (IOPS), I/O size, and access
pattern (read or write).
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
33
Let‘s begin by looking at what the host application is sending the VNX. From the cifs-std and nfs-std
compound stats we can see that the host is only doing NFS transactions. The workload from the
Host looks small - 8KiB random and 100% write oriented with about 20,000 write IOPS.
Some columns in the above output are omitted. To get the full output consider running the
command for CIFS and NFS in separate windows or redirect output into a .csv format.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
34
Next, we will use the nfsOps-std statPath to determine the response time for NFS I/Os. The
command displays a response time of 6.5 to 7 ms, which is a little bit higher than expected but not
a real bottleneck. Average response time for a drive should be between 2 and 5.6 ms depending
on the drive’s rotational speed.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
35
With the diskvolume-std stat we get statistics on the dVols. Here we are able to verify if the dVols
are well balanced. In this scenario, the write I/O rates are a bit high and dVols are not as well
balanced as they should. For these dVols to be well balanced, the write IOPS should be in the 800‘s
to 900‘s. Some of them are in 1000‘s, which is not a significant difference. There would be a
problem if some of these write IOPS were double in size, or in the 1600‘s. If that was the case, that
would be a good indicative that AVM was not used in the creation of the file system.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
36
To find out which LUNs are being used by the file system, run the nas_fs -i to determine which
dVols are being used, and then run the nas_disk -i to find out the corresponding LUN on the
back-end.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
37
With the naviseccli getlun command we are able to verify the I/O size being used to send data to
the LUN. To calculate the I/O size simply divide the blocks written value by the write requests
number. In this case, 16 blocks are being written per write request, which translates to 8KiB I/O
size.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
38
Shown here is the I/O size being used to send data to disk. To calculate I/O size to disk, divide the
amount of Kbytes written by the amount of write requests. For this example, the disks are seeing
about 32 KiB sized I/Os.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
39
By using server_stats, we came up with the following results; the host application is sending data in
8KiB I/Os and the dVols and LUNs are both seeing 8KiB I/Os as well. The only difference is with the
disk which is getting 32KiB I/Os. The reason being is that the data leaving the host was logically
contiguous in 32KiB increments. Once this data reaches the storage system cache, the 8KiB I/Os
are coalesced in 32KiB I/Os and sent to disk. An idea would be to have the host send data in 32KiB
I/Os instead of 8KiB I/Os to improve efficiency.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
40
Looking at the file system from the host’s perspective we find out that the application file system is
mounted with rsize and wsize of 8KiB. The rsize and wsize parameters cause the NFS client to try
to negotiate a buffer size up to the size specified. A large buffer size does improve performance,
but both the server and client have to support it. In the case where one of these does not support
the size specified, the size negotiated will be the largest that both support.
In this example, the largest I/O size that the host would be able to use is only 8KiB.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
41
Needless to say, we can’t change the I/O size from the NAS front-end to the Block back-end (dVol or
LUN), but we can change the host I/O size. Here is the command to mount the file system with
rsize and wsize of 32KiB.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
42
Once we changed the mount read and write size on the host, we were able to see an increase in
throughput and a decrease in IOPS, making the business process more efficient in transferring data.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
43
HAVT (High Availability Verification Test) will report deficiencies in the HA setup for a host. Because
ESXi has no service console, the test is run from an external host – a Windows host.
SAN zoning should be checked for path redundancy. The host failover software should also be
checked, and ALUA supported configuration used where appropriate.
Systems with file-level access allow redundancy in the LAN configuration, and this should be
verified. Aggregation protocols allow for failover and load balancing; this configuration should also
be checked.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
44
The choice of RAID type is important for performance and availability; a compromise may have to
be made, especially when pools are used. The backend port (strictly not a bus in the SAS
environment) selection should be made to maximize availability, especially for FAST Cache drives.
Bear in mind that if a RAID Group has all disks but one in the Vault enclosure, the disk could be
marked for rebuilding when power is removed.
Ensure that hot spares of the correct speed, size and disk type are available.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
45
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
46
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
47
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
48
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
49
This module covered analysis of data collected from the environment.
Copyright © 2012 EMC Corporation. All rights reserved
Module 3: Analysis
50
This module focuses on design best practices for VNX systems in unified environments.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
1
This lesson covers the introduction to storage design, specifically the terminology used and
environmental considerations.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
2
The storage system does not operate alone; other devices in the environment will play an
important role as well. A designer will need to know the abilities and limitations of those devices.
On occasion the goals for the environment will conflict, and a measure of compromise will be
required. An example is the potential conflict between performance and availability; in the choice
of RAID types, for example, RAID 6 would be the optimal choice for availability, whereas RAID 1/0
may be the best choice for performance in the specific environment.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
3
Units used in the design of storage environments can be confusing. In the SI system of
measurement, multipliers are decimal, and mega, for example, means 10^6, or 1,000,000. Some
measurements used in IT, though, are based on the binary equivalents, which are somewhat larger
than the decimal units. Note, for example, that at the TB/TiB level the binary unit is almost 10%
larger than the decimal unit.
Disk manufacturers specify disk sizes in decimal units, but use a sector size of 512 bytes (a binary
value) when discussing formatted sizes. The VNX Block systems, and their CLARiiON predecessors,
use 520 byte sectors, and this must be taken into account as well. Note that cache sizes, LUN sizes
and file system sizes are specified in binary units.
The binary units are a relatively new standard, established by the IEC in 2000. The standard has
been accepted by all major standards organizations including the IEEE and NIST. See the Wikipedia
article at http://en.wikipedia.org/wiki/Binary_prefix for more detail.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
4
This slide shows the terminology used to specify the amount of space used by Pools. Note that some terms
have changed in the 05.32 VNX OE for Block code release.
Because Thin LUNs allow oversubscription, and because VNX Snapshots are thinly provisioned, the
consumed capacity and user capacity may not be the same for the Pool.
In the Physical Capacity pane, the system reports:
Total - Total usable space in the Pool (disk space minus RAID overhead)
Free - Total amount of available space in the Pool
Percent Full - Percentage of consumed Pool user capacity
Total Allocation (visible when VNX Snapshot enabler is installed) - Amount of space in the pool allocated for
all data
Snapshot Allocation (visible when VNX Snapshot enabler is installed) - Amount of space in the pool
allocated to VNX Snapshot LUNs
In the Virtual Capacity pane, the system reports:
Total Subscription - Total amount of LUN user capacity configured in the pool and (potentially) presented to
attached hosts
Snapshot Subscription (visible when VNX Snapshot enabler is installed) - Total potential size required by
VNX Snapshots if their primary data was completely overwritten, plus the size of any writable (attached)
VNX Snapshots
Percent Subscribed - Percentage of Pool capacity that has been assigned to LUNs. This includes primary and
snapshot capacity
Oversubscribed By - Amount of subscribed capacity that exceeds the usable capacity of the pool. If this
value is less than or equal to zero, the option is grayed out
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
5
This slide shows the LUN types currently available on VNX systems, and looks at some of the
terminology used to specify the amount of space used by those LUNs. Thick and Thin LUNs both
use a mapping mechanism to keep track of data, and are sometimes referred to as Mapped LUNs,
or MLUs. The driver for this LUN type may be described as the MLU driver in documentation and
White Papers.
Because Thin LUNs allow oversubscription, the consumed capacity and user capacity may not be
the same for the LUN. Thick LUNs have space allocated at creation time, and use additional space
for metadata, so their consumed capacity will always be higher than their user capacity, though
consumed capacity will not be displayed in the LUN Properties dialog.. RAID Groups allocate space
to LUNs at LUN creation time, so the LUN size and the consumed capacity will always be identical
(or within 1 stripe of identical).
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
6
Storage system performance, and planning for that performance, depends on the sizes of the I/O
used in the environment. In VNX Block environments, I/O sizes up to 16 KiB in size are regarded as
small, while I/O sizes above 64 KiB are regarded as large. Various terms exist to describe I/O access
patterns; I/O can generally be classified as either random or sequential. Some I/O patterns are
single-threaded, where one operation will finish before the next begins; in other cases, multiple
operations may be active simultaneously, and this pattern is therefore described as multi-threaded,
or as exhibiting concurrent I/O. In some cases, I/O accesses are made to areas of the data surface
which are very close to each other, and this nearness is described as locality, or more correctly
spatial locality. Spatial locality refers to disk addresses, usually described in terms of logical block
addresses, or LBAs. Temporal locality refers to nearness of access on a time basis, and is important
when dealing with, for example, FAST VP.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
7
VNX Block systems have OE code which is designed to optimize the use of disk storage. These
optimizations will typically try to make disk accesses as sequential as possible, and will try to make
I/O sizes as large as possible. Other features deal with the operation of cache.
Coalescing refers to the grouping of smaller writes with contiguous LBAs into a larger write. This
will be especially efficient if the writes can fill a stripe in parity-protected RAID environments; a fullstripe write, sometimes also known as an MR3 write, will then occur. The RAID write penalty
associated with small-block parity-protected writes does not apply to full-stripe writes. As an
example, the small block write penalty for a 4+1 R5 LUN is 4, whereas the full-stripe write penalty
for the same LUN is 1.25 – even less than R1/0. Where reads are concerned, multiple contiguous
accesses will cause prefetching to be triggered if it is enabled.
Cache terminology includes destaging or flushing, the regular copying of data from cache to disk.
This activity is controlled by the watermarks, HWM and LWM, and by the idle flushing
configuration. Dumping of cache to the vault only occurs on failures or power down of the storage
system.
Disk crossings, reported by Unisphere Analyzer, occur when a single host I/O touches 2 or more
physical disks. This will always occur for I/O of > 64 KiB; for smaller I/Os, it may be an indication
that data is misaligned. Data alignment will be discussed later.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
8
FAST Cache and FAST VP are topics which will be covered in more detail later. The terminology is
often seen elsewhere, though, so is mentioned here.
An application may have a vast amount of data storage allocated to it, but may be actively using
only a small portion of that data space. The portion in use is the active data, or the working data
set. Having some parts of the data more active than others is the basis of skew, which both FAST
Cache and FAST VP rely on for their operation. Skew is discussed more fully in a later slide. FAST
Cache and FAST VP copy or move data, and those data movements are known as promotions,
write-backs, or relocations. FAST Cache and FAST VP take some time to gather statistics and
perform the data movement; in FAST Cache, this time is known as the warm-up time.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
9
This lesson covers storage design best practices in block-only environments. Note, though, that
many of these best practices will also apply to file-only and unified environments due to the
relationship between the file front-end and the block back-end.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
10
Some general best practices for all design activity are shown here.
Be familiar with the documentation for the systems and software. Documentation will describe
features and limitations that are important to bear in mind. Also, make sure to use the latest
supported code, and verify code interoperability across systems.
Understanding the workload is a very important part of the design process. Different applications
have vastly different access patterns and I/O sizes; look at the vendor documentation and white
papers to get more information. Remember also that with disk sizes increasing while I/O capability
remains about the same, designing for performance first and then looking at capacity is the
accepted practice. In many cases, storage system (and host, etc) default values will perform well
over a wide range of environments. Changing these values may improve performance in some
cases, but can easily lead to performance degradation. Be careful when moving away from default
values, and understand the consequences of your actions.
VNX storage systems generate informational, warning, and alert messages of various kinds when
potential issues exist. Note these messages carefully, and work speedily to resolve the issues.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
11
Hosts connected to VNX systems benefit from multipathing. Direct-attach multipathing requires at least two HBAs for
attachment to the 2 SPs; four HBAs allow failover with a lessened chance of LUNs needing to trespass. SAN
multipathing also requires at least two HBAs, with each HBA zoned to more than one SP port. Though a multiport HBA
may be used to provide 2 HBA ports, and therefore may simplify failover, the HBA may become a single point of failure.
The advantages of multipathing are:
•
•
•
Failover from port to port on the same SP, maintaining an even system load and minimizing LUN trespassing
Port load balancing across SP ports and host HBAs
Higher bandwidth attach from host to storage system (assuming the host has as many HBAs as paths used)
While PowerPath offers failover and load balancing across all available active paths, this comes at some cost:
•
•
•
Some host CPU resources are used during both normal operations and failover
Every active and passive path from the host requires an initiator record; VNX systems allow only a finite
number of initiators
Active paths increase time to fail over in some situations. (PowerPath tries several paths before trespassing
a LUN from one SP to the other.)
Microsoft Multi-Path I/O (MPIO) as implemented by MS Windows Server versions provides a similar, but more limited,
multipathing capability than PowerPath. Features found in MPIO include failover, failback, Round Robin Pathing,
weighted Pathing, and I/O Queue Depth management.
Consult the Microsoft documentation for information on MPIO features and implementation.
Linux MPIO is implemented by Device Mapper (dm). Multipathing capability is similar to PowerPath, though more
limited. The MPIO features found in Device Mapper are dependent on the Linux release and the revision.
Review the Native Multipath Failover Based on DM-MPIO for v2.6.x Linux Kernel and EMC Storage Arrays Configuration
Guide available on Powerlink for more details.
Vmware Native Multi-Pathing (NMP) allows failover similar to that of PowerPath, with very limited load balancing.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
12
PowerPath versions 5.1 and later are ALUA-compliant releases.
PowerPath load balances across optimized paths, and only uses non-optimized paths if all the
optimized paths have failed. For example, if an optimized path to the original owning SP fails, I/O
will be sent across remaining optimized paths. If all optimized paths fail, for example as the result
of a storage processor failure, I/O is sent to the peer SP. If the I/O count to the non-optimized path
exceeds a preset value, PowerPath initiates a trespass to change LUN ownership. The nonoptimized paths then become the optimized paths, and the optimized paths become the nonoptimized paths.
Not all multipathing applications or revisions are ALUA compliant; verify that your native hostbased failover application can interoperate with ALUA.
When configuring PowerPath on hosts that can use ALUA, the default storage system failover mode
is Failover Mode 4. This configures the VNX for asymmetric Active/Active operation. This has the
advantage of allowing I/O to be sent to a LUN regardless of LUN ownership. Details on the
separate failover modes 1 through 4 can be found in the EMC CLARiiON Asymmetric Active/Active
Feature — A Detailed Review white paper, available on Powerlink.
To take advantage of ALUA features, the host operating system also needs to be ALUA-compliant.
Several operating systems support native failover with Active/Passive (A/P) controllers. However,
there are exceptions. Refer to the appropriate support guide for O/S support. For example, ALUA
supported Linux operating systems would be found in the EMC® Host Connectivity Guide for Linux.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
13
High availability requires at least two HBA connections to provide redundant paths to the SAN or storage
system.
It is a best practice to have redundant HBAs. Using more than one single-port HBA enables port- and pathfailure isolation, and may provide performance benefits. Using a multiport HBA provides a component cost
savings and efficient port management. Multiport HBAs are useful for hosts with few available I/O bus slots,
but represent a single point of failure for several ports. With a single-ported HBA, a failure would affect only
one port.
HBAs should also be placed on separate host buses for performance and availability. This may not be
possible on hosts that have a single bus or a limited number of bus slots. In this case, multiport HBAs are the
only option.
Always use an HBA that equals or exceeds the bandwidth of the storage network, e.g. do not use 2 Gb/s or
slower HBAs for connections to 4 Gb/s SANs.
FC SANs reduce the speed of the network path to the HBA’s speed either as far as the first connected
switch, or to the storage system’s front-end port if directly connected. This may cause a bottleneck when
the intention is to optimize bandwidth.
Finally, using the most current HBA firmware and driver from the manufacturer is always recommended.
This software may be found, and should be downloaded from, the vendor (EMC) area of the HBA
manufacturer’s website. The Unified Procedure Generator (installation available through Powerlink)
provides instructions and the configuration settings for HBAs specific to your storage system.
<Continued>
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
14
iSCSI environments may make use of NICs, TOE cards or iSCSI HBAs. The differences include cost,
host CPU utilization, and features such as security. The same server cannot use NICs and HBAs to
connect to the same VNX storage system.
NICs are the typical way of connecting a host to an Ethernet network, and are supported by
software iSCSI initiators.
Ethernet networks will auto-negotiate down to the lowest common device speed; a slower NIC
may bottleneck the storage network’s bandwidth. Do not use legacy 10/100 Mb/s NICs for iSCSI
connections to 1 Gb/s or higher Ethernet networks. A TCP Offload Engine (TOE) NIC is a faster type
of NIC. A TOE has on-board processors that offload TCP packet segmentation, checksum
calculations, and optionally IPSec from the host CPU to themselves. This allows the host CPU(s) to
be used exclusively for application processing.
Redundant NICs, iSCSI HBAs, and TOEs should be used for availability. NICs may be either single or
multiported. A host with a multiported NIC or more than one NIC is called a multihomed host.
Typically, each NIC or NIC port is configured to be on a separate subnet. Ideally, when more than
one NIC is provisioned, they should also be placed on separate host buses. Note this may not be
possible on smaller hosts having a single bus or a limited number of bus slots, or when the onboard host NIC is used.
All NICs do not have the same level of performance. This is particularly true of host motherboard
NICs, 10 Gb/s NICs, and 10 Gb/s HBAs. For the most up-to-date compatibility information, check
the E-Lab Interoperability Navigator at: http://elabnavigator.EMC.com.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
15
At least two paths between the hosts and the storage system are required for high availability.
Ideally, the cabling for these paths should be physically separated. In addition paths should be
handled by separate switching, if not directly connecting hosts and storage systems. This includes
redundant, separate HBAs, and attachment to both of the storage system’s storage processors.
Path management software such as PowerPath and dynamic multipathing software on hosts (to
enable failover to alternate paths and load balancing) are recommended.
For device fan-in, connect low-bandwidth devices such as tape, and low utilized and older, slower
hosts to edge switches or director blades.
Contact an EMC USPEED Professional (or your EMC Sales representative if a partner) for assistance
with FCoE performance.
For additional information on FCoE, see the Fibre Channel over Ethernet (FCoE) TechBook available
on Powerlink.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
16
iSCSI SANs on Ethernet do not have the same reliability and built-in protocol availability as Fibre
Channel SANs; advantages are that they handle longer transmission distances and are less
expensive to setup and maintain.
If you require the highest availability, for a SAN under 500m (1640 ft.) a Fibre Channel SAN is
recommended.
Note that the number of VLANs that may be active per iSCSI port is dependent on the LAN’s
bandwidth. A 10 GigE network can support a greater number.
Ideally, separate Ethernet networks should be created to ensure redundant communications
between hosts and storage systems. The cabling for the networks should be physically as widely
separated as is practical. In addition, paths should be handled by separate switching, direct
connections are not used. If you do not use a dedicated storage network, iSCSI traffic should be
either separated onto separate LAN segments, or a virtual LAN (VLAN). VLANs allow the creation of
multiple virtual LANs, as opposed to multiple physical LANs in your Ethernet infrastructure. This
allows more than one logical network to share the same physical network while maintaining
separation of the data.
Ethernet connections to the storage system should use separate subnets depending on if they are
workload or storage system management related.
<Continued>
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
17
Separate the storage processor management 10/100 Mb/s ports into separate subnets from the
iSCSI front-end network ports. It is also prudent to separate the front-end iSCSI ports of each
storage processor onto a separate subnet.
Do this by placing each port from SPA on a different subnet. Place the corresponding ports from
SPB on the same set of subnets. The 10.x.x.x or 172.16.0.0 through 172.31.255.255 private
network addresses are completely available.
For example, a typical configuration for the iSCSI ports on a storage system, with two iSCSI ports
per SP would be:
A0: 10.168.10.10 (Subnet mask 255.255.255.0; Gateway 10.168.10.1)
A1: 10.168.11.10 (Subnet mask 255.255.255.0; Gateway 10.168.11.1)
B0: 10.168.10.11 (Subnet mask 255.255.255.0; Gateway 10.168.10.1)
B1: 10.168.11.11 (Subnet mask 255.255.255.0; Gateway 10.168.11.1)
A host with two NICs should have its connections configured similar to the following in the iSCSI
initiator to allow for load balancing and failover:
NIC1 (for example, 10.168.10.180) - SP A0 and SP B0 iSCSI connections
NIC2 (for example, 10.168.11.180) - SP A1 and SP B1 iSCSI connections
Note that 128.221.0.0/16 should never be used because the management service ports are hard
configured for this subnet.
There is also a restriction on 192.168.0.0/16 subnets. This has to do with the configuration of the
PPP ports. The only restricted addresses are 192.168.1.1 and 192.168.1.2 and the rest of the
192.168.x.x address space are usable with no problems.
For more information about VLANs and VLAN tagging, please refer to the VLAN Tagging and
Routing on EMC CLARiiON white paper available on Powerlink.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
18
Availability refers to the storage system‘s ability to provide user access to data in the case of a hardware or
software fault. Midrange systems like the VNX-series are classified as highly available because they provide
access to data without any single point-of-failure. Performance in degraded mode is typically lower than
during normal operation. The following configuration settings can improve performance under degraded
mode scenarios.
Single DAE Provisioning is the practice of restricting the placement of a RAID group within a single enclosure.
This is sometimes called horizontal provisioning. Single DAE provisioning is the default method of
provisioning RAID groups, and, because of its convenience and High Availability attributes, is the most
commonly used method.
In Multiple DAE Provisioning, two or more enclosures are used. An example of multiple DAE provisioning
requirement is where drives are selected from one or more additional DAEs because there are not enough
drives remaining in one enclosure to fully configure a desired RAID Group. Another example is SAS back-end
port balancing. The resulting configuration may or may not span back-end ports depending on the storage
system model and the drive-to-enclosure placement.
An LCC connects the drives in a DAE to one SP’s SAS back-end port; the peer LCC connects the DAE’s drives
to the peer SP. In a single DAE LCC failure, the peer storage processor still has access to all the drives in the
DAE, and RAID group rebuilds are avoided. The storage system automatically uses its lower director
capability to re-route around the failed LCC and through the peer SP. The peer SP experiences an increase in
its bus loading while this redirection is in-use. The storage system is in a degraded state until the failed LCC
is replaced. When direct connectivity is restored between the owning SP and its LUNs, data integrity is
maintained by a background verify (BV) operation.
Request forwarding’s advantages of data protection and availability result in a recommendation to
horizontally provision. In addition, note that horizontal provisioning requires less planning and labor.
If vertical provisioning was used for compelling performance reasons, provision drives within RAID groups to
take advantage of request forwarding. This is done as follows:
RAID 5: At least two (2) drives per SAS back-end port in the same DAE.
RAID 6: At least three (3) drives per back-end port in the same DAE.
RAID 1/0: Both drives of a mirrored pair on separate backend ports.
<Continued>
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
19
FAST Cache
It is required that flash drives be provisioned as hot spares for FAST Cache drives. Hot sparing for
FAST Cache works in a similar fashion to hot sparing for traditional LUNs made up of flash drives.
However, the FAST Cache feature’s RAID 1 provisioning affects the result.
If a FAST Cache Flash drive indicates potential failure, proactive hot sparing attempts to initiate a
repair with a copy to an available flash drive hot spare before the actual failure. An outright failure
results in a repair with a RAID group rebuild.
If a flash drive hot spare is not available, then FAST Cache goes into degraded mode with the failed
drive. In degraded mode, the cache page cleaning algorithm increases the rate of cleaning and the
FAST Cache is read-only.
A double failure within a FAST Cache RAID group may cause data loss. Note that double failures are
extremely rare. Data loss will only occur if there are any dirty cache pages in the FAST cache at the
moment both drives of the mirrored pair in the RAID group fail. It is possible that flash drive data
can be recovered through a service diagnostics procedure.
The first four drives, 0 through 3 in a DPE or in the DAE-OS of SPE-based VNX models are the
system drives. The system drives may be referred to as the Vault drives. On SPE-based storage
systems the DAE housing the system drives may be referenced as either DAE0 or DAE-OS. Only SAS
drives may be provisioned as system drives on the VNX series. These drives contain files and files
space needed for the:
Saved write cache in the event of a failure
Storage system’s operating system files
Persistent Storage Manager (PSM)
Operating Environment (OE) -- Configuration database
The remaining capacity of system drives not used for system files can be used for user data. This is
done by creating RAID Groups (Pools cannot use system drives) on this unused capacity.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
20
Hot spares are important, though not mandatory, for both RAID Groups and Pools. Note that the
hot spare algorithm will first look for the smallest hot spare that can accommodate the used
capacity on the failed or failing drive – this means that, under the right circumstances, a 300 GB
drive could spare for a 600 GB drive, for example.
RAID Groups or Pool tiers that use NL-SAS drives should be configured with RAID 6, especially in the
case of larger RAID Groups. Note that the NL-SAS tier in Pools now has 2 different recommended
configurations – 6+2 and 14+2. The latter configuration allows for a Private RAID Group capacity of
over 25 TiB when using 2 TB disks, and rebuild times will be lengthy.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
21
RAID-level data protection
All the LUNs bound within a Pool will suffer loss of availability, and may suffer data loss, from a complete failure of a
Pool RAID group. The larger the number of private RAID groups within the pool, the bigger the effect of a failure.
It is important to choose levels of protection for the Pool in line with the value of the Pool contents.
Three levels of data protection are available for pools.
RAID 5 has good data availability. If one drive of a private RAID group fails, no data is lost. RAID 5 is appropriate for
small to moderate sized homogeneous Pools. It may also be used in small to large Pool tiers provisioned with SAS and
Flash drives which have high availability.
RAID 6 provides the highest data availability. With RAID 6, up to two drives may fail in a private RAID group and result
in no data loss. Note that this is true double-disk failure protection. RAID 6 is appropriate for any size Pool or Pool
tier, including the largest possible, and highly recommended for NL-SAS tiers.
RAID 1/0 has high data availability. A single disk failure in a private RAID group results in no data loss. Multiple disk
failures within a RAID group may be survived. However, a primary and its mirror cannot fail together, or data will be
lost. Note, this is not double-disk failure protection. RAID 1/0 is appropriate for small to moderate sized Pools or Pool
tiers.
A user needs to determine whether the priority is: availability, performance, or capacity utilization.
If the priority is availability, RAID 6 is the recommendation.
Number of RAID groups
A fault domain refers to data availability. A Pool is made-up of one or more private RAID groups. A Pool fault domain is
a single Pool private RAID group. That is, the availability of a pool is the availability of any single private RAID group.
Unless RAID 6 is the level of protection for the entire Pool, avoid creating Pools with a very large number of RAID
groups.
Rebuild Time and other MTTR functions
A failure in a Pool-based architecture may affect a greater number of LUNs than in a Traditional LUN architecture.
Quickly restoring RAID Groups from degraded mode to normal operation becomes important for the overall operation
of the storage system.
Always have hot spares of the appropriate type available. The action of proactive hot sparing will reduce the adverse
performance effect a Rebuild would have on backend performance. In addition, always replace failed drives as quickly
as possible to maintain the number of available hot spares.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
22
Avoiding iSCSI network congestion is the primary consideration for achieving iSCSI LAN performance. It is important to
take into account network latency and the potential for port oversubscription when configuring your network.
Network congestion is usually the result of an ill-suited network configuration or improper network settings. Ill-suited
may be a legacy CAT5 cable in-use on a GigE link. Network settings include IP overhead and protocol configuration of
the network’s elements.
For example a common problem is a switch in the data path into the storage system that is fragmenting frames.
As a minimum, the following recommendations should be reviewed to ensure the best performance.
Simple network topology
Both bandwidth and throughput rates are subject to network conditions and latency.
It is common for network contentions, routing inefficiency, and errors in LAN and VLAN configuration to adversely
affect iSCSI performance. It is important to profile and periodically monitor the network carrying iSCSI traffic to ensure
the consistently high Ethernet network performance.
In general, the simplest network topologies offer the best performance. Minimize the length of cable runs, and the
number of cables, while still maintaining physically separated redundant connections between hosts and the storage
system(s).
Avoid routing iSCSI traffic as this will introduce latency. Ideally the host and the iSCSI front-end port are on the same
subnet and there are no gateways defined on the iSCSI ports. If they are not on the same subnet, users should define
static routes. This can be done per target or subnet using naviseccli connection –route.
Latency can contribute substantially to iSCSI-based storage system’s performance. As the distance from the host to
the storage system increases; a latency of about 1 millisecond per 200 kilometers (125 miles) is introduced. This
latency has a noticeable effect on WANs supporting sequential I/O workloads.
For example, a 40 MB/s 64 KB single stream would average 25 MB/s over a 200 km distance. EMC recommends
increasing the number of streams to maintain the highest bandwidth with these long-distance, sequential I/O
workloads.
Bandwidth-balanced configuration
A balanced bandwidth iSCSI configuration is when the host iSCSI initiator’s bandwidth is greater than or equal to the
bandwidth of its connected storage system’s ports. Generally, configure each host NIC or HBA port to only two storage
system ports (one per SP). One storage system port should be configured as active, and the other to standby. This
avoids oversubscribing a host’s ports.
<Continued>
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
23
Network settings
Manually override auto-negotiation on the host NIC or HBA and network switches for the following settings. These
settings improve flow control on the iSCSI network:
•
•
•
Jumbo frames
Pause frames
TCP Delayed ACK
Jumbo frames
Using jumbo frames can improve iSCSI network bandwidth by up to 50 percent. When supported by the network, we
recommend using jumbo frames to increase bandwidth.
Jumbo frames can contain more iSCSI commands and a larger iSCSI payload than normal frames without fragmenting
or with less fragmenting depending on the payload size. On a standard Ethernet network the frame size is 1500 bytes.
Jumbo frames allow packets configurable up to 9,000 bytes in length.
The VNX series supports 4,000, 4,080, or 4,470 MTUs for its front-end iSCSI ports. It is not recommended to set your
storage network for Jumbo frames to be any larger then these.
If using jumbo frames, all switches and routers in the paths to the storage system must support and be capable of
handling and configured for jumbo frames. For example, if the host and the storage system’s iSCSI ports can handle
4,470-byte frames, but an intervening switch can only handle 4,000 bytes, then the host and the storage system’s
ports should be set to 4,000 bytes.
Note that the File Data Mover has a different Jumbo frame MTU than the VNX front-end ports. The larger Data Mover
frame setting should be used.
<Continued>
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
24
Pause frames
Pause frames are an optional flow-control feature that permits the host to temporarily stop all traffic from the storage
system. Pause frames are intended to enable the host’s NIC or HBA, and the switch, to control the transmit rate.
Due to the characteristic flow of iSCSI traffic, pause frames should be disabled on the iSCSI network used for storage.
They may cause delay of traffic unrelated to specific host port to storage system links.
TCP Delayed ACK
On MS Windows, and ESX-based hosts, TCP Delayed ACK delays an acknowledgement for a received packet for the
host.
TCP Delayed ACK should be disabled on the iSCSI network used for storage.
When enabled, an acknowledgment is delayed up to 0.5 seconds or until two packets are received. Storage
applications may time out during this delay. A host sending an acknowledgment to a storage system after the
maximum of 0.5 seconds is possible on a congested network. Because there was no communication between the host
computer and the storage system during that 0.5 seconds, the host computer issues Inquiry commands to the storage
system for all LUNs based on the delayed ACK. During periods of congestion and recovery of dropped packets, delayed
ACK can slow down the recovery considerably, resulting in further performance degradation.
Note that delayed ACK cannot be disabled on Linux hosts.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
25
Performance estimate procedure
The steps required to perform a ROM performance estimate are as follows:
Determine the workload.
Determine the I/O drive load.
Determine the number of drives required for Performance.
Determine the number of drives required for Capacity.
Analysis
The steps need to be executed in sequence; the output of the previous step is the input to the next
step.
Determining the workload
This is often one of the most difficult parts of the estimation. Many people do not know what the
existing loads are, let alone load for proposed systems. Yet it is crucial for you to make a forecast as
accurately as possible. An estimate must be made.
The estimate must include not only the total IOPS or bandwidth, but also what percentage of the
load is reads and what percentage is writes. Additionally, the predominant I/O size must be
determined.
Determine the I/O drive load
This step requires the use of drive IOPS. To determine the number of drive IOPS implied by a host
I/O load, adjust as follows for parity or mirroring operations:
Parity RAID 5: Drive IOPS = Read IOPS + 4*Write IOPS
Parity RAID 6: Drive IOPS = Read IOPS + 6*Write IOPS
<Continued>
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
26
Mirrored RAID 1/0: Drive IOPS = Read IOPS + 2*Write IOPS
An example the default private RAID group of a RAID 1/0 pool is a (4+4). Assume a homogenous pool with
six private RAID groups. For simplicity, a single LUN is bound to the pool. Further assume the I/O mix is 50
percent random reads and 50 percent random writes with a total host IOPS of 10,000:
IOPS = (0.5 * 10,000 + 2 * (0.5 * 10,000))
IOPS = 15,000
For bandwidth calculations, when large or sequential I/O is expected to fill LUN stripes, use the following
approaches, where the write load is increased by a RAID multiplier:
Parity RAID 5: Drive MB/s = Read MB/s + Write MB/s * (1 + (1/ (number of user data drives in group)))
Parity RAID 6: Drive MB/s = Read MB/s + Write MB/s * (1 + (2/ (number of user data drives in group)))
Mirrored RAID 1/0: Drive MB/s = Read MB/s + Write MB/s * 2
For example, the default private RAID group of a RAID 5 pool is 5-drive 4+1 (four user data drives in group).
Assume the read load is 100 MB/s, and write load is 50 MB/s:
Drive MB/s = 100 MB/s + 40 MB/s * (1 + (1/4))
Drive MB/s = 150 MB/s
Determine the number of drives required for Performance
Make both a performance calculation to determine the number of drives in the storage system.
Divide the total IOPS (or bandwidth) by the per-drive IOPS value provided in Table 9 for small-block random
I/O and Table 29 for large-block random I/O.
The result is the approximate number of drives needed to service the proposed I/O load. If performing
random I/O with a predominant I/O size larger than 16 KB (up to 32 KB), but less than 64 KB, increase the
drive count by 20 percent. Random I/O with a block size greater than 64 KB must address bandwidth limits
as well. This is best done with the assistance of an EMC USPEED professional.
<Continued>
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
27
Determine the number of drives required for Capacity
Calculate the number of drives required to meet the storage capacity requirement.
Typically, the number of drives needed to meet the required capacity is fewer than the number
needed for performance.
Remember, the formatted capacity of a drive is smaller than its raw capacity. Add the capacity
required for a Virtual provisioning pool to maintain the pool’s file system. This is the pool’s
metadata overhead.
Furthermore, the system drives require four drives, and it is prudent to add one hot spare drive per
30 drives (rounded to the nearest integer) to the drive count. Do not include the system drives and
hot spare drives into the performance calculation when calculating the operational performance.
Analysis
Ideally, the number of drives needed for the proposed I/O load is the same as the number of drives
needed to satisfy the storage capacity requirement. Use the larger number of drives from the
performance and storage capacity estimates for the storage environment.
Total performance drives
Total Approximate Drives = RAID Group IOPS / (Hard Drive Type IOPS) + Large Random I/O
adjustment + Hot Spares + System Drives
For example, if an application was previously calculated to execute 4,000 IOPS, the I/O is 16 KB
random requests, and the hard drives specified for the group are 15K RPM SAS drives (see
Table 9 Small block random I/O performance by drive type):
Total Approximate Drives = 4,000 / 180 + 0 + ((4,000 / 180) / 30) + 5
Total Approximate Drives = 28
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
28
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
29
VNX systems allow the creation of two types of (block) storage pool – RAID Groups and Pools.
RAID Groups (RGs) are the traditional way to group disks into sets. Rules regarding the number of
disks allowed in a RG, and the minimum/maximum number for a given RAID type are enforced by
the system. Supported RAID types are RAID 1, RAID 1/0, RAID 3, RAID 5 and RAID 6. RAID Groups
can be created with a single disk or as unprotected RAID 0 groups, though this is uncommon. Hot
Spares consist of a single-disk RAID Group with a single LUN automatically created on it. Note that
only Traditional LUNs can be created on a RG.
Pools are required for FAST VP (auto-tiering), and may have mixed disk types (Flash, SAS and NLSAS). The number of disks in a Pool depends on the VNX model, and is the maximum number of
disks in the storage system less 4. As an example, the VNX5700 has a maximum capacity of 500
disks, and a maximum Pool size of 496 disks. The remaining 4 disks are system drives, which cannot
be part of a Pool. At present, only RAID 5, RAID 6 and RAID 1/0 are supported in Pools, and each
tier will be one RAID type.
Pools have metadata associated with them, and that Pool metadata decreases the amount of
available space in the Pool. In the uppermost screenshot, a 4+1 RAID 5 Pool has been created with
600 GB SAS drives. Note that 5 GiB is allocated (and therefore unusable by LUNs) even though the
Pool currently has no LUNs created on it. In the lower screenshot, 8 NL-SAS drives of 2 TB have
been added as a 6+2 RAID 6 tier. Note that the 5 GiB of space is still consumed – Pool overhead.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
30
Traditional LUNs, sometimes referred to in documentation as FLUs (FLARE Logical Unit, the legacy
term), are created on RGs. They exhibit the highest level of performance of any LUN type, and are
recommended where predictable performance is required.
All LUNs in a RG will be of the same RAID type.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
31
Two different types of LUNs may be created on Pools – Thick LUNs and Thin LUNs. There are
significant differences between them in terms of both operation and performance.
When a Thick LUN is created, the entire space that will be used for the LUN is allocated; if there is
insufficient space in the Pool, the Thick LUN will not be created. The slices that make up the Thick
LUN each contain 1 GiB of contiguous Logical Block Addresses (LBAs). Because tracking happens at
a granularity of 1 GiB, the amount of metadata is relatively low, and the lookups that are required
to find the location of the slice in the Pool are fast. Because lookups are required, Thick LUN
accesses will be slower than accesses to Traditional LUNs.
Thin LUNs allocate 1 GiB slices when space is needed, but the granularity inside those slices is at
the 8 KiB block level. Any 1 GiB slice will be allocated to only 1 Thin LUN, but the 8 KiB blocks will
not necessarily be from contiguous LBAs. Oversubscription is allowed, so the total size of the Thin
LUNs in a Pool can exceed the size of the available physical data space. Monitoring is required to
ensure that out of space conditions do not occur. There is appreciably more overhead associated
with Thin LUNs than with Thick LUNs and Traditional LUNs, and performance is substantially
reduced as a result.
The Pool LUNs in the screenshots were created, with size set to MAX, on the Pools shown 2 slides
back. Note that the user space of the Pool LUN is not equal to the free space in the Pool – the
mapping of slices and blocks consumes disk space.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
32
As mentioned, metadata is associated with the use of both Thick LUNs and Thin LUNs. The
metadata is used to track the location of the data on the private LUNs used in the Pool structure.
The amount of metadata depends on the size of the LUN, and may be slightly higher
(proportionally) for smaller LUNs – those under around 250 GiB.
Thin LUNs will consume around 1 GiB more space than a Thick LUN of the same size.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
33
Thin LUNs should be positioned in Block environments where space saving and storage efficiency
outweigh performance as the main goals. Areas where storage space is traditionally over allocated,
and where the Thin LUN “allocate space on demand” functionality would be an advantage, include
user home directories and shared data space.
If FAST VP is a requirement, and Pool LUNs are being proposed for that reason, it is important to
remember that Thick LUNs achieve better performance than Thin LUNs.
Be aware that Thin LUNs are not recommended in certain environments. Among these are
Exchange 2010, and VNX file systems.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
34
Space is assigned to Thin LUNs at a granularity of 8 KiB (inside a 1 GiB slice). The implication here is
that tracking is required for each 8 KiB piece of data saved on a Thin LUN, and that tracking involves
capacity overhead in the form of metadata. In addition, since the location of any 8 KiB piece of data
cannot be predicted, each data access to a Thin LUN requires a lookup to determine the data
location. If the metadata is not currently memory-resident, a disk access will be required, and an
extended response time will result. This makes Thin LUNs appreciably slower than Traditional LUNs,
and slower than Thick LUNs. If a Pool with Thin LUNs has a Flash tier, metadata will be relocated to
Flash, and LUN performance will improve.
Because Thin LUNs make use of this additional metadata, recovery of Thin LUNs after certain types
of failure (e.g. cache dirty faults) will take appreciably longer than recovery for Thick LUNs or
Traditional LUNs. A strong recommendation, therefore, is to place mission critical applications on
Thick LUNs or Traditional LUNs.
In some environments – those with a high locality of data reference – FAST Cache may help to
reduce the performance impact of the metadata lookup.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
35
In summary, note that:
The use of Thin LUNs is contra-indicated in some environments. One of those environments is VNX
File, where Thin LUNs should not be used.
Thin LUNs should never be used where high performance is an important goal.
Pool space should be monitored carefully (Thin LUNs allow Pool oversubscription whereas Thick
LUNs do not). The system issues an alert when the consumption of any pool reaches a userselectable limit. By default, this limit is 70%, and allows ample time for the user to take any
corrective action required.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
36
In the next slides we’ll take a look at the metaLUN detail.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
37
A metaLUN is seen by the host as a single SCSI LUN – the individual LUNs that make up a Volume
Group will each be seen as a SCSI LUN. Using only 1 SCSI LUN may be an advantage. If the host
renumbers LUNs when new LUNs are added, and especially if a system restart is required (possibly
after a kernel rebuild), then a metaLUN may be a better choice. If the host does not support a Volume
Manager, or does not support a Volume Manager used in conjunction with PowerPath, metaLUNs may
fit the bill. VNX Replication Software see a metaLUN as a single LUN, so any replication will be simpler
on a metaLUN than on a Volume Group.
As noted before, VMs will automatically multithread large I/O requests to a Volume Group. metaLUNs
will not, so use a VM if this feature is a requirement. Dedicated LUNs may be a better choice when
LUN I/O patterns differ widely; mixing random and sequential I/O on the same physical disks is never
optimal.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
38
Three metaLUN examples are shown here.
metaLUN 1 is a striped metaLUN, made up of 4 identical LUNs. Data is striped across all 4. Note the
order of the data on the diagram. The striping will take a while to complete if LUN 0 is already
populated.
metaLUN 2 is a concatenated metaLUN. Though all 4 LUNs are shown as the same size, they need not
be. There is no requirement that they should be the same size, either, though in this case they are.
Data fills LUN 0, then LUN 1, LUN2 and finally LUN 3. Expansion by concatenation is immediate, but
may produce suboptimal results.
metaLUN 3 is a hybrid metaLUN, made up of two striped components concatenated together. LUNs 0
and 1 are striped, so are therefore the same size and RAID type. Similarly, LUNs 2 and 3 are striped,
and will be the same size and RAID type, though there is no requirement that they be the same size or
RAID type as LUNs 0 and 1. The 2 pairs of LUNs are then concatenated together, an instantaneous
process.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
39
This slide shows an example of 4 5-disk RAID 5 LUNs striped into a single metaLUN.
The Base LUN, LUN 0, is shown on top, with each 64 kB data element shown as Data 00, Data 01, etc.
The parity element for each stripe is also shown, though it doesn’t contribute to the calculation of
stripe size. There are 4 data elements per stripe, of 64 kB each, the default, for a data stripe size of
256 kB. Only 4 stripes are shown for the sake of clarity.
Because the Base LUN is a 5-disk RAID 5 LUN, the metaLUN that uses it should have the Element Size
Multiplier set to 4. This will mean that 1 MB (256 kB times the multiplier of 4) of data will be written
to each LUN before data is written to the next LUN in turn. As shown at the bottom of the slide, 16
elements, Data 00 through Data 15, are written to LUN 4095 (remember that component LUNs are
renumbered when creating a metaLUN), then the next 16 elements are written to LUN 4094, and so
on. 16 elements are 1 MB of data.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
40
Component LUNs should be selected carefully for use with metaLUNs; some recommendations are
shown above.
metaLUNs may be of 3 different types. The next slide shows examples.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
41
The metaLUN stripe segment size is the largest I/O that will be sent to a Component LUN. Setting the
stripe element size multiplier is a compromise between the need for large stripes (for bandwidth) and
small stripes (to distribute bursty I/O across all Component LUNs). The multiplier needs to be large
enough that if the RAID Groups on which Component LUNs are built are expanded (by adding physical
disks), I/O can still fill a stripe to allow MR3 writes.
Using metaLUNs with slower drives is not generally recommended. If it must be done, remember to
keep the RAID Groups small, bear rebuild times in mind, and avoid random I/O.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
42
Striping of metaLUNs will typically produce better performance than concatenation. Where possible,
the component LUNs should have equal-sized RAID groups of disks of the same speed. If
concatenation is required, the best practice will be to concatenate striped components, to form a
hybrid metaLUN.
If metaLUNs are made up of LUNs from the same RAID Groups, the base LUN may be selected in a
rotating manner, to further spread the load evenly across the component LUNs.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
43
A typical Flash drive has components similar to these. The core of the drive is the Flash controller
and the array of Flash RAM components. Because there are multiple paths to the Flash RAM,
multiple operations can be performed simultaneously. This is especially true of reads; writes
require additional processing because of the wear leveling or write leveling (2 terms with the same
meaning) feature implemented on these drives. As a result of the multiple channels, multithreaded accesses are especially efficient on Flash drives. The internal data organization of Flash
drives is different to that of electromechanical disks; that organization, and the internal operation
of the drives, makes them very different from traditional disks.
It is especially important to note that while mechanical disks are regarded as random-access
devices, some accesses are more expensive than others. Reads or writes involving long seeks take
longer than the same operations with short seeks. This is not true of Flash drives, which are true
random-access devices; performance is uniform over the entire data space.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
44
This example of Flash drive data organization shows that data is divided into 4 KiB pages (some
drives will have pages of a different size, e.g. 16 KiB), where a page is the smallest amount of data
that a Flash drive can read or write, compared to the 512 B for an electromechanical drive. Writing
to a page can only occur when the page is clean (contains no data); modifying an existing page
requires that the data be written elsewhere (implemented by write leveling), or that the entire 512
KiB block be erased. A block is the smallest amount of space that can be erased. As the Flash drive
fills, it will need to make space for new writes by erasing unused blocks. This block erasure is slow
(typically milliseconds), and contributes to the slowdown in performance as Flash drives become
full.
A garbage collection routine runs on the drive to collect and clean previously used pages; this may
also involve consolidation of partly full blocks. Excessive writes directly to the Flash RAM are
reduced by consolidating data into DRAM on the drive. This DRAM is made persistent by backing it
up with a battery and super-capacitors.
Note that blocks are further combined into planes, and that there may be multiple planes on a
single die, and potentially multiple dies on a single IC package; this level of organization does not
concern us here, though.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
45
Flash drives may be provisioned together in enclosures with any other type drive. In addition, there
are no restrictions on the number of Flash drives allowed in any VNX storage system. The drives are
high-performance devices, though, and can easily saturate a backend port because of their superior
bandwidth and throughput. It is recommended that no more than 12 Flash drives be allocated to a
backend port when high bandwidth is the goal, and no more than 5 Flash drives be allocated to a
backend port when high throughput is the goal. Note also that RAID 5 gives the best overall ratio of
user data to data protection, and is recommended for use where appropriate.
Because Flash drive reads are so much faster than writes, the best performance improvement,
compared to HDDs, will be seen in environments where the ratio of reads to writes is high.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
46
Flash drives have vastly better random read performance than SAS drives due to the multiple paths
to the Flash RAM. Throughput is particularly high with I/O sizes of 32 KB or less; it decreases as the
block size becomes larger than the Flash drive page size.
With four or more threads of large-block reads, Flash drives have up to twice the bandwidth of SAS
HDDs, due to the absence of seek time.
Flash drives have superior random write performance to SAS HDDs; throughput decreases with
increasing block size.
When writes are uncached by SP RAM cache, Flash drives are somewhat slower than SAS HDDs
with single-threaded writes; bandwidth improves with thread count. Note that if Flash drives have
SP write cache enabled, small-block sequential performance will improve dues to cache coalescing.
This will potentially allow full-stripe writes to the Flash LUNs.
Avoid deploying Flash drives on small-block workloads such as data logging.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
47
Flash drives are indicated for use where HDDs do not have sufficient performance to meet the
demands of the environment, particularly where low response times are a requirement.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
48
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
49
This slide covers the use of Flash drives in RAID Groups or dedicated, homogeneous Pools. This
mode of employment ensures peak performance from the drives; note, though, that the benefit is
limited to the individual RG or Pool. If only a small number of Flash drives is available, they are
often better employed in FAST Cache configurations.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
50
No discussion of FAST Cache and FAST VP is complete without a discussion of data skew, usually
simply referred to as skew.
Skew is the percentage of total load at the percentage of total capacity where the sum of those
values is 100%. The dashed line shows a perfectly linear distribution of workload over the available
data area. It is clear that the total load would be 50% at the 50% capacity mark, so the skew would
be 50%. In the case of the bold, solid line, 90% of the workload is distributed over 10% of the
capacity, for a skew of 90%. Skew values lower than 50% are meaningless (and are the same as
100% minus that value). Data with skew values around the 50% mark is regarded as data with no
skew.
The calculation of skew, at the LUN or slice level, uses tools available to EMC employees only.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
51
Fast Cache manages a map of pages located on flash drives. Data is cached by promoting data
from hard disk drives to flash drives which improves response time and avoids cache misses –
especially for reads. Dirty pages are asynchronously written back to hard disk drives when cleaning
(demotion) takes place - this optimizes the writes. FAST Cache provides a much larger, scalable
second-level cache. The available capacity for FAST Cache is divided equally between SPA and SPB.
For example, if 400 GB of FAST Cache is configured (4x 200 GB drives, or 8x 100 GB drives), the
available capacity will be around 366 GiB, and SPA and SPB will each be allocated 183 GiB. Unlike
FAST VP, FAST Cache works equally well with Pool LUNs and RAID Group LUNs. Bear in mind,
though, that enabling and disabling FAST Cache occurs at the LUN level for RAID Group LUNs, but at
the Pool level for other LUNs.
Disable FAST Cache for private LUNs, except metaLUN components. The CPL and WIL LUNs already
have optimizations that keep them cache-resident, and RLP LUNs are unlikely to benefit from FAST
Cache.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
52
The benefits of FAST Cache will not be evident in all environments. This slide lists some of the
factors to be aware of when proposing FAST Cache.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
53
Fast Cache is a system wide resource, is easy to set up and can allocate up to 2 TB to cache (read or
read/write).
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
54
FAST Cache and/or FAST VP should not be proposed if:
• Customer data has no skew. Skew means that I/Os to a LUN are not evenly distributed over
the entire data area. Some areas are very busy, while others may be accessed very
infrequently. An example of skew is when 5% of the customer data generates 95% of all
I/Os.
• The customer cannot tolerate a false positive. A false positive means that critical data was
placed on a slower tier when it was needed on the fastest tier.
• The customer has unrealistic expectations. For example, customer data has traditionally
been spread across 100s of 15K drives, and the goal is to replace this with 2 Flash drives in
FAST Cache.
In cases such as these, homogeneous pools, or the “Highest Available Tier” policy for
heterogeneous pools, should be proposed.
Finally, in cases where there is skew, do NOT under size the FLASH tier. If 5% of all data is
responsible for 95% of all IOs, have 1% at least in FAST Cache and then 5-7% as a Flash tier. Tools
such as Pool Sizer and Tier Advisor can help. Where the tools are limited to use by USPEED
members, contact your local USPEED member for help.
FAST VP should be given time to learn the environment. This is an important expectation to set
with customers, especially important when data is migrated from a high disk count
environment. There is a huge initial difference between a 100% 15K system and a 5%-20%-75%
VNX with FAST VP. It can take FAST VP several days to learn which data is hot and get these slices
moved to the higher tiers.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
55
Production workloads often show spatial and/or temporal locality, and thus lend themselves to the use of
FAST Cache and/or FAST VP. This is not generally true of benchmark data, and FAST Cache and FAST VP may
show little benefit in benchmark environments. This is true of both file and block environments, and makes
prediction of performance improvement, as well as demonstration of the benefit of the feature, a difficult
task. In addition, there are insufficient tools available to the general field community to aid in the design,
implementation and troubleshooting of advanced VNX features such as FAST VP, FAST Cache and
compression.
As is the case with FAST VP, the “learning” process takes time. The warm-up phase for FAST Cache can take
from several minutes to several hours, and this should be taken into account.
Workloads that perform sequential activity, especially where I/O sizes are small, are poor candidates for
FAST Cache. Be aware that some “housekeeping” activities performed by applications may match this
profile, and may cause pollution of FAST Cache. Some of this cache pollution is avoided by the sequential
data detection added to FAST Cache in the 05.32 VNX OE for Block release.
For improved data availability, allocate drives that make up a FAST Cache RAID 1 pair to different back-end
ports. This may also have a positive effect on FAST Cache performance. Note that, as a general rule, Flash
drives should be spread across multiple buses when the number of Flash drives in use demands it. More
than 5 Flash drives can saturate a back-end port, so consider drive placement carefully when using FAST
Cache, FAST VP or even Flash-only storage pools. Using Flash drives on Bus 0 is acceptable. DO NOT,
however, use Enclosure 0,0 for only one Flash drive in a RAID 1 pair.
FAST Cache does not proactively flush the dirty chunks back to the disks but will only flush when it needs
capacity for future promotions. This can cause problems where a workload changes rapidly and FAST Cache
can not react quickly enough. If a customer needs to disable or resize the FAST Cache, it may take a
considerable length of time to de-stage the dirty chunks before FAST Cache can be disabled.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
56
The amount of memory (RAM) per SP, and the maximum SP write cache size, differs for the
different VNX models. Advanced software features, such as FAST Cache, FAST VP, Thin LUNs, and
Data Compression, consume SP RAM and therefore reduce the amount of available SP write cache.
The amount of RAM consumed depends on the VNX model, the features implemented, and the
size of FAST Cache, and will be up to 29% for FAST Cache, and between 23% and 37% for any
combination (one or more) of the other advanced features.
It should be noted that the reduction in available SP write cache is not an indicator that
performance will decrease; typically, it is more than offset by the increased performance and
efficiency offered by the features that are implemented. For example, although not included in the
system cache size metric, significant amounts of memory used by these features are used in
caching data used by those features.
The rate of cache flushing is a more important metric than write cache size. If a system is already
‘on the edge’ with the cache the right thing to do in any case is to adjust the design to reduce that
cache pressure – identify the offending LUNs, and migrate them to better RAID types or faster
drives before the advanced features are added to the system. This is in accordance with our best
practice recommendation that you have under 60% saturation before attempting to add FAST
Cache, as one example.
The VNX7500, configured with 48 GiB memory per SP, does not see a major impact in cache size
when FAST Cache or FAST VP are used – the larger memory was added to improve performance
with those features.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
57
The user can adjust write cache to the previous free page limit by adjusting the watermarks. Both
HWM and LWM will need to be adjusted.
As an example, if the previous write cache size was 10,000 MiB and the HWM/LWM were
configured to 80% and 60%, the write cache headroom was 2,000 MiB. If adding a software feature
reduces write cache size by 30%, the new size is 7,000 MiB. To maintain 2,000 MiB headroom, the
HWM will have to change to 5,000/7,000 MiB, which is 71%. The LWM can then be configured as
around 20% lower, in the 51% region. In the lower end of the VNX model range, where the
available write cache is a smaller quantity, the difference between the HWM and the LWM may
need to be kept smaller (around 10%) to force cache to be flushed more frequently.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
58
FAST Cache and/or FAST VP are supported for all applications. Though their use will typically lead to
improved performance for the application, this may not always be the case. Some applications do
not exhibit data skew, and some will have background activity which interferes with the FAST Cache
and FAST VP statistics gathering.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
59
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
60
This lesson covers designing for File-only environments.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
61
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
62
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
63
When selecting dVols (LUNs) from Traditional RAID groups for use in a pool entry, AVM selects disk
volumes such that all LUNs must be from the same storage system and from different RAID groups.
Having LUNs come from different RAID groups increases the number of spindles in use and avoids
head contention on the disks. All LUNs must also have the same RAID configuration (for example,
4+1 R5) and the same size, thus helping to ensure that all parts of the file system have the same
performance potential as well as availability characteristics.
If more LUNs are available than needed, the RAID groups with the least utilized LUNs are used first.
Utilization is defined as the number of LUNs used by the VNX OE for File in the RAID group divided
by the number of LUNs visible to VNX OE for File in the RAID group.
After selection based on utilization, if more LUNs are available than are needed, LUNs are chosen in
a way that dVols will come from different back-end buses, then SP balancing will be used. If more
LUNs are available than are needed, LUNs are chosen such that dVols with higher IDs are selected
first in order to avoid the dVols with the lowest IDs.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
64
When selecting dVols (LUNs) from Traditional RAID groups for use in a pool entry, AVM selects disk
volumes such that all LUNs must be from the same storage system and from different RAID groups.
Having LUNs come from different RAID groups increases the number of spindles in use and avoids
head contention on the disks. All LUNs must also have the same RAID configuration (for example,
4+1 R5) and the same size, thus helping to ensure that all parts of the file system have the same
performance potential as well as availability characteristics.
If more LUNs are available than needed, the RAID groups with the least utilized LUNs are used first.
Utilization is defined as the number of LUNs used by the VNX OE for File in the RAID group divided
by the number of LUNs visible to VNX OE for File in the RAID group.
After selection based on utilization, if more LUNs are available than are needed, LUNs are chosen in
a way that dVols will come from different back-end buses, then SP balancing will be used. If more
LUNs are available than are needed, LUNs are chosen such that dVols with higher IDs are selected
first in order to avoid the dVols with the lowest IDs.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
65
Continuing with the methodology described in the previous slide, AVM attempts to create pool
entries using 4 dVols. If AVM cannot create a four dVol pool entry, it will attempt to create a three
dVol pool entry, then a two, and then finally use a single dVol.
AVM will fully consume pool volumes created from four dVols before considering a pool volume
containing three disk volumes. This is another method that AVM uses to distribute the load among
equally sized pool entries instead of “stacking” file systems on the first available pool entry. EFDs
have multiple internal channels that can simultaneously service up to 16 concurrent I/Os. To reach
peak EFD performance, all of these channels must be kept busy. This requires that both VNX SPs
have access to the EFDs, and that there are multiple I/O queues in front of the EFDs. To accomplish
this, you should bind multiple LUNs within an EFD RAID Group, and balance ownership of these
LUNs across the SPs.
It is acceptable to stripe across all dVols from the same EFD RAID Group (RG), because of the
physical structure of Flash drives. AVM will stripe together all dVols of the same size from up to two
RGs to create a pool entry.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
66
File systems impose unique performance requirements on the LUNs on which they are built. In
order to optimize the performance of file systems, it is recommended that they be built on
Traditional (RAID Group) LUNs or Thick (Pool) LUNs only. The use of Thin LUNs is strongly
discouraged.
Where Pool LUNs are used, it is recommended that the entire pool be used for file system storage,
and not shared between File and Block access. To keep performance consistent, and to allow
support for slice volumes, the tiering policy should be identical on all Thick LUNs in a pool used for
File storage. Additional recommendations are that there should be 1 Thick LUN for each 4 physical
disks in the pool, and that the LUN count should be divisible by 10 in order to make striping easier,
and balance LUNs across SPs.
AVM can make use of Pools and RAID Groups. The storage provisioning activity needs to be
performed manually – creating Pool LUNs, assigning them to the Storage Group, and
running a Rescan. After these provisioning steps, AVM can be used.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
67
File systems impose unique performance requirements on the LUNs on which they are built. In
order to optimize the performance of file systems, it is recommended that they be built on
Traditional (RAID Group) LUNs or Thick (Pool) LUNs only. The use of Thin LUNs is strongly
discouraged.
Where Pool LUNs are used, it is recommended that the entire pool be used for file system storage,
and not shared between File and Block access. To keep performance consistent, and to allow
support for slice volumes, the tiering policy should be identical on all Thick LUNs in a pool used for
File storage. Additional recommendations are that there should be 1 Thick LUN for each 4 physical
disks in the pool, and that the LUN count should be divisible by 10 in order to make striping easier,
and balance LUNs across SPs.
AVM can make use of Pools and RAID Groups. The storage provisioning activity needs to be
performed manually – creating Pool LUNs, assigning them to the Storage Group, and
running a Rescan. After these provisioning steps, AVM can be used.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
68
When volumes are created from Pool LUNs, there are a number of recommendations:
•
•
•
•
Stripe LUNs rather than using concatenation
Use 5 LUNs per stripe (LUN count in pool divisible by 10 as noted on previous slide)
Use a stripe size of 256 KiB
Choose stripe LUNs in such a way that SP ownership of the first LUN alternates in stripes
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
69
At present there are performance issues when file systems are created on Pool LUNs. The general
recommendation is to use Traditional LUNs for file data unless FAST VP is a requirement. A
Unisphere wizard makes the creation of a file storage pool simpler if Traditional LUNs are used.
If Pool LUNs are used, use Thick LUNs rather than Thin LUNs, and configure those Thick LUNs to the
same tiering policy.
Thin LUNs are not recommended for VNX File data. If thin provisioning is a requirement for File
storage, use the auto extension feature on file systems built on top of Traditional or Thick LUNs.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
70
The guideline, when using advanced features such as compression, is to implement them at the layer that
has the most knowledge of the data structure. The user or administrator will have to choose whether to use
file-based or block-based compression based on knowledge of the data, and whether it is repetitive or likely
compressible in nature.
If the data is shared with end users via an NFS export or CIFS share, then file-based compression should be
used. Note that the name “File Compression and Deduplication” is the VNX name for the product referred to
on the previous platform as “Celerra Deduplication”.
If the data is assigned to hosts through Fibre Channel, iSCSI, or FCoE connections, then block level
compression should be used. Remember that block level compression converts Traditional LUNs and Thick
LUNs to Thin LUNs via an internal migration, as part of the compression process. Since Thin LUNs are not
recommended for file system use, block level compression should be avoided when using file systems. The
same will apply to environments using applications where Thin LUNs are not recommended.
Certain types of block level data lend themselves to compression:
• Sharepoint BLOB (Binary Large OBject) externalization to block storage
• Atmos/VE, where Atmos is the archival storage target, and its storage is provisioned on compressed
LUNs
• VMware VM template repositories, which are read-mostly structures
• Archive files
In block environments with compression, small random reads achieve the best performance, while small,
random writes are most expensive. The latter causes data to be decompressed and then overwritten, and
the process will add to the response time of the write. The data access pattern expected from archive-type
environments involves small, random reads.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
71
As noted before, LUNs that have the compression feature turned on are migrated to Thin LUNs,
with the associated performance impact noted previously. Decompression of a compressed LUN is
possible; note, though, that a Traditional LUN which is compressed and then decompressed (by
disabling compression) does not return to being a Traditional LUN; instead, it is a fully-provisioned
Thin LUN. Returning to a Traditional LUN will require additional steps, e.g. a LUN migration.
Areas of a compressed LUN that are accessed must be decompressed for reads and/or writes to
take place. This adds additional performance overhead. Note that this is a decompression, in
memory, of only the accessed portion of the data, not a full decompression of the entire LUN.
Improvements in the way compression is performed on VNX systems has led to performance that
exceeds that found on the previous generation CX systems. This performance improvement is
achieved in both throughput-oriented (typically small-block random) and bandwidth-oriented
(typically large-block sequential) environments. Thick LUNs will still perform better than
compressed LUNs.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
72
In summary:
Implement compression at the level that the data is used – file or block.
Be aware of the data type and the data access pattern before implementing compression.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
73
The restrictions applicable to AVM are:
Create a file system by using only one storage pool. If you need to extend a file system, extend it by
using either the same storage pool or by using another compatible storage pool. Do not extend a
file system across storage systems unless it is absolutely necessary.
File systems might reside on multiple disk volumes. Ensure that all disk volumes used by a file
system reside on the same storage system for file system creation and extension. This is to protect
against storage system and data unavailability.
LUNs that have been added to the file-based storage group are discovered during the normal
storage discovery (diskmark) and mapped to their corresponding storage pools on the VNX for file.
If a pool is encountered with the same name as an existing user-defined pool or system-defined
pool from the same VNX for block system, diskmark will fail. It is possible to have duplicate pool
names on different VNX for block systems, but not on the same VNX for block system.
Names of pools mapped from a VNX for block system to a VNX for file cannot be modified.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
74
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
75
Automatic file system extension does not work on MGFS, which is the EMC file system type used
while performing data migration from either CIFS or NFS to the VNX system by using VNX File
System Migration (also known as CDMS).
Automatic extension is not supported on file systems created with manual volume management.
You can enable automatic file system extension on the file system only if it is created or extended
by using an AVM storage pool.
Automatic extension is not supported on file systems used with TimeFinder/FS NearCopy or
FarCopy.
While automatic file system extension is running, the Control Station blocks all other commands
that apply to this file system. When the extension is complete, the Control Station allows the
commands to run.
The Control Station must be running and operating properly for automatic file system extension, or
any other VNX feature, to work correctly.
Automatic extension cannot be used for any file system that is part of a remote data facility (RDF)
configuration. Do not use the nas_fs command with the -auto_extend option for file systems
associated with RDF configurations. Doing so generates the error message: Error 4121: operation
not supported for file systems of type EMC SRDF®.
The options associated with automatic extension can be modified only on file systems mounted
with read/write permission. If the file system is mounted read-only, you must remount the file
system as read/write before modifying the automatic file system extension, HWM, or maximum
size options.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
76
Enabling automatic file system extension and thin provisioning does not automatically reserve the
space from the storage pool for that file system. Administrators must ensure that adequate storage
space exists, so that the automatic extension operation can succeed. When there is not enough
storage space available to extend the file system to the requested size, the file system extends to
use all the available storage. For example, if automatic extension requires 6 GB but only 3 GB are
available, the file system automatically extends to 3 GB. Although the file system was partially
extended, an error message appears to indicate that there was not enough storage space available
to perform automatic extension. When there is no available storage, automatic extension fails. You
must manually extend the file system to recover from this issue.
Automatic file system extension is supported with EMC VNX Replicator. Enable automatic extension
only on the source file system in a replication scenario. The destination file system synchronizes
with the source file system and extends automatically. Do not enable automatic extension on the
destination file system.
When using automatic extension and thin provisioning, you can create replicated copies of
extendible file systems, but to do so, use slice volumes (slice=y).
iSCSI virtually provisioned LUNs are supported on file systems with automatic extension enabled.
Automatic extension is not supported on the root file system of a Data Mover or on the root file
system of a Virtual Data Mover (VDM).
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
77
With thin provisioning enabled, the NFS, CIFS, and FTP clients see the actual size of the VNX
Replicator destination file system while they see the virtually provisioned maximum size of the
source file system.
Thin provisioning is supported on the primary file system, but not supported with primary file
system checkpoints. NFS, CIFS, and FTP clients cannot see the virtually provisioned maximum size
of any EMC SnapSure™ checkpoint file system.
If a file system is created by using a virtual storage pool, the -thin option of the nas_fs command
cannot be enabled. VNX for file thin provisioning and VNX for block thin provisioning cannot coexist
on a file system.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
78
Use RAID group-based LUNs instead of pool-based LUNs to create system control LUNs. Pool-based
LUNs can be created as thin LUNs or converted to thin LUNs at any time. A thin control LUN could
run out of space and lead to a Data Mover panic.
VNX for block mapped pools support only RAID 5, RAID 6, and RAID 1/0:
• RAID 5 is the default RAID type, with a minimum of three drives (2+1). Use multiples of five
drives or 9 drives.
• RAID 6 has a minimum of four drives (2+2). Use multiples of eight drives or 16 drives.
• RAID 1/0 has a minimum of two drives (1+1). Eight drives per group are recommended.
EMC Unisphere™ is required to provision virtual devices (thin and thick LUNs) on the VNX for block
system. Any platforms that do not provide Unisphere access cannot use this feature.
You cannot mix mirrored and non-mirrored LUNs in the same VNX for block system pool. You must
separate mirrored and non-mirrored LUNs into different storage pools on VNX for block systems. If
diskmark discovers both mirrored and non-mirrored LUNs, diskmark will fail.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
79
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
80
This lesson covers designing for mixed Block and File environments.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
81
Designing for a mixed environment is little different from designing for a file-only environment.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
82
This lesson covers the design of environments for specific applications.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
83
Exchange 2010 was designed to use large, slow drives, and minimizes access to the physical disks.
As a result, FAST Cache is only useful if very high levels of performance are required. Jetstress, used
in testing during Exchange deployment, has poor data locality, so FAST Cache is not likely to provide
any deterministic performance improvement with Exchange 2010.
BDM, Background Database Maintenance, a regular part of Exchange implementations, causes
pollution of FAST VP statistics collection and ranking. Homogeneous Pools, or Traditional LUNs, will
not exhibit this effect, and are recommended. The use of “Highest Tier Available” data placement
for Exchange data may reduce the effect of BDM on FAST VP LUNs.
The use of Thin Pool LUNs should be avoided with Exchange 2010. If Thick Pool LUNs are used,
users should be aware that Jetstress causes data to be allocated to LUNs unevenly, causing initial
LUN performance to be poorer than that for other LUNs. This could cause Jetstress to report a
failure. To work around this, engineering has developed a utility called “SOAPTool” which forces
even distribution of the data. If using Thick Pool LUNs, use the SOAPTool utility to ensure optimal
performance. Alternatively, Traditional (RAID Group) LUNs may be used instead.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
84
20 SAS spindles (16 for DB, 4 for Log; in RAID 10 configuration)
4x100GB Flash drives in FAST Cache (50% of working set)
Max Supported TPS/Ravg figures. Next sample breached gating metric. ( 2 second Average
Response Time (Ravg))
SAS Only – 1448 TPS, 0.8s Ravg
SAS + FAST Cache – 5778 TPS, 1.95s Ravg
A VNX5700 was used in this testing, and was configured as shown in the slide. The chart shows SQL
Server performance when running on SAS drives only, and compares this to an environment where
FAST Cache has been added. Performance levels off as the system saturation point is reached
(vertical blue lines in the chart), but at a point which is considerably higher for the FAST Cacheenabled tests.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
85
FAST Cache requires a warm-up period before it will show optimal performance improvement.
Once the warm-up period has elapsed, the benefit is persistent across server or SP reboots. The SP
write cache will need some time to reach optimal efficiency (in terms of rehits, etc) after a reboot.
In OLTP environments, Pool LUNs show a reduction in performance of around 14% when compared
to Traditional LUNs. FAST Cache will improve performance significantly in this environment.
Note that if best performance is required from FAST VP, initial data placement should be set to
“Highest Available Tier”.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
86
In virtual desktop environments, FAST Cache has demonstrated a significant performance increase
over the use of magnetic disk drives alone. This means that slower, less costly disks can be used in
this environment. FAST Cache will service much of the I/O load in the boot phase, and will absorb
much of the post-boot load. This type of environment, with high data locality, lends itself to the use
of FAST Cache.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
87
This slide shows the throughput of the LUNs (on Flash drives) used to hold the operating system
image. LUN names are EFD_Replica LUN 1 and EFD_Replica LUN 2. Note the high level of activity
during the boot phase.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
88
With FAST Cache implemented, I/O activity caused by booting and steady-state user load is
serviced largely from FAST Cache.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
89
Here is a look at a different virtual desktop environment. Note that the I/O pattern is very similar to
that seen in previous slides, with a very high level of disk activity at boot time, and much less at
steady state.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
90
In environments such as this one, there is likely to be very high locality of data reference due to
common user boot configurations. That is apparent here – the first boot storm loads the boot
replica into FAST Cache very quickly, thereby minimizing the FAST Cache warm-up time.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
91
SharePoint data is typically stored in a SQL database. For structured data, this is efficient, and
makes good use of the SQL query mechanism. For other, unstructured data, such as files, this is not
an optimal use of a SQL Server database.
SharePoint therefore allows BLOB (Binary Large OBject) data to be stored external to the SQL
Server database. An External BLOB Store (EBS) Provider must be installed on each application Web
server in the farm; this Provider allows the use of external file-based storage for up to 80% of the
data in typical environments.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
92
For the most frequent operation – browsing – externalized BLOB storage is faster than regular
database storage. Note also that there is only a slight difference between the performance of SAS
and NL-SAS drives in this environment.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
93
This slide compares a traditional (SQL Server) and an externalized implementation of BLOB storage.
Note the saving is disk count, as well as the opportunity to use fewer disks for the same workload.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
94
Historically, DSS workloads have not been a sweet spot for the CX4, except possibly for the CX4960. The VNX changes this position.
This solution does not leverage Flash drives or the FAST Suite as the large sequential workloads do
not lend themselves to this technology. The huge improvements in total throughput (particularly in
the lower end platforms) can drive up to 4.5x the bandwidth on an apples to apples basis. The
CX4-120 can achieve around 750MB/s and the VNX5300 can achieve around 3,500MB/s! The cost
of the comparable configuration (Block-only VNX5300) is 84% higher than the CX4, however to
achieve the throughput provided by that platform with CX4 would require a CX4-960 platform,
which would be considerably more expensive than the VNX5300.
The slide shows the VNX5300 being able to scale performance 4.5x compared to the upper limit of
its predecessor, the CX4-120.
VNX systems are ideal for Oracle DSS environments.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
95
This solution leverages Flash drives and the FAST Suite – the workload lends itself to this
technology.
The slide shows the typical improvement that can be achieved – 5x the performance from a
platform that is marginally more expensive (as configured) than the previous generation system it is
compared with.
As was the case with Oracle DSS, VNX systems are ideal for Oracle OLTP environments.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
96
This slide summarizes cases where Pools, FAST Cache and FAST VP may be beneficial.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
97
This lesson covers storage design for virtualized environments.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
98
Virtual environments behave in much the same way as physical environments, especially at the VM
level. Guidelines for virtual environments are mentioned.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices
99
Questions 1 to 3.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices 100
Questions 1 to 3.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices 101
Questions 4 and 5.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices 102
Questions 4 and 5.
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices 103
This module covered best practices for VNX system designs, and best practices for specific host and
application environments.
See the following documents for more information:
H10938 EMC VNX Unified Best Practices for Performance – Applied Best Practices Guide (Aug
2012)
Copyright © 2012 EMC Corporation. All rights reserved
Module 4: Storage Design Best Practices 104
This module focuses on BC design best practices.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
1
This lesson covers local replication.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
2
VNX SnapView snapshots make efficient use of space – the technology is pointer-based, and only
chunks that have changed occupy space in the Reserved LUN Pool. Those chunks are copied by the
Copy on First Write (COFW) process, which can have a significant performance impact on the
source LUN.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
3
The existence of a Snapshot has a direct effect on the read performance of the Source LUN. COFW
activity reads Source LUN data in 64 KiB chunks; this COFW may be caused by a host write to the
Source LUN, or by a secondary host write to the Snapshot. Reads directed at the Snapshot are
satisfied by the Source LUN if they have not yet been copied to the RLP as a result of COFW activity.
No Snapshot activity causes writes to the Source LUN. Source LUN writes are indirectly affected by
Snapshot activity, though COFW adds significant latency to a host write, especially when write
cache is saturated. The next slide discusses this added latency.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
4
Snapshot activity, as expected, has a direct effect on the RLP LUN(s). Snapshot reads hit the RLP if
data has previously been copied there as a result of a COFW. A write by a secondary host to a virgin
Snapshot chunk causes a 64 KiB write to the RLP (as part of the COFW). Subsequent secondary host
writes to that chunk are the size of the host write.
Normal COFW activity causes 4 writes to the RLP – three of them are related to the map, and may
be 8 KiB or 64 KiB in size, depending on the operation being performed, and the remaining write is
the 64 KiB data chunk.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
5
Key factors that influence Snap performance are listed on the slide.
• Application I/O reports can be used to profile the application I/O activity.
 Very active Source LUNs cause more COFW activity, especially if the application I/O
profile uses small-block, write-intensive, random I/O. In this case, bandwidth to the RLP
LUNs far exceeds bandwidth to the Source LUN. The larger the change rate, the less
efficient Snapshot operation becomes. Remember the rule of thumb is that Snapshots
work better on Source LUNs that experience less than 30 percent change rates over the
life of the Session.
• The number of concurrent Sessions may have a dramatic effect on the COFW activity and
the amount of RLP space used.
• The duration of the Sessions determines how long COFW operations continue, and how
much data is stored in the RLP. If Sessions run indefinitely, eventually the RLP holds all of the
original data from the Source LUN, and uses slightly more space than the Source LUN.
• The Snapshot I/O profile is an important factor to consider. Writes to the Snapshot can
affect the Source LUN, due to COFW reads, as well as add extra I/O load to the RLP.
• The RLP behaves like any ordinary FLARE LUN, and must be configured carefully – size and
performance are both important. RLP LUNs should be spread across multiple RAID Groups
for best performance. The RAID type used should be chosen carefully, as well as the disk
type. ATA drives do not perform well when used for RLP LUNs.
• Each VNX model has different performance characteristics and limits.
• If host data was aligned with the native LUN Offset, then SnapView operations cause disk
crossings, and performance is degraded even further.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
6
Very small I/Os that cause a COFW appear to be affected more than larger I/Os. If a 512 B host write causes
a COFW, the ratio of host data : RLP data is between 1 : 160 (1 x 64 kB write, and 2 x 8 kB writes) and 1 : 384
(3 x 64 kB writes). If a 64 kB host I/O causes a COFW, then the ratio is 1 : 3 at worst, and it appears as though
the performance impact is less.
Random data is more troublesome with Snapshots than is sequential data. Sequential writes, no matter how
small, eventually fill a chunk, and the host data : RLP data ratio is close to optimal. In addition, random data
is more likely to trigger a COFW, and the performance impact is more severe.
The total number of I/Os, especially writes, is significant. If a LUN is lightly loaded, the extra I/Os caused by
COFW activity may not be noticed. As the I/O load increases, whether caused by host reads or writes, the
VNX becomes busier, and COFW activity makes noticeable changes to response time. If the host application
is very write-intensive, then the COFW load is particularly severe.
The number of writes, if used alone, only gives us part of the picture. We need to know, or calculate as best
we can, the number of writes that are made to the same chunk. The write cache rehit ratio gives a rough
idea; VNX has no native tools or commands to measure write activity onto specific disk areas. The ktrace
utility can help here; it is, however, an EMC proprietary tool. Note that customer estimates are likely to be
too low, often by an order of magnitude or more. A customer may assume that if a LUN contains 1,000,000
blocks, and 1,000 of them are changed, that this represents a change of 0.1%. This is correct in the absolute
sense; for design purposes, however, we need to determine where those writes took place, and how many
chunks were touched. If each changed block was on an unique chunk, then 12.8% of the data would have
been changed as seen by a Snapshot – 128 times more than the customer estimate.
<Continued>
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
7
If we have no reliable data, then our planning procedure must assume that each host write touches a new
chunk.
A factor which is often overlooked when sizing the RLP is the I/O profile of the Snapshots.
Snapshots are often used for backups or similar activities; the number of writes made to the Snapshot in
those cases is low. If the secondary host performs write-intensive activity on the Snapshot, then we have to
be aware that primary host writes and secondary host writes cause COFW activity. This is particularly severe
if the secondary host is allowed to start writing soon after the Session has been started – the impact of the
primary host’s COFW is still near its peak.
Other factors which are relevant to the I/O profile of the Source LUN are also relevant here, and mentioned
in the slide.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
8
If the Sessions on a Source LUN are started at the same time, or close to the same time, then (most
of) the chunks that are copied to the RLP is shared among the Snapshots, and only one copy of any
chunk is in the RLP. If Sessions are started with long intervals between them, it is likely that some,
or much of the disk data changed between Session starts. Fewer chunks are shared; not only does
the RLP take up more disk space, but the number and impact of COFWs are greater.
A single Session writes data to the RLP (to the chunk storage area, to be more accurate) in a
sequential manner, even though the data may come from very different areas of the Source LUN.
As the number of concurrent Sessions increases, writes to the RLP still remain sequential. Once we
start to terminate and restart Sessions, space freed up (in random locations) by previous Sessions is
used, and the access pattern becomes more random.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
9
As noted previously, the load on the RLP LUNs can be very high – the number of IOPs can be larger
than the writes/s on the Source LUN, and the bandwidth can be very much higher than that of the
Source LUN. The performance of the RLs is important because of the impact on host applications.
As a result, the disks should be SAS disks, and we should limit the number of RLs per RAID Group.
Treating RLs like regular LUNs, and applying performance best practices to them, improves the
chance of designing a successful Snapshot implementation.
Sizing the RLP LUNs for capacity is a compromise between efficient use of disk space, and efficient
use of LUN numbers. In many cases, making the RLs 10% of the average size of the Source LUNs,
and allocating 2 RLs per Source LUN, works well.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
10
Performance is a consideration when you position SnapView. It’s important that you and the
customer discuss the implications, in order to ensure a successful configuration.
SnapView Snapshots typically underperform SnapView Clones in like situations, although at low I/O
loads, the impact may not be significant.
SnapView Snapshots affect the read performance of the source device; Clones do not, unless they
are synchronizing. Write performance to the Source LUN is affected by an active Session, and by a
non-fractured Clone. Reads from a Snapshot are slower than reads from Source LUNs or fractured
Clones.
Avoid environments with small-block random writes whenever possible; that scenario has the most
dramatic impact on performance.
Snapshots can increase the Source LUN response time significantly. Additional snapshots, however,
have a minor impact on response time. The impact continues as long as the SnapView session is
active. However, the impact decreases over time, as more chunks are copied to the RLP.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
11
Problems found during modeling must be addressed before the solution can be implemented.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
12
The correct configuration and placement of RLP LUNs is the most important factor when planning
for Snapshots.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
13
Here, we briefly take a look at 3 examples of sizing the RLP for performance.
Note that, in a purely random environment, each host write (at least initially) causes at least one
COFW, no matter the size of the host I/O. If the host I/O size is larger than a chunk, then each host
write causes multiple COFWs.
In purely sequential environments, the ratio of host I/Os to COFWs is the same as the ratio of chunk
size to host I/O size.
• For 4 KiB host I/Os, we see a COFW for every 16 host writes.
• For 256 KiB host I/Os, we see 4 COFWs for each host write.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
14
As noted earlier, an understanding of the COFW process, and the role played by the RLP, is vital to
understanding the performance of, and planning for, Snapshots.
A LUN which has been added to the Reserved LUN Pool, and then allocated to a Source LUN, has data
stored in 3 distinct areas:
• A bitmap, found at the beginning of the LUN. This bitmap tracks chunk usage on the Reserved
LUN.
• A chunk index and status area, which points to the location of data on the LUN, and keeps the
status of chunks on the Source LUN. It is indexed by chunk number on the Source LUN, and
therefore has a size which is related to Source LUN size.
• The chunk storage area. COFW data is saved here. This area occupies the rest of the Reserved
LUN.
These areas are contiguous on disk. The gaps between them on the slide are there to simplify the
illustration.
The required size of Area 2 is only known once the Reserved LUN is assigned to a Source LUN. The first
COFW to the Reserved LUN triggers creation of the map area, resulting in a much higher level of write
activity than is the case for subsequent COFWs.
<Continued>
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
15
Each of these areas is subject to SnapView paging, meaning that its data can be paged into SP memory
or out to disk as required. This paging operates in 64 kB ‘pages’, so writes to these SnapView areas will
always be 64 kB in size when paging, but will be 8 KiB in size in most other circumstances. Similarly,
the COFW process writes data in 64 KiB chunks only.
When a COFW is performed, each of these areas is updated. A single host write of 64 KiB or less can
therefore cause 4 writes of 64 KiB (at worst – usually one 64 KiB and three 8 KiB writes) to be made to
the RLP, in addition to the 64 KiB read made from the Source LUN. This has performance implications,
especially in environments where access to the Source LUN involves random, small-block writes.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
16
These screenshots show activity on an RLP LUN just after a Session was started.
Writes, shown in red, take place to metadata areas on the RL; moves, shown in green, are the
actual 64 kB data chunks on the RL. The Source LUN used here is 50 GB in size; had it been larger,
the sequential writes seen at around block 200,000 would have appeared elsewhere. As an
example, for a 200 GB Source LUN, the line would appear just above the 800,000 block address.
The areas highlighted by oval outlines will be expanded in subsequent screenshots.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
17
The index area is accessed largely randomly. This random activity affects the performance of the
RLP LUN, and should act as a guideline when selecting disk type and RAID type for RLP LUNs. Note
that random reads as well as random writes are occurring in the index area.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
18
Write activity (shown in green – these are actually moves performed by the Data Mover Layer) to
the chunk storage area is sequential in nature. Note that this screenshot consists of writes only –
the Snapshot was not being read (or, in fact, accessed) by the secondary host.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
19
We’ll take a look at 3 brief examples of sizing the RLP for capacity. These are the same as the
examples used when sizing for performance.
Note that, in a purely random environment, each host write (at least initially) causes at least one
COFW, no matter the size of the host I/O. If the host I/O size is larger than a chunk, then each host
write causes multiple COFWs.
In purely sequential environments, the ratio of host I/Os to COFWs is the same as the ratio of chunk
size to host I/O size. For 4 KiB host I/Os, we see a COFW for every 16 host writes, and for 256 KiB
host I/Os, we see 4 COFWs for each host write.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
20
For optimal performance, Reserved LUNs should be created on the number of disks, and the RAID
type, that match the expected I/O profile. Bear in mind, particularly in environments where the
Source LUN has a small-block, write-intensive load, that the RLP may need to handle much more
bandwidth than the Source LUN.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
21
This slide is a brief overview of SnapView Clone operations. The 2 major issues associated with
Clones are the performance effect of synchronization (related to the size of the Source LUN), and
the correct placement of the Clone. The RAID type and disk type chosen for the Clone is also very
important.
The Clone Private LUN (CPL) – strictly 2 LUNs – contains all the bitmaps used by SnapView Clones.
Each LUN must be a minimum of 1 GiB in size.
Because a Clone is a full copy, there is no COFW mechanism. While a Clone is synchronized, each
write to the Source LUN causes a write to be made to the Clone, which places an additional load on
the write cache. Synchronization can be costly because of the tracking granularity associated with
Clones.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
22
Clone synchronization reads data from the Source LUN and writes it to one or more Clones, which
are independent LUNs. Because of their independence, separate writes go to each, thereby adding
additional load to the write cache and disks.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
23
Clone synchronization reads data from the Source LUN and writes it to one or more Clones, which
are independent LUNs. Because of their independence, separate writes go to each, thereby adding
additional load to the write cache and disks.
Remember that when synchronizing or reverse synchronizing, the entire extent is copied if any
changes occurred to the extent.
The easy way to calculate the extent size for a Clone is to use LUN size in GB as the extent size in
blocks, e.g., a 256 GiB LUN has an extent size of 256 blocks = 128 KiB, a 400 GiB LUN has an extent
size of 200 KiB, etc.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
24
No more than 8 clones per Source LUN.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
25
The major items to be considered are the number of LUNs to be replicated, how they will be
replicated, and the total data size. The use of LVMs and clusters may add an additional level of
complexity.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
26
Decisions about expiration of backup volumes are important because they affect the total amount
of space required for the local replication solution.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
27
Within mission-critical data, some subset of that data may be more time sensitive than others.
Subcategories of critical data may be needed, with higher-priority data being protected more
rigorously.
History shows that data errors tend to recur as a result of consistently recurring events.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
28
To summarize, local replication requirements include, but are not limited to:
• Volumes belonging to volume groups
• Frequency of replications required
• Expiration schedule of replicated copies (stopping Sessions, or removing Clones from Clone
Group)
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
29
Note that these are LUN IOPs (as seen from the host). Disk IOPs must take account of the RAID type
used for the LUNs.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
30
Note that these are LUN IOPs (as seen from the host). Disk IOPs must take account of the RAID type
used for the LUNs.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
31
This lesson covers local replication.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
32
VNX Snapshots address limitations of copy on first write (COFW) SnapView Snapshots. The VNX
Snapshot technology is redirect on write (ROW or FOFW). VNX Snapshots are limited to Poolbased LUNs (i.e. not RAID Group LUNs). Up to 256 writeable VNX Snapshots can be associated
with any Primary LUN, though only 255 are user visible. Because the VNX Snapshot uses
pointers rather than a full copy of the LUN, it is space-efficient, and can be created almost
instantaneously. The ROW mechanism does not use a read from the Primary LUN as part of its
operation, and thus eliminates the most costly (in performance terms) part of the process.
A Reserved LUN Pool is not required for VNX Snapshots - VNX Snapshots use space from the
same Pool as their Primary LUN. Management options allow limits to be placed on the amount
of space used for VNX Snapshots in a Pool.
VNX Snapshots allow replicas of replicas; this includes Snapshots of VNX Snapshots,
Snapshots of attached VNX Snapshot Mount Points, and Snapshots of VNX Snapshot
Consistency Groups. VNX Snapshots can coexist with SnapView snapshots and clones, and are
supported by RecoverPoint.
If all VNX Snapshots are removed from a Thick LUN, the driver will detect this and begin the
defragmentation process. This converts Thick LUN slices back to contiguous 1 GiB addresses.
The process runs in the background and can take a significant amount of time. The user can not
disable this conversion process directly, however, it can be prevented by keeping at least one
VNX Snapshot of the Thick LUN.
Note: while a delete process is running, the Snapshot name remains used. So, if one needs to
create a new Snapshot with the same name, it is advisable to rename the Snapshot prior to
deleting it.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
33
A VNX Snapshot Mount Point (SMP) is a container that holds SCSI attributes
• WWN
• Name
• Storage Group LUN ID, etc.
An SMP is similar to a Snapshot LUN in the SnapView Snapshot environment. It is independent of
the VNX Snapshot (though it is tied to the Primary LUN), and can therefore exist without a VNX
Snapshot attached to it. Because it behaves like a LUN, it can be migrated to another host and
retain its WWN. In order for the host to see the point in time data, the SMP must have a VNX
Snapshot attached to it. Once the Snapshot is attached, the host will see the LUN as online and
accessible. If the Snapshot is detached, and then another Snapshot is attached, the host will see
the new point in time data without the need for a rescan of the bus.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
34
The VNX Snapshot Consistency Group allows Snapshots to be taken at the same point in time on
multiple Primary LUNs. If individual Snapshots were made of the Primary LUNs, it is possible that
updates to one or more Primary LUNs could take place between the time of the Snapshot on the
first Primary LUN and the time of the Snapshot on the last Primary LUN. This causes inconsistency
in the Snapshot data for the set of LUNs. The user can ensure consistency by quiescing the
application but this is unacceptable in many environments.
A Consistency Group can have a Snapshot taken of it, and can have members added or removed.
Restore operations can only be performed on Groups that have the same members as the
Snapshot. This may require modifying Group membership prior to a restore.
When a Snapshot is made of a Group, updates to all members are held until the operation
completes. This has the same effect as a quiesce of the I/O to the members, but is performed on
the storage system rather than on the host.
VNX Snapshot Set – a group of all Snapshots from all LUNs in a Consistency Group. For
simplifications, is referred to as CG Snap throughout the material.
VNX Snapshot Family – a group of Snaps from the same Primary LUN
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
35
This slide, and the following side, compares the two VNX snapshot technologies.
In this slide, the processes involved in a new host write to the source LUN (primary LUN) are
compared. In the familiar SnapView Snapshot environment, the COFW process reads the original
64 KiB data chunk from the source LUN, writes that chunk to the Reserved LUN, and updates the
pointers in the Reserved LUN map area. Once these steps complete, the host write to the Source
LUN is allowed to proceed, and the host will receive an acknowledgement that the write is
complete. If a SnapView Snapshot is deleted, data in the RLP is simply removed, and no processing
takes place on the Source LUN.
In the case of a VNX Snapshot, a new host write is simply written to a new location (redirected)
inside the Pool. The original data remains where it is, and is untouched by the ROW process. The
granularity of Thin LUNs is 8 KiB, and this is the granularity used for VNX Snapshots. If a VNX
Snapshot is removed from a Thin LUN,
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxx. If the last VNX Snapshot is removed from a Thick LUN, the defragmentation process moves
the new data to the original locations on disk, and freed space is returned to the Pool.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
36
In this slide, the processes involved in a secondary host read of a Snapshot are compared. In the
familiar SnapView Snapshot environment, data which has not yet been modified is read from the
source LUN, while data that has been modified since the start of the SnapView Session is read from
the Reserved LUN. SnapView always needs to perform a lookup to determine whether data is on
the Source LUN or Reserved LUN, which causes Snapshot reads to be slower than Source LUN
reads.
In the case of a VNX Snapshot, the original data remains where it is, and is therefore read from the
original location on the Primary LUN. That location will be discovered by a lookup which is no
different to that performed on a Thin LUN which does not have a VNX Snapshot, so the
performance is largely unchanged.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
37
VNX Snapshot management is performed from the Data Protection tab in the top navigation bar.
An option under Wizards is the Snapshot Mount Point Configuration Wizard, while the
Consistency Group area has the Create Snapshot Consistency Group link.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
38
When a Pool is configured, the Advanced tab allows the selection of parameters related to the use
of VNX Snapshots.
The upper checkbox, selected by default, modifies VNX Snapshot behavior based on total Pool
utilization; the default behavior will start to delete the oldest VNX Snapshots when the Pool
becomes 95% full, and will continue with the deletion until the Pool is 85% full.
The lower checkbox, deselected by default, modifies VNX Snapshot behavior based on the amount
of Pool space occupied by VNX Snapshots; if it is selected, the default behavior is to start to delete
the oldest VNX Snapshots when 25% of the total Pool space is being used by VNX Snapshots, and to
continue with the deletion until the total Pool space used by VNX Snapshots reaches 20%.
Note that these options are not mutually exclusive. Auto-deletion may be paused by the user at any
time that it is running, and may be resumed at any later time.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
39
VNX Snapshots may be taken of individual Primary LUNs, VNX Snapshots, or Consistency Groups. A
VNX Snapshot of a Consistency Group implies that VNX Snapshots have been taken of the member
Primary LUNs at the same point in time. This can be seen in the Creation Time column for the VNX
Snapshots.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
40
This lesson covers local replication.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
41
VNX SnapSure saves disk space and time by allowing multiple snapshot versions of a VNX file
system. These logical views are called checkpoints. SnapSure checkpoints can be read-only or
read/write.
SnapSure is not a discrete copy product and does not maintain a mirror relationship between
source and target volumes. It maintains pointers to track changes to the primary file system and
reads data from either the primary file system or from a specified copy area. The copy area is
referred to as a savVol, and is defined as a VNX File Metavolume.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
42
PFS
The PFS is any typical VNX file system. Applications that require access to the PFS are referred to as
“PFS Applications”.
Checkpoint
A point-in-time view of the PFS. SnapSure uses a combination of live PFS data and saved data to
display what the file system looked like at a particular point-in-time. A checkpoint is thus
dependent on the PFS and is not a disaster recovery solution. It is NOT a copy of a file system.
SavVol
Each PFS with a checkpoint has an associated save volume, or SavVol. The first change made to
each PFS data block triggers SnapSure to copy that data block to the SavVol. It also holds the
changes made to a writeable checkpoint.
Bitmap
SnapSure maintains a bitmap of every data block in the PFS where it identifies if the data block has
changed. Each PFS with a checkpoint has one bitmap that always refer to the most recent
checkpoint. The only exception is when a PFS has a writeable checkpoint, where an individual
bitmap will be created for each writeable checkpoint to track the changes made to it.
Blockmap
A blockmap of the SavVol is maintained to record the address in the SavVol of each saved data
block. Each checkpoint, read-only or writeable, has its own blockmap.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
43
In the next several slides, we will see how SnapSure works in capturing data from file system
modifications and providing data to users and applications.
Displayed on this slide is a PFS with data blocks containing the letters A through F. When the first
file system checkpoint is created, a SavVol is also created on disk to hold the bitmap, the original
data from the PFS, and that particular checkpoint’s blockmap. Each bit of the bitmap will reference
a block on the PFS.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
44
Next, we are going to have a user or application make some modifications to the PFS. In this case,
we are writing an “H” in the place of the “B”, and a “K” in the place of the “E”.
Before these writes can take place, SnapSure will place a hold on the I/Os and copy the “B” and “E”
to the SavVol. Then the blockmap will be updated with the location of the data in the SavVol. In
this example, the first column of the blockmap refers to the block address in the PFS, and the left
column refers to the block address in the SavVol. Next, the bitmap is updated with “1”s wherever a
block has changed in the PFS. A “0” means that there were no changes for that block.
After all this process takes place, SnapSure will then release the hold and the writes can be
established. If these same two blocks are modified once again, the writes will go through and
nothing will be saved in the SavVol. This is true because we already saved the original data from
that point in time and anything after that is not Ckpt1’s responsibility.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
45
When a second checkpoint is created, another blockmap will be created in the same SavVol as the
old checkpoint. The bitmap that used to refer to the Ckpt1, now refers to Ckpt2, which is the
newest checkpoint. The bitmap is then reset to all “0”s waiting for the next PFS modification. Any
writes from now on will be monitored by Ckpt2 since it’s the newest checkpoint.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
46
For our next example, we have an application modifying the first, second, and sixth blocks in the
PFS with the letters “J”, “L”, and “S”. SnapSure will hold these writes, copy the original data in the
SavVol, and update the bitmap and Ckpt2’s blockmap.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
47
If a user or application decides to access Ckpt2, SnapSure will check the bitmap for any PFS blocks
that were modified. In this case, the first, second, and sixth blocks were modified. The data for
these blocks will come from the SavVol. Everything else will be read from the PFS.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
48
When an older checkpoint is accessed, like Ckpt1 for example, SnapSure cannot use the bitmap
because it refers to the newest checkpoint. SnapSure will have to access the desired checkpoint’s
blockmap to check for any data that has been copied to the SavVol.
In this case, we’ll take the first and second blocks from the SavVol and use the data to fill the
second and fifth blocks of Ckpt1. SnapSure will continue to read all of the old blockmaps as it
makes its way to the newest checkpoint.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
49
Once SnapSure arrives at the newest checkpoint, the bitmap can then be utilized to determine any
other blocks that have changed in the PFS. In the example above, the bitmap says that the first,
second, and sixth blocks have changed in the PFS. Notice that we already have the second block in
the PFS accounted for with a “B” from the previous slide due to blockmap1. If SnapSure finds
blocks that already have been accounted for, it will simply skip it and go to the next block that is
represented in the bitmap.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
50
SnapSure requires a SavVol to hold data when you create the first checkpoint of a PFS. AVM
algorithms determine the selection of disks used for a SavVol. AVM tries to match the storage pool
for the SavVol with that of the PFS whenever possible. If the storage pool is a system-defined pool,
and it is too small for the SavVol, AVM will auto-extend it. User-defined storage pools cannot be
auto-extended.
The system allows SavVols to be created and extended until the sum of the space consumed by all
SavVols on the system exceeds 20% (default) of the total space available. This is tunable in the
/nas/sys/nas_param file.
When the SavVol High Water Mark (HWM) is reached, SnapSure will extend the SavVol based on
the size of the file system:
• If PFS < 64 MiB, then extension = 6i4 MB.
• If PFS < 20 GiB, then extension = PFS size.
• If PFS > 200 GiB, then extension = 10 percent of the PFS size.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
51
The VNX File allows up to 1 GiB of physical RAM per Data Mover to page the bitmap and blockmaps
for all the PFS that have checkpoints. The 1 GiB also ensures that sufficient Data Mover memory is
available for VNX Replicator. For systems with less than 4 GiB of memory, a total of 512 MiB of
physical RAM per Data Mover is allocated for the blockmap storage.
Each time a checkpoint is read, the system queries it to find the required data block’s location. For
any checkpoint, blockmap entries needed by the system, but not resident in main memory are
paged in from the SavVol. The entries stay in main memory until system memory consumption
requires them to be purged.
A bitmap will consume 1 bit for every block (8 KiB) in the PFS. The blockmap will consume 8 bytes
for every block (8 KiB) in the checkpoint. Once a second checkpoint is created, still one bitmap
exists. There is one bitmap for the most recent checkpoint only. Blockmaps exist for every
checkpoint created. The server_sysstat command with the -blockmap switch will provide the Data
Mover memory space currently being consumed by all of the blockmaps.
The following slide shows an example of calculating the SnapSure memory requirements for a 10
GiB PFS that has two checkpoints created.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
52
The VNX checks the amount of Data Mover memory available before it allows a checkpoint file system to be
created, extended, or mounted. If that amount exceeds the predefined limit of the Data Mover’s total
memory allotted for checkpoint (or VNX Replicator) SavVols, an error message is sent to the
/nas/log/sys_log file. At this point, you can delete any unused SavVols, upgrade to a Data Mover with
more memory, or use another Data Mover with more memory.
To avoid this situation, plan memory consumption carefully. Use the calculations displayed here to
determine your specific memory requirements. These calculations support the “rule of thumb” that says for
every 1 GiB of savVol space consumed, one MiB of memory will be required.
The scenario says that a 10 GiB PFS has 2 checkpoints created. One checkpoint has 10% of the PFS in the
SavVol, and the other checkpoint has 1% of the PFS in the SavVol. Checkpoint 2 is the only one that will
have a bitmap associated with it since it is the most recent checkpoint.
Since each bitmap requires 1 bit for every block in the PFS, you must calculate the number of blocks in the
PFS. The calculation on the slide shows that the PFS has 1,310,720 8 KiB blocks in the 10 GiB PFS. Each
block requires 1 bit, which calculates to 160 KiB. Checkpoint 2 also has a blockmap which will consume 8
bytes for every block in the checkpoint. The checkpoint is 1% of the PFS. The calculation on the slide shows
that there are 13,107.2 blocks. Multiply this by 8 bytes per block to get about 102 KiB.
Checkpoint 1 does not require a bitmap and 1024 KiB is required for the blockmap.
As a result, the total memory utilization is 1024KiB (checkpoint 1 blockmap) + 102KiB (checkpoint 2
blockmap) + 160KiB (checkpoint2 bitmap) = 1,286 KiB
Using the Rule of Thumb; the checkpoints have 11% of the PFS in the SavVol, or 1.1GB. Therefore, that
would require ~1 MiB of memory when rounding up.
Note: Remember that VNX Replicator internal checkpoints needs to be taken into account.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
53
SnapSure operations typically cause a decrease in performance.
Creating a checkpoint requires the PFS to be paused. Therefore, PFS write activity is suspended, but
read activity continues while the system creates the checkpoint. The pause time depends on the
amount of data in the cache, but it is typically one second or less. SnapSure needs time to create
the SavVol for the file system if the checkpoint is the first one.
The PFS will see performance degradation every time a block is modified for the first time only.
This is known as the Copy On First Write (COFW) penalty. Once that particular block is modified,
any other modifications to the same block will not impact performance.
Deleting a checkpoint requires the PFS to be paused. Therefore, PFS write activity is suspended
momentarily, but read activity continues while the system deletes the checkpoint.
Restoring a PFS from a checkpoint requires the PFS to be frozen. Therefore, all PFS activities are
suspended during the restore initialization process.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
54
Refreshing a checkpoint requires the checkpoint to be frozen. Checkpoint read activity is suspended
while the system refreshes the checkpoint. During a refresh, the checkpoint is deleted and another
one is created with the same checkpoint name. Clients attempting to access the checkpoint during
a refresh process experience the following:
• NFS clients — The system continuously tries to connect indefinitely. When the system
thaws, the file system automatically remounts.
• CIFS clients — Depending on the application running on Windows, or if the system freezes
for more than 45 seconds, the Windows application might drop the link. The share might
need to be remounted and remapped.
If a checkpoint becomes inactive for any reason, read/write activity on the PFS continues
uninterrupted.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
55
NAS Engineering has done some SnapSure performance testing. The results will be shown on the
following slides. The test environment was CIFS only. NFS has been shown to perform better under
all the conditions that will be described here. CIFS results vary by workload.
The testing was done with “pure” workloads including sequential read, random read, sequential
write, and random write. Most workloads typically include a combination of all (or some) of these
workloads.
The throughput was tested on an uncheckpointed PFS, as well as a PFS with a full checkpoint and
an empty checkpoint. A full checkpoint means that 100% of the PFS blocks have already been
saved in the SavVol. In this case, all writes to existing PFS data were guaranteed to be re-writes and
does not affect the checkpoint (COFW was already performed). All reads from the full checkpoint
are satisfied by the SavVol.
An empty checkpoint means that there is no data in the SavVol; therefore, any write to the PFS
requires a write to the SavVol (COFW). All read requests to the empty checkpoint are satisfied by
the PFS.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
56
The results clearly show that when SnapSure is providing the expected additional functionality, for
COFW, a decrease in throughput will occur. When writing to the PFS with an empty checkpoint, we
see almost a 50% decrease in throughput. Since every write to a block would be the first write, the
block would first have to be written to the SavVol. Random Write shows a similar degradation.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
57
When the checkpoint is empty, reading from the checkpoint is much the same as reading from a
file system without any checkpoints.
When the checkpoint is fully populated, reading from the checkpoint shows a 90% degradation for
a sequential read. This is caused by SnapSure processing required to check the bitmap/blockmap
to determine if the block has been changed and whether it should be read from the PFS or the
SavVol. When performing a random read on the full checkpoint, the performance is a bit better
than sequential. This is true due to the fact that SavVol reads are random in nature.
When reading from a checkpoint, the size of the checkpoint will determine how long it takes to
determine if the block should be read from the SavVol or PFS, and the time to read.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
58
Writes to a single SavVol are purely sequential. NL-SAS drives have very good sequential I/O
performance. On the other hand, reads from a SavVol are nearly always random where SAS drives
perform better. Workload analysis is important in determining if NL-SAS drives are appropriate for
SavVols.
Many SnapSure checkpoints are never read from at all; or, if they are, the reads are infrequent and
nonperformance-sensitive. In these cases, NL-SAS drives could be used for SavVols. If checkpoints
are used for testing, data mining and data sharing, and experience periods of heavy read access,
then SAS drives are a better choice.
Be careful when using multiple SavVols on a single set of NL-SAS drives since the I/O at the disk
level will appear more random where SAS drives perform better.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
59
Migration activity produces significant changes in the file system. Therefore, it is best to complete
migration tasks before using SnapSure. This avoids consuming SnapSure resources to capture
information that is unnecessary to the checkpoint.
If you have a choice, read from the most current (active) checkpoint. When the latest checkpoint is
accessed by clients, SnapSure queries its bitmap for the existence of the needed block. Access
through the bitmap is faster than access through the blockmap. Therefore, read performance will
be slightly better from the most recent checkpoint than from older checkpoints where blockmaps
will need to be read.
When multiple checkpoints are active, additional Data Mover resources are required; memory and
CPU. Therefore, less Data Mover memory would be available for read cache and other operations.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
60
This lesson covers remote replication on VNX systems.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
61
Full copy mode copies all data from source to target, and does not track changes to the source
while the Session is running.
The number of links used and their capacity is decided by the amount of data needed to move
inside a given time.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
62
Incremental mode copies all data from source to target initially, and performs incremental updates
thereafter.
The number of links used and their capacity is decided by the amount of data needed to move
inside a given time.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
63
Incremental SAN Copy (ISC) allows the transfer of changed chunks only, from source to destination.
ISC copies all changes made until a user-defined point in time, and uses SnapView Snapshot
technology as required to keep track of where those changes are. The changed chunks are then
copied from source to destination and a checkpoint mechanism tracks the progress of the transfer.
The Source LUN is available to the host at all times. The Target LUN is only of use to an attached
host once the transfer is completed. At that point, the Target LUN will be a consistent, restartable,
but previous point-in-time copy of the Source LUN.
The following show the enumeration of the steps in the graphic shown here:
1.
2.
3.
4.
5.
6.
Primary host writes to Source LUN
COFW invoked if needed
Acknowledgement from local storage system
Trigger event
Chunks copied from local to remote storage system
Acknowledgement from remote storage system
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
64
ISC is not designed to be a DR product, though it is often used in that manner. Important
considerations are mentioned here.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
65
Locality of Reference: Write I/O rate does NOT always equal number of COFW
 1,000 I/O per second
 20% write = 200 writes per second does NOT always cause 200 COFW per
second…usually some ‘re-hit’, often significant
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
66
In some cases, the necessary data is not available or the customer is not prepared to provide it. In
these cases, it is almost impossible to predict whether the SAN Copy cycle time matches the
desired RPO. As such, customer assumptions are always listed as the lowest preference and quite
obviously present the most risk to successful delivery of the SAN Copy solution.
The change rate and write activity are the two most important factors here.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
67
The low, medium, and high change rates above dictate the respective change rates that are likely to
be seen. If you map your starting reference point to these change profiles, then the table can help
define how the logarithmic curve applies to your customer environment.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
68
Time taken to move the data depends on a number of factors, discussed on this slide.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
69
The key point here is that more resources allow a more even spread of I/O across LUNs. Remember
to factor in any Clone operations.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
70
Here we are doing a simple sizing. As we will see, the simple math is an insufficient way to
determine cycle times.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
71
We are looking at synchronizing 12.5 GB/hour. Our simple calculations determine we need less
than a T3 to propagate the changes over the link for a one hour cycle.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
72
Now we walk through a sample sizing effort.
The estimated changed data is the total amount of data multiplied by the 12% logarithmic change
rate. We are using a high change rate for a conservative estimate.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
73
Simple method will only get into the ‘ball park’ for a guess at changed data. We need to know the
peak number of changes for a given cycle.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
74
There are a number of ways that you can improve cycle times with a SAN Copy configuration. Here
is a list of a few of them.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
75
A key factor is the front-end port utilization. If you can’t get in the front door, the work certainly is
not going to get done.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
76
Note that ISC uses 64 KiB transfers for the initial synchronization, where SC can use up to 1 MiB.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
77
Application data may need processing before it becomes usable; this processing, while strictly part
of the RTO, is ignored here.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
78
This lesson covers local replication.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
79
The VNX series offers all the replication methodologies needed to keep data available and secure.
Integrated with Unisphere, RecoverPoint/SE offers several data protection strategies.
RecoverPoint/SE Continuous data protection (CDP) enables local protection for block
environments.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
80
There are two types of splitters that can be employed in a RecoverPoint/SE solution:
1. K drivers are low level operating system kernel drivers that split the I/O – for Windows hosts
only (with RecoverPoint/SE)
2. VNX splitter drivers where the driver runs on the SPs
Choosing the type of splitter is one of the main design concerns in architecting a RecoverPoint/SE
solution. Consider the following when choosing a splitter type:
• When LUN size > 2 TB needs to be supported, VNX splitter is currently the only choice.
• VNX splitters are easy to deploy and manage when scalability is not a big concern and should
be chosen whenever there is an option to choose a VNX splitter.
• Host-based splitters are ideal for small installations where a small hit on the host CPU
performance is acceptable. Large numbers of hosts mean that manageability will be
cumbersome.
• When performance is critical, array-based splitters are a good choice.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
81
To deploy RecoverPoint/SE release 3.4 and later with VNX splitters:
• The latest VNX OE bundle.
• The latest RecoverPoint splitter engine (driver). Verify whether this is required based on the
VNX OE bundle version.
• The RecoverPoint splitter enabler.
• All RPA ports must be zoned with all available VNX SP ports
RecoverPoint/SE version 3.4 supports the VNX array-based splitter. This splitter runs in each
storage processor of a VNX array and will split all writes to a VNX LUN, sending one copy to the
original target and the other copy to the RecoverPoint appliance.
Refer to the EMC RecoverPoint Deployment Manager Release Notes for the latest version
information.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
82
For sizing journal volumes, use the following equation:
Journal size = (changed data in Mibps) x (required rollback time in seconds) / (percentage reserved
from target-side log) x system factor (usually 5% for internal system needs)
For example: Journal size = 5 Mibps x 86400 seconds (24 hours) / 0.8 (80%) x 1.05 = 567000Mib
(~71 GiB) The minimum size for a journal volume is 5 GB.
Note: RecoverPoint field implementers recommend 20% of the data you are replicating for sizing
the journal volume.
The Snapshot consolidation feature allows the snapshots in the copy journal to be consolidated to
allow for the storage of a longer history of data. For most customers, the granularity of snapshots
becomes less important over time. Snapshot consolidation allows us to retain the crucial per write
or per second data of write transactions for a specified period of time (for example, the last 24
hours) and only then to start gradually decreasing the granularity of older snapshots, at preset
intervals (for example, to create daily, then weekly, and then monthly snapshots).
Journal volume sizing when utilizing ‘snapshot consolidation’ must take note of incremental change
of data over the period of consolidation.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
83
A single RecoverPoint appliance can sustain an average of 75 MiB/s write I/Os and up to peaks of
110 MiB/s. This throughput figure should be used to calculate the number of appliances required
for the desired replication. A minimum of two RPAs are required for redundancy in any
RecoverPoint solution. The maximum sustainable incoming throughput for a single cluster is 600
MiB/s.
Always refer to the release notes for the most up to date information.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
84
Note that in this discussion I/O refers to write I/O only. There are three potential bottle necks in a
RecoverPoint/SE environment that may limit the amount of I/O the system can sustain. The bottle
necks are the WAN pipe between sites, the performance of the remote storage, and the
performance of the appliances themselves.
The system will be able to sustain the load dictated by the weakest link.
In case the system cannot sustain the load, it goes into a high load condition. High load occurs
when internal buffers on the appliance fill up. This may happen when the appliances themselves
cannot sustain the load or the available WAN bandwidth is insufficient. When the remote storage is
the bottle neck, behavior is different. See the Storage Performance section for details.
In high load the system keeps tracking the I/O. Once the high load condition is over, the system
resynchronizes the missed I/O. When sizing, it should be kept in mind that occasional high loads
are not problematic for the end user. Hence, the system should be sized to be able to sustain the
average load, and not with the goal of being able to sustain the maximal peaks.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
85
During normal operations, RecoverPoint/SE has no impact on the production storage. During
initialization of a consistency group, the RP appliance reads from the production storage. This may
entail a performance impact on the source storage and theoretically impact production
applications.
It is possible to configure RP to throttle the read rate from the production storage although it is
difficult to find the optimal rate without impacting the production performance. The RecoverPoint
recommended best practice for first time initialization in large environments is to do it one
consistency group at a time.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
86
Review the system and locate errors or warnings using the RecoverPoint Management Application.
Log files, the system panel, system traffic, and the consistency group status are all useful in finding
problems.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
87
This lesson covers remote replication on VNX systems.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
88
The sequence of events when using MirrorView is as follows:
1. The host sends a write to the SP that owns the Primary Image LUN.
2. The Reserved LUN is updated via the COFW process (if required).
3. The Primary Image is updated.
4. A ‘write complete’ acknowledgement is passed to the host.
5. A Snapshot is taken of the Secondary Image LUN before the update cycle starts.
6. The changed data, tracked by a Snapshot, is sent to the remote SP that owns the Secondary
Image when the update is due.
7. Data being updated causes a COFW, which puts original data in the Reserved LUN.
8. The Secondary Image is updated.
9. Acknowledgements are passed to the Primary Image’s owning SP.
10. If required, the Secondary Image may be rolled back to a previous known consistent state.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
89
MirrorView/A uses SnapView Snapshots and SAN Copy as underlying technology.
The traditional SnapView 64 KiB chunk is still copied from Source LUN to the RLP. SnapView flags
the 2 KiB ‘sub-chunks’ that actually changed, and transfers only those 2 KiB pieces across the
network. This helps improve network performance, but does not affect the performance impact on
the Source LUN as a result of COFWs.
Note that latency of a MirrorView/A transfer does not affect performance of the primary image,
except that the SnapView Session runs for a longer time.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
90
With MV/A we need to understand the Workload to determine the appropriate amount of
bandwidth and RLP space. The desired Recovery Point Objective dictates the update interval
chosen.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
91
The effect of MirrorView/A on VNX performance cannot be ignored. The chart above shows a typical trace
of MirrorView/A performance.
In region A, the source LUN is operating normally and is not yet a MirrorView/A image. Response time
depends on a number of factors, but is typically around 1 ms for cached writes. Note that we are most
concerned about the effect of MirrorView/A on host writes to the production LUN (primary image); the
impact on reads is indirect, and typically much less severe.
In region B, the mirror is being updated at the specified intervals. Response times peak each time an update
cycle is started (because of COFW activity), then decrease over time, until the next update cycle starts.
Response times here also depend on various factors; the response time for writes is likely to be at least 3
times what it was previously, plus the response time for an uncached read; we expect to see a peak of at
least 15 ms and, under severe operating conditions, peak response times of 50 ms or more. The line drawn
midway through the sawtooth portion shows the average response time when update cycles are running. It
is important to note that COFW activity only occurs if data chunks are modified two or more times once the
underlying ISC Session has been marked – if there is no rewriting of data chunks (no locality of reference for
writes), then only tracking activity takes place.
Region C shows the response time curve that is produced by normal COFW activity. Note that this is shown
for illustration only – MirrorView/A Sessions are typically not kept active for long enough to see the full
decay curve. The time taken to return to the original response time (before the Session started) is known as
the recovery time; its duration depends on the size of the Source LUN, the granularity of tracking, the
number of writes per second, and the randomness of the data access pattern.
<Continued >
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
92
Be aware, then, that sizing a MirrorView/A solution does not consist simply of making sure that
there is enough space in the RLP – RLP performance dramatically affects the performance of the
primary image.
Factors to consider are therefore:
• Source LUN: size, RAID type, number of disks, data access pattern, R/W ratio, IOPs
• RLP LUNs: size, RAID type, number of disks, expected data access pattern, expected R/W ratio,
projected IOPs
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
93
This chart shows 6 update cycles on a MV/A mirror (LUN 50) with update interval set to 15 minutes
from start of last update. Immediately after the 5th update cycle ended, the I/O rate to the LUN was
doubled. Things to note:
• Effect of MV/A activity on LUN 51, which is not a mirror image, but is on the same RG as LUN
50
• Effect on response time
• Shape of response time curve during update
• Length of time taken for update cycle 6 to complete – what does this show?
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
94
Note that there is a performance impact even when MV/A is not actively transferring data.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
95
The sequence of events when using MirrorView is as follows:
1. The host sends a write to the SP that owns the Primary Image LUN, and the primary image is
updated.
2. (Optional) The WIL is updated.
3. The Primary Image is updated.
4. The data is sent to the Secondary VNX.
5. The Secondary Image is updated.
6. The Secondary VNX sends an acknowledgement to the Primary VNX.
7. A ‘write complete’ acknowledgement is sent to the host.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
96
The size of the I/O matters as it is sent across the link. Smaller pipes have a tougher time with
larger block sizes.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
97
Here we can see what the effect on the transfer rate is with various block sizes. For an example let’s
use a T3 line.
T3: Speed = 45 Mb/s approx 4.5 MiB/s
4 KiB block: 4 KiB / 4.5 = 0.9 ms
32 KiB block: 32 KiB / 4.5 = 7.3 ms
256 KiB block: 256 KiB / 4.5 = 58.3 ms
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
98
If data is not required for remote restart, it should not be mirrored. Find out from the customer
which data is not required. Examples might include paging space, test filesystems, filesystems used
for database reorgs, temporary files, etc.
Once you have your base figure, multiply the number of writes per second by the average blocksize.
This gives you the bandwidth requirement before compression.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
99
When we size we must account for the round trip. In synchronous transfers an I/O is not complete
until the acknowledgement is sent.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
100
This slide helps you understand write distribution. Write I/Os are usually concentrated on a few
volumes.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
101
IP networks require qualification to ensure they have sufficient quality to carry data replication
workloads without significant packet loss and inconsistent packet arrival.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
102
Here are some things to watch out for.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
103
Before there were modeling tools there were individual spreadsheets that all used iterations of the
same formulas that applied basic math to find the results of an MirrorView/S solution. The problem
with theses models were they were often prone to errors and miscalculations.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
104
MV Link Service Time = Protocol converter delay + Signal Propagation Delay + Data Transfer Time
To estimate the impact of implementing MV on the expected Write IO response times first we need
to calculate the amount of time it takes to send the IO across the MV link to the Secondary and get
confirmation back that the data has been received. This is important because in Synchronous mode
as the Write IO is not confirmed as completed to the host until it has been sent to the Secondary
and an acknowledgement received back from the Target VNX. Therefore host response time for a
local Write IO is extended.
With MV over Fibre Channel and DWDM implementations the Protocol converter delay for switches
or multiplexors is considered to be insignificant. However careful consideration should be made to
ensure sufficient “buffer to buffer credits” for MV over Fibre Channel implementations.
Signal Propagation Delay The speed of light through a fiber optic cable is constant, however a small
delay is incurred. This can be calculated as 1 millisecond per 125 miles. Remember to take into
account the number of “round trips” needed when calculating this figure.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
105
MV Link Service Time = Protocol converter delay + Signal Propagation Delay + Data Transfer Time
“Data Transfer Time” is the amount of time it takes for a single write IO to be transmitted across
the MV link in one direction. This is dependent on the blocksize and the bandwidth of a single MV
link available.
Even though multiple MV links should be configured for resiliency, a single IO can only be
transmitted down one link, therefore data transfer time can not be improved by increasing the
number of links. To improve data transfer time, (and therefore the host Write IO Response Time), it
is necessary to use a faster link such as Fibre Channel.
The MV Link Service Time can now be calculated. The protocol converter delay + signal propagation
delay + data transfer time = MV Link Service Time.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
106
MV Response Time = Current Response Time x 2 + Link Service Time
Now we can calculate the average expected host Response Time after implementing MV.
The MV Write IO Response Time is calculated by adding the Link Service Time to the current
Response Time (or new expected non-MV response time).
The READ IO response is unaffected by implementing MV.
Average Expected Response Time for Read & Write IOs can be calculated by using the formula:
(Read IO Response Time * Read Ratio) + (Write IO Response Time * Write Ratio)
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
107
With the introduction of corporately sponsored sizing tools, the risk associated with synchronous
design is reduced to the data that is input into the tool. Since the tool models with host or VNX
performance data, the risk is reduced to the data collection sample.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
108
Note that business processes performed on data after it is recovered is ignored when addressing
RTO here.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
109
Secondary MirrorView/S images should have write cache enabled; if they don’t, they’ll slow down
the synchronization, and even regular updates caused by host I/O to the primary image.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
110
In environments that use synchronous replication, such as MirrorView/S environments, planning
the network link is very important. The link must be sized to carry not only the regular writes, but
must also have enough headroom to handle traffic generated after an event such as a mirror
fracture. If the link is sized to carry regular writes only, it may be impossible to resynchronize the
mirror after loss of the link for any appreciable length of time.
The size of MV/S (and SnapView Clone) extents can cause the amount of data to be replicated to
increase dramatically as the fracture duration increases. Bear in mind that one an extent is marked
as dirty – by changing even only one block of data – the entire extent will be copied to the replica.
This example asks 2 questions: how much data is marked dirty in a fracture, and how much
bandwidth is required in order to copy that dirty data to the replica in a reasonable – usually
customer-specified – time?
Bandwidth required by regular writes can be calculated by multiplying the number of writes in a
second by the size of the writes. Here we have 1,000 IOPs with a R/W ratio of 3:1, giving 250
writes/s. Each write is 4 KiB in size, giving a required bandwidth of 250 x 4 KiB = 1,000 KiB/s. To
convert KiB/s to Kb/s, multiply by 10 (there are 8 bits in a byte, but using 10 compensates for
protocol overhead and the binary-decimal conversion). We therefore have 10,000 Kb/s, or (dividing
by 1,000 since this is a serial link) 10 Mb/s.
The amount of data marked dirty during a fracture will be the size of the extent multiplied by the
number of writes that occurred during the fracture, assuming random I/O. In this case, extent size
is 256 blocks (LUN size is 256 GiB) = 128 KiB, and the number of writes is 250 writes/s x 5 minutes x
60 seconds/minute = 9,600,000 KiB. If this data must be copied to the replica in 75 minutes (the
resynchronization time), then the bandwidth requirement for the synchronization traffic only is
9,600,000 KiB / (75 minutes x 60 seconds/minute) = 2,133 KiB/s, which can be converted to 21.3
Mb/s. Adding the 10 Mb/s required by regular writes gives a total requirement of 31.3 Mb/s.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
111
This lesson covers remote replication on VNX systems.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
112
VNX Replicator is an IP-based replication solution that produces a read-only, point-in-time copy of a
file system, iSCSI LUN, or VDM. The VNX Replication service periodically updates this copy, making
it consistent with the production object. Replicator uses internal checkpoints to ensure availability
of the most recent point-in-time copy. These internal checkpoints are based on SnapSure
technology.
This read-only replica can be used by a Data Mover in the same VNX cabinet (local and loopback
replication), or a Data Mover at a remote site (remote replication) for content distribution, backup,
and application testing.
Replication is an asynchronous process. The target side may be a certain amount of minutes out of
sync with the source side. The default is 10 minutes. When a replication session is first started, a
full backup is performed. After initial synchronization, Replicator only sends changed data over IP.
In the event that the primary site becomes unavailable for processing, VNX Replicator enables you
to failover to the remote site for production. When the primary site becomes available, you can
use VNX Replicator to synchronize the primary site with the remote site, and then failback to the
primary site for production. You can also use the switchover/reverse features to perform
maintenance at the primary site or testing at the remote site.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
113
Replicator can set the data amount to be sent across the IP network before an acknowledgement is
required from the receiving side. Replicator performs TCP Window auto-sizing on a per-session pertransfer basis, so it is not necessary to tune these values.
The use of jumbo frames has not been shown to improve source-to-destination transfer rates for
Replicator, so it is not critical to enable jumbo frames on the entire replication data path. When
transferring data over a WAN, there is a high risk of dropped packets due to the latency and
instability of the medium. Using regular frames will allow Replicator to resend packets faster and
more efficiently.
Depending on the nature of the data being transferred, external network compression devices
should be able to decrease the packet sizes and improve replication transfer rates for environments
with limited WAN bandwidth.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
114
The bandwidth schedule controls throttle bandwidth by specifying bandwidth limits on specific
periods of time. A bandwidth schedule allocates the interconnect bandwidth used on the source
and destination sites for specific days and times, instead of using all available bandwidth at all
times for the replication. For example, during work hours 40% of the bandwidth can be allocated to
Replicator and then changed to 100% during off hours. Each side of a Data Mover interconnect
should have the same bandwidth schedule for all replication sessions using that interconnect. By
default, an interconnect provides all available bandwidth at all times for the interconnect.
Displayed here is a bandwidth schedule of 10 MB/s from 7 in the morning to 6 at night for all
weekdays. On the weekends, Replicator will use all available bandwidth.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
115
When possible, dedicate separate spindles for source, destination, and SavVol. The pattern of I/O
for each of these is distinct, and mixing them on the same physical spindles can lead to contention.
Also, a given set of disks should perform most efficiently under a consistent workload.
In addition, Replicator (V2) has several patterns of linked or concurrent I/O. For instance, because
there is a checkpoint of the PFS, new writes to the PFS will result in additional reads from the PFS
and subsequent writes to the SavVol. If the PFS and SavVol share the same spindles, there will be
additional latency for these operations as the disk head positions for the PFS read, and then seeks
to position for the SavVol write. Likewise, during the transfer of the delta set, data is read from the
source object (primarily) and written immediately to the destination object. If source and
destination share spindles, the seek latency will again be introduced.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
116
NAS Engineering has done some Replicator (V2) performance testing. The results will be shown on
the following slides.
The test environment consists of CIFS clients accessing (2) NS80s with CX3-80 back-ends and (2) NS960s with CX4-960 back-ends. All file systems being replicated are 20 GiB in size and created on
450 GB 15K RPM FC drives. The network used in the testing is a GigE LAN.
We will be looking at a remote replication. The first result will compare the NS80 and NS-960 full
copy transfer rates. The data reported in this test represents a maximum; the systems tested in this
characterization were optimized to provide greatest possible transfer rates.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
117
The session counts in these results are the number of sessions that are actively transferring data,
not necessarily the number of configured sessions.
On a per-session basis, the NS-960 provides 2 times the Replicator performance vs. the NS80.
Overall the NS-960 offers all-around improved performance over the NS80, and this improvement
as you can see also applies to Replicator (V2) as well.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
118
NL-SAS drives are useful because they maximize available storage while minimizing cost. However,
their performance characteristics must be considered before determining that they are appropriate
for a specific purpose. Generally, because Replicator SavVol activity is highly sequential, you can use
NL-SAS disks for the SavVol without performance impact, especially if the SavVol is built on spindles
that are not shared with data file systems.
It may also be tempting to use NL-SAS drives for the replication destination for cost efficiency. NLSAS drives should only be considered appropriate for use by the destination object under one of
the following scenarios:
• If it is determined that NL-SAS drives will be able to meet the demands of the anticipated
workload on the source object, and that they will further be able to meet the additional I/O
requirements of replication, then using NL-SAS for the destination should also provide
adequate performance.
• NL-SAS drives provide poor random write performance when compared to SAS drives, but
their sequential write performance is nearly equal to SAS. If it is determined that the write
workload to the source system will be highly sequential, then it is likely that NL-SAS drives
would provide the appropriate performance to receive the Replicator updates.
• NL-SAS drives should typically not be used for the destination object under any of the
following scenarios:
• If other objects receiving client traffic will be sharing physical spindles with the destination
object, the additional workloads could interact with the replication workload to create a
more random and intensive workload than either separately.
• If the destination object will serve as the source for a cascaded replication, then the
additional workload should be factored in .
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
119
Replicator has policies to control how often the destination object is refreshed by using the
max_time_out_of_sync (SLA) setting, which is configured on a per-session basis. This value
determines the longest amount of time that will elapse between Replicator transfers. The default
SLA value is 10 minutes.
The SLA can be set between 1 and 1440 minutes. This value should align with the customer’s
requested recovery point objective.
Use of a larger SLA can be valuable if any of the following are true:
•
If the source object is prone to bursts of write activity combined with longer periods of
relative inactivity, use of a small SLA will cause Replicator to add additional I/O to the
spindles when they are already at their busiest (during the burst). However, if the SLA is set
higher, then it is more likely that the source disks will have a chance to absorb the burst of
activity and return to a quieter level of activity before the transfer of the changes.
•
Write activity to the source object tends to overwrite the same block of data multiple times
-- With a one-minute SLA, data that is written to the source object is marked for transfer
almost immediately. For most workloads in this scenario, there is very little chance that a
subsequent write intended for the same block will occur before the next transfer is initiated.
As the value of the SLA increases, there is a greater chance for these re-hits that decrease
the amount of data which needs to be transferred
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
120
This lesson covers remote replication on VNX systems.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
121
The VNX series offers all the replication methodologies needed to keep data available and secure.
Integrated with Unisphere, RecoverPoint/SE offers several data protection strategies.
RecoverPoint/SE Continuous data protection (CDP) enables local protection for block
environments.
RecoverPoint/SE Concurrent local and remote (CLR) replication enables concurrent local and
remote replication for block environments.
RecoverPoint/SE Continuous remote replication (CRR) enables block protection as well as a file
Cabinet DR solution. This enables failover and failback of both block and file data from one VNX to
another. Data can be synchronously or asynchronously replicated. During file failover, one or more
stand-by Data Movers at the DR site come online and take over for the primary location. After the
primary site has been brought online, failback would allow the primary site to resume operations
as normal. Configuration can be active/passive as shown here or active/active where each array is
the DR site for the other. Deduplication and Compression allow for efficient network utilization,
reducing WAN bandwidth by up to 90%, enabling very large amounts of data to be protected
without requiring large WAN bandwidth. RecoverPoint/SE is also integrated with both vCenter and
Site Recovery Manager (SRM). SRM integration is block only.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
122
There are four major components to a RecoverPoint installation.
•
RecoverPoint Appliances (RPA) – These appliances are Linux based boxes that accept the “split” data
and route the data to the appropriate destination volume, either via IP or Fibre Channel. The RPA
also acts as the sole management interface to the RecoverPoint installation.
•
RecoverPoint Journal Volumes – Journal volumes are dedicated LUNs on both Production and Target
sides that are used to stage small aperture, incremental snapshots of the host data. As the
personality of production and target can change during failover and failback scenarios, Journal
volumes are required on all sides of Replication (production, CDP and CRR).
•
Splitter – RecoverPoint splitter driver is a use-specific, small footprint software that enables
continuous data protection (CDP) and continuous remote replication (CRR). The splitter driver can
be loaded on a host, or on a VNX/CLARiiON array.
•
Remote Replication – Two RecoverPoint Appliance (RPA) clusters can be connected via TCP/IP or FC
in order to perform replication to a remote location. RPA clusters connected via TCP/IP for remote
communication will transfer "split" data via IP to the remote cluster. The target cluster's distance
from the source is only limited by the physical limitations of TCP/IP. RPA clusters can also be
connected remotely via Fibre Channel. They can reside on the same fabric or on different fabrics, as
long as the two clusters can be zoned together. The target cluster's distance from the source is only
limited by the physical limitations of FC. RPA clusters can support distance extension hardware (i.e.,
DWDM) to extend the distance between clusters.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
123
A fully redundant and high fidelity network with minimal packet loss would greatly improve the
replication performance.
Gateway Load Balancing Protocol (GLBP) or Virtual Router Redundancy Protocol (VRRP) can be
used to configure redundant paths in a WAN environment.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
124
RecoverPoint WAN network, in case of remote replication, must be well-engineered with no packet
loss or duplication as that would lead to undesirable retransmission. While planning the network,
care must be taken to ensure that the average utilized throughput does not exceed the available
bandwidth. Oversubscribing available bandwidth leads to network congestion, which causes
dropped packets and leads to TCP slow start. Network congestion must be considered between
switches as well as between the switch and the end device.
To determine the bandwidth required to meet end-user requirements and user RPO requirements,
the I/O fluctuations should be understood and taken into consideration. In order to size the WAN
pipe, the relevant data is:
• Average incoming I/O for a representative window in MB/s (24 hrs/7 days/30 days)
• Compression level achievable on the data (This is often hard to obtain and depends on the
compressibility of the data. The rule is 2x to 6x.)
Best practice is to dedicate a segment or pipe for the replication traffic or implement an external
QOS system to ensure bandwidth allocated to replication is available to meet the required recover
point objectives (RPO).
From these numbers, compute the minimal BW requirements of the environment by multiplying
the estimated compression level by the average incoming data. It should be noted that allocating
this BW for replication does not provide any guarantee on RPO or the frequency of high loads
because the I/O rate can fluctuate throughout the representative window.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
125
Deduplication is a consistency group policy that eliminates the transfer of repetitive data to a
remote site, saving bandwidth. When deduplication is enabled for a consistency group, every
new block of data is stored in both the local and the remote RPAs. Each block that arrives at
the local RPA is analyzed, and whenever duplicate information is detected, a request is sent
to the remote RPA to deliver the data directly to the remote storage, instead of sending the
data to the remote site.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
126
Compression is CPU intensive; therefore, for best performance of RecoverPoint, it is important to
configure compression levels correctly. Setting a strong compression causes CPU congestion and
high loads because of the inability of the RPA to compress all data.
Similarly, when the WAN setting is limited to low compression, it causes high loads because it will
try to transmit more data. The general rules for compression are as follows: (all units in
Megabytes/sec):
• When used with low bandwidth (2 MiB/s) or with low upper bound on the write-rate, (in
scale of 5 MiB/s) use the maximal compression level. Note that this is only a general rule,
and when data is very compressible, we can handle 10 MiB of data with highest
compression.
• When used with higher bandwidth (5 MiB/s) or with upper bound of about 10-15 MiB/s on
the write rate, use one of the mid-range compression levels.
• In any other case, use the minimal compression level.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
127
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
128
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
129
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
130
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
131
Test the knowledge acquired through this training by answering the questions in this slide.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
132
Displayed here are the answers from the previous slide. Please take a moment to review them.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
133
Test the knowledge acquired through this training by answering the questions in this slide.
Continue to the next page for the answer key.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
134
Displayed here are the answers from the previous slide. Please take a moment to review them.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
135
This module covered BC Design on VNX systems.
Copyright © 2012 EMC Corporation. All rights reserved
Module 5: BC Design Best Practices
136
This module focuses on a case study and design exercises.
Copyright © 2012 EMC Corporation. All rights reserved
Module 6: Case Studies
1
Copyright © 2012 EMC Corporation. All rights reserved
Module 6: Case Studies
2
This module covered a case study and design exercises.
Copyright © 2012 EMC Corporation. All rights reserved
Module 6: Case Studies
3
This course covered gathering relevant information, analyzing it, and using the result of that
analysis to design a VNX solution.
Copyright © 2012 EMC Corporation. All rights reserved
Course Introduction
4
This concludes the training. Thank you for your participation.
Please remember to complete the course evaluation available from your instructor.
Copyright © 2012 EMC Corporation. All rights reserved