Archive Solutions for the Enterprise with EMC Isilon Scale

ARCHIVE SOLUTIONS WITH EMC ISILON
SCALE-OUT NAS
ABSTRACT
This white paper outlines the principles and concepts involved in implementing an
enterprise-class archive repository using EMC Isilon scale-out storage. It also
describes key Isilon technology and features used to implement an efficient and
secure archive solution.
October 2015
WHITE PAPER
To learn more about how EMC products, services, and solutions can help solve your business and IT challenges, contact your local
representative or authorized reseller, visit www.emc.com, or explore and compare products in the EMC Store
Copyright © 2014 EMC Corporation. All Rights Reserved.
EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without
notice.
The information in this publication is provided “as is.” EMC Corporation makes no representations or warranties of any kind with
respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a
particular purpose.
Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com.
EMC2, EMC, the EMC logo, Isilon, OneFS, FlexProtect, SmartConnect, SmartLock, SmartPools, SnapshotIQ, and SyncIQ are registered
trademarks or trademarks of EMC Corporation in the United States and other countries. All other trademarks used herein are the
property of their respective owners.
Part Number H11224.1
2
TABLE OF CONTENTS
1.0 EXECUTIVE SUMMARY ........................................................................ 4
1.1 About this white paper ................................................................................. 4
1.2 Assumptions............................................................................................... 4
1.3 Industry trends ........................................................................................... 4
2.0 ARCHIVE VS. BACKUP ......................................................................... 5
2.1 Archiving objectives .................................................................................... 5
2.2 Archive technologies .................................................................................... 6
3.0 EMC ISILON ARCHIVE SOLUTIONS ARCHITECTURE ............................ 6
3.1 Hardware ................................................................................................... 7
3.2 Software .................................................................................................... 7
3.3 Archiving with Automated Storage Tiering ...................................................... 8
3.4 Archive server load balancing with Isilon SmartConnect ...................................10
3.5 High availability, data protection and security ................................................10
4.0 ARCHIVE SIZING .............................................................................. 10
4.1 What to understand about sizing an archive ...................................................11
4.2 Sizing examples .........................................................................................12
4.3 Scaling the estimate for future growth and planning .......................................14
4.4 Other sizing considerations ..........................................................................14
4.4.1 Information on the maximum number of connections .......................................... 14
4.4.2 Maximum number of total files in the file system................................................. 14
4.4.3 Isilon SyncIQ for archive data replication ........................................................... 15
5.0 CONCLUSION .................................................................................... 15
TAKE THE NEXT STEP .......................................................................................16
6.0 APPENDIX ........................................................................................ 17
6.1 Archive application file system ACL requirements............................................17
6.2 References ................................................................................................17
About EMC......................................................................................................18
3
1.0 EXECUTIVE SUMMARY
As data continues to grow, simply adding more capacity to primary storage no longer makes sense. It’s expensive, inefficient, and
doesn’t meet the growing need to correctly classify, manage, and protect data in a way that allows quick, efficient access. As a
result, archive systems are an increasingly important business requirement in modern enterprise environments.
Archiving applications can efficiently manage growing data stores to contain storage costs and satisfy both compliance and ediscovery requirements. Simultaneously, this practice satisfies both compliance and e-discovery requirements. EMC® Isilon® scaleout network-attached storage (NAS) provides a disk-based archive that scales from terabytes (TB) to petabytes (PB) of capacity.
With Isilon, archival information benefits from massive scalability and the robustness of the EMC Isilon OneFS® operating system,
with proven capabilities for protecting data and optimizing the flow of information within an organization.
This white paper outlines the principles and concepts for deploying an Isilon storage cluster as an enterprise-class archive repository.
It includes architectural explanations of the technologies and features associated with providing NAS-based file services, as well as
technical recommendations for sizing Isilon storage clusters based on appropriate cluster limits for archive applications.
1.1 ABOUT THIS WHITE PAPER
This guide provides technical considerations when planning archive systems based on Isilon storage in an enterprise environment. It
assumes the use of third-party archive software.
1.2 ASSUMPTIONS
This guide assumes the reader has an understanding and working knowledge of
the following:
•
NAS storage protocols
•
EMC Isilon scale-out storage architecture and the OneFS operating system
•
File-system management concepts and practices
While this guide is intended to provide a consolidated reference point for systems administrators and managers looking to deploy
archive software on an Isilon storage cluster, it is not intended to be the authoritative source of information on the technologies and
features used to provide and support a file-services platform.
1.3 INDUSTRY TRENDS
Data archiving is the process of identifying and moving inactive data out of current production (primary storage) systems and into
long-term archival storage systems. Implementing and maintaining an efficient and secure data archive is increasingly important for
organizations across a wide range of industries and public sector agencies.
The rapid growth of unstructured data that most enterprises are experiencing is straining primary storage resources and pressuring
IT budgets. Moving inactive data out of primary storage optimizes the performance of primary storage resources while archival
systems store information much more cost-effectively while still maintaining the data online, in contrast with the cost of retrieval
with tape media.
Many enterprises are subject to varying regulatory requirements that mandate long-term data retention. Some organizations are
also looking to preserve and protect content for historical purposes. Archive retention periods of 10 to15 years are not uncommon for
many businesses, while some industries are mandated to retain specific data upwards of 100 years.
In certain cases, archive data is purposely not deduplicated, compressed, or changed in any manner to effectively meet stringent
compliance requirements or to avoid potential litigation issues. Archiving can help e-discovery solutions and is an important way to
comply with regulatory and corporate governance requirements.
4
2.0 ARCHIVE VS. BACKUP
The term “data archiving” is sometimes confused with or used interchangeably with “data backup.” This is often the result of
assuming that a backup can be a substitute for an archive if needed. In reality, these two data retention strategies have distinct data
protection objectives.
Backups, whether to disk or tape, typically have relatively short lifecycles—measured in days, weeks, or months between full copies.
They are primarily used to restore data that may have been lost, corrupted, or destroyed. A data backup must usually be
reconstituted from a proprietary backup format to other media at a different location before it can be used.
Unlike a backup that is used to restore recently lost or deleted data, archiving is a systematic approach to providing structure to
unstructured data, usually laying it out in a predictable directory format, based on repeatable application algorithms. It enables the
storing, management, retrieval, and eventual deletion of data as appropriate throughout its entire lifecycle. Table 1 summarizes the
differences between archive and backup.
Archive
Backup
A primary copy of information
A secondary copy of information
Used for information retrieval
Used for operational recovery
Improves operational efficiency by removing
fixed content and duplicate data from the operational
environment
Improves availability by enabling applications
to be restored to specific points in time
Typically long term (years, decades, or forever)
Typically short term (weeks or months)
Data typically maintained for analysis or compliance as a
managed repository
Data typically overwritten on a periodic basis (daily,
weekly, monthly, etc.) and stored in an altered state
which needs to be restored
Useful for compliance—enforces retention policies by
typically storing data in its native form
Does not meet the needs of regulatory compliance—
though some are forced to use it this way
Table 1.
Archive and Backup Comparison
2.1 ARCHIVING OBJECTIVES
Many enterprises are seeing 50% growth in data each year, but little or no growth in storage hardware budgets. Not all data is
equally of value and its value can change. Many organizations find that their managed data is too performance-sensitive to move to
tape storage, but not time-critical enough to justify the cost of keeping it on high-performance storage. As an organization’s data
footprint grows, managed data becomes more heterogeneous, with different data subsets having divergent performance and
protection requirements.
Data archiving objectives will vary by organization but will typically include one or more of the following:
•
Reclaim expensive existing primary storage
•
Reduce on-going primary storage acquisition costs
•
Move static data out of the recurring backup process
•
Secure and protect data for long-term retention
•
Satisfy applicable regulatory and governance requirements
•
Provide reliable availability of data when needed
Data archive solutions can be used to manage growing data stores efficiently, optimize the use of primary storage resources, protect
data for long-term retention and, greatly help to reduce overall storage costs.
5
2.2 ARCHIVE TECHNOLOGIES
An effective archiving solution is typically comprised of a mix of software that will likely include:
•
Archiving software that automates the movement of data from primary storage to archival storage based on policies
established in the data classification and rationalization process. Data classification and rationalization (policy tuning) are
functions of the organization’s data mining requirements and legal compliance needs. Archive software can delete files at the
end of their retention period. It may also include features such as data compression or single-instance storage to maximize disk
utilization. Archiving application software generally utilizes NFS or SMB protocols to move content from primary storage to an
archive repository.
•
E-discovery software, using archiving software as a base, employs advanced search features that enable users and
administrators to quickly search all files, emails, texts, and other data related to a specific topic, for use in data mining services
or in response to legal inquiries. Some applications combine e-discovery and archive into purpose-built platforms such as an
email archive solution or document and records management solutions.
Data retention policies can be established and enforced through archiving and e-discovery software by ensuring that all files of
significance are flagged appropriately to prevent them from being altered or deleted.
3.0 EMC ISILON ARCHIVE SOLUTIONS ARCHITECTURE
EMC Isilon archive solutions combine Isilon scale-out NAS storage platforms, Isilon software and, in some cases, other EMC and 3rd
party software.
Figure 1. Isilon Archive Solutions Architecture
Each element of the Isilon archive solution is described in the following sections.
3.1 HARDWARE
EMC Isilon offers 4 tiers of hardware storage platforms that can be combined to deliver the right blend of performance and capacity
for a wide range of workloads, including file shares, home directories, archiving, and Big Data analytics.
For primary storage workloads and applications, Isilon offers:
•
EMC Isilon S-Series: High performance NAS storage platform built for high transactional and IOPS-intensive applications. The
Isilon S-Series can scale performance to more than 3.75 million IOPS in a single cluster.
6
•
EMC Isilon X-Series: Highly versatile NAS storage platform that strikes a balance between high-performance and large
capacity storage to support a wide range of workloads. Isilon X-Series solutions scale from 18 TB to over 20 PB in a single
namespace, with an aggregate throughput of up to 200 GB per second.
For nearline storage, active archiving, and high density, deep archiving, Isilon offers:
•
EMC Isilon NL-Series: Designed for highly efficient, large-capacity storage and is ideal for nearline storage and active
archiving workloads. The Isilon NL-Series product family includes the Isilon NL410, a 4U platform that offers a flexible mix of
configuration options and has a capacity that ranges between 36 TB to 210 TB per node.
•
EMC Isilon HD-Series: High density storage platform that is an ideal choice for deep archive and archive uses cases with a
capacity requirement of over 2.5 PB. The Isilon HD-Series product family includes the Isilon HD400, a 4U platform with a node
capacity of 354 TB. The Isilon HD400 can store up to 3.2 PB in a single rack and reduce data center operating costs, including
power, cooling and floor space, by up to 50%.
Table 2 provides a summary of design elements and considerations to determine the relative suitability of Isilon NL-Series and Isilon
HD-Series nodes.
Design Element
Isilon NL-Series more suitable
Isilon HD-Series more suitable
Type of archive
Active archive - data at rest but accessed
periodically
Active archive or deep archive – long-term
data retention with minimal access
Archive size
< 2.5 PB
> 2.5 PB
Rack considerations
If normal depth racks preferred over deep
racks
If deep racks preferred over normal racks
Weight consideration
When weight of the nodes is a concern
When weight of nodes is not a concern
Power consideration
Low line or high line power requirements
Only high line power is acceptable
Table 2.
Isilon NL-Series and Isilon HD-Series Comparison
Isilon NL-Series and Isilon HD-Series nodes can be combined into a single cluster that includes Isilon primary storage platforms
(Isilon S-Series and Isilon X-Series nodes). In this way, primary storage workloads and archive workloads can be supported in a
single Isilon cluster. This provides an opportunity for implementation of a policy-based, automated storage tiering solution that
automatically moves data to the appropriate storage tier and helps to optimize storage resources and lower costs.
3.2 SOFTWARE
All nodes in an Isilon cluster are powered by the EMC Isilon OneFS® operating system. OneFS combines the three layers of
traditional storage architectures—file system, volume manager, and data protection—into one unified software layer, creating a
single intelligent file system that spans all nodes within the cluster. This allows the Isilon cluster to facilitate the archive process
automatically through policy-based file placement on specified storage tiers. With its modular, single file system architecture, OneFS
enables Isilon storage systems to provide a massively scalable platform that is highly efficient and simple to manage.
Isilon storage solutions can scale from as small as 18 TB to over 50 PB in a single file system. To increase capacity, new nodes can
be added to an existing Isilon cluster within a minute and without disruption. With the Isilon AutoBalanceTM feature of OneFS,
there is no need to move data manually – it is done automatically and transparently to system users.
The Isilon OneFS operating system enables:
•
Independent, linear scalability of performance and capacity
•
A single point of management for large and rapidly growing data repositories
•
High reliability and high availability with state-of-the-art data protection.
7
The OneFS operating system manages aggregation of multiple Isilon node types into one namespace and the combination of all
available resources from every node in the cluster. An internal, InfiniBand-based network functions as a virtual backplane between
the nodes and enables any node in the cluster to service data for any client in the cluster (i.e., any other node), regardless of where
the data actually resides. A cluster may comprise of different types of Isilon nodes (including S-Series, X-Series, NL-Series and HDSeries), with each type of node suited to a particular workflow participating in different activities within the same, unified OneFS file
system. This mixed node approach is often the preferred choice for environments in which different workloads will coexist on the
same storage array. It also enables enterprises to optimize storage resources by implementing a policy-based, automated tiered
storage strategy using EMC Isilon SmartPools™ software.
OneFS also provides symmetric multiprocessing (SMP) capabilities that enable the system to move tasks between processors, which
results in extremely efficient workload balancing. The combination of these features also results in very high overall bandwidth and
performance capabilities for the cluster—an important capability when the Isilon cluster is supporting primary and archive workloads
simultaneously.
EMC Isilon software for used for management and data protection purposes are summarized in Table 3.
DATA PROTECTION SOFTWARE
EMC ISILON SNAPSHOTIQ™
Protect data efficiently and reliably with efficient snapshots, while incurring
little to no performance overhead. Fast data with near-immediate on-demand
snapshot restores.
EMC ISILON SYNCIQ®
Replicate and distribute large mission-critical data sets to multiple shared
storage systems in multiple sites for disaster recovery protection. Simple
failover and failback to increase data availability.
EMC ISILON SMARTLOCK®
Protect data against accidental, premature, or malicious alteration or deletion
with Isilon’s software-based approach to WORM. Also helps meet stringent
compliance and governance needs, such as SEC 17a-4 requirements
MANAGEMENT SOFTWARE
EMC ISILON SMARTPOOLS®
Implement a highly efficient, automated tiered storage strategy to optimize
storage performance and efficiency
EMC ISILON SMARTCONNECT™
Enable client connection load balancing and the dynamic NFS failover and failback of
client connections across storage nodes to optimize the use of cluster resources
EMC ISILON SMARTDEDUPE™
Increase efficiency and reduce storage capacity requirements by up to 35 percent with
deduplication of redundant data across multiple sources
EMC ISILON INSIGHTIQ™
Performance monitoring and reporting tools to maximize the performance of your Isilon
cluster
Table 3.
Table 3: Isilon Data Protection and Management Software
3.3 ARCHIVING WITH AUTOMATED STORAGE TIERING
Many enterprise environments have workloads or applications that require the use of active data while archiving older data. A
mechanism is required to migrate stale primary data to archive storage, which should happen without any disruption to users and
applications.
Isilon SmartPools software simplifies management and lowers storage costs with a transparent, policy-based, automated tiering
approach. It lets organizations optimize storage resources and automatically move older, unused data to economical archive storage.
SmartPools is integrated with the Isilon OneFS operating system to allow a single point of management, with a single scalable file
system that offer multiple tiers of performance—depending on the data.
8
With SmartPools, storage administrators can automatically match storage resources with specific data and application requirements.
It also simplifies management by eliminating the need for manual data migrations. SmartPools moves data among tiers based on the
enterprise requirements without sacrificing data protection, application performance, or uptime. Administrators can also use defined
policies to move data based on age, type, owner, location, or other criteria, from one tier to another.
SmartPools is tightly integrated with Isilon OneFS, so all data, regardless of physical location, are in the same single file system. This
means that SmartPools data movements are completely transparent to the end user application, removing management, backup, and
other issues related to stub-based tiering architectures such as those present in hierarchical storage management (HSM)
implementations.
With Isilon SmartPools, data protection levels can be set on a per-directory storage tier or even a per-file basis. Regardless of which
type of EMC Isilon storage nodes used in the cluster, SmartPools can be used to control the data protection level. SmartPools also
provides an option to configure performance profiles, so different types of data are actually laid out on disk differently to optimize
performance for different types of workloads.
Figure 2 illustrates a mixed-node, single file system deployment in an Isilon cluster for online and archive data using SmartPools.
Figure 2. Isilon cluster with SmartPools for automated tiered storage
In summary, tiering provides the following advantages:
•
Simplifies management with automatic, policy-based data movement within a single namespace, single file system without
complex links, stubs, or manual data migration.
•
Enables storage consolidation by allowing the support of multiple applications and workloads with varying performance
requirements on a single storage system.
•
Optimizes storage resources by automatically aligning application needs.
•
Adapts seamlessly to workflow changes.
•
Provides workflow isolation.
9
3.4 ARCHIVE SERVER LOAD BALANCING WITH ISILON SMARTCONNECT
Enterprise archive environments will typically engage several if not dozens of archive servers. These servers are often set to move
data to the archive vault based on certain file criteria, usually last modified time. While the working sets that the archive software or
policies monitor are not likely to all coalesce simultaneously, it is possible. It is important to make sure you take this into
consideration based on the number of archive servers that will be deployed. EMC Isilon SmartConnect™ software optimizes network
throughput by enabling intelligent client-connection load balancing and is used to automatically manage Isilon cluster access.
SmartConnect manages client connection load balancing through a single host name to the cluster nodes. This provides optimal
utilization of the cluster’s available network interfaces and network system resources. By leveraging an organization’s existing DNS
infrastructure, SmartConnect provides universal compatibility with all client types, eliminating the need for complicated connection
management on the client side.
To a client system, the cluster appears as a single network element. SmartConnect automatically balances incoming client
connections across all available interfaces on the Isilon storage cluster. This improves performance on the cluster by distributing the
workload evenly across multiple network paths and multiple nodes.
For an EMC Isilon storage cluster that hosts multiple concurrent workloads in addition to the organization’s archive, SmartConnect
gives administrators the ability to partition workloads, by type, across the available node interfaces in a cluster. By maintaining
multiple SmartConnect pools and minimizing the number of pools that overlap on a particular node interface, administrators can
maintain sufficient network bandwidth for critical workloads on dedicated interface connections.
3.5 HIGH AVAILABILITY, DATA PROTECTION AND SECURITY
Isilon OneFS operating system provides scale-out data protection through EMC Isilon FlexProtect™ feature of the Isilon OneFS
operating system. FlexProtect utilizes advanced technology to provide redundancy and availability capabilities far beyond those of
traditional RAID. FlexProtect uses Forward Error Correction (FEC) to create an n-way, redundant fabric that scales as nodes are
added to the cluster, providing 100 percent data availability even with up to four simultaneous node failures. This goes far beyond
the maximum level of RAID commonly in use today, which is the double-failure protection of RAID 6. Additional details on this are
provided in the High Availability & Data Protection with EMC Isilon Scale-Out NAS white paper.
For data backup and disaster recovery protection, you can easily copy and replicate data to remote sites. EMC Isilon SnapshotIQ
software enables fast and efficient data backup and allows you to take point-in-time copies of data with a choice of snapshot
intervals and RPO time options. For multi-site disaster recovery protection, EMC Isilon SyncIQ can be used to replicate data to your
choice of local and remote sites. SyncIQ supports both LAN and WAN networks to replicate over short or long distances, providing
protection from both site-specific and regional disasters.
Security and compliance options from Isilon include:
•
Role-based access control (RBAC) and secure access zones to limit data access.
•
Immutable storage for data with its write once, read many (WORM) locking capability with Isilon SmartLock software. In
this way, SmartLock protects your archived data against accidental, premature, or malicious alteration or deletion.
•
Data at rest encryption options from Isilon storage platforms that include self-encrypting drives to prevent data theft.
•
Integrated file system auditing to identify unauthorized data access attempts.
•
Security and Technical Implementation Guide (STIG) hardening, CAC/PIV Smartcard authentication, and FIPS OpenSSL
support.
Together, these capabilities provide a comprehensive data protection and security solution for your data archive.
4.0 ARCHIVE SIZING
There are many factors to consider in sizing a cluster appropriately for an archive workload. Archive environments will often have
many archive servers moving files based on time criteria to the Isilon cluster. Having many archive servers access the cluster is a
perfect-use case for SmartConnect. SmartConnect will load balance the servers across all nodes of the cluster (while accessing the
same file system for continuity). It is essential for sizing exercises that this practice is implemented so that no single-node bottleneck
is reached artificially.
10
The potential to have many servers moving data simultaneously to the cluster can cause “cliff” events where the application can run
into a requirement to move thousands of files all at once. Regardless of your deployment scenario, you must use peak values for
your sizing estimate. If for example, you have 10 archive servers but they are on different vault schedules, you must plan for the
worst-case scenario: that all are undergoing a vault operation simultaneously. That will be the assumption in the following sections.
This type of event requires consideration about the number of connections that can be handled, file system IO, bandwidth, etc. The
following sizing section will address these considerations with Isilon clusters as well as provide general guidelines for elements such
as maximum file counts and replication of the archive, among others.
4.1 WHAT TO UNDERSTAND ABOUT SIZING AN ARCHIVE
There are three base scenarios to consider when sizing an archive:
1. You have an existing Isilon cluster and wish to add an archive tier to it through expansion via node addition.
2. You will be deploying a new Isilon scale-out cluster for archive purposes only, and using third-party archiving software such as
Symantec Enterprise Vault or EMC SourceOne for File Systems.
3. You will be deploying a new Isilon scale-out cluster that will be used for both working and archive data sets.
For scenarios one and three above, SmartPools should be implemented. For each of these deployment models, consider the archive
tier or the archive target cluster to be composed of either NL-Series nodes or HD-Series nodes. That’s because:
•
Archives are typically composed of capacity-intensive data sets.
•
Archives usually have lower performance and throughput requirements than active working data sets (still size for peak vaulting
events).
•
Archives generally don’t have a strong metadata caching requirement.
•
Archives are backed up on a different schedule than primary data.
Six key metrics are usually gathered for evaluating the number of nodes required to meet a given archive workload:
1. The number of files that will be written to the file system
2. The average file size that will be moved in an archiving operation
3. The number of archive servers
4. The number of connections expected from an archive server
5. The expected capacity of the archive
6. Growth requirements
The combination of I/O and average file size gives a good idea of the aggregate throughput the cluster can expect to receive. This
helps plan SmartConnect zones and which interfaces need to be used. It also helps determine whether to choose nodes with four
1Gp/s ports or nodes with two 1 Gb/s ports and two 10 Gb/s ports.
Transfers or migrations that occur internally to an Isilon storage cluster use the back-end InfiniBand network, so there is no external
network interface bandwidth consumed. Interface/bandwidth requirements, therefore, are more of a concern for archive target
clusters. For this reason, it is usually best to start with identifying I/O requirements.
The number of archive servers and the expected connection counts help determine whether to consider a quantity of nodes that
might be greater than the archive capacity requirements.
Expected capacity and estimated growth help determine how many nodes and at what density are needed from a raw storage
perspective. All of these factors need to be balanced such that the most stringent requirement is the sizing metric of choice.
Once a profile of performance requirements is available, it becomes a trivial matter to predict future performance values based on
these results because performance grows linearly with node addition, as each Isilon node carries a “unit” of storage/processor/bandwidth.
11
4.2 SIZING EXAMPLES
Using different scenarios based on the metrics mentioned above, we can review a couple of different deployment models to ascertain
how scenarios might play out.
In the first example, we take a look at number of files, average file size, and how that correlates to bandwidth. Figure 2 depicts a
scenario in which 10,000 files are pushed from three archive servers to a three-node NL410 archive cluster at a rate of 100 files per
second; the average file size for the archive is measured to be 128 KB.
Figure 3. Archive sizing example 1
The bandwidth requirement for a target cluster would be, on average, (100 files/s)*(128 KB/file) = 12,800 KB/s. It would best be
balanced over the cluster interfaces with SmartConnect managing the traffic across the archive target nodes. Three archive servers
are not expected to drain the connection pool for a three-node cluster (the cluster is designed to handle many client connections). An
internal cluster tier migration would not need to take this into consideration because it would be handled by the back-end IB
network, but there might be contention for disk I/O.
The second example takes a look at capacity, bandwidth based on source file breakdowns, and the relative write capability of the
archive target cluster.
Assumptions:
•
200 TB of archive source data
•
File breakdown: 80/20: 80% 10+ MB files and 20% 4KB files
•
Target cluster is a 3 x Isilon NL410 nodes
12
Figure 4. Archive Sizing Example 2 – 200 TB Distributed Archive Content
Note that 200 TB of archive data is within the capacity of a three-node NL410 cluster. However, the aggregate throughput, based on
some performance numbers for a cluster that size, will probably mean it is undersized for your deployment.
The following values are for demonstration purposes only and do not reflect true performance. Assume a typical mixed work load
profile for the nodes as follows:
•
Write 15 MB/s (small files)
•
Write 520 MB/s (large files)
The read values are not of great importance here (unless the source is another NL-Series cluster), so writes to the target cluster will
be the values to use. In this case, the entire archiving event occurs at once. Requirements for writes to the target will, of course,
vary. For example, 15 MB/s to write 40 TB of small files take about 31 days. One hundred and sixty TB of large files written to the
target at 520 MB/s takes about 3.5 days, for a total transfer time of 36.5 days. This value is a maximum write capacity at the target
and is impractical due to the time involved, the system impact, and, potentially, the network bandwidth requirements. However,
expect the performance of the cluster to grow linearly, so if a single extra node is added, transfer time will be reduced by 30 percent.
An additional fifth node will further reduce the time taken by an additional 25 percent
The third example below is similar to the second one; however, the archive capacity requirement is much higher – in this case, over
2 PB
Assumptions:
•
2.5 PB of archive source data
•
The workload is deep archive (long term data storage with minimal access)
•
Physical space to install the hardware is a concern
13
•
File breakdown – 80/20 (80% 10MB+ files and 20% 4KB files)
•
Target cluster consists of Isilon HD400 nodes
Isilon HD400 nodes are specifically designed for deep archive type workloads. They provide massive scalability alongside lowering
operating costs. The fact, that they come with 60 hard drives in a 4U form factor, makes them ideally suitable for places where
physical space (datacenter space) is limited. The deep or cold archive use case mentioned in the example can be satisfied using a 12
node HD400 cluster. This would provide a usable capacity of 3 PB and a storage efficiency of over 83%. A conservative approach of
using 85% of the usable capacity will still cover the requirement of 2.5PB. A comparable usable archive capacity can be achieved
using a 20 node NL410 cluster. The relative advantages of the Isilon HD400-based solution in this scenario are reflected in the Table
4.
Solution Attributes
Isilon HD400 Cluster
Isilon NL410 Cluster
NUMBER OF NODES REQUIRED
12
20
TOTAL RACK UNITS
48
80
AGGREGATE POWER MAX (WATTS)
15600 Watts
17000 Watts
AGGREGATE HEAT MAX (BTU)
45000 BTU
58000 BTU
Table 4.
Solution Comparison: 3 PB Archive with Isilon HD400 or NL410 Cluster
In this example, the archive solution using Isilon HD400 nodes requires less physical space, less power and lower cooling
requirements.
4.3 SCALING THE ESTIMATE FOR FUTURE GROWTH AND PLANNING
Isilon clusters deployed with three or more nodes demonstrate performance growth that is mostly linear with node addition for a
given working set. That is, if you size
an archive cluster at three nodes (based on performance and/or capacity) and the performance requirement or expected growth
changes by 30 percent, a single node addition will supply the incremental increase, assuming you are load balancing the cluster with
SmartConnect. Similarly, a cluster sized at 10 nodes that needs a
20 percent increase in performance for the same working set will need an addition
of two nodes.
These rules and guidelines “wrap” at the natural limit of protection level and SmartPools node count (which is on the order of 20 nodes
per pool for 2:1). Starting another pool once these quantities are reached is the logical progression for cluster growth. There is no
issue, for example, with having a 22-node cluster. If desired by the customer, however, they could divide the cluster into two pools
of
11 nodes at a protection level of N+2:1 and would still have protection against two simultaneous node failures—while additionally
having protection against four simultaneous disk failures. This option is detailed in the OneFS data protection documentation listed in
the References section.
4.4 OTHER SIZING CONSIDERATIONS
The following sections briefly discuss other elements that you may want keep in mind when assessing an archive deployment.
4.4.1 INFORMATION ON THE MAXIMUM NUMBER OF CONNECTIONS
If multiple archive servers experience an archiving cliff event, a condition that allows many connections to be opened against the
cluster is possible. The number of SMB sessions to a single node is limited by the amount of memory for a single process due to perconnection overhead. SmartConnect is essential in order to most efficiently utilize cluster memory resources.
4.4.2 MAXIMUM NUMBER OF TOTAL FILES IN THE FILE SYSTEM
14
Archives can grow quite large over time, and each cluster has a limit. The maximum number of files in a single directory is the same
as the maximum number of files for the cluster; however, that is not a practical number. Limits depend on total storage capacity of
the cluster, not the number of nodes (although obviously the use of more nodes typically translates to higher capacity). An Isilon
cluster can scale to file counts in the billions.
4.4.3 ISILON SYNCIQ FOR ARCHIVE DATA REPLICATION
If archive replication is required, there are many considerations needed to determine both the effective replication timelines and
whether it is even viable given the replication bandwidth. Replication assumptions include:
1. An existing EMC Isilon source cluster composed of either single or mixed node types
2. Replication strategy: Disaster Recovery (DR)
2.1
DR allows for dissimilar performance profiles between the source, which is the primary working set, and the target
2.2
The target is typically less highly performing and is used to restore data in the event that the primary is destroyed
2.2.1 Isilon NL-Series nodes can be used here
2.2.2 Isilon HD-Series nodes can be used here
3. Replication strategy: Business Continuance (BC)
3.1
BC requires similar performance on the replication target as on the source, which is the primary working set
3.2
In a BC scenario, the target location is made into the active site during a primary site failure
3.2.1 A run book and cutover process are required
To calculate transfer times, use the second example in section 4.2 as a reference.
5.0 CONCLUSION
EMC Isilon scale-out NAS for archives is designed to provide the most efficient utilization of capacity, reduce overall storage footprint,
and deliver significant savings. Its support for standard NAS protocols provides flexibility in selecting the right archive applications
for your business. Once data is archived off of the primary storage, Isilon incorporates a number of features to help ensure archived
files are stored as efficiently as possible:
•
Single file-system simplicity and single volume simplicity with 80 percent or greater storage utilization. And, unlike most other
systems, an Isilon cluster’s storage efficiency, availability, and performance improve as its capacity scales.
•
Isilon SmartPools automatic, policy-based tiering capabilities can be used to store data in either high-performance or highcapacity storage pools based on the data’s relevance and value to the organization.
•
Isilon SmartConnect manages load balancing of many archive servers.
•
Isilon’s simplified scale-out architecture enables consolidation of file data, reduces the number of locations across the
organization where data is stored, decreases management overhead, and streamlines archive operations.
Additional features of Isilon archive solutions are summarized in Table 5.
15
Feature
Benefit
Massive scalability and high density storage options
Archives may start small, but inevitably will grow. Isilon
clusters can start as small as 18 TB and grow to over
50 PB in a single file system, adding additional
performance, availability, and storage efficiency as it
grows.
Cost-effective, disk-based solution that offers accessibility For long-term data preservation and immediate access,
and data integrity advantages
an Isilon disk-based archive solution offers the greatest
flexibility and eliminates risks.
over tape
High-performance NAS access enables easy consolidation
of archives across multiple application and compute
environments
Performance that scales as capacity grows
enables high-volume throughput and interoperability
and flexibility with multiprotocol support.
Automated, policy-based tiering to minimize storage costs SmartPools automatically manages storage tiering to
help ensure that data is stored in the most effective tier
for maximum protection and cost savings.
Automated load balancing
Isilon SmartConnect provides a robust and automatic
method for all of your archive servers to be load
balanced across all nodes of the cluster to most
effectively utilize all the available resources.
Robust data protection
Isilon storage is highly resilient and offers a wide array
of data protection and security options to safeguard
your data.
Table 5.
Key Isilon scale-out NAS features and benefits for archive
EMC Isilon archive solutions provide organizations with a highly efficient, massively scalable, and secure disk-based archive that
protects data for long-term retention while optimizing the use of primary storage resources.
TAKE THE NEXT STEP
Contact your EMC sales representative or authorized reseller to learn more about how EMC Isilon archive solutions can benefit your
organization.
Also see our solutions in the EMC Store at https://store.emc.com/isilon.
16
6.0 APPENDIX
6.1 ARCHIVE APPLICATION FILE SYSTEM ACL REQUIREMENTS
Many applications will not function “out of the box” with default share ACLs. This is not an EMC Isilon system limitation; it is a
property of secure systems. There are a few share and ACL changes that need to occur for most third-party applications to function
properly with their vaults. See the References section for an ACL link.
6.2 REFERENCES
The following documents provide additional and relevant information.
High Availability & Data Protection with EMC Isilon Scale-Out NAS
•
An overview of EMC Isilon data protection technology
EMC Isilon Third-Party Software Compatibility Guide
•
A set of matrices listing currently supported third-party software and hardware
EMC Isilon OneFS Operating System
•
An overview of EMC Isilon OneFS operating system
EMC Isilon Share Configuration for Symantec Enterprise Vault
•
A technote describing custom SMB settings for a vault share
EMC Isilon HD-Series
•
Specification sheet for EMC Isilon HD-Series nodes
EMC Isilon NL-Series
•
Specification sheet for EMC Isilon NL-Series nodes
EMC Isilon X-Series
•
Specification sheet for EMC Isilon X-Series nodes
EMC Isilon S-Series
•
Specification sheet for EMC Isilon S-Series nodes
EMC Isilon SmartConnect
•
A white paper discussing the Isilon cluster load-balancing mechanism
Next Generation Storage Tiering with EMC Isilon SmartPools
•
A white paper describing Isilon SmartPools functionality and configuration
Best Practices for Data Replication with EMC Isilon SyncIQ
•
A white paper describing EMC Isilon replication technology
EMC Isilon OneFS User Guide
•
A detailed user guide on the EMC Isilon OneFS operating system
17
ABOUT EMC
EMC Corporation is a global leader in enabling businesses and service providers to transform their operations and deliver IT as a
service. Fundamental to this transformation is cloud computing. Through innovative products and services, EMC accelerates the
journey to cloud computing, helping IT departments to store, manage, protect and analyze their most valuable asset - information in a more agile, trusted and cost-efficient way. Additional information about EMC can be found at www.EMC.com.
18