ARCHIVE SOLUTIONS WITH EMC ISILON SCALE-OUT NAS ABSTRACT This white paper outlines the principles and concepts involved in implementing an enterprise-class archive repository using EMC Isilon scale-out storage. It also describes key Isilon technology and features used to implement an efficient and secure archive solution. October 2015 WHITE PAPER To learn more about how EMC products, services, and solutions can help solve your business and IT challenges, contact your local representative or authorized reseller, visit www.emc.com, or explore and compare products in the EMC Store Copyright © 2014 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. The information in this publication is provided “as is.” EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. EMC2, EMC, the EMC logo, Isilon, OneFS, FlexProtect, SmartConnect, SmartLock, SmartPools, SnapshotIQ, and SyncIQ are registered trademarks or trademarks of EMC Corporation in the United States and other countries. All other trademarks used herein are the property of their respective owners. Part Number H11224.1 2 TABLE OF CONTENTS 1.0 EXECUTIVE SUMMARY ........................................................................ 4 1.1 About this white paper ................................................................................. 4 1.2 Assumptions............................................................................................... 4 1.3 Industry trends ........................................................................................... 4 2.0 ARCHIVE VS. BACKUP ......................................................................... 5 2.1 Archiving objectives .................................................................................... 5 2.2 Archive technologies .................................................................................... 6 3.0 EMC ISILON ARCHIVE SOLUTIONS ARCHITECTURE ............................ 6 3.1 Hardware ................................................................................................... 7 3.2 Software .................................................................................................... 7 3.3 Archiving with Automated Storage Tiering ...................................................... 8 3.4 Archive server load balancing with Isilon SmartConnect ...................................10 3.5 High availability, data protection and security ................................................10 4.0 ARCHIVE SIZING .............................................................................. 10 4.1 What to understand about sizing an archive ...................................................11 4.2 Sizing examples .........................................................................................12 4.3 Scaling the estimate for future growth and planning .......................................14 4.4 Other sizing considerations ..........................................................................14 4.4.1 Information on the maximum number of connections .......................................... 14 4.4.2 Maximum number of total files in the file system................................................. 14 4.4.3 Isilon SyncIQ for archive data replication ........................................................... 15 5.0 CONCLUSION .................................................................................... 15 TAKE THE NEXT STEP .......................................................................................16 6.0 APPENDIX ........................................................................................ 17 6.1 Archive application file system ACL requirements............................................17 6.2 References ................................................................................................17 About EMC......................................................................................................18 3 1.0 EXECUTIVE SUMMARY As data continues to grow, simply adding more capacity to primary storage no longer makes sense. It’s expensive, inefficient, and doesn’t meet the growing need to correctly classify, manage, and protect data in a way that allows quick, efficient access. As a result, archive systems are an increasingly important business requirement in modern enterprise environments. Archiving applications can efficiently manage growing data stores to contain storage costs and satisfy both compliance and ediscovery requirements. Simultaneously, this practice satisfies both compliance and e-discovery requirements. EMC® Isilon® scaleout network-attached storage (NAS) provides a disk-based archive that scales from terabytes (TB) to petabytes (PB) of capacity. With Isilon, archival information benefits from massive scalability and the robustness of the EMC Isilon OneFS® operating system, with proven capabilities for protecting data and optimizing the flow of information within an organization. This white paper outlines the principles and concepts for deploying an Isilon storage cluster as an enterprise-class archive repository. It includes architectural explanations of the technologies and features associated with providing NAS-based file services, as well as technical recommendations for sizing Isilon storage clusters based on appropriate cluster limits for archive applications. 1.1 ABOUT THIS WHITE PAPER This guide provides technical considerations when planning archive systems based on Isilon storage in an enterprise environment. It assumes the use of third-party archive software. 1.2 ASSUMPTIONS This guide assumes the reader has an understanding and working knowledge of the following: • NAS storage protocols • EMC Isilon scale-out storage architecture and the OneFS operating system • File-system management concepts and practices While this guide is intended to provide a consolidated reference point for systems administrators and managers looking to deploy archive software on an Isilon storage cluster, it is not intended to be the authoritative source of information on the technologies and features used to provide and support a file-services platform. 1.3 INDUSTRY TRENDS Data archiving is the process of identifying and moving inactive data out of current production (primary storage) systems and into long-term archival storage systems. Implementing and maintaining an efficient and secure data archive is increasingly important for organizations across a wide range of industries and public sector agencies. The rapid growth of unstructured data that most enterprises are experiencing is straining primary storage resources and pressuring IT budgets. Moving inactive data out of primary storage optimizes the performance of primary storage resources while archival systems store information much more cost-effectively while still maintaining the data online, in contrast with the cost of retrieval with tape media. Many enterprises are subject to varying regulatory requirements that mandate long-term data retention. Some organizations are also looking to preserve and protect content for historical purposes. Archive retention periods of 10 to15 years are not uncommon for many businesses, while some industries are mandated to retain specific data upwards of 100 years. In certain cases, archive data is purposely not deduplicated, compressed, or changed in any manner to effectively meet stringent compliance requirements or to avoid potential litigation issues. Archiving can help e-discovery solutions and is an important way to comply with regulatory and corporate governance requirements. 4 2.0 ARCHIVE VS. BACKUP The term “data archiving” is sometimes confused with or used interchangeably with “data backup.” This is often the result of assuming that a backup can be a substitute for an archive if needed. In reality, these two data retention strategies have distinct data protection objectives. Backups, whether to disk or tape, typically have relatively short lifecycles—measured in days, weeks, or months between full copies. They are primarily used to restore data that may have been lost, corrupted, or destroyed. A data backup must usually be reconstituted from a proprietary backup format to other media at a different location before it can be used. Unlike a backup that is used to restore recently lost or deleted data, archiving is a systematic approach to providing structure to unstructured data, usually laying it out in a predictable directory format, based on repeatable application algorithms. It enables the storing, management, retrieval, and eventual deletion of data as appropriate throughout its entire lifecycle. Table 1 summarizes the differences between archive and backup. Archive Backup A primary copy of information A secondary copy of information Used for information retrieval Used for operational recovery Improves operational efficiency by removing fixed content and duplicate data from the operational environment Improves availability by enabling applications to be restored to specific points in time Typically long term (years, decades, or forever) Typically short term (weeks or months) Data typically maintained for analysis or compliance as a managed repository Data typically overwritten on a periodic basis (daily, weekly, monthly, etc.) and stored in an altered state which needs to be restored Useful for compliance—enforces retention policies by typically storing data in its native form Does not meet the needs of regulatory compliance— though some are forced to use it this way Table 1. Archive and Backup Comparison 2.1 ARCHIVING OBJECTIVES Many enterprises are seeing 50% growth in data each year, but little or no growth in storage hardware budgets. Not all data is equally of value and its value can change. Many organizations find that their managed data is too performance-sensitive to move to tape storage, but not time-critical enough to justify the cost of keeping it on high-performance storage. As an organization’s data footprint grows, managed data becomes more heterogeneous, with different data subsets having divergent performance and protection requirements. Data archiving objectives will vary by organization but will typically include one or more of the following: • Reclaim expensive existing primary storage • Reduce on-going primary storage acquisition costs • Move static data out of the recurring backup process • Secure and protect data for long-term retention • Satisfy applicable regulatory and governance requirements • Provide reliable availability of data when needed Data archive solutions can be used to manage growing data stores efficiently, optimize the use of primary storage resources, protect data for long-term retention and, greatly help to reduce overall storage costs. 5 2.2 ARCHIVE TECHNOLOGIES An effective archiving solution is typically comprised of a mix of software that will likely include: • Archiving software that automates the movement of data from primary storage to archival storage based on policies established in the data classification and rationalization process. Data classification and rationalization (policy tuning) are functions of the organization’s data mining requirements and legal compliance needs. Archive software can delete files at the end of their retention period. It may also include features such as data compression or single-instance storage to maximize disk utilization. Archiving application software generally utilizes NFS or SMB protocols to move content from primary storage to an archive repository. • E-discovery software, using archiving software as a base, employs advanced search features that enable users and administrators to quickly search all files, emails, texts, and other data related to a specific topic, for use in data mining services or in response to legal inquiries. Some applications combine e-discovery and archive into purpose-built platforms such as an email archive solution or document and records management solutions. Data retention policies can be established and enforced through archiving and e-discovery software by ensuring that all files of significance are flagged appropriately to prevent them from being altered or deleted. 3.0 EMC ISILON ARCHIVE SOLUTIONS ARCHITECTURE EMC Isilon archive solutions combine Isilon scale-out NAS storage platforms, Isilon software and, in some cases, other EMC and 3rd party software. Figure 1. Isilon Archive Solutions Architecture Each element of the Isilon archive solution is described in the following sections. 3.1 HARDWARE EMC Isilon offers 4 tiers of hardware storage platforms that can be combined to deliver the right blend of performance and capacity for a wide range of workloads, including file shares, home directories, archiving, and Big Data analytics. For primary storage workloads and applications, Isilon offers: • EMC Isilon S-Series: High performance NAS storage platform built for high transactional and IOPS-intensive applications. The Isilon S-Series can scale performance to more than 3.75 million IOPS in a single cluster. 6 • EMC Isilon X-Series: Highly versatile NAS storage platform that strikes a balance between high-performance and large capacity storage to support a wide range of workloads. Isilon X-Series solutions scale from 18 TB to over 20 PB in a single namespace, with an aggregate throughput of up to 200 GB per second. For nearline storage, active archiving, and high density, deep archiving, Isilon offers: • EMC Isilon NL-Series: Designed for highly efficient, large-capacity storage and is ideal for nearline storage and active archiving workloads. The Isilon NL-Series product family includes the Isilon NL410, a 4U platform that offers a flexible mix of configuration options and has a capacity that ranges between 36 TB to 210 TB per node. • EMC Isilon HD-Series: High density storage platform that is an ideal choice for deep archive and archive uses cases with a capacity requirement of over 2.5 PB. The Isilon HD-Series product family includes the Isilon HD400, a 4U platform with a node capacity of 354 TB. The Isilon HD400 can store up to 3.2 PB in a single rack and reduce data center operating costs, including power, cooling and floor space, by up to 50%. Table 2 provides a summary of design elements and considerations to determine the relative suitability of Isilon NL-Series and Isilon HD-Series nodes. Design Element Isilon NL-Series more suitable Isilon HD-Series more suitable Type of archive Active archive - data at rest but accessed periodically Active archive or deep archive – long-term data retention with minimal access Archive size < 2.5 PB > 2.5 PB Rack considerations If normal depth racks preferred over deep racks If deep racks preferred over normal racks Weight consideration When weight of the nodes is a concern When weight of nodes is not a concern Power consideration Low line or high line power requirements Only high line power is acceptable Table 2. Isilon NL-Series and Isilon HD-Series Comparison Isilon NL-Series and Isilon HD-Series nodes can be combined into a single cluster that includes Isilon primary storage platforms (Isilon S-Series and Isilon X-Series nodes). In this way, primary storage workloads and archive workloads can be supported in a single Isilon cluster. This provides an opportunity for implementation of a policy-based, automated storage tiering solution that automatically moves data to the appropriate storage tier and helps to optimize storage resources and lower costs. 3.2 SOFTWARE All nodes in an Isilon cluster are powered by the EMC Isilon OneFS® operating system. OneFS combines the three layers of traditional storage architectures—file system, volume manager, and data protection—into one unified software layer, creating a single intelligent file system that spans all nodes within the cluster. This allows the Isilon cluster to facilitate the archive process automatically through policy-based file placement on specified storage tiers. With its modular, single file system architecture, OneFS enables Isilon storage systems to provide a massively scalable platform that is highly efficient and simple to manage. Isilon storage solutions can scale from as small as 18 TB to over 50 PB in a single file system. To increase capacity, new nodes can be added to an existing Isilon cluster within a minute and without disruption. With the Isilon AutoBalanceTM feature of OneFS, there is no need to move data manually – it is done automatically and transparently to system users. The Isilon OneFS operating system enables: • Independent, linear scalability of performance and capacity • A single point of management for large and rapidly growing data repositories • High reliability and high availability with state-of-the-art data protection. 7 The OneFS operating system manages aggregation of multiple Isilon node types into one namespace and the combination of all available resources from every node in the cluster. An internal, InfiniBand-based network functions as a virtual backplane between the nodes and enables any node in the cluster to service data for any client in the cluster (i.e., any other node), regardless of where the data actually resides. A cluster may comprise of different types of Isilon nodes (including S-Series, X-Series, NL-Series and HDSeries), with each type of node suited to a particular workflow participating in different activities within the same, unified OneFS file system. This mixed node approach is often the preferred choice for environments in which different workloads will coexist on the same storage array. It also enables enterprises to optimize storage resources by implementing a policy-based, automated tiered storage strategy using EMC Isilon SmartPools™ software. OneFS also provides symmetric multiprocessing (SMP) capabilities that enable the system to move tasks between processors, which results in extremely efficient workload balancing. The combination of these features also results in very high overall bandwidth and performance capabilities for the cluster—an important capability when the Isilon cluster is supporting primary and archive workloads simultaneously. EMC Isilon software for used for management and data protection purposes are summarized in Table 3. DATA PROTECTION SOFTWARE EMC ISILON SNAPSHOTIQ™ Protect data efficiently and reliably with efficient snapshots, while incurring little to no performance overhead. Fast data with near-immediate on-demand snapshot restores. EMC ISILON SYNCIQ® Replicate and distribute large mission-critical data sets to multiple shared storage systems in multiple sites for disaster recovery protection. Simple failover and failback to increase data availability. EMC ISILON SMARTLOCK® Protect data against accidental, premature, or malicious alteration or deletion with Isilon’s software-based approach to WORM. Also helps meet stringent compliance and governance needs, such as SEC 17a-4 requirements MANAGEMENT SOFTWARE EMC ISILON SMARTPOOLS® Implement a highly efficient, automated tiered storage strategy to optimize storage performance and efficiency EMC ISILON SMARTCONNECT™ Enable client connection load balancing and the dynamic NFS failover and failback of client connections across storage nodes to optimize the use of cluster resources EMC ISILON SMARTDEDUPE™ Increase efficiency and reduce storage capacity requirements by up to 35 percent with deduplication of redundant data across multiple sources EMC ISILON INSIGHTIQ™ Performance monitoring and reporting tools to maximize the performance of your Isilon cluster Table 3. Table 3: Isilon Data Protection and Management Software 3.3 ARCHIVING WITH AUTOMATED STORAGE TIERING Many enterprise environments have workloads or applications that require the use of active data while archiving older data. A mechanism is required to migrate stale primary data to archive storage, which should happen without any disruption to users and applications. Isilon SmartPools software simplifies management and lowers storage costs with a transparent, policy-based, automated tiering approach. It lets organizations optimize storage resources and automatically move older, unused data to economical archive storage. SmartPools is integrated with the Isilon OneFS operating system to allow a single point of management, with a single scalable file system that offer multiple tiers of performance—depending on the data. 8 With SmartPools, storage administrators can automatically match storage resources with specific data and application requirements. It also simplifies management by eliminating the need for manual data migrations. SmartPools moves data among tiers based on the enterprise requirements without sacrificing data protection, application performance, or uptime. Administrators can also use defined policies to move data based on age, type, owner, location, or other criteria, from one tier to another. SmartPools is tightly integrated with Isilon OneFS, so all data, regardless of physical location, are in the same single file system. This means that SmartPools data movements are completely transparent to the end user application, removing management, backup, and other issues related to stub-based tiering architectures such as those present in hierarchical storage management (HSM) implementations. With Isilon SmartPools, data protection levels can be set on a per-directory storage tier or even a per-file basis. Regardless of which type of EMC Isilon storage nodes used in the cluster, SmartPools can be used to control the data protection level. SmartPools also provides an option to configure performance profiles, so different types of data are actually laid out on disk differently to optimize performance for different types of workloads. Figure 2 illustrates a mixed-node, single file system deployment in an Isilon cluster for online and archive data using SmartPools. Figure 2. Isilon cluster with SmartPools for automated tiered storage In summary, tiering provides the following advantages: • Simplifies management with automatic, policy-based data movement within a single namespace, single file system without complex links, stubs, or manual data migration. • Enables storage consolidation by allowing the support of multiple applications and workloads with varying performance requirements on a single storage system. • Optimizes storage resources by automatically aligning application needs. • Adapts seamlessly to workflow changes. • Provides workflow isolation. 9 3.4 ARCHIVE SERVER LOAD BALANCING WITH ISILON SMARTCONNECT Enterprise archive environments will typically engage several if not dozens of archive servers. These servers are often set to move data to the archive vault based on certain file criteria, usually last modified time. While the working sets that the archive software or policies monitor are not likely to all coalesce simultaneously, it is possible. It is important to make sure you take this into consideration based on the number of archive servers that will be deployed. EMC Isilon SmartConnect™ software optimizes network throughput by enabling intelligent client-connection load balancing and is used to automatically manage Isilon cluster access. SmartConnect manages client connection load balancing through a single host name to the cluster nodes. This provides optimal utilization of the cluster’s available network interfaces and network system resources. By leveraging an organization’s existing DNS infrastructure, SmartConnect provides universal compatibility with all client types, eliminating the need for complicated connection management on the client side. To a client system, the cluster appears as a single network element. SmartConnect automatically balances incoming client connections across all available interfaces on the Isilon storage cluster. This improves performance on the cluster by distributing the workload evenly across multiple network paths and multiple nodes. For an EMC Isilon storage cluster that hosts multiple concurrent workloads in addition to the organization’s archive, SmartConnect gives administrators the ability to partition workloads, by type, across the available node interfaces in a cluster. By maintaining multiple SmartConnect pools and minimizing the number of pools that overlap on a particular node interface, administrators can maintain sufficient network bandwidth for critical workloads on dedicated interface connections. 3.5 HIGH AVAILABILITY, DATA PROTECTION AND SECURITY Isilon OneFS operating system provides scale-out data protection through EMC Isilon FlexProtect™ feature of the Isilon OneFS operating system. FlexProtect utilizes advanced technology to provide redundancy and availability capabilities far beyond those of traditional RAID. FlexProtect uses Forward Error Correction (FEC) to create an n-way, redundant fabric that scales as nodes are added to the cluster, providing 100 percent data availability even with up to four simultaneous node failures. This goes far beyond the maximum level of RAID commonly in use today, which is the double-failure protection of RAID 6. Additional details on this are provided in the High Availability & Data Protection with EMC Isilon Scale-Out NAS white paper. For data backup and disaster recovery protection, you can easily copy and replicate data to remote sites. EMC Isilon SnapshotIQ software enables fast and efficient data backup and allows you to take point-in-time copies of data with a choice of snapshot intervals and RPO time options. For multi-site disaster recovery protection, EMC Isilon SyncIQ can be used to replicate data to your choice of local and remote sites. SyncIQ supports both LAN and WAN networks to replicate over short or long distances, providing protection from both site-specific and regional disasters. Security and compliance options from Isilon include: • Role-based access control (RBAC) and secure access zones to limit data access. • Immutable storage for data with its write once, read many (WORM) locking capability with Isilon SmartLock software. In this way, SmartLock protects your archived data against accidental, premature, or malicious alteration or deletion. • Data at rest encryption options from Isilon storage platforms that include self-encrypting drives to prevent data theft. • Integrated file system auditing to identify unauthorized data access attempts. • Security and Technical Implementation Guide (STIG) hardening, CAC/PIV Smartcard authentication, and FIPS OpenSSL support. Together, these capabilities provide a comprehensive data protection and security solution for your data archive. 4.0 ARCHIVE SIZING There are many factors to consider in sizing a cluster appropriately for an archive workload. Archive environments will often have many archive servers moving files based on time criteria to the Isilon cluster. Having many archive servers access the cluster is a perfect-use case for SmartConnect. SmartConnect will load balance the servers across all nodes of the cluster (while accessing the same file system for continuity). It is essential for sizing exercises that this practice is implemented so that no single-node bottleneck is reached artificially. 10 The potential to have many servers moving data simultaneously to the cluster can cause “cliff” events where the application can run into a requirement to move thousands of files all at once. Regardless of your deployment scenario, you must use peak values for your sizing estimate. If for example, you have 10 archive servers but they are on different vault schedules, you must plan for the worst-case scenario: that all are undergoing a vault operation simultaneously. That will be the assumption in the following sections. This type of event requires consideration about the number of connections that can be handled, file system IO, bandwidth, etc. The following sizing section will address these considerations with Isilon clusters as well as provide general guidelines for elements such as maximum file counts and replication of the archive, among others. 4.1 WHAT TO UNDERSTAND ABOUT SIZING AN ARCHIVE There are three base scenarios to consider when sizing an archive: 1. You have an existing Isilon cluster and wish to add an archive tier to it through expansion via node addition. 2. You will be deploying a new Isilon scale-out cluster for archive purposes only, and using third-party archiving software such as Symantec Enterprise Vault or EMC SourceOne for File Systems. 3. You will be deploying a new Isilon scale-out cluster that will be used for both working and archive data sets. For scenarios one and three above, SmartPools should be implemented. For each of these deployment models, consider the archive tier or the archive target cluster to be composed of either NL-Series nodes or HD-Series nodes. That’s because: • Archives are typically composed of capacity-intensive data sets. • Archives usually have lower performance and throughput requirements than active working data sets (still size for peak vaulting events). • Archives generally don’t have a strong metadata caching requirement. • Archives are backed up on a different schedule than primary data. Six key metrics are usually gathered for evaluating the number of nodes required to meet a given archive workload: 1. The number of files that will be written to the file system 2. The average file size that will be moved in an archiving operation 3. The number of archive servers 4. The number of connections expected from an archive server 5. The expected capacity of the archive 6. Growth requirements The combination of I/O and average file size gives a good idea of the aggregate throughput the cluster can expect to receive. This helps plan SmartConnect zones and which interfaces need to be used. It also helps determine whether to choose nodes with four 1Gp/s ports or nodes with two 1 Gb/s ports and two 10 Gb/s ports. Transfers or migrations that occur internally to an Isilon storage cluster use the back-end InfiniBand network, so there is no external network interface bandwidth consumed. Interface/bandwidth requirements, therefore, are more of a concern for archive target clusters. For this reason, it is usually best to start with identifying I/O requirements. The number of archive servers and the expected connection counts help determine whether to consider a quantity of nodes that might be greater than the archive capacity requirements. Expected capacity and estimated growth help determine how many nodes and at what density are needed from a raw storage perspective. All of these factors need to be balanced such that the most stringent requirement is the sizing metric of choice. Once a profile of performance requirements is available, it becomes a trivial matter to predict future performance values based on these results because performance grows linearly with node addition, as each Isilon node carries a “unit” of storage/processor/bandwidth. 11 4.2 SIZING EXAMPLES Using different scenarios based on the metrics mentioned above, we can review a couple of different deployment models to ascertain how scenarios might play out. In the first example, we take a look at number of files, average file size, and how that correlates to bandwidth. Figure 2 depicts a scenario in which 10,000 files are pushed from three archive servers to a three-node NL410 archive cluster at a rate of 100 files per second; the average file size for the archive is measured to be 128 KB. Figure 3. Archive sizing example 1 The bandwidth requirement for a target cluster would be, on average, (100 files/s)*(128 KB/file) = 12,800 KB/s. It would best be balanced over the cluster interfaces with SmartConnect managing the traffic across the archive target nodes. Three archive servers are not expected to drain the connection pool for a three-node cluster (the cluster is designed to handle many client connections). An internal cluster tier migration would not need to take this into consideration because it would be handled by the back-end IB network, but there might be contention for disk I/O. The second example takes a look at capacity, bandwidth based on source file breakdowns, and the relative write capability of the archive target cluster. Assumptions: • 200 TB of archive source data • File breakdown: 80/20: 80% 10+ MB files and 20% 4KB files • Target cluster is a 3 x Isilon NL410 nodes 12 Figure 4. Archive Sizing Example 2 – 200 TB Distributed Archive Content Note that 200 TB of archive data is within the capacity of a three-node NL410 cluster. However, the aggregate throughput, based on some performance numbers for a cluster that size, will probably mean it is undersized for your deployment. The following values are for demonstration purposes only and do not reflect true performance. Assume a typical mixed work load profile for the nodes as follows: • Write 15 MB/s (small files) • Write 520 MB/s (large files) The read values are not of great importance here (unless the source is another NL-Series cluster), so writes to the target cluster will be the values to use. In this case, the entire archiving event occurs at once. Requirements for writes to the target will, of course, vary. For example, 15 MB/s to write 40 TB of small files take about 31 days. One hundred and sixty TB of large files written to the target at 520 MB/s takes about 3.5 days, for a total transfer time of 36.5 days. This value is a maximum write capacity at the target and is impractical due to the time involved, the system impact, and, potentially, the network bandwidth requirements. However, expect the performance of the cluster to grow linearly, so if a single extra node is added, transfer time will be reduced by 30 percent. An additional fifth node will further reduce the time taken by an additional 25 percent The third example below is similar to the second one; however, the archive capacity requirement is much higher – in this case, over 2 PB Assumptions: • 2.5 PB of archive source data • The workload is deep archive (long term data storage with minimal access) • Physical space to install the hardware is a concern 13 • File breakdown – 80/20 (80% 10MB+ files and 20% 4KB files) • Target cluster consists of Isilon HD400 nodes Isilon HD400 nodes are specifically designed for deep archive type workloads. They provide massive scalability alongside lowering operating costs. The fact, that they come with 60 hard drives in a 4U form factor, makes them ideally suitable for places where physical space (datacenter space) is limited. The deep or cold archive use case mentioned in the example can be satisfied using a 12 node HD400 cluster. This would provide a usable capacity of 3 PB and a storage efficiency of over 83%. A conservative approach of using 85% of the usable capacity will still cover the requirement of 2.5PB. A comparable usable archive capacity can be achieved using a 20 node NL410 cluster. The relative advantages of the Isilon HD400-based solution in this scenario are reflected in the Table 4. Solution Attributes Isilon HD400 Cluster Isilon NL410 Cluster NUMBER OF NODES REQUIRED 12 20 TOTAL RACK UNITS 48 80 AGGREGATE POWER MAX (WATTS) 15600 Watts 17000 Watts AGGREGATE HEAT MAX (BTU) 45000 BTU 58000 BTU Table 4. Solution Comparison: 3 PB Archive with Isilon HD400 or NL410 Cluster In this example, the archive solution using Isilon HD400 nodes requires less physical space, less power and lower cooling requirements. 4.3 SCALING THE ESTIMATE FOR FUTURE GROWTH AND PLANNING Isilon clusters deployed with three or more nodes demonstrate performance growth that is mostly linear with node addition for a given working set. That is, if you size an archive cluster at three nodes (based on performance and/or capacity) and the performance requirement or expected growth changes by 30 percent, a single node addition will supply the incremental increase, assuming you are load balancing the cluster with SmartConnect. Similarly, a cluster sized at 10 nodes that needs a 20 percent increase in performance for the same working set will need an addition of two nodes. These rules and guidelines “wrap” at the natural limit of protection level and SmartPools node count (which is on the order of 20 nodes per pool for 2:1). Starting another pool once these quantities are reached is the logical progression for cluster growth. There is no issue, for example, with having a 22-node cluster. If desired by the customer, however, they could divide the cluster into two pools of 11 nodes at a protection level of N+2:1 and would still have protection against two simultaneous node failures—while additionally having protection against four simultaneous disk failures. This option is detailed in the OneFS data protection documentation listed in the References section. 4.4 OTHER SIZING CONSIDERATIONS The following sections briefly discuss other elements that you may want keep in mind when assessing an archive deployment. 4.4.1 INFORMATION ON THE MAXIMUM NUMBER OF CONNECTIONS If multiple archive servers experience an archiving cliff event, a condition that allows many connections to be opened against the cluster is possible. The number of SMB sessions to a single node is limited by the amount of memory for a single process due to perconnection overhead. SmartConnect is essential in order to most efficiently utilize cluster memory resources. 4.4.2 MAXIMUM NUMBER OF TOTAL FILES IN THE FILE SYSTEM 14 Archives can grow quite large over time, and each cluster has a limit. The maximum number of files in a single directory is the same as the maximum number of files for the cluster; however, that is not a practical number. Limits depend on total storage capacity of the cluster, not the number of nodes (although obviously the use of more nodes typically translates to higher capacity). An Isilon cluster can scale to file counts in the billions. 4.4.3 ISILON SYNCIQ FOR ARCHIVE DATA REPLICATION If archive replication is required, there are many considerations needed to determine both the effective replication timelines and whether it is even viable given the replication bandwidth. Replication assumptions include: 1. An existing EMC Isilon source cluster composed of either single or mixed node types 2. Replication strategy: Disaster Recovery (DR) 2.1 DR allows for dissimilar performance profiles between the source, which is the primary working set, and the target 2.2 The target is typically less highly performing and is used to restore data in the event that the primary is destroyed 2.2.1 Isilon NL-Series nodes can be used here 2.2.2 Isilon HD-Series nodes can be used here 3. Replication strategy: Business Continuance (BC) 3.1 BC requires similar performance on the replication target as on the source, which is the primary working set 3.2 In a BC scenario, the target location is made into the active site during a primary site failure 3.2.1 A run book and cutover process are required To calculate transfer times, use the second example in section 4.2 as a reference. 5.0 CONCLUSION EMC Isilon scale-out NAS for archives is designed to provide the most efficient utilization of capacity, reduce overall storage footprint, and deliver significant savings. Its support for standard NAS protocols provides flexibility in selecting the right archive applications for your business. Once data is archived off of the primary storage, Isilon incorporates a number of features to help ensure archived files are stored as efficiently as possible: • Single file-system simplicity and single volume simplicity with 80 percent or greater storage utilization. And, unlike most other systems, an Isilon cluster’s storage efficiency, availability, and performance improve as its capacity scales. • Isilon SmartPools automatic, policy-based tiering capabilities can be used to store data in either high-performance or highcapacity storage pools based on the data’s relevance and value to the organization. • Isilon SmartConnect manages load balancing of many archive servers. • Isilon’s simplified scale-out architecture enables consolidation of file data, reduces the number of locations across the organization where data is stored, decreases management overhead, and streamlines archive operations. Additional features of Isilon archive solutions are summarized in Table 5. 15 Feature Benefit Massive scalability and high density storage options Archives may start small, but inevitably will grow. Isilon clusters can start as small as 18 TB and grow to over 50 PB in a single file system, adding additional performance, availability, and storage efficiency as it grows. Cost-effective, disk-based solution that offers accessibility For long-term data preservation and immediate access, and data integrity advantages an Isilon disk-based archive solution offers the greatest flexibility and eliminates risks. over tape High-performance NAS access enables easy consolidation of archives across multiple application and compute environments Performance that scales as capacity grows enables high-volume throughput and interoperability and flexibility with multiprotocol support. Automated, policy-based tiering to minimize storage costs SmartPools automatically manages storage tiering to help ensure that data is stored in the most effective tier for maximum protection and cost savings. Automated load balancing Isilon SmartConnect provides a robust and automatic method for all of your archive servers to be load balanced across all nodes of the cluster to most effectively utilize all the available resources. Robust data protection Isilon storage is highly resilient and offers a wide array of data protection and security options to safeguard your data. Table 5. Key Isilon scale-out NAS features and benefits for archive EMC Isilon archive solutions provide organizations with a highly efficient, massively scalable, and secure disk-based archive that protects data for long-term retention while optimizing the use of primary storage resources. TAKE THE NEXT STEP Contact your EMC sales representative or authorized reseller to learn more about how EMC Isilon archive solutions can benefit your organization. Also see our solutions in the EMC Store at https://store.emc.com/isilon. 16 6.0 APPENDIX 6.1 ARCHIVE APPLICATION FILE SYSTEM ACL REQUIREMENTS Many applications will not function “out of the box” with default share ACLs. This is not an EMC Isilon system limitation; it is a property of secure systems. There are a few share and ACL changes that need to occur for most third-party applications to function properly with their vaults. See the References section for an ACL link. 6.2 REFERENCES The following documents provide additional and relevant information. High Availability & Data Protection with EMC Isilon Scale-Out NAS • An overview of EMC Isilon data protection technology EMC Isilon Third-Party Software Compatibility Guide • A set of matrices listing currently supported third-party software and hardware EMC Isilon OneFS Operating System • An overview of EMC Isilon OneFS operating system EMC Isilon Share Configuration for Symantec Enterprise Vault • A technote describing custom SMB settings for a vault share EMC Isilon HD-Series • Specification sheet for EMC Isilon HD-Series nodes EMC Isilon NL-Series • Specification sheet for EMC Isilon NL-Series nodes EMC Isilon X-Series • Specification sheet for EMC Isilon X-Series nodes EMC Isilon S-Series • Specification sheet for EMC Isilon S-Series nodes EMC Isilon SmartConnect • A white paper discussing the Isilon cluster load-balancing mechanism Next Generation Storage Tiering with EMC Isilon SmartPools • A white paper describing Isilon SmartPools functionality and configuration Best Practices for Data Replication with EMC Isilon SyncIQ • A white paper describing EMC Isilon replication technology EMC Isilon OneFS User Guide • A detailed user guide on the EMC Isilon OneFS operating system 17 ABOUT EMC EMC Corporation is a global leader in enabling businesses and service providers to transform their operations and deliver IT as a service. Fundamental to this transformation is cloud computing. Through innovative products and services, EMC accelerates the journey to cloud computing, helping IT departments to store, manage, protect and analyze their most valuable asset - information in a more agile, trusted and cost-efficient way. Additional information about EMC can be found at www.EMC.com. 18