White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures White paper Fujitsu Storage ETERNUS CS8000 Disaster-resilient architectures ETERNUS CS8000 enables comprehensive disaster recovery architectures. Content 1 Introduction 2 2 The Role of Backup 3 3 The role of ETERNUS CS8000 4 4 How ETERNUS CS8000 adds robustness to backup and recovery 5 4.1 Backup application architectures 5 4.2 How ETERNUS CS8000 Complements Backup Application Architectures 6 4.2.1 ETERNUS CS8000 Grid Architecture 6 4.2.2 True Tape Virtualization 8 4.2.3 Split Site Configurations 9 4.3 ETERNUS CS8000 as Consolidation Platform 10 5 Disaster Recovery with ETERNUS CS8000 11 5.1 Overview 11 5.2 DR Solutions Limited to One Site 11 5.3 DR Solutions Meeting More Aggressive Service Levels 12 5.4 Two Data Centers and One Split Site ETERNUS CS8000 15 5.4.1 Disaster-Resilient Architecture 15 5.4.2 Failure 17 5.4.3 Fallback 19 5.5 Two Data Centers and Two ETERNUS CS8000 Systems Connected by Long Distance Replication 20 5.5.1 The Challenge 20 5.5.2 Disaster-Resilient Architecture 21 5.5.3 Failure 24 5.5.4 Fallback 25 6 General Availability Features of ETERNUS CS8000 26 7 Conclusion 30 Page 1 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 1 Introduction Continuous operation, ideally with no downtimes, is vital for all companies who depend on IT technology. System outages for whatever reason have immediate negative impact on the business. Customers who cannot access IT services quickly enough move away to other alternatives and if this happens repeatedly a company’s reputation may be damaged. In addition to high availability of server and storage hardware as well as data mirrors, snapshots, and replication, data protection is a vital element to ensure appropriate service levels. ETERNUS CS8000 is an extremely powerful solution element in this context and helps backup software-based data protection infrastructures reach previously unknown service levels. This white paper explains the various possibilities of ETERNUS CS8000 to enable more aggressive service levels for data protection including 1 disaster resilience. It helps if the reader is already familiar with ETERNUS CS8000 and VTL concepts on a general level, because it would go beyond the scope of this paper to explain every feature and function of this product in greater detail. Nevertheless, care was taken to provide at least short explanations for the ETERNUS CS8000 and VTL-specific terminology. In addition to backup, ETERNUS CS8000 is also a very attractive complement for archiving and second-tier file storage solutions, thereby providing a unified backup and archiving target. Because this white paper covers the topic of disaster resilience mainly from a backup application point of view, archiving is not covered here but will be reserved for another document concentrating on that subject. 1 VTL: Virtual Tape Library Page 2 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 2 The Role of Backup As soon as continuous operation is the goal, many people immediately start to think about high availability concepts for server and storage 2 3 hardware. In addition mirror, snapshot, and replication techniques are used on a data level to come close to the ultimate goal RPO =RTO =0. 4 However, even with these technologies it is not possible to keep this ambitious RTO-RPO promise for all potential threats. Consider for a moment that a software bug in a database is causing data corruption. This data corruption is immediately mirrored or replicated to all data copies and renders these high-availability targets useless. If you also consider that this data corruption is not detected for several days, snapshot technologies will usually not be able to cover this time period, either. Another case could involve an important file being accidently erased with the deletion undiscovered for a longer time period. In the end this is the same case as before: Mirror, replication, and snapshot technologies will not help to recover the lost file. Last but not least, major catastrophes can easily destroy all mirrored / replicated / snapshot data copies, which again leads to a situation with no RTO-RPO promise at all. Common to all these scenarios is the fact that only a traditional backup scheme provides the ability to recover from these types of data losses. Although a backup is not able to meet the very aggressive RPO=RTO=0 service level, it is able to secure the more aggressive promises of mirror / replication / snapshot techniques because they are not able to keep the promise for all possible (and probable) threats (cf. figure 1). In this sense a backup is the final insurance in case the other methods to secure infrastructure and data do not work. And it is in this sense a vital and indispensable element in every attempt to maximize IT operation continuity. Figure 1: Mix of different technologies In essence, traditional backup offers two important features to increase data availability: The ability to “travel back in time” to access (multiple) older data copies in case the newer ones are damaged or deleted The ability to maintain data copies over very long distances You could argue that these features are also available with an intelligent combination of data mirrors, replication, and snapshots, but in reality these concepts only work within certain boundaries and can become very expensive. Especially in database environments (and these are very often mission-critical) daily change rates can be so high that the ability to cross larger distances and maintain older versions long enough becomes unrealistic or at least economically unfeasible. This leaves most customers with the (widely accepted) mix of all these technologies (cf. figure 1), i.e. the usage of mirror / replication / snapshot functions to meet aggressive RPO / RTO goals and the usage of traditional backup methods to either assure these aggressive RPO / RTO goals or to have an economically more feasible alternative especially in case of high, daily data change rates and long distances to cross. 2 3 4 RPO: Recovery Point Objective: The point in time you have to go back to in order to find the last valid and saved copy of lost data RTO: Recovery Time Objective: The amount of time needed to restore the last valid and saved copy of lost data Even if the RTO-RPO promise is a non-zero value in the range of minutes and hours Page 3 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 3 The role of ETERNUS CS8000 As outlined in the last chapter, a backup infrastructure is an important complement to every IT infrastructure in order to secure a minimum service level for all major threats that could harm continuous operation. One important element in a backup infrastructure is the ability to store multiple versions of primary data for several weeks or months. This guarantees that recovery of older data versions is still possible in case errors are detected only after some time. In order to survive catastrophes backup copies must also be available at remote sites. A remote site can easily be several hundred kilometers away from the primary data site. A question that now arises is if it is necessary to implement specific robustness in the backup infrastructure itself. At first sight it could be considered enough to just provide a single backup path with very limited redundancies implemented, because the backup path is not part of the productive IT infrastructure and a failure within the backup infrastructure would have no immediate effect on the continuous operation of the primary IT environment. This perception will, however, easily change if mission-critical data needs to be restored that is otherwise no longer available. If the backup system also fails during this outage, no possibility is left to restore mission-critical data. The consequences will be significant. Even the case that backup data is still there but is temporarily inaccessible can have an extremely negative effect, because the recovery time is likely to increase to unacceptably high values. ETERNUS CS8000 now adds scalability, extended capacity, robustness and disaster resilience to the backup repository that not only enables backup applications to meet more aggressive service levels but also simplifies backup operation by avoiding complex backup policies in the first place. Page 4 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 4 How ETERNUS CS8000 adds robustness to backup and recovery 4.1 Backup application architectures Before we dive deeper into the ETERNUS CS8000 architecture, we will have a brief look at backup application architectures. Today most backup applications consist of three major elements (cf. figure 2): Backup clients: These are the systems that contain the data to be backed up. This includes small workplace systems as well as large database servers. Media servers: These are data movers who receive data from backup clients and store them in a backup repository. ETERNUS CS8000 can be such a repository. Depending on the number of clients and their throughput requirements multiple media servers can be deployed to perform the daily backup. Depending on the backup application other terms like “storage node” or media agent” are used to describe the same function. During this document we will refer to this function as “media server”, independent of the individual backup application in use. Backup server: This instance is responsible for the overall backup management. Very often only one backup server is necessary to manage a complete backup infrastructure consisting of thousands of clients and many media servers. Depending on the backup application other terms like “master server” or “commserve” are used to describe the same function. During this document we will refer to this function as “backup server”, independent of the individual backup application in use. Figure 2: Major elements of backup applications It is quite apparent that the media server is the scaling element in this architecture. If backup requirements grow, new media servers can be deployed under the centralized management of a backup server. This media server concept also adds redundancy to the overall backup architecture. If a media server fails, another one can take over the workload and ensure continuous backup operation. Only the backup server itself is a potential single point of failure, which is the reason that backup servers are very often operated in a clustered setup. There have been many concepts implemented to secure a backup server and its master data, but it would be beyond the scope of this document to describe them all. Page 5 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 4.2 How ETERNUS CS8000 Complements Backup Application Architectures 4.2.1 ETERNUS CS8000 Grid Architecture A very similar architecture is used in the ETERNUS CS8000. The system as a whole behaves like one very large tape library. The (logical) tape 5 drives of this library, however, are realized via a grid of processors, called ICPs . As the number of media servers grows, the number of ICPs can also be increased and are managed via one centralized instance in ETERNUS CS8000. Like media servers may fail, ICPs may fail as well, but other ICPs will still work and be able to take the additional load. It is therefore recommended to connect media servers not only to one ICP but to multiple units. Ideally a media server has access to all ICPs installed inside an ETERNUS CS8000, because that also makes load balancing easier. However, the grid of ICPs would be useless if it is not able to share data. This data sharing must take place on a scalable storage system that 6 also offers sufficient bandwidth for all ICPs. Like in any VTL the first level of this storage is realized on disk. This disk repository is named TVC 7 and consists of a number of RAID systems that can be scaled according to the number of implemented ICPs. All ICPs and all RAID systems 8 are connected via a redundant high speed internal SAN. Eventually a clustered file system called CAFS enables access of all ICPs to the same data. This CAFS is in fact the key element that makes it possible for ICPs to take over the work of others, because it allows all ICPs to access the same data (cf. figure 3). Figure 3: ETERNUS CS8000 Grid Architecture 5 6 7 8 ICP: Integrated Channel Processor; provides logical (tape) drives (LDs) for media servers TVC: Tape Volume Cache; cache for logical volumes implemented on RAID disk storage RAID: Redundant Array of Independent Disks; Level 5 or 6 in ETERNUS CS8000 CAFS: CentricStor Appliance File System: all virtual tapes are stored in the TVC in a number of CAFSs Page 6 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures In a nutshell, the ETERNUS CS8000 Grid Architecture is able to meet almost every performance target by growing its number of ICPs and RAID systems according to the number of deployed media servers. And it is able to provide redundant access to the backup repository that ideally complements the backup application concept of media server failover. If one media server fails, another one can not only take over its work, but also get access to all the backup data the failed media server has already written. As a consequence data can be restored independent of media server availability. On the other hand a media server will be connected to multiple ICPs which means that in case of an ICP failure it can still use (logical) tape drives of other ICPs that still allow access to all previously written backup data (cf. figure 4). Figure 4: Redundant access to the backup repository Page 7 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 4.2.2 True Tape Virtualization The grid architecture concept provides an elegant method of scaling a system and adding redundancy to individual components like ICP or the TVC. But what happens if data is lost (i.e. data in the backup repository is lost and no longer available for restores)? Most backup environments today address this by creating additional data copies of the backup images, sometimes also referred to as clones. Clones need to be managed on a backup application level which can create significant overheads: All data needs to be read and written a second time, thereby requiring significant bandwidth of the backup network Every clone will result in new index information stored in the backup server, which might cause problems with some backup applications, especially if backup targets include file systems with many files. It might take a while until the generation of clones is started, which leads to a situation in which only one backup copy is available for a significant time. The overall backup policy management and scheduling gets more complex. Especially the definition and management of additional backup copies creates additional administration efforts and should not be underestimated. ETERNUS CS8000 greatly simplifies this form of redundancy management and in addition offers better automation in error situations. As soon 9 10 as a tape (or better a LV ) has finished writing / reading, e.g. is unmounted, it is copied from the TVC to a PV (i.e. a physical tape). Up to three of these physical tape copies (all created in parallel) are supported (the feature is known as Dual Save / Triple Save). Together with an additional disk mirror, which is explained later, up to 5 copies of the LV can exist inside ETERNUS CS8000. These 5 copies however, are transparent to the backup application which still only sees one (cf. figure 5). As a result the backup application needs only to take care of one backup copy without any need to generate additional clones. Access to all data copies is automated and completely transparent for the backup application. For example if access to a LV stored on a PV fails, the I/O request is redirected to the next LV copy on another PV and furthermore an automated self-healing mechanism is initiated that will recover the LV copy on the failed PV. All this is completely transparent to the backup application, which will not notice that there was an error at all. Other backup appliances on the market do not offer this feature and solely depend on the backup application’s capability to manage 11 additional clone copies of the backup data. As a consequence ETERNUS CS8000 does not depend on functionality like OST to better integrate with backup appliances. In addition, all data copies inside ETERNUS CS8000 are created using the internal SAN and do not take away precious bandwidth of the customer’s backup network. Figure 5: The backup application only sees one copy – multiple copies managed by ETERNUS CS8000 In a nutshell: No clones need to be created by the backup application, thus saving bandwidth on the backup network. Instead data copies are created and managed inside ETERNUS CS8000 using its internal network. No extra index information is created by the backup application, resulting in a leaner internal database. Data copies are created as soon as LVs are unmounted e.g. parallel to the data ingest at the earliest possible time. The backup application only needs to manage one backup copy leaving the redundancy task to ETERNUS CS8000. As a result backup policies become simpler. Last but not least the self-healing mechanisms of ETERNUS CS8000 are superior to typical cloning type scenarios of backup applications 9 LV: Logical Volume; emulation of a tape with a capacity of 0.8 to 200 GB PV: Physical Volume; a real tape in an automated tape library 11 OST: Open Storage Technology; a Symantec backup interface that helps to better integrate Symantec backup software with backup appliances 10 Page 8 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 4.2.3 Split Site Configurations 12 Today many data centers distribute their IT operation across at least two sites. Every site uses the other one as a DR site. Server and storage infrastructure is available at both sites and data is mirrored or replicated in metro cluster type fashion. Consequently a major disaster impacting one site can be compensated by the remaining one, because all infrastructure elements including the data are present at the surviving site to continue operation. The corresponding backup infrastructure should be set up in a similar way. From a backup application perspective media servers would be distributed across the two sites and a failover scenario would be implemented for the backup server which normally runs at one site. For these environments ETERNUS CS8000 offers a unique architecture called the “split site configuration”. The aim here is to preserve the idea of a single backup repository by simply distributing the grid components of an ETERNUS CS8000 equally across the two sites together with a strong interconnect. In case of a disaster at one site enough components of ETERNUS CS8000 are still available and will continue to work. In contrast to other backup appliances, which are confined to a single site and will need some kind of failover to another site, an ETERNUS CS8000 system remains one system with internal components at two sites. This has the major advantage that in case of a disaster no failover is necessary. The system simply continues to run. The backup application continues to see the same backup repository and also does not 13 need to failover its operation to another one. This setup is best combined with a synchronous mirror of the TVC, called CMF . CMF together with Dual Save (i.e. a secondary physical tape copy) will assure that not only the ETERNUS CS8000 grid components are available at both sites, but also all data instances on disk (i.e. on TVC) and physical tapes (cf. figure 6). Figure 6: ETERNUS CS8000 Split Site Configuration - Components and data instances are available at both sites All in all the distribution of a single system across two sites in a split site configuration plus the duplication of all backup data via a synchronous mirror (CMF) and Dual Save (e.g. a second physical tape copy) offers a number of advantages: Both sites consist of enough components to survive a major disaster at one site. If configured right, ETERNUS CS8000 continues to run without interruption even in case of a disaster. The concept of a single (virtual tape) library is maintained even in case of a disaster which makes it easier for a backup application to recover. The ETERNUS CS8000 split site architecture ideally complements a mission-critical setup of a disaster-resilient IT architecture consisting of highly available server and storage infrastructure distributed across two data centers including the deployment of state-of-the-art mirror, replication, and snapshot technologies plus a backup application. 12 13 DR: Disaster Recovery CMF: Cache Mirror Feature; provides a synchronous mirror for the TVC, especially for split site configurations Page 9 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 4.3 ETERNUS CS8000 as Consolidation Platform All these technology elements like the grid architecture, the true tape virtualization, and the split site capability position ETERNUS CS8000 as an ideal consolidation target. On the one hand, it can easily scale according to capacity (> 1 Exabyte) and throughput (> 100 TB/h) requirements and on the other hand it can be configured with no single point of failure, and as we will see in the next chapters, it can also work as a completely disaster-resilient backup appliance. In order to fully exploit these capabilities ETERNUS CS8000 supports all major backup applications and is also able to serve them in parallel at the same time. In addition, it is able to support FICON-based mainframe environments in parallel to open systems within the same appliance, which further extends its consolidation capabilities. And last but not least ETERNUS CS8000 supports all major tape library and drive vendors in its own back-end. Page 10 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 5 Disaster Recovery with ETERNUS CS8000 5.1 Overview In the past chapters we have seen that despite mirror, replication, and snapshot technologies traditional backup methods are indispensable for IT environments that contain mission-critical components. We also understood that in addition to the productive IT environment backup infrastructures themselves must also be kept operational under all circumstances. It was also explained that backup applications are able to maintain a significantly higher level of operability and meet more aggressive SLAs with the help of ETERNUS CS8000. ETERNUS CS8000 features are the basis to reach previously unknown disaster resilience. But before we dive deeper into the special functionality of ETERNUS CS8000, let’s have a look at a DR solution limited to one data center site. 5.2 DR Solutions Limited to One Site In a data center limited to one site the corresponding DR solution is just able to preserve the data in case of a disaster. Less focus can be put on the recovery time, because the ability to implement any form of continuous operation would require a second site. As a consequence simple straight forward-backup architecture is deployed with minimum built-in redundancy with the primary goal to create just one backup copy that is somehow kept at a remote site. In its simplest form, physical tapes are created and vaulted (cf. figure 7). This concept will just preserve the data and in case the primary site is lost it is unknown how long it will take to organize new hardware and software and put them in a state to recover data from the vaulted tapes. This is not a desirable solution for a mission-critical environment that usually relies on some form of duplicating the data plus its related IT infrastructure (i.e. application, server, storage, and data) across two active sites. Figure 7: Data center limited to one site - Physical tapes are created and vaulted Also in environments, which are limited to one site, ETERNUS CS8000 can add significant robustness, as already described in the previous chapters (cf. figure 8). Figure 8: ETERNUS CS8000 adds robustness even in data centers which are limited to one site ETERNUS CS8000 can provide significant robustness The ETERNUS CS8000 grid architecture allows the scale out 14 implementation of logical (tape) drives (LD ) across multiple nodes (ICPs). Media servers will use LDs belonging to different ICPs and therefore be able to continue operation in case of loss of an ICP. A clustered file system allows access to the same LVs by multiple ICPs (and their corresponding media servers). Data is always available for restore even in the case of loss of a media server / ICP. The clustered file system runs on top of the RAID protected systems forming the TVC. In addition, LV data is copied to physical tapes and stored in multiple copies transparent for the backup application, including a self-healing mechanism in case one copy is lost or corrupt. This further increases data availability and keeps the backup application’s storage policy setup very simple, i.e. only one copy has to be maintained by the backup application - and ETERNUS CS8000 takes care that this copy is always available. 14 LD: Logical (tape) Drive; emulation of a tape drive, e.g. of the LTO type, etc. Page 11 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 5.3 DR Solutions Meeting More Aggressive Service Levels ETERNUS CS8000 usually unfolds its full capabilities in mission-critical environments. These environments require more than just the ability of the data to survive a major outage or disaster. As outlined before, mission-critical IT architecture involves at least two data center sites that under normal conditions work independently of each other but will be able to take over the work from the other site in case of a disaster. In its simplest form, the two sites are realized by splitting one data center into two fire areas. In a more sophisticated environment the sites are connected over longer distances (typically several kilometers). With growing distances challenges increase that are induced by growing latencies in the I/O chain. In practice this leads to two main data center architectures: For distances up to the double digit km range two data center sites are used that are tightly coupled in the sense that primary data is synchronously mirrored. In this architecture the focus is on having all data center infrastructure elements, i.e. server, storage, and data, available at both sites 100% synchronized at any time in order to continue IT operation even in case of a complete data center loss at one site. This architecture is ideally complemented by an ETERNUS CS Split Site Configuration. Not only will the primary IT infrastructure seamlessly continue to work with such a split site configuration, but also the backup infrastructure. As a result the whole IT environment will continue to run with optimum data protection although running with fewer components, thereby providing enough robustness until complete recovery of the lost site is achieved (cf. figure 9). Figure 9: Architecture for two data centers with synchronous mirror Page 12 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures For very long distances or limited bandwidth two sites will still be considered, but they will certainly be less tightly coupled. Asynchronous replication mechanisms will be more appropriate and also backup applications will most likely play a more important role in continuing IT operation at the remote site, as they are another important vehicle to make data available at the remote data center in addition to replication. For these architectures two individual ETERNUS CS8000 systems can be coupled with long distance replication mechanisms (cf. figure 10). Figure 10: Architecture for two data centers with asynchronous replication Page 13 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures Before we get into the details of these two disaster-resilient architectures, let’s summarize the different DR approaches and the corresponding contribution by ETERNUS CS8000: IT Architecture One data center Vaulting backup data Optimize general robustness by using ETERNUS CS8000 grid architecture Environment Only one data center is operated. The backup solution including ETERNUS CS8000 will be the only instance creating data copies that will be stored somewhere outside the data center. Two data centers within reasonably small distance Use of ETERNUS CS8000 split site architecture Two 100% synchronized data centers are operated at all levels (server, storage, data, backup). Continuous operation is key and is consequently not only available for the primary storage and server infrastructure but also available for the backup infrastructure via ETERNUS CS8000 Two data centers connected via very long distances Use of two ETERNUS CS8000 systems with long distance replication feature Due to increased latencies induced by very long distances the concept of 100% synchronicity cannot be maintained. Nevertheless long distance replication of backup data is possible by connecting two ETERNUS CS8000 systems to enable a restart of the IT infrastructure at the remote site in case of a disaster Page 14 of 30 Usage Not a real DR solution. Focus is on preserving data in case of a disaster. RTO is very often unknown. ETERNUS CS8000 delivers maximum robustness within the framework of this single data center approach and will also create additional backup copies for vaulting. ETERNUS CS8000 is the vital element to ensure that together with the primary server and storage infrastructure a fully functional backup / restore environment is also available at the second data center without interruption, thereby securing mission-critical applications that will lack a mirror in case of a disaster until recovery of the failed site. ETERNUS CS8000 is the vital element to ensure that backup data is automatically and efficiently replicated across very long distances to a remote site and instantaneously available for a quick restore. http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 5.4 Two Data Centers and One Split Site ETERNUS CS8000 5.4.1 Disaster-Resilient Architecture The core element of the architecture is one logical ETERNUS CS8000 system which is distributed over two geographically separated sites (“split site configuration”). We will refer to these sites as “Site A” and “Site B”. All vital components like nodes which emulate the tape drives (ICPs) Tape Volume Cache (TVC) 15 16 back-end nodes (IDPs )which move data back and forth between the TVC and the physical tape library (PTL ) 17 18 Virtual Library Processors (VLP and SVLP ) which manage the system as a single entity are present at both sites so that in case of an outage or a disaster affecting the whole site the other one still contains enough components to continue operation (cf. figure 11). Figure 11: ETERNUS CS8000 split site configuration 15 16 17 18 IDP: Integrated Device Processor; operates physical tape drives (PDs) in a physical tape library (PTL) PTL: Physical Tape Library; real tape library connected to an ETERNUS CS8000 system VLP: Virtual Library Processor; ETERNUS CS8000 control component SVLP: Standby Virtual Library Processor; takes over the work of the VLP in the course of an (backup application transparent) AutoVLP failover Page 15 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures The ETERNUS CS8000 internal and redundant LAN and SAN infrastructure (not visible to the outside world) which connects all components is thereby extended to a second site. The advantage is obvious: Although distributed across two sites, there is still one single system representing one system state that is always available at both sites. The system state (i.e. the configuration and database) is always synchronously mirrored across the two sites. In contrast to configurations involving two independent systems, there is no classical failover necessary from one to the other system but instead the system simply continues to run with a reduced number of components. In order to facilitate a smooth continuation a number of features are available, which ensure that not only the essential ETERNUS CS8000 components are available at both sites but also the logical volume data. This is achieved via a synchronous mirror of the configuration and internal database data Logical volumes (LVs) stored on disk in the TVC that can be synchronously mirrored via CMF Logical volumes (LVs) stored on physical tapes (PVs) that can be maintained in up to three copies using Dual / Triple Save with at least one copy being stored at the second site. All this guarantees that as soon as the backup application has written any data to ETERNUS CS8000 this data is already disaster protected (if written to a mirrored portion of the TVC – data written to a non-mirrored portion of the TVC will only be disaster protected when at least a dual save copy on physical tape has been created) and is accessible via a redundant infrastructure spread across two distant sites. Split site configuration: Benefits and key architectural elements Benefits: Interruption-free continuation of ETERNUS CS8000 operation if one site goes down Complete takeover of all tasks of the failed site All LVs written to a mirrored portion of the TVC will be available immediately after a disaster All LVs written to a non-mirrored portion of the TVC that have already been written (with Dual Save) to PVs are available after a system reconfiguration. All aspects of the cross-site, redundant data storage (CMF, Dual Save) are handled by ETERNUS CS8000. The backup application is relieved of these tasks. Key architectural elements are (see also considerations from the chapter “General Availability Features of ETERNUS CS8000”): All ETERNUS CS8000 components are distributed across the sites so that one site contains the minimum set to continue operation. The ETERNUS CS8000 internal SAN and LAN are extended from site A to site B. Cross-site connections of the media servers to ICPs / Logical (tape) Drives (LDs), thereby enabling continuous access to ETERNUS CS8000 by all media servers even in case of a disaster. Remote Dual Save to PTLs set up at both sites to guarantee LV copies on PVs at every site. TVC RAID systems evenly distributed over both sites and a mirror configured for those LVs that must survive a disaster without interruption. VLP and SVLP are distributed over the two sites and linked via redundant SAN and LAN. The redundant SAN and LAN connections between the two sites should be routed via alternative paths to avoid a communications failure as far as possible. (please refer to a later chapter for details) To come straight to the point: This architecture represents the optimum in disaster protection including automatic continuation in case of an outage disabling a complete site. An ETERNUS CS8200 model is a requirement. Page 16 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 5.4.2 Failure There are three basic disaster scenarios to be distinguished: Failure of site A Failure of site B Total failure of all communications between sites A and B A site failure is handled by the (backup application transparent) AutoVLP failover mechanism that will activate the SVLP to become the new VLP in case the VLP site is affected. Because the ETERNUS CS8000 database and the configuration data are synchronously mirrored across the sites, every site has access to all relevant data to continue operation. This also includes immediate access to all LVs that have also been mirrored. Total communications failure between the two sites constitutes a special situation. A failure of this kind leads VLP and SVLP each to assume that the other site has failed and, for its own site, either to continue to exercise control (VLP) or to take over control (SVLP). This phenomenon, known by the term “split brain”, results in a single consistent system dividing into two separate systems, which from this point forward have separate configurations, which cannot easily be merged again in a fallback. To avoid this “split brain” behavior a Tie Breaker 19 Processor (TBP ) is introduced at another site, site C (cf. figure 12). Figure 12: Three basic disaster scenarios 19 TBP: Tie Breaker Processor; ETERNUS CS8000 component involved in resolving critical situations Page 17 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures Together with the VLP at site A and the SVLP at site B, the TBP works as a so-called quorum node. All three quorum nodes (VLP, SVLP, and TBP) communicate via LAN with each other to decide in which form the ETERNUS CS8000 system is operated. In a normal situation all three quorum nodes are up and running resulting in a fully functional ETERNUS CS8000 system at both sites that is managed by the VLP. In the event of malfunction, ETERNUS CS8000 responds with the behavior described below: Disaster situation Failure of site A Behavior The VLP quorum node cannot be reached by the TBP and the SVLP. The SVLP and the TBP gain cluster majority. The AutoVLP failover mechanism is executed and the SVLP becomes the new VLP at site B. Failure of site B Total communications failure The SVLP quorum node cannot be reached by the TBP and the VLP. The VLP and the TBP gain cluster majority. The VLP continues to manage the remaining ETERNUS CS8000 components at site A. The VLP quorum node and the SVLP quorum node cannot reach each other. However, the TBP can still reach the VLP and the SVLP. Based upon its LAN interface configuration the TBP decides with which other quorum node it will gain cluster majority. If the selected quorum node is the VLP (and this is the case in a normal configuration) it will continue operation at site A. Site B will be expelled from the cluster. If the selected quorum node is the SVLP the AutoVLP failover mechanism is executed and the SVLP becomes the new VLP at site B. Site A will go down. Result Continued operation: Site B takes over operation under the control of the SVLP, which now becomes the new VLP Continued operation: Site A continues operation Continued operation: One site continues operation while the other will become unavailable. In a nutshell, the TBP is introduced as an additional instance in order to distinguish a total communications failure from a site failure. In the table above all failure cases were discussed but the failure of the TBP. In this case the VLP and the SVLP quorum nodes can still reach each other and will therefore be able to gain cluster majority and continue operation without any interruption. If, however, in this state the (redundant) LAN connection between VLP and SVLP is also interrupted by any means, there will be not enough quorum nodes left connected to form a new cluster (more than half of the quorum nodes must be up and running and connected to each other). In this case the system will not be able to automatically continue to run but will instead need a small system intervention to restart. If continuous operation is imperative, care should be taken that the TBP is located at a 3rd site (site C) independent of site A and B maintains LAN connections to the VLP and SVLP that are ideally independent from the connection between VLP and SVLP. Not in all customer environments is it possible to install the TBP at a 3rd site. As an alternative the TBP can be installed together with the VLP at one site. As long as the site containing TBP and the VLP is not affected, the situation will be the same as described before. Only in case the site containing the TBP and the VLP is lost will different behavior occur: Disaster situation Failure of site containing TBP and VLP Behavior The VLP and TBP quorum nodes cannot be reached by the SVLP. The SVLP has no majority to manage the cluster. Result Manual restart: The other site takes over operation under the control of the former SVLP, which now becomes the new VLP The following picture emerges in all cases after the surviving site has resumed operation: 20 ETERNUS CS8000 operates with the remaining ICPs / LDs or IDPs / PD s. One control component (VLP or SVLP promoted to the new VLP) is running. Previously mirrored components (TVC, ETERNUS CS8000 database and configuration) continue to exist without mirror. LVs belonging to a non-mirrored file system (CAFS) will only be accessible if a (Dual / Triple Save) copy has been written on PVs. Surviving resources will be reassigned to make them available again. Only the PTLs of one site may have survived. Dual Save is temporarily suspended. 20 PD: Physical (tape) Drive; real tape drive connected to an ETERNUS CS8000 system Page 18 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 5.4.3 Fallback Once the failed or inactivated site can be put back into operation, the following important steps must be carried out: Make the new / recovered ICPs, IDPs, the VLP, and the RAID system known to ETERNUS CS8000 Start resynchronization of the mirrored portion of the TVC to resume CMF operation Move back non-mirrored LVs to their original place that had previously been moved to the surviving site. Reintegrate the PTL of the failed site – Lost PVs can be regenerated by making new PVs available to the Dual / Triple Save setup that in turn will automatically create them based on the information of the survived copy. Page 19 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 5.5 Two Data Centers and Two ETERNUS CS8000 Systems Connected by Long Distance Replication 5.5.1 The Challenge Although the split site architecture described in the last chapter represents the best way to add robustness to backup environments including disaster resilience while still meeting very aggressive RTO targets, it will reach its limits with growing distance between the two sites. For distances very much beyond 100 km the latencies induced by growing round trip times will render this concept useless. This is because in every IT environment at least a group of I/O transactions needs to be acknowledged from time to time. The speed by which these I/Os are acknowledged depends on the time required to send the I/O from the source to the target and the time to confirm the I/O back to the source. Not counting the delays caused by network equipment like routers etc., the distance between source and target plays an important role. The maximum speed at which an I/O signal can travel is limited by the speed of light, which is for example limited to 200,000 km/s in fiber optics. Although this speed is not a problem within a data center and distances of a few meters only, it will certainly become one for larger distances. Let’s have a look at the example of 100 km: The round trip time from the source to the target and back again to the source will be 200 km / 200,000 km/s = 1 ms (cf. figure 13). Figure 13: Round trip time – 100 km example This value already adds a measurable overhead to normal I/O performance. Although you can try to minimize the number of acknowledgments in a total I/O flow (and this is in fact offered by modern network devices today via techniques called “buffer to buffer crediting”, etc.) it will in the end remain a problem (cf. figure 14). Figure 14: WAN round trip time - example with buffer to buffer crediting Page 20 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures That leaves us with the following unusual situation: Although data can still be sent at very high rates like 8 Gb/s FC or 10 Gb/s Ethernet the speed of a single data stream is slowed down by the above mentioned latency. You can also rephrase this into a more positive statement, saying that although a single data stream can be slowed down by latency effects the presence of multiple data streams at the same time can compensate this, because the bandwidth is still high. An ETERNUS CS8000 system based on split site architecture will not be able to address this situation because it is unaware of it. However, we will see in the next section that two individual ETERNUS CS8000 systems are indeed able to address these circumstances by using the “Long Distance Volume Replication” feature. But before we explain this in more detail we need to understand a few other aspects that are important to master the challenges of growing distances. For example the network protocols that are used for long distance replication will be more likely TCP/IP-based as opposed to SCSI over FC-based. An efficient interconnect between two sites should therefore be built around TCP/IP. With TCP/IP being the dominant protocol used for longer distances, the likelihood also increases that data streams might be routed over public lines or at least lines where it is not sufficiently clear that third parties might be able to access the data stream. This requires the ability to encrypt data at least during transportation and implement a robust authentication scheme between the involved systems. Sufficient network bandwidth becomes a costly issue with growing distances. Therefore care should be taken to minimize the total amount of data that is replicated on a regular basis. Last but not least, the probability that I/O errors will occur and may interrupt longer running I/O processes will also increase with growing distance (and the number of involved network components). This is best addressed by a transport algorithm that allows effective error recovery without the need to restart longer I/O streams from the beginning each time it is interrupted by a less reliable network. 5.5.2 Disaster-Resilient Architecture As outlined in the last chapter, a number of special requirements exist for larger distances that suggest architecture other than a split site configuration. In addition, the advantages of a split site configuration should be preserved as much as possible, i.e. the ability to follow the paradigm of continuous backup and restore services as much as possible. For this purpose “Long Distance Volume Replication” is designed to effectively connect two ETERNUS CS8000 systems over longer distances. We will refer to the two systems in the following way: A source ETERNUS CS8000 system at site A or the source site A target ETERNUS CS8000 system at site B or the target site Page 21 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures The concept is to integrate the replication mechanism in the normal back-end tape operation of the source ETERNUS CS8000 system and in the normal front-end tape operation of the target ETERNUS CS8000 system (cf. figure 15). This means A number of logical tape drives on the target systems are connected to the source system as if they were physical tape drives. In contrast to physical tape drives, the logical tape drive connections do not suffer from the “shoe shining effect”, i.e. they don’t need to be operated in streaming mode but may also run at low speed. Instead of SCSI over FC, TCP/IP over Ethernet is used to connect the back-end of the source system to the front-end of the target system, thereby offering full access to the most common long distance architectures. In addition to I/O latency induced optimizations on the network component level, the logical drive concept used on the target ETERNUS CS8000 allows the cost-efficient operation of many drives in parallel so that the available bandwidth is used in an optimal way and latency effects on a single stream level are compensated by parallelization of I/O streams. Logical volumes on the source site will be copied to the target site using its TCP/IP-based logical drives. Due to the 1:1 copy mechanism that preserves the native tape format on the target site, they become immediately available to the backup application for fast restores. Seamless integration in the Dual / Triple Save concept makes it possible to maintain multiple copies of the logical volumes generated on the source site. For example it will be possible to maintain a high performance copy on disk or physical tape on the source site for fast restores (i.e. the copy remains on disk or tape at the source site), while a second copy is automatically (and transparently) created on the target site using the TCP/IP connected logical drives over a longer distance. Data can be encrypted during transportation and a “ssh” (secure shell) based authentication mechanism ensures a proper identity check. Data will be compressed and a delta transmission algorithm is used on top to minimize the amount of replicated data. This delta transmission technique also ensures that a broken I/O process will not require data transmission to be started over and over again but will instead continue based on the last synchronization point. Figure 15: Long distance volume replication From an overall perspective, the target ETERNUS CS8000 system will work like a physical tape library that is connected via TCP/IP (covering tape drive emulation and library control) and is seen by the source system as an additional back-end target that is able to store additional logical volume copies within the standard Dual / Triple Save scheme. Cf. figure 17 for example concepts. Page 22 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures Figure 16: Examples for Cascading concepts Long distance volume replication with mono site source system An ETERNUS CS8000 system is installed as a source in a non-split site configuration maintaining a first data copy on disks / physical tape close to the source system. Another ETERNUS CS8000 system is operated as a target maintaining a Dual Save copy of source logical volumes on disk / physical tape library. Long distance volume replication with split site source system A split site ETERNUS CS8000 system is installed as a source already maintaining two data copies (via Dual Save) on disk / physical tape in the same manner as described in the previous chapter in order to meet very aggressive RTO targets. A second ETERNUS CS8000 system is installed at a remote site over a very long distance using “Long Distance Volume Replication”. This target system can also be a split site system, but will in most cases just be an ordinary mono site system. It will store a third data copy (as seen by the source system) via the Triple Save scheme. Essentially this system configuration is the combination of the two ETERNUS CS8000 architectures described in this paper: It allows maximum DR capabilities by offering split site architecture (including the support of two sites within reasonable distance) and involving a third data copy via “Long Distance Volume Replication” on a second ETERNUS CS8000 target system, which can be very far away from the split sites and is therefore suited as an additional DR layer in case a catastrophe were to affect the two sites covered by the split site configuration. Long distance volume replication and vice versa Of course it is possible to also operate both ETERNUS CS8000 systems as source and target at the same time. This will result in a system configuration in which a first system replicates its data to a second one with the second one also replicating data back to the first one. Page 23 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures In addition, concepts can also be realized where one source system replicates to multiple target systems (i.e. generates multiple DR copies) and where multiple source systems replicate into one central target system (i.e. realizing a branch office backup concept). Mixed environments are also possible that would allow “n” source systems to replicate to “m” target systems (cf. figure 17). Figure 17: Example: Multiple source systems replicate to two target systems In the following we will concentrate on the first two long distance volume replication concepts that are in fact just one, because in this scenario there is no real difference between a split site and a non-split site source system. 5.5.3 Failure In a nutshell, the “Long Distance Volume Replication” architecture consists of one ETERNUS CS8000 system being the active backup target, while the second ETERNUS CS8000 system just works as a data repository holding an additional copy and is (not yet) directly accessed by the backup application. However, with only a short delay caused by the asynchronous nature of the replication mechanism, the second ETERNUS CS8000 system will own a 100% identical copy of all LVs that have been created by the first system. This enables instantaneous access to the additional data copies by the backup application. Please note that in addition to the content also the names of the LVs will be identical to the LV names of the first system. This makes it possible for the backup application to access the volumes at the second site in case of a failure. The following procedure will apply in this failure case to ensure a quick restart: Verify that the source ETERNUS CS8000 system at site A has really stopped working and no longer accesses its volumes. It is important that only one system can access or modify LVs. Now the backup application can access the second ETERNUS CS8000 system at site B and configure it as a replacement for the first system. Because all LVs are the same at the second system on site B the backup application already knows them and can easily re-catalogue them. It is therefore recommended that the backup application does not see the target library during normal operation, i.e. as long as it uses the source ETERNUS CS8000 system. The backup server of the backup application might also be affected by the failure. In this case a failover to another backup server must be 21 organized . After the backup server failover has been accomplished, it can configure the target ETERNUS CS8000 system on site B as the replacement for the first system. After all this has been completed, the existing or new backup server will continue to work with the second ETERNUS CS8000 system at site B in the same way as it used the first system. The main difference being that it will temporarily not use “Long Distance Volume Replication” and will work with one LV copy less than before. In a summary, it can be said that due to the fact that all LVs are replicated to the second site B in an identical manner they can be quickly accessed without major reconfiguration work by the surviving instance of the backup application. This provides a method for continuing backup and restore operation at a remote site shortly after a major failure at the first site. 21 How this can be achieved is backup application-specific and cannot be covered in this document Page 24 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 5.5.4 Fallback The following procedure will ensure a proper fallback to the original configuration: Bring the source system or its replacement up again at site A (not yet connected to the backup application). Stop all backup activities on the target system at site B. Identify all LVs that have been changed since the start of operation on the target system at site B. Make the system at site B again available as target for “Long Distance Volume Replication” for site A Physical tapes (e.g. representing the first copy of a Dual Save setup on the source system) might have survived the disaster on site A. As a consequence you either join or replace these volumes based on the fact that the associated LVs have been updated by the system at site B in the meantime. Reconfigure the backup application to again use the source system at site A. This may also require proper steps to fall back to the backup application’s original backup server if this one had also been affected by the disaster. Make sure that the target system remains untouched by the backup application to avoid confusion with replicated volumes the backup application is unaware of. In any case the final result will be a new and up-to-date source system at site A, working as the backup repository for the backup application and maintaining all LVs (typically a first copy of the LVs at site A and a second (Dual Save) copy via “Long Distance Volume Replication” on the target machine at site B). Page 25 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 6 General Availability Features of ETERNUS CS8000 ETERNUS CS8000 consists of different components which not only scale exceptionally well within the framework of its grid architecture, but also constitute the basis for redundant and hence highly available systems that form the basis for the previously described DR scenarios. This chapter takes a more detailed view of the individual components from the availability perspective (cf. figure 18). Specifically, we cover: The Integrated Channel Processor (ICP) and the associated Logical (tape) Drives (LDs) The Virtual Library Processor (VLP) and its standby component (SVLP) including the ETERNUS CS8000 database on the RAID system (“/DB”) The Tape Volume Cache (TVC) and the Logical (tape) Volumes (LVs) stored therein The Integrated Device Processor (IDP) and the associated Physical (tape) Drives (PDs) The Physical Tape Library (PTL) and the associated Physical (tape) Volumes (PVs) Figure 18: Components of ETERNUS CS8000 Page 26 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures ETERNUS CS8000 is available in various capacity stages. In the smaller systems, multiple components (e.g. ICP, IDP and VLP) have been implemented in a single processor. In the large systems, such components are implemented in independent processors. In this way a distinction can be made between three architecture classes which differ in terms of their availability features (cf. figure 19). ETERNUS CS8200 models are the “scale-up” systems. The components ICP, IDP and VLP (or SVLP) are implemented on one processor, which 22 is also called an Integrated Universal Processor (IUP ). The models consist of 2 IUPs with external TVC. ETERNUS CS8400 models are the “scale-out-single-site” systems. All the components are implemented in dedicated processors. ETERNUS CS8800 models are the “scale-out-split-site” systems. Also here all the components are implemented in dedicated processors. Furthermore, the aforementioned “split site configurations” are possible, in which a system can be distributed over two sites. This scenario forms the basis for one of the DR architectures discussed in the previous chapters. Figure 19: ETERNUS CS8000 - Three architecture classes 22 IUP: Integrated Universal Processor; processor housing several ETERNUS CS8000 components, e.g. ICP, IDP, and VLP, all in one processor Page 27 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures The availability features of ETERNUS CS8000 become steadily better with ascending model number, from CS8200 through CS8400 to CS8800. The following overview helps identify and exploit the optimization potential in respect of availability for each component (it should be borne in mind here that, depending on the above-described ETERNUS CS8000 models, several components may be housed in one processor). Wherever possible, the power, LAN and SAN infrastructure should always be configured redundantly. If an entire site goes down due to a disaster, the optimizations described below on their own will not help. This case has been dealt with in the previous chapters. Component ICP / LD Availability architecture / recommendations Configure multiple ICPs in the system Each media server should serve more than one LD. The LDs attached to a media server should be generated on different ICPs. The media servers’ systems should be connected via redundant SAN / FICON or ESCON infrastructures Information on failure / recovery If an ICP goes down (and consequently also LDs), the media server can continue on another LD/ICP combination. The backup application is responsible for redirecting new backup / restore jobs to the remaining LDs. Starting with the ETERNUS CS8400, the ICP resides alone on a dedicated processor, i.e. no other components are affected by the failure. A new ICP can be installed and commissioned during live operation. VLP / SVLP An SVLP should always be configured for each VLP. Starting with the ETERNUS CS8400, the SVLP runs on a 23 separate ISP ; on the ETERNUS CS8200 the function is available as an option and is accommodated along with the ICP and IDP functions on one IUP. The AutoVLP failover functionality should be activated. TVC / LV The TVC is generally implemented using RAID5 or RAID6 technology. On the ETERNUS CS8400 and above the TVC can consist of two and more RAID systems. The ETERNUS CS8000 database and configuration data should be mirrored across two RAID systems to maximize system availability. In any case additional meta data copies are kept on the VLP and on the physical tapes. A CAFS should at least consist of 2 LUNs (preferably spread across two RAID systems) to enable a mirror of superblock/inode type information (even in case the CAFS is not mirrored). For performance even 4 LUNs per CAFS (8 in case of CMF) are recommended. 23 The SVLP is automatically put into operation via the AutoVLP failover if the VLP fails. The SVLP takes over the functions of the VLP incl. the externally visible IP address. On the ETERNUS CS8400 and above the VLP/SVLP is contained on a separate processor, i.e. no other components are affected by the failure. A new VLP/SVLP can be commissioned with no interruption in operation Individual disks are automatically replaced by hot-spare disks in case of a disk crash. The system continues running without interruption. Defective disks can be swapped during online operation. The operation continues in case of a complete LUN or even RAID system failure. However, non-mirrored LVs will become temporarily unavailable. If non-mirrored LVs have not yet been written to PVs, they will be lost. If a CAFS spans more than one LUN or RAID, system information on lost LVs is preserved (the superblock / inode type information is mirrored) and may be used for a quick recovery of LVs already written to physical tape A continuous operation including data availability can be achieved by the Cache Mirror Feature (CMF) CMF can be configured on a CAFS level A lost mirror is automatically recreated after fixed or replaced hardware is connected to the CAFS again. ISP: Integrated Standard Processor (i.e. ICP, IDP, VLP, or SVLP) Page 28 of 30 http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures Component IDP / PD PTL/PV Page 29 of 30 Availability architecture / recommendations Configure multiple IDPs in the system and connect PDs to different IDPs Configure enough PDs so they can cope with the data throughput rate in order to prevent performance bottlenecks and TVC congestion in the event of failure The PDs attached to an IDP should be distributed over 2 PTLs, and conversely the PDs of a PTL should be distributed over min. 2 IDPs. The SAN infrastructure between IDPs and PDs should also be configured redundantly. Create redundant copies of the LVs on different PVs and PTLs using Dual / Triple Save. Connect a 2nd PTL to the ETERNUS CS system via redundant paths (optional for ETERNUS CS8200 and ETERNUS CS8400, otherwise supported as standard) Premature aging of PVs can be prevented by automatic “refreshing” based on a suitable refreshing policy. Information on failure / recovery If an IDP or a PD crashes, ETERNUS CS8000 automatically uses the remaining resources. IDPs and PDs can be replaced and commissioned during online operation. Operation continues, i.e. independently of the backup application at the front-end. The advantage of virtualization is that the PTL does not have to be permanently available. This means that any problems affecting PTL and PDs no longer have an adverse impact on backup operations. With Dual Save recovery mode is initiated if a damaged PV is accessed. In this mode the redundant copy is accessed immediately. When a damaged PV is detected, its status is set to “faulty” and the system automatically generates a new copy. “Faulty” PVs must be replaced in the medium term as part of routine operational maintenance. http://fujitsu.com/fts/eternus_cs White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures 7 Conclusion To sum up, it can be said that a suitably configured ETERNUS CS8000 system can withstand a failure of all component types and is even able to survive a complete site failure to keep your data safe. This is especially true for ETERNUS CS8800 systems which can be configured with no single point of failure plus the ability to be installed as split site configurations. The ETERNUS CS8400 systems can be configured with no single point of failure on single-site configurations. The ETERNUS CS8200 systems, on which each of the components ICP, IDP and VLP is implemented on an IUP (Integrated Universal Processor), behave in a similar manner with the exception that these models cannot be installed as a split site configuration. If one IUP fails, the remaining IUP can continue working. Provided that the SVLP function is installed with AutoVLP failover. It is also important to make sure that the PTL infrastructure is connected to both IUPs so that if one IUP fails, the remaining one can access PDs and PTL in the back-end. In the cases considered, operation either continues or can be resumed within a short period of time. In addition, new components can be installed while the system continues to run. This also includes software upgrades which can be performed in a rolling manner, i.e. component by component without system interruption. Nonetheless, the time in which the original status can be restored plays a key role, since a further failure of the same component type can lead to a system interruption. For this reason ETERNUS CS8000 provides three important mechanisms for quickly notifying any failures: Notification of the operator or administrators via the ETERNUS CS8000 console, by SNMP, SMS or email Forwarding to a management center via SNMP Forwarding to the service provider (via AIS Connect, telephone or email) The integration of these mechanisms into operational processes is a vital prerequisite for meeting agreed service levels. Contact Fujitsu E-mail: storage-pm@ts.fujitsu.com Website: www.fujitsu.com/eternus 2013-12-02 WW EN Page 30 of 30 ƒ Copyright 2013 Fujitsu Technology Solutions GmbH, Fujitsu, the Fujitsu logo, are trademarks or registered trademarks of Fujitsu Limited in Japan and other countries. Other company, product and service names may be trademarks or registered trademarks of their respective owners. Technical data subject to modification and delivery subject to availability. Any liability that the data and illustrations are complete, actual or correct is excluded. Designations may be trademarks and/or copyrights of the respective manufacturer, the use of which by third parties for their own purposes may infringe the rights of such owner. http://fujitsu.com/fts/eternus_cs