ETERNUS CS8000 - disaster resilient architectures

advertisement
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
White paper
Fujitsu Storage ETERNUS CS8000
Disaster-resilient architectures
ETERNUS CS8000 enables comprehensive disaster recovery architectures.
Content
1 Introduction
2
2 The Role of Backup
3
3 The role of ETERNUS CS8000
4
4 How ETERNUS CS8000 adds robustness to backup and recovery 5
4.1 Backup application architectures
5
4.2 How ETERNUS CS8000 Complements Backup Application
Architectures
6
4.2.1 ETERNUS CS8000 Grid Architecture
6
4.2.2 True Tape Virtualization
8
4.2.3 Split Site Configurations
9
4.3 ETERNUS CS8000 as Consolidation Platform
10
5 Disaster Recovery with ETERNUS CS8000
11
5.1 Overview
11
5.2 DR Solutions Limited to One Site
11
5.3 DR Solutions Meeting More Aggressive Service Levels
12
5.4 Two Data Centers and One Split Site ETERNUS CS8000
15
5.4.1 Disaster-Resilient Architecture
15
5.4.2 Failure
17
5.4.3 Fallback
19
5.5 Two Data Centers and Two ETERNUS CS8000 Systems
Connected by Long Distance Replication
20
5.5.1 The Challenge
20
5.5.2 Disaster-Resilient Architecture
21
5.5.3 Failure
24
5.5.4 Fallback
25
6 General Availability Features of ETERNUS CS8000
26
7 Conclusion
30
Page 1 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
1 Introduction
Continuous operation, ideally with no downtimes, is vital for all companies who depend on IT technology. System outages for whatever
reason have immediate negative impact on the business. Customers who cannot access IT services quickly enough move away to other
alternatives and if this happens repeatedly a company’s reputation may be damaged. In addition to high availability of server and storage
hardware as well as data mirrors, snapshots, and replication, data protection is a vital element to ensure appropriate service levels. ETERNUS
CS8000 is an extremely powerful solution element in this context and helps backup software-based data protection infrastructures reach
previously unknown service levels.
This white paper explains the various possibilities of ETERNUS CS8000 to enable more aggressive service levels for data protection including
1
disaster resilience. It helps if the reader is already familiar with ETERNUS CS8000 and VTL concepts on a general level, because it would go
beyond the scope of this paper to explain every feature and function of this product in greater detail. Nevertheless, care was taken to provide
at least short explanations for the ETERNUS CS8000 and VTL-specific terminology.
In addition to backup, ETERNUS CS8000 is also a very attractive complement for archiving and second-tier file storage solutions, thereby
providing a unified backup and archiving target. Because this white paper covers the topic of disaster resilience mainly from a backup
application point of view, archiving is not covered here but will be reserved for another document concentrating on that subject.
1
VTL: Virtual Tape Library
Page 2 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
2 The Role of Backup
As soon as continuous operation is the goal, many people immediately start to think about high availability concepts for server and storage
2
3
hardware. In addition mirror, snapshot, and replication techniques are used on a data level to come close to the ultimate goal RPO =RTO =0.
4
However, even with these technologies it is not possible to keep this ambitious RTO-RPO promise for all potential threats. Consider for a
moment that a software bug in a database is causing data corruption. This data corruption is immediately mirrored or replicated to all data
copies and renders these high-availability targets useless. If you also consider that this data corruption is not detected for several days,
snapshot technologies will usually not be able to cover this time period, either.
Another case could involve an important file being accidently erased with the deletion undiscovered for a longer time period. In the end this
is the same case as before: Mirror, replication, and snapshot technologies will not help to recover the lost file.
Last but not least, major catastrophes can easily destroy all mirrored / replicated / snapshot data copies, which again leads to a situation with
no RTO-RPO promise at all.
Common to all these scenarios is the fact that only a traditional backup scheme provides the ability to recover from these types of data
losses. Although a backup is not able to meet the very aggressive RPO=RTO=0 service level, it is able to secure the more aggressive promises
of mirror / replication / snapshot techniques because they are not able to keep the promise for all possible (and probable) threats (cf. figure
1). In this sense a backup is the final insurance in case the other methods to secure infrastructure and data do not work. And it is in this
sense a vital and indispensable element in every attempt to maximize IT operation continuity.
Figure 1: Mix of different technologies
In essence, traditional backup offers two important features to increase data availability:
 The ability to “travel back in time” to access (multiple) older data copies in case the newer ones are damaged or deleted
 The ability to maintain data copies over very long distances
You could argue that these features are also available with an intelligent combination of data mirrors, replication, and snapshots, but in
reality these concepts only work within certain boundaries and can become very expensive. Especially in database environments (and these
are very often mission-critical) daily change rates can be so high that the ability to cross larger distances and maintain older versions long
enough becomes unrealistic or at least economically unfeasible. This leaves most customers with the (widely accepted) mix of all these
technologies (cf. figure 1), i.e. the usage of mirror / replication / snapshot functions to meet aggressive RPO / RTO goals and the usage of
traditional backup methods to either assure these aggressive RPO / RTO goals or to have an economically more feasible alternative especially
in case of high, daily data change rates and long distances to cross.
2
3
4
RPO: Recovery Point Objective: The point in time you have to go back to in order to find the last valid and saved copy of lost data
RTO: Recovery Time Objective: The amount of time needed to restore the last valid and saved copy of lost data
Even if the RTO-RPO promise is a non-zero value in the range of minutes and hours
Page 3 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
3 The role of ETERNUS CS8000
As outlined in the last chapter, a backup infrastructure is an important complement to every IT infrastructure in order to secure a minimum
service level for all major threats that could harm continuous operation. One important element in a backup infrastructure is the ability to
store multiple versions of primary data for several weeks or months. This guarantees that recovery of older data versions is still possible in
case errors are detected only after some time. In order to survive catastrophes backup copies must also be available at remote sites. A remote
site can easily be several hundred kilometers away from the primary data site.
A question that now arises is if it is necessary to implement specific robustness in the backup infrastructure itself. At first sight it could be
considered enough to just provide a single backup path with very limited redundancies implemented, because the backup path is not part of
the productive IT infrastructure and a failure within the backup infrastructure would have no immediate effect on the continuous operation of
the primary IT environment. This perception will, however, easily change if mission-critical data needs to be restored that is otherwise no
longer available. If the backup system also fails during this outage, no possibility is left to restore mission-critical data. The consequences will
be significant. Even the case that backup data is still there but is temporarily inaccessible can have an extremely negative effect, because the
recovery time is likely to increase to unacceptably high values.
ETERNUS CS8000 now adds scalability, extended capacity, robustness and disaster resilience to the backup repository that not only enables
backup applications to meet more aggressive service levels but also simplifies backup operation by avoiding complex backup policies in the
first place.
Page 4 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
4 How ETERNUS CS8000 adds robustness to backup and recovery
4.1 Backup application architectures
Before we dive deeper into the ETERNUS CS8000 architecture, we will have a brief look at backup application architectures. Today most
backup applications consist of three major elements (cf. figure 2):
 Backup clients: These are the systems that contain the data to be backed up. This includes small workplace systems as well as large
database servers.
 Media servers: These are data movers who receive data from backup clients and store them in a backup repository. ETERNUS CS8000 can be
such a repository. Depending on the number of clients and their throughput requirements multiple media servers can be deployed to
perform the daily backup. Depending on the backup application other terms like “storage node” or media agent” are used to describe the
same function. During this document we will refer to this function as “media server”, independent of the individual backup application in
use.
 Backup server: This instance is responsible for the overall backup management. Very often only one backup server is necessary to manage a
complete backup infrastructure consisting of thousands of clients and many media servers. Depending on the backup application other
terms like “master server” or “commserve” are used to describe the same function. During this document we will refer to this function as
“backup server”, independent of the individual backup application in use.
Figure 2: Major elements of backup applications
It is quite apparent that the media server is the scaling element in this architecture. If backup requirements grow, new media servers can be
deployed under the centralized management of a backup server. This media server concept also adds redundancy to the overall backup
architecture. If a media server fails, another one can take over the workload and ensure continuous backup operation. Only the backup server
itself is a potential single point of failure, which is the reason that backup servers are very often operated in a clustered setup. There have
been many concepts implemented to secure a backup server and its master data, but it would be beyond the scope of this document to
describe them all.
Page 5 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
4.2 How ETERNUS CS8000 Complements Backup Application Architectures
4.2.1 ETERNUS CS8000 Grid Architecture
A very similar architecture is used in the ETERNUS CS8000. The system as a whole behaves like one very large tape library. The (logical) tape
5
drives of this library, however, are realized via a grid of processors, called ICPs . As the number of media servers grows, the number of ICPs
can also be increased and are managed via one centralized instance in ETERNUS CS8000. Like media servers may fail, ICPs may fail as well,
but other ICPs will still work and be able to take the additional load. It is therefore recommended to connect media servers not only to one
ICP but to multiple units. Ideally a media server has access to all ICPs installed inside an ETERNUS CS8000, because that also makes load
balancing easier.
However, the grid of ICPs would be useless if it is not able to share data. This data sharing must take place on a scalable storage system that
6
also offers sufficient bandwidth for all ICPs. Like in any VTL the first level of this storage is realized on disk. This disk repository is named TVC
7
and consists of a number of RAID systems that can be scaled according to the number of implemented ICPs. All ICPs and all RAID systems
8
are connected via a redundant high speed internal SAN. Eventually a clustered file system called CAFS enables access of all ICPs to the same
data. This CAFS is in fact the key element that makes it possible for ICPs to take over the work of others, because it allows all ICPs to access
the same data (cf. figure 3).
Figure 3: ETERNUS CS8000 Grid Architecture
5
6
7
8
ICP: Integrated Channel Processor; provides logical (tape) drives (LDs) for media servers
TVC: Tape Volume Cache; cache for logical volumes implemented on RAID disk storage
RAID: Redundant Array of Independent Disks; Level 5 or 6 in ETERNUS CS8000
CAFS: CentricStor Appliance File System: all virtual tapes are stored in the TVC in a number of CAFSs
Page 6 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
In a nutshell, the ETERNUS CS8000 Grid Architecture is able to meet almost every performance target by growing its number of ICPs and RAID
systems according to the number of deployed media servers. And it is able to provide redundant access to the backup repository that ideally
complements the backup application concept of media server failover. If one media server fails, another one can not only take over its work,
but also get access to all the backup data the failed media server has already written. As a consequence data can be restored independent of
media server availability. On the other hand a media server will be connected to multiple ICPs which means that in case of an ICP failure it
can still use (logical) tape drives of other ICPs that still allow access to all previously written backup data (cf. figure 4).
Figure 4: Redundant access to the backup repository
Page 7 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
4.2.2 True Tape Virtualization
The grid architecture concept provides an elegant method of scaling a system and adding redundancy to individual components like ICP or
the TVC. But what happens if data is lost (i.e. data in the backup repository is lost and no longer available for restores)? Most backup
environments today address this by creating additional data copies of the backup images, sometimes also referred to as clones. Clones need
to be managed on a backup application level which can create significant overheads:
 All data needs to be read and written a second time, thereby requiring significant bandwidth of the backup network
 Every clone will result in new index information stored in the backup server, which might cause problems with some backup applications,
especially if backup targets include file systems with many files.
 It might take a while until the generation of clones is started, which leads to a situation in which only one backup copy is available for a
significant time.
 The overall backup policy management and scheduling gets more complex. Especially the definition and management of additional backup
copies creates additional administration efforts and should not be underestimated.
ETERNUS CS8000 greatly simplifies this form of redundancy management and in addition offers better automation in error situations. As soon
9
10
as a tape (or better a LV ) has finished writing / reading, e.g. is unmounted, it is copied from the TVC to a PV (i.e. a physical tape). Up to
three of these physical tape copies (all created in parallel) are supported (the feature is known as Dual Save / Triple Save). Together with an
additional disk mirror, which is explained later, up to 5 copies of the LV can exist inside ETERNUS CS8000. These 5 copies however, are
transparent to the backup application which still only sees one (cf. figure 5).
As a result the backup application needs only to take care of one backup copy without any need to generate additional clones. Access to all
data copies is automated and completely transparent for the backup application. For example if access to a LV stored on a PV fails, the I/O
request is redirected to the next LV copy on another PV and furthermore an automated self-healing mechanism is initiated that will recover
the LV copy on the failed PV. All this is completely transparent to the backup application, which will not notice that there was an error at all.
Other backup appliances on the market do not offer this feature and solely depend on the backup application’s capability to manage
11
additional clone copies of the backup data. As a consequence ETERNUS CS8000 does not depend on functionality like OST to better
integrate with backup appliances. In addition, all data copies inside ETERNUS CS8000 are created using the internal SAN and do not take
away precious bandwidth of the customer’s backup network.
Figure 5: The backup application only sees one copy – multiple copies managed by ETERNUS CS8000
In a nutshell:
 No clones need to be created by the backup
application, thus saving bandwidth on the backup
network. Instead data copies are created and
managed inside ETERNUS CS8000 using its internal
network.
 No extra index information is created by the
backup application, resulting in a leaner internal
database.
 Data copies are created as soon as LVs are
unmounted e.g. parallel to the data ingest at the
earliest possible time.
 The backup application only needs to manage one
backup copy leaving the redundancy task to
ETERNUS CS8000. As a result backup policies
become simpler.
 Last but not least the self-healing mechanisms of
ETERNUS CS8000 are superior to typical cloning
type scenarios of backup applications
9
LV: Logical Volume; emulation of a tape with a capacity of 0.8 to 200 GB
PV: Physical Volume; a real tape in an automated tape library
11
OST: Open Storage Technology; a Symantec backup interface that helps to better integrate Symantec backup software with backup appliances
10
Page 8 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
4.2.3 Split Site Configurations
12
Today many data centers distribute their IT operation across at least two sites. Every site uses the other one as a DR site. Server and storage
infrastructure is available at both sites and data is mirrored or replicated in metro cluster type fashion. Consequently a major disaster
impacting one site can be compensated by the remaining one, because all infrastructure elements including the data are present at the
surviving site to continue operation. The corresponding backup infrastructure should be set up in a similar way. From a backup application
perspective media servers would be distributed across the two sites and a failover scenario would be implemented for the backup server
which normally runs at one site.
For these environments ETERNUS CS8000 offers a unique architecture called the “split site configuration”. The aim here is to preserve the idea
of a single backup repository by simply distributing the grid components of an ETERNUS CS8000 equally across the two sites together with a
strong interconnect. In case of a disaster at one site enough components of ETERNUS CS8000 are still available and will continue to work. In
contrast to other backup appliances, which are confined to a single site and will need some kind of failover to another site, an ETERNUS
CS8000 system remains one system with internal components at two sites. This has the major advantage that in case of a disaster no failover
is necessary. The system simply continues to run. The backup application continues to see the same backup repository and also does not
13
need to failover its operation to another one. This setup is best combined with a synchronous mirror of the TVC, called CMF . CMF together
with Dual Save (i.e. a secondary physical tape copy) will assure that not only the ETERNUS CS8000 grid components are available at both
sites, but also all data instances on disk (i.e. on TVC) and physical tapes (cf. figure 6).
Figure 6: ETERNUS CS8000 Split Site Configuration - Components and data instances are available at both sites
All in all the distribution of a single system across two sites in a split site configuration plus the duplication of all backup data via a
synchronous mirror (CMF) and Dual Save (e.g. a second physical tape copy) offers a number of advantages:
 Both sites consist of enough components to survive a major disaster at one site.
 If configured right, ETERNUS CS8000 continues to run without interruption even in case of a disaster.
 The concept of a single (virtual tape) library is maintained even in case of a disaster which makes it easier for a backup application to
recover.
 The ETERNUS CS8000 split site architecture ideally complements a mission-critical setup of a disaster-resilient IT architecture consisting of
highly available server and storage infrastructure distributed across two data centers including the deployment of state-of-the-art mirror,
replication, and snapshot technologies plus a backup application.
12
13
DR: Disaster Recovery
CMF: Cache Mirror Feature; provides a synchronous mirror for the TVC, especially for split site configurations
Page 9 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
4.3 ETERNUS CS8000 as Consolidation Platform
All these technology elements like the grid architecture, the true tape virtualization, and the split site capability position ETERNUS CS8000 as
an ideal consolidation target. On the one hand, it can easily scale according to capacity (> 1 Exabyte) and throughput (> 100 TB/h)
requirements and on the other hand it can be configured with no single point of failure, and as we will see in the next chapters, it can also
work as a completely disaster-resilient backup appliance.
In order to fully exploit these capabilities ETERNUS CS8000 supports all major backup applications and is also able to serve them in parallel at
the same time. In addition, it is able to support FICON-based mainframe environments in parallel to open systems within the same
appliance, which further extends its consolidation capabilities. And last but not least ETERNUS CS8000 supports all major tape library and
drive vendors in its own back-end.
Page 10 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
5 Disaster Recovery with ETERNUS CS8000
5.1 Overview
In the past chapters we have seen that despite mirror, replication, and snapshot technologies traditional backup methods are indispensable
for IT environments that contain mission-critical components. We also understood that in addition to the productive IT environment backup
infrastructures themselves must also be kept operational under all circumstances. It was also explained that backup applications are able to
maintain a significantly higher level of operability and meet more aggressive SLAs with the help of ETERNUS CS8000. ETERNUS CS8000
features are the basis to reach previously unknown disaster resilience. But before we dive deeper into the special functionality of ETERNUS
CS8000, let’s have a look at a DR solution limited to one data center site.
5.2 DR Solutions Limited to One Site
In a data center limited to one site the corresponding DR solution is just able to preserve the data in case of a disaster. Less focus can be put
on the recovery time, because the ability to implement any form of continuous operation would require a second site. As a consequence
simple straight forward-backup architecture is deployed with minimum built-in redundancy with the primary goal to create just one backup
copy that is somehow kept at a remote site.
In its simplest form, physical tapes are created and vaulted (cf. figure 7). This concept will just preserve the data and in case the primary site
is lost it is unknown how long it will take to organize new hardware and software and put them in a state to recover data from the vaulted
tapes. This is not a desirable solution for a mission-critical environment that usually relies on some form of duplicating the data plus its
related IT infrastructure (i.e. application, server, storage, and data) across two active sites.
Figure 7: Data center limited to one site - Physical tapes are created and vaulted
Also in environments, which are limited to one site, ETERNUS CS8000 can add significant robustness, as already described in the previous
chapters (cf. figure 8).
Figure 8: ETERNUS CS8000 adds robustness even in data centers which are limited to one site
ETERNUS CS8000 can provide significant robustness
 The ETERNUS CS8000 grid architecture allows the scale out
14
implementation of logical (tape) drives (LD ) across
multiple nodes (ICPs). Media servers will use LDs belonging
to different ICPs and therefore be able to continue operation
in case of loss of an ICP.
 A clustered file system allows access to the same LVs by
multiple ICPs (and their corresponding media servers). Data
is always available for restore even in the case of loss of a
media server / ICP.
 The clustered file system runs on top of the RAID protected
systems forming the TVC.
 In addition, LV data is copied to physical tapes and stored in
multiple copies transparent for the backup application,
including a self-healing mechanism in case one copy is lost
or corrupt. This further increases data availability and keeps
the backup application’s storage policy setup very simple,
i.e. only one copy has to be maintained by the backup
application - and ETERNUS CS8000 takes care that this copy
is always available.
14
LD: Logical (tape) Drive; emulation of a tape drive, e.g. of the LTO type, etc.
Page 11 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
5.3 DR Solutions Meeting More Aggressive Service Levels
ETERNUS CS8000 usually unfolds its full capabilities in mission-critical environments. These environments require more than just the ability of
the data to survive a major outage or disaster. As outlined before, mission-critical IT architecture involves at least two data center sites that
under normal conditions work independently of each other but will be able to take over the work from the other site in case of a disaster. In
its simplest form, the two sites are realized by splitting one data center into two fire areas. In a more sophisticated environment the sites are
connected over longer distances (typically several kilometers). With growing distances challenges increase that are induced by growing
latencies in the I/O chain.
In practice this leads to two main data center architectures:
 For distances up to the double digit km range two data center sites are used that are tightly coupled in the sense that primary data is
synchronously mirrored. In this architecture the focus is on having all data center infrastructure elements, i.e. server, storage, and data,
available at both sites 100% synchronized at any time in order to continue IT operation even in case of a complete data center loss at one
site. This architecture is ideally complemented by an ETERNUS CS Split Site Configuration. Not only will the primary IT infrastructure
seamlessly continue to work with such a split site configuration, but also the backup infrastructure. As a result the whole IT environment will
continue to run with optimum data protection although running with fewer components, thereby providing enough robustness until
complete recovery of the lost site is achieved (cf. figure 9).
Figure 9: Architecture for two data centers with synchronous mirror
Page 12 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
 For very long distances or limited bandwidth two sites will still be considered, but they will certainly be less tightly coupled. Asynchronous
replication mechanisms will be more appropriate and also backup applications will most likely play a more important role in continuing IT
operation at the remote site, as they are another important vehicle to make data available at the remote data center in addition to
replication. For these architectures two individual ETERNUS CS8000 systems can be coupled with long distance replication mechanisms (cf.
figure 10).
Figure 10: Architecture for two data centers with asynchronous replication
Page 13 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
Before we get into the details of these two disaster-resilient architectures, let’s summarize the different DR approaches and the
corresponding contribution by ETERNUS CS8000:
IT Architecture

One data center

Vaulting backup data

Optimize general robustness by
using ETERNUS CS8000 grid
architecture
Environment
Only one data center is operated. The
backup solution including ETERNUS CS8000
will be the only instance creating data
copies that will be stored somewhere
outside the data center.

Two data centers within
reasonably small distance
Use of ETERNUS CS8000 split site
architecture
Two 100% synchronized data centers are
operated at all levels (server, storage,
data, backup). Continuous operation is key
and is consequently not only available for
the primary storage and server
infrastructure but also available for the
backup infrastructure via ETERNUS CS8000
Two data centers connected via
very long distances
Use of two ETERNUS CS8000
systems with long distance
replication feature
Due to increased latencies induced by very
long distances the concept of 100%
synchronicity cannot be maintained.
Nevertheless long distance replication of
backup data is possible by connecting two
ETERNUS CS8000 systems to enable a
restart of the IT infrastructure at the
remote site in case of a disaster



Page 14 of 30
Usage
Not a real DR solution. Focus is on
preserving data in case of a disaster.
RTO is very often unknown. ETERNUS
CS8000 delivers maximum robustness
within the framework of this single data
center approach and will also create
additional backup copies for vaulting.
ETERNUS CS8000 is the vital element to
ensure that together with the primary
server and storage infrastructure a fully
functional backup / restore environment
is also available at the second data
center without interruption, thereby
securing mission-critical applications
that will lack a mirror in case of a
disaster until recovery of the failed site.
ETERNUS CS8000 is the vital element to
ensure that backup data is
automatically and efficiently replicated
across very long distances to a remote
site and instantaneously available for a
quick restore.
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
5.4 Two Data Centers and One Split Site ETERNUS CS8000
5.4.1 Disaster-Resilient Architecture
The core element of the architecture is one logical ETERNUS CS8000 system which is distributed over two geographically separated sites (“split
site configuration”). We will refer to these sites as “Site A” and “Site B”. All vital components like
 nodes which emulate the tape drives (ICPs)
 Tape Volume Cache (TVC)
15
16
 back-end nodes (IDPs )which move data back and forth between the TVC and the physical tape library (PTL )
17
18
 Virtual Library Processors (VLP and SVLP ) which manage the system as a single entity
are present at both sites so that in case of an outage or a disaster affecting the whole site the other one still contains enough components to
continue operation (cf. figure 11).
Figure 11: ETERNUS CS8000 split site configuration
15
16
17
18
IDP: Integrated Device Processor; operates physical tape drives (PDs) in a physical tape library (PTL)
PTL: Physical Tape Library; real tape library connected to an ETERNUS CS8000 system
VLP: Virtual Library Processor; ETERNUS CS8000 control component
SVLP: Standby Virtual Library Processor; takes over the work of the VLP in the course of an (backup application transparent) AutoVLP failover
Page 15 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
The ETERNUS CS8000 internal and redundant LAN and SAN infrastructure (not visible to the outside world) which connects all components is
thereby extended to a second site. The advantage is obvious: Although distributed across two sites, there is still one single system
representing one system state that is always available at both sites. The system state (i.e. the configuration and database) is always
synchronously mirrored across the two sites.
In contrast to configurations involving two independent systems, there is no classical failover necessary from one to the other system but
instead the system simply continues to run with a reduced number of components. In order to facilitate a smooth continuation a number of
features are available, which ensure that not only the essential ETERNUS CS8000 components are available at both sites but also the logical
volume data. This is achieved via
 a synchronous mirror of the configuration and internal database data
 Logical volumes (LVs) stored on disk in the TVC that can be synchronously mirrored via CMF
 Logical volumes (LVs) stored on physical tapes (PVs) that can be maintained in up to three copies using Dual / Triple Save with at least one
copy being stored at the second site.
All this guarantees that as soon as the backup application has written any data to ETERNUS CS8000 this data is already disaster protected (if
written to a mirrored portion of the TVC – data written to a non-mirrored portion of the TVC will only be disaster protected when at least a dual
save copy on physical tape has been created) and is accessible via a redundant infrastructure spread across two distant sites.
Split site configuration: Benefits and key architectural elements
Benefits:
 Interruption-free continuation of ETERNUS CS8000 operation if
one site goes down
 Complete takeover of all tasks of the failed site
 All LVs written to a mirrored portion of the TVC will be available
immediately after a disaster
 All LVs written to a non-mirrored portion of the TVC that have
already been written (with Dual Save) to PVs are available
after a system reconfiguration.
 All aspects of the cross-site, redundant data storage (CMF,
Dual Save) are handled by ETERNUS CS8000. The backup
application is relieved of these tasks.
Key architectural elements are (see also considerations from the
chapter “General Availability Features of ETERNUS CS8000”):
 All ETERNUS CS8000 components are distributed across the sites
so that one site contains the minimum set to continue operation.
 The ETERNUS CS8000 internal SAN and LAN are extended from
site A to site B.
 Cross-site connections of the media servers to ICPs / Logical
(tape) Drives (LDs), thereby enabling continuous access to
ETERNUS CS8000 by all media servers even in case of a disaster.
 Remote Dual Save to PTLs set up at both sites to guarantee LV
copies on PVs at every site.
 TVC RAID systems evenly distributed over both sites and a mirror
configured for those LVs that must survive a disaster without
interruption.
 VLP and SVLP are distributed over the two sites and linked via
redundant SAN and LAN.
 The redundant SAN and LAN connections between the two sites
should be routed via alternative paths to avoid a
communications failure as far as possible. (please refer to a later
chapter for details)
To come straight to the point: This architecture represents the optimum in disaster protection including automatic continuation in case of an
outage disabling a complete site. An ETERNUS CS8200 model is a requirement.
Page 16 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
5.4.2 Failure
There are three basic disaster scenarios to be distinguished:
 Failure of site A
 Failure of site B
 Total failure of all communications between sites A and B
A site failure is handled by the (backup application transparent) AutoVLP failover mechanism that will activate the SVLP to become the new
VLP in case the VLP site is affected. Because the ETERNUS CS8000 database and the configuration data are synchronously mirrored across the
sites, every site has access to all relevant data to continue operation. This also includes immediate access to all LVs that have also been
mirrored.
Total communications failure between the two sites constitutes a special situation. A failure of this kind leads VLP and SVLP each to assume
that the other site has failed and, for its own site, either to continue to exercise control (VLP) or to take over control (SVLP). This
phenomenon, known by the term “split brain”, results in a single consistent system dividing into two separate systems, which from this point
forward have separate configurations, which cannot easily be merged again in a fallback. To avoid this “split brain” behavior a Tie Breaker
19
Processor (TBP ) is introduced at another site, site C (cf. figure 12).
Figure 12: Three basic disaster scenarios
19
TBP: Tie Breaker Processor; ETERNUS CS8000 component involved in resolving critical situations
Page 17 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
Together with the VLP at site A and the SVLP at site B, the TBP works as a so-called quorum node. All three quorum nodes (VLP, SVLP, and
TBP) communicate via LAN with each other to decide in which form the ETERNUS CS8000 system is operated. In a normal situation all three
quorum nodes are up and running resulting in a fully functional ETERNUS CS8000 system at both sites that is managed by the VLP.
In the event of malfunction, ETERNUS CS8000 responds with the behavior described below:
Disaster situation
Failure of site A
Behavior

The VLP quorum node cannot be reached by the TBP and the SVLP.

The SVLP and the TBP gain cluster majority.

The AutoVLP failover mechanism is executed and the SVLP becomes
the new VLP at site B.
Failure of site B



Total
communications
failure





The SVLP quorum node cannot be reached by the TBP and the VLP.
The VLP and the TBP gain cluster majority.
The VLP continues to manage the remaining ETERNUS CS8000
components at site A.
The VLP quorum node and the SVLP quorum node cannot reach
each other.
However, the TBP can still reach the VLP and the SVLP.
Based upon its LAN interface configuration the TBP decides with
which other quorum node it will gain cluster majority.
If the selected quorum node is the VLP (and this is the case in a
normal configuration) it will continue operation at site A. Site B will
be expelled from the cluster.
If the selected quorum node is the SVLP the AutoVLP failover
mechanism is executed and the SVLP becomes the new VLP at site
B. Site A will go down.
Result
Continued operation: Site B
takes over operation under the
control of the SVLP, which now
becomes the new VLP
Continued operation: Site A
continues operation
Continued operation: One site
continues operation while the
other will become unavailable.
In a nutshell, the TBP is introduced as an additional instance in order to distinguish a total communications failure from a site failure. In the
table above all failure cases were discussed but the failure of the TBP. In this case the VLP and the SVLP quorum nodes can still reach each
other and will therefore be able to gain cluster majority and continue operation without any interruption. If, however, in this state the
(redundant) LAN connection between VLP and SVLP is also interrupted by any means, there will be not enough quorum nodes left connected
to form a new cluster (more than half of the quorum nodes must be up and running and connected to each other). In this case the system
will not be able to automatically continue to run but will instead need a small system intervention to restart.
If continuous operation is imperative, care should be taken that the TBP
 is located at a 3rd site (site C) independent of site A and B
 maintains LAN connections to the VLP and SVLP that are ideally independent from the connection between VLP and SVLP.
Not in all customer environments is it possible to install the TBP at a 3rd site. As an alternative the TBP can be installed together with the VLP
at one site. As long as the site containing TBP and the VLP is not affected, the situation will be the same as described before. Only in case the
site containing the TBP and the VLP is lost will different behavior occur:
Disaster situation
Failure of site
containing TBP and
VLP
Behavior

The VLP and TBP quorum nodes cannot be reached by the
SVLP.

The SVLP has no majority to manage the cluster.
Result
Manual restart: The other site takes over
operation under the control of the
former SVLP, which now becomes the
new VLP
The following picture emerges in all cases after the surviving site has resumed operation:
20
 ETERNUS CS8000 operates with the remaining ICPs / LDs or IDPs / PD s.
 One control component (VLP or SVLP promoted to the new VLP) is running.
 Previously mirrored components (TVC, ETERNUS CS8000 database and configuration) continue to exist without mirror.
 LVs belonging to a non-mirrored file system (CAFS) will only be accessible if a (Dual / Triple Save) copy has been written on PVs. Surviving
resources will be reassigned to make them available again.
 Only the PTLs of one site may have survived. Dual Save is temporarily suspended.
20
PD: Physical (tape) Drive; real tape drive connected to an ETERNUS CS8000 system
Page 18 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
5.4.3 Fallback
Once the failed or inactivated site can be put back into operation, the following important steps must be carried out:
 Make the new / recovered ICPs, IDPs, the VLP, and the RAID system known to ETERNUS CS8000
 Start resynchronization of the mirrored portion of the TVC to resume CMF operation
 Move back non-mirrored LVs to their original place that had previously been moved to the surviving site.
 Reintegrate the PTL of the failed site
– Lost PVs can be regenerated by making new PVs available to the Dual / Triple Save setup that in turn will automatically create them based
on the information of the survived copy.
Page 19 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
5.5 Two Data Centers and Two ETERNUS CS8000 Systems Connected by Long Distance Replication
5.5.1 The Challenge
Although the split site architecture described in the last chapter represents the best way to add robustness to backup environments including
disaster resilience while still meeting very aggressive RTO targets, it will reach its limits with growing distance between the two sites. For
distances very much beyond 100 km the latencies induced by growing round trip times will render this concept useless.
This is because in every IT environment at least a group of I/O transactions needs to be acknowledged from time to time. The speed by which
these I/Os are acknowledged depends on the time required to send the I/O from the source to the target and the time to confirm the I/O back
to the source. Not counting the delays caused by network equipment like routers etc., the distance between source and target plays an
important role. The maximum speed at which an I/O signal can travel is limited by the speed of light, which is for example limited to 200,000
km/s in fiber optics. Although this speed is not a problem within a data center and distances of a few meters only, it will certainly become
one for larger distances. Let’s have a look at the example of 100 km: The round trip time from the source to the target and back again to the
source will be 200 km / 200,000 km/s = 1 ms (cf. figure 13).
Figure 13: Round trip time – 100 km example
This value already adds a measurable overhead to normal I/O performance. Although you can try to minimize the number of
acknowledgments in a total I/O flow (and this is in fact offered by modern network devices today via techniques called “buffer to buffer
crediting”, etc.) it will in the end remain a problem (cf. figure 14).
Figure 14: WAN round trip time - example with buffer to buffer
crediting
Page 20 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
That leaves us with the following unusual situation: Although data can still be sent at very high rates like 8 Gb/s FC or 10 Gb/s Ethernet the
speed of a single data stream is slowed down by the above mentioned latency. You can also rephrase this into a more positive statement,
saying that although a single data stream can be slowed down by latency effects the presence of multiple data streams at the same time can
compensate this, because the bandwidth is still high.
An ETERNUS CS8000 system based on split site architecture will not be able to address this situation because it is unaware of it. However, we
will see in the next section that two individual ETERNUS CS8000 systems are indeed able to address these circumstances by using the “Long
Distance Volume Replication” feature. But before we explain this in more detail we need to understand a few other aspects that are
important to master the challenges of growing distances.
 For example the network protocols that are used for long distance replication will be more likely TCP/IP-based as opposed to SCSI over
FC-based. An efficient interconnect between two sites should therefore be built around TCP/IP.
 With TCP/IP being the dominant protocol used for longer distances, the likelihood also increases that data streams might be routed over
public lines or at least lines where it is not sufficiently clear that third parties might be able to access the data stream. This requires the
ability to encrypt data at least during transportation and implement a robust authentication scheme between the involved systems.
 Sufficient network bandwidth becomes a costly issue with growing distances. Therefore care should be taken to minimize the total amount
of data that is replicated on a regular basis.
 Last but not least, the probability that I/O errors will occur and may interrupt longer running I/O processes will also increase with growing
distance (and the number of involved network components). This is best addressed by a transport algorithm that allows effective error
recovery without the need to restart longer I/O streams from the beginning each time it is interrupted by a less reliable network.
5.5.2 Disaster-Resilient Architecture
As outlined in the last chapter, a number of special requirements exist for larger distances that suggest architecture other than a split site
configuration. In addition, the advantages of a split site configuration should be preserved as much as possible, i.e. the ability to follow the
paradigm of continuous backup and restore services as much as possible. For this purpose “Long Distance Volume Replication” is designed to
effectively connect two ETERNUS CS8000 systems over longer distances. We will refer to the two systems in the following way:
 A source ETERNUS CS8000 system at site A or the source site
 A target ETERNUS CS8000 system at site B or the target site
Page 21 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
The concept is to integrate the replication mechanism in the normal back-end tape operation of the source ETERNUS CS8000 system and in
the normal front-end tape operation of the target ETERNUS CS8000 system (cf. figure 15). This means
 A number of logical tape drives on the target systems are connected to the source system as if they were physical tape drives. In contrast to
physical tape drives, the logical tape drive connections do not suffer from the “shoe shining effect”, i.e. they don’t need to be operated in
streaming mode but may also run at low speed.
 Instead of SCSI over FC, TCP/IP over Ethernet is used to connect the back-end of the source system to the front-end of the target system,
thereby offering full access to the most common long distance architectures.
 In addition to I/O latency induced optimizations on the network component level, the logical drive concept used on the target ETERNUS
CS8000 allows the cost-efficient operation of many drives in parallel so that the available bandwidth is used in an optimal way and latency
effects on a single stream level are compensated by parallelization of I/O streams.
 Logical volumes on the source site will be copied to the target site using its TCP/IP-based logical drives. Due to the 1:1 copy mechanism that
preserves the native tape format on the target site, they become immediately available to the backup application for fast restores.
 Seamless integration in the Dual / Triple Save concept makes it possible to maintain multiple copies of the logical volumes generated on the
source site. For example it will be possible to maintain a high performance copy on disk or physical tape on the source site for fast restores
(i.e. the copy remains on disk or tape at the source site), while a second copy is automatically (and transparently) created on the target site
using the TCP/IP connected logical drives over a longer distance.
 Data can be encrypted during transportation and a “ssh” (secure shell) based authentication mechanism ensures a proper identity check.
 Data will be compressed and a delta transmission algorithm is used on top to minimize the amount of replicated data. This delta
transmission technique also ensures that a broken I/O process will not require data transmission to be started over and over again but will
instead continue based on the last synchronization point.
Figure 15: Long distance volume replication
From an overall perspective, the target ETERNUS CS8000 system will work like a physical tape library that is connected via TCP/IP (covering
tape drive emulation and library control) and is seen by the source system as an additional back-end target that is able to store additional
logical volume copies within the standard Dual / Triple Save scheme. Cf. figure 17 for example concepts.
Page 22 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
Figure 16: Examples for Cascading concepts
Long distance volume replication with mono site source system
An ETERNUS CS8000 system is installed as a source in a non-split site
configuration maintaining a first data copy on disks / physical tape
close to the source system. Another ETERNUS CS8000 system is
operated as a target maintaining a Dual Save copy of source logical
volumes on disk / physical tape library.
Long distance volume replication with split site source system
A split site ETERNUS CS8000 system is installed as a source already
maintaining two data copies (via Dual Save) on disk / physical tape
in the same manner as described in the previous chapter in order to
meet very aggressive RTO targets. A second ETERNUS CS8000 system
is installed at a remote site over a very long distance using “Long
Distance Volume Replication”. This target system can also be a split
site system, but will in most cases just be an ordinary mono site
system. It will store a third data copy (as seen by the source system)
via the Triple Save scheme. Essentially this system configuration is
the combination of the two ETERNUS CS8000 architectures described
in this paper: It allows maximum DR capabilities by offering split site
architecture (including the support of two sites within reasonable
distance) and involving a third data copy via “Long Distance Volume
Replication” on a second ETERNUS CS8000 target system, which can
be very far away from the split sites and is therefore suited as an
additional DR layer in case a catastrophe were to affect the two sites
covered by the split site configuration.
Long distance volume replication and vice versa
Of course it is possible to also operate both ETERNUS CS8000
systems as source and target at the same time. This will result in a
system configuration in which a first system replicates its data to a
second one with the second one also replicating data back to the
first one.
Page 23 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
In addition, concepts can also be realized where one source system replicates to multiple target systems (i.e. generates multiple DR copies)
and where multiple source systems replicate into one central target system (i.e. realizing a branch office backup concept). Mixed
environments are also possible that would allow “n” source systems to replicate to “m” target systems (cf. figure 17).
Figure 17: Example: Multiple source systems replicate to two target systems
In the following we will concentrate on the first two long distance volume replication concepts that are in fact just one, because in this
scenario there is no real difference between a split site and a non-split site source system.
5.5.3 Failure
In a nutshell, the “Long Distance Volume Replication” architecture consists of one ETERNUS CS8000 system being the active backup target,
while the second ETERNUS CS8000 system just works as a data repository holding an additional copy and is (not yet) directly accessed by the
backup application.
However, with only a short delay caused by the asynchronous nature of the replication mechanism, the second ETERNUS CS8000 system will
own a 100% identical copy of all LVs that have been created by the first system. This enables instantaneous access to the additional data
copies by the backup application. Please note that in addition to the content also the names of the LVs will be identical to the LV names of
the first system. This makes it possible for the backup application to access the volumes at the second site in case of a failure.
The following procedure will apply in this failure case to ensure a quick restart:
 Verify that the source ETERNUS CS8000 system at site A has really stopped working and no longer accesses its volumes. It is important that
only one system can access or modify LVs.
 Now the backup application can access the second ETERNUS CS8000 system at site B and configure it as a replacement for the first system.
Because all LVs are the same at the second system on site B the backup application already knows them and can easily re-catalogue them.
It is therefore recommended that the backup application does not see the target library during normal operation, i.e. as long as it uses the
source ETERNUS CS8000 system.
 The backup server of the backup application might also be affected by the failure. In this case a failover to another backup server must be
21
organized . After the backup server failover has been accomplished, it can configure the target ETERNUS CS8000 system on site B as the
replacement for the first system.
After all this has been completed, the existing or new backup server will continue to work with the second ETERNUS CS8000 system at site B
in the same way as it used the first system. The main difference being that it will temporarily not use “Long Distance Volume Replication” and
will work with one LV copy less than before.
In a summary, it can be said that due to the fact that all LVs are replicated to the second site B in an identical manner they can be quickly
accessed without major reconfiguration work by the surviving instance of the backup application. This provides a method for continuing
backup and restore operation at a remote site shortly after a major failure at the first site.
21
How this can be achieved is backup application-specific and cannot be covered in this document
Page 24 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
5.5.4 Fallback
The following procedure will ensure a proper fallback to the original configuration:
 Bring the source system or its replacement up again at site A (not yet connected to the backup application).
 Stop all backup activities on the target system at site B.
 Identify all LVs that have been changed since the start of operation on the target system at site B.
 Make the system at site B again available as target for “Long Distance Volume Replication” for site A
 Physical tapes (e.g. representing the first copy of a Dual Save setup on the source system) might have survived the disaster on site A. As a
consequence you either join or replace these volumes based on the fact that the associated LVs have been updated by the system at site B
in the meantime.
 Reconfigure the backup application to again use the source system at site A. This may also require proper steps to fall back to the backup
application’s original backup server if this one had also been affected by the disaster. Make sure that the target system remains untouched
by the backup application to avoid confusion with replicated volumes the backup application is unaware of.
In any case the final result will be a new and up-to-date source system at site A, working as the backup repository for the backup application
and maintaining all LVs (typically a first copy of the LVs at site A and a second (Dual Save) copy via “Long Distance Volume Replication” on
the target machine at site B).
Page 25 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
6 General Availability Features of ETERNUS CS8000
ETERNUS CS8000 consists of different components which not only scale exceptionally well within the framework of its grid architecture, but
also constitute the basis for redundant and hence highly available systems that form the basis for the previously described DR scenarios. This
chapter takes a more detailed view of the individual components from the availability perspective (cf. figure 18).
Specifically, we cover:
 The Integrated Channel Processor (ICP) and the associated Logical (tape) Drives (LDs)
 The Virtual Library Processor (VLP) and its standby component (SVLP) including the ETERNUS CS8000 database on the RAID system (“/DB”)
 The Tape Volume Cache (TVC) and the Logical (tape) Volumes (LVs) stored therein
 The Integrated Device Processor (IDP) and the associated Physical (tape) Drives (PDs)
 The Physical Tape Library (PTL) and the associated Physical (tape) Volumes (PVs)
Figure 18: Components of ETERNUS CS8000
Page 26 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
ETERNUS CS8000 is available in various capacity stages. In the smaller systems, multiple components (e.g. ICP, IDP and VLP) have been
implemented in a single processor. In the large systems, such components are implemented in independent processors.
In this way a distinction can be made between three architecture classes which differ in terms of their availability features (cf. figure 19).
 ETERNUS CS8200 models are the “scale-up” systems. The components ICP, IDP and VLP (or SVLP) are implemented on one processor, which
22
is also called an Integrated Universal Processor (IUP ). The models consist of 2 IUPs with external TVC.
 ETERNUS CS8400 models are the “scale-out-single-site” systems. All the components are implemented in dedicated processors.
 ETERNUS CS8800 models are the “scale-out-split-site” systems. Also here all the components are implemented in dedicated processors.
Furthermore, the aforementioned “split site configurations” are possible, in which a system can be distributed over two sites. This scenario
forms the basis for one of the DR architectures discussed in the previous chapters.
Figure 19: ETERNUS CS8000 - Three architecture classes
22
IUP: Integrated Universal Processor; processor housing several ETERNUS CS8000 components, e.g. ICP, IDP, and VLP, all in one processor
Page 27 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
The availability features of ETERNUS CS8000 become steadily better with ascending model number, from CS8200 through CS8400 to CS8800.
The following overview helps identify and exploit the optimization potential in respect of availability for each component (it should be borne
in mind here that, depending on the above-described ETERNUS CS8000 models, several components may be housed in one processor).
Wherever possible, the power, LAN and SAN infrastructure should always be configured redundantly. If an entire site goes down due to a
disaster, the optimizations described below on their own will not help. This case has been dealt with in the previous chapters.
Component
ICP / LD
Availability architecture / recommendations

Configure multiple ICPs in the system

Each media server should serve more than one LD.

The LDs attached to a media server should be
generated on different ICPs.

The media servers’ systems should be connected via
redundant SAN / FICON or ESCON infrastructures
Information on failure / recovery

If an ICP goes down (and consequently also
LDs), the media server can continue on
another LD/ICP combination.

The backup application is responsible for
redirecting new backup / restore jobs to the
remaining LDs.

Starting with the ETERNUS CS8400, the ICP
resides alone on a dedicated processor, i.e. no
other components are affected by the failure.

A new ICP can be installed and commissioned
during live operation.
VLP / SVLP




An SVLP should always be configured for each VLP.
Starting with the ETERNUS CS8400, the SVLP runs on a
23
separate ISP ; on the ETERNUS CS8200 the function is
available as an option and is accommodated along
with the ICP and IDP functions on one IUP.
The AutoVLP failover functionality should be activated.


TVC / LV




The TVC is generally implemented using RAID5 or
RAID6 technology.
On the ETERNUS CS8400 and above the TVC can consist
of two and more RAID systems.
The ETERNUS CS8000 database and configuration data
should be mirrored across two RAID systems to
maximize system availability. In any case additional
meta data copies are kept on the VLP and on the
physical tapes.
A CAFS should at least consist of 2 LUNs (preferably
spread across two RAID systems) to enable a mirror of
superblock/inode type information (even in case the
CAFS is not mirrored). For performance even 4 LUNs per
CAFS (8 in case of CMF) are recommended.






23
The SVLP is automatically put into operation
via the AutoVLP failover if the VLP fails. The
SVLP takes over the functions of the VLP incl.
the externally visible IP address.
On the ETERNUS CS8400 and above the
VLP/SVLP is contained on a separate processor,
i.e. no other components are affected by the
failure.
A new VLP/SVLP can be commissioned with no
interruption in operation
Individual disks are automatically replaced by
hot-spare disks in case of a disk crash. The
system continues running without
interruption. Defective disks can be swapped
during online operation.
The operation continues in case of a complete
LUN or even RAID system failure. However,
non-mirrored LVs will become temporarily
unavailable. If non-mirrored LVs have not yet
been written to PVs, they will be lost.
If a CAFS spans more than one LUN or RAID,
system information on lost LVs is preserved
(the superblock / inode type information is
mirrored) and may be used for a quick recovery
of LVs already written to physical tape
A continuous operation including data
availability can be achieved by the Cache
Mirror Feature (CMF)
CMF can be configured on a CAFS level
A lost mirror is automatically recreated after
fixed or replaced hardware is connected to the
CAFS again.
ISP: Integrated Standard Processor (i.e. ICP, IDP, VLP, or SVLP)
Page 28 of 30
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
Component
IDP / PD
PTL/PV
Page 29 of 30
Availability architecture / recommendations

Configure multiple IDPs in the system and connect PDs
to different IDPs

Configure enough PDs so they can cope with the data
throughput rate in order to prevent performance
bottlenecks and TVC congestion in the event of failure

The PDs attached to an IDP should be distributed over 2
PTLs, and conversely the PDs of a PTL should be
distributed over min. 2 IDPs.

The SAN infrastructure between IDPs and PDs should
also be configured redundantly.

Create redundant copies of the LVs on different PVs and
PTLs using Dual / Triple Save.

Connect a 2nd PTL to the ETERNUS CS system via
redundant paths (optional for ETERNUS CS8200 and
ETERNUS CS8400, otherwise supported as standard)

Premature aging of PVs can be prevented by automatic
“refreshing” based on a suitable refreshing policy.
Information on failure / recovery

If an IDP or a PD crashes, ETERNUS CS8000
automatically uses the remaining resources.

IDPs and PDs can be replaced and
commissioned during online operation.

Operation continues, i.e. independently of the
backup application at the front-end.


The advantage of virtualization is that the PTL
does not have to be permanently available.
This means that any problems affecting PTL
and PDs no longer have an adverse impact on
backup operations.
With Dual Save recovery mode is initiated if a
damaged PV is accessed. In this mode the
redundant copy is accessed immediately.
When a damaged PV is detected, its status is
set to “faulty” and the system automatically
generates a new copy. “Faulty” PVs must be
replaced in the medium term as part of routine
operational maintenance.
http://fujitsu.com/fts/eternus_cs
White paper Fujitsu Storage ETERNUS CS8000 – disaster-resilient architectures
7 Conclusion
To sum up, it can be said that a suitably configured ETERNUS CS8000 system can withstand a failure of all component types and is even able
to survive a complete site failure to keep your data safe.
 This is especially true for ETERNUS CS8800 systems which can be configured with no single point of failure plus the ability to be installed as
split site configurations.
 The ETERNUS CS8400 systems can be configured with no single point of failure on single-site configurations.
 The ETERNUS CS8200 systems, on which each of the components ICP, IDP and VLP is implemented on an IUP (Integrated Universal
Processor), behave in a similar manner with the exception that these models cannot be installed as a split site configuration. If one IUP
fails, the remaining IUP can continue working. Provided that the SVLP function is installed with AutoVLP failover. It is also important to make
sure that the PTL infrastructure is connected to both IUPs so that if one IUP fails, the remaining one can access PDs and PTL in the back-end.
In the cases considered, operation either continues or can be resumed within a short period of time. In addition, new components can be
installed while the system continues to run. This also includes software upgrades which can be performed in a rolling manner, i.e. component
by component without system interruption. Nonetheless, the time in which the original status can be restored plays a key role, since a further
failure of the same component type can lead to a system interruption. For this reason ETERNUS CS8000 provides three important mechanisms
for quickly notifying any failures:
 Notification of the operator or administrators via the ETERNUS CS8000 console, by SNMP, SMS or email
 Forwarding to a management center via SNMP
 Forwarding to the service provider (via AIS Connect, telephone or email)
The integration of these mechanisms into operational processes is a vital prerequisite for meeting agreed service levels.
Contact
Fujitsu
E-mail: storage-pm@ts.fujitsu.com
Website: www.fujitsu.com/eternus
2013-12-02 WW EN
Page 30 of 30
ƒ Copyright 2013 Fujitsu Technology Solutions GmbH, Fujitsu, the Fujitsu logo, are trademarks or registered
trademarks of Fujitsu Limited in Japan and other countries. Other company, product and service names may be
trademarks or registered trademarks of their respective owners. Technical data subject to modification and
delivery subject to availability. Any liability that the data and illustrations are complete, actual or correct is
excluded. Designations may be trademarks and/or copyrights of the respective manufacturer, the use of which
by third parties for their own purposes may infringe the rights of such owner.
http://fujitsu.com/fts/eternus_cs
Download