Enterprise Survivable Servers (ESS): Architectural

advertisement
Enterprise Survivable Servers (ESS):
Architectural Overview
written by
Greg Weber and Aaron Miller
Section 1 - Introduction
1.1 What is ESS ?
Enterprise Survivable Servers (ESS) is an offering which takes an existing Avaya Communication Manager
(CM) system to a higher level of availability and survivability. ESS achieves this by allowing media
servers to be used as alternate controllers within a system by leveraging IP control of port network
gateways and being completely independent of the main servers both functionally and geographically. ESS
protects the communication system against a catastrophic main server failure and, at the same time,
provides service to port network gateways that have been fragmented away from their current controlling
entity.
1.2 How is this paper organized ?
Section #1 – The Background section gives the technical background needed to fully understand and
appreciate ESS by going into a deep dive of how ACM works today in a non-ESS environment. It provides
detailed definitions of some basic infrastructure terms and goes into tremendous depth of how media
servers control port network gateways. If the only objective of reading this paper is to obtain a high level
overview of ESS, then this section can be skipped. However, if the desire is to understand the “whys” of
the ESS operation, then this section builds that foundation.
Section #2 – The Reliability – Single Cluster Environments section covers many of the methods service can
be supplied to a port network gateway while still only having a single controlling cluster. The ESS offering
does not replace these methods, but rather builds on top of them. The methods described in this section
attempt to restore service if faults occur within the communication system. Only if all of these methods,
including server interchanges, IP Server Interface (IPSI) interchanges, control fallback, and more, are
unsuccessful in restoring service will the ESS offering begin to take effect.
Section #3 – The Reliability – Multiple Cluster Environments section covers some of the current methods
of protecting the communication system against main server failures. All of these offerings, such as
Survivable Remote Processors (SRP), ATM WAN Spare Processors (ATM WSP), and Manual Back-up
Servers (MBS) are being replaced by ESS. This section provides a brief overview of the operation for each
one of these offerings, making it clear that ESS provides at least the same level and in most cases a higher
level of protection.
Section #4 – The ESS Overview section introduces the ESS offering by showing, at a high level, how ESS
servers increase overall system availability by providing protection against catastrophic main server
failures and extended network fragmentations. In addition, this section emphasizes that the ESS offering is
one of last resort and that it works in conjunction with all existing recovery mechanisms by reviewing a
number of failure scenarios which are resolved by methods other than ESS.
Section #5 – The How Does ESS Work section does a deep technical dive into the operational steps
required to make ESS operate. The section describes how ESS servers register with and acquire translation
updates from the main cluster. Furthermore, it also covers, in detail, how IPSIs prioritize various ESS
clusters within a system and when it will use this ranking to get service. The information in this section
gives the “how’s” and “why’s” of the ESS offering’s operation.
1
Section #6 – The ESS in Control section examines what happens to existing calls when port networks
failover under ESS control. While there is no feature debt incurring or performance degradation running
under ESS control, there are subtle operational differences. The section concludes with a discussion of
these differences including possible call-flow changes and access to centralized resources and voicemail.
Section #7 – The ESS Variants Based on System Configuration section inspects the subtle differences that
occur when a system needs to utilize ESS clusters for control based on various port network connectivity
configurations. For each type of port network connectivity configuration available, failure examples,
including catastrophic server outages and severe network fragmentations, are simulated and thoroughly
explained.
Section #8 – The ESS in Action - A Virtual Demo section allows the reader to experience the ESS offering
in a real world environment. The demo is given on a communication system that consists of three
geographically separate sites interconnected in various ways. After a thorough explanation of the setup and
how the system operates in a non-faulted environment, catastrophic server failure and network
fragmentations are introduced. Each stage of the demo is clearly explained and the viewpoints of the
system from the main servers and the ESS servers are shared through administration status screens. The
demo concludes with an execution of the steps required to restore the PBX switch back to normal operation
after the networks is healed and the main cluster fixed.
Section #9 – The More Thoughts and Frequently Asked Questions section covers various topics involving
the ESS offering. These topics include details about license files, re-use of MBS servers as ESS servers,
the alarming differences between main servers and ESS servers, how non-controlling IPSIs work with ESS,
and much more.
Section 2 - Background
2.1 What is an Avaya Media Server ?
The Switch Processing Element (SPE) is the controlling entity of an Avaya Communcation System. The
SPE responsibilities range from making any and all intelligent decisions to control of every endpoint within
the switch. In a traditional DEFINITY PBX, the SPE software resides on proprietary hardware (UN/TN
form factor circuit packs) and runs on a proprietary operating system called Oryx/Pecos. Avaya’s next
generation Communication System needs to have the ability to efficiently integrate faster processors as
they become available. One of the main objectives of the Avaya Communication Manager was to transport
this powerful, feature-rich, reliable SPE onto an off-the-shelf server running an open operating system.
This objective was realized by the migration of the SPE onto media servers running the Linux operating
system. The following table shows the different types of Avaya media servers which are currently
available.
Server Name
S8720
S8710
S8500
S8700
S8400
S8300
Chip Set
AMD Opteron – 2.8 GHz
Intel Xeon – 3.06 GHz
Intel Pentium-4 – 3.06 GHz
Intel Pentium-3 – 850 MHz
Intel Pentium-M 600 MHz
Intel Mobile Celeron – 400 MHz
Duplicated
Yes
Yes
No
Yes
No
No
Port Network Support
Yes
Yes
Yes
Yes
Yes
No
Table 1 – Types of Media Servers
In the table above, the server types are listed in descending order of processing power. This is a function of
the chip set on which they run. Duplication is another distinguishing property of the server types. The
S87XX series of media servers are always duplicated, where each server in the server pair has an SPE
running on it. The standby SPE, running on one of the servers, is backing up the active SPE on the other
server in the pair. Refer to the What is a Server Interchange section for more information on server
duplication.
2
The Avaya product line has many different types of gateways including H.248 gateways, H.323 gateways,
traditional DEFINITY cabinetry, and 19” rack mountable cabinets. Port networks (explained in the section
below), which are made up of traditional DEFINITY cabinets and 19” rack mountable cabinets, are not
supported by all media servers. Since the ESS project is designed to provide survivability for port
networks, the S8300 media server will be omitted throughout the rest of this paper unless specifically noted
since it does not support the ESS offering. Also, the scope of the ESS product does not include support for
the S8400 media server at this time.
2.2 What are Port Networks ?
All endpoints connect to the communication system through circuit packs when utilizing the S8720, S8710,
S8700, or S8500 Media Servers. For example, digital phones are wired directly to digital line circuit packs
(TN2124), PRI trunks terminate at DS-1 circuit packs (TN464), and IP endpoints utilize CLAN circuit
packs (TN799) as their gatekeepers. All of these TN form factor circuit packs are housed in cabinets.
These cabinets, listed in the table below, support circuit packs by providing them power, clocking, and bus
resources. The bus resources, a time division multiplexed (TDM) bus and a packet bus, are used as both a
control conduit to the circuit packs and a bearer interconnectivity medium.
Cabinet Type
MCC/SCC
G600
G650
Duplicated IPSI Support
Yes
No
Yes
EI Support
Yes
No
Yes
Rack Mountable
No
Yes
Yes
Table 2 – Types of Cabinets
A port network is the grouping of cabinets that have physically connected buses or, in other words, a port
network is a collection of cabinets that share the same TDM bus and the same packet bus. In addition,
since all circuit packs within a port network share a TDM bus, they need to be synchronized. For that
reason, each port network contains one active tone clock residing on either a dedicated tone clock board
(TN2182) or an IP Server Interface (IPSI) card (TN2312, described later in this section).
2.3 How are Port Networks Interconnected ?
The interconnected buses, within a set of cabinets making up a port network, provide a direct
communication medium between all endpoints associated with that port network. However, endpoints in
disperse port networks also need to be able to establish a communication medium when needed or, in other
words, there needs to be support of Port Network Connectivity (PNC). The PNC provides an
interconnection of port network buses when required. There are a variety of PNC types supported by ACM
and are shown in the table below.
PNC Type
Center Stage Switch (CSS)
Direct Connect
ATM
IP
Port Network
Interconnectivity Device
EI Board (TN570)
EI Board (TN570)
ATM EI Board (TN2305/6)
MEDPRO (TN2302)
Ability to
Tunnel Control
Yes
Yes
Yes
No
Resource
Manager
SPE
SPE
ATM Switch
IP Network
Max # of Port
Networks
45
3
64
64
Table 3 – Types of PNC
When a communication path is needed between two port networks, the SPE will set up the path. For the
PNC types which have the SPE as the resource manager such as Center Stage Switch (CSS) and direct
connect, the SPE will not only inform the Expansion Interface (EI) boards which path to use, but also
create the path through the PNC. For example, to set up a communication path between two port networks
using the CSS PNC, the SPE will inform the EI in port network #1 to use path A, set up path A through the
CSS, and inform the EI in port network #2 to also use path A. When the SPE needs to create the actual
communication path, the PNC is referred to as SPE managed. For the PNC types which do not have the
3
SPE as the resource manager such as ATM and IP, the SPE will only need to tell the ATM EI boards (or
MEDPRO boards) to interface with their peer ATM EI board (or MEDPRO board) and the creation of the
actual path is left to either the ATM switch (if ATM PNC) or the IP network (if IP PNC). For example, to
set up a communication path between two port networks using the ATM PNC, the SPE will inform the
ATM EI in port network #1 to communicate with the ATM EI in port network #2 and the ATM switch
figures out how this is done. When the SPE does not need to create the actual communication path, the
PNC is referred to as self-managed.
PN #1
PN #2
EI
EI
PN #1
EI
SNI
Direct Connect
PN #2
PN #1
EI
PN #2
EI
EI
ATM
SWITCH
SNI
Center Stage Switch
ATM PNC
PN #1
PN #2
MEDPRO
MEDPRO
IP NETWORK
IP PNC
Figure 1 – Connectivity of Different Types of PNCs
2.4 How are Port Networks Controlled ?
As described above, a port network consists of a number of circuit packs and two transport buses, a TDM
bus and a packet bus. Each circuit pack is locally controlled by resident firmware called an angel.
Therefore, in order to utilize the functions of a circuit pack, the SPE needs to interface with that circuit
pack’s angel. Some circuit packs contain enhanced angels, specifically EI boards and IPSI boards
(described later in this section), which have the capability of becoming an arch-angel when activated. An
activated angel, or arch-angel, provides the communication interface to all other angels in the port network
over the TDM bus for the SPE. Hence, in order for the SPE to control all the circuit packs and the TDM
bus in a port network, the SPE must be able to establish a communication path to the port network’s archangel.
Some other circuit packs in the system, such as CLAN boards for example, need additional control links for
various reasons. These control links traverse the port network’s packet bus in the form of Link Access
Protocol Channel-D (LAPD) links to the circuit pack. The SPE utilizes a Packet Interface (PKTINT) to
manage the port network’s packet bus and terminate the other end of the LAPD links. Therefore, in order
to control these types of circuit packets, a LAPD link needs to be established between the PKTINT and the
circuit pack and the SPE needs a communication path to the PKTINT.
The communication path between the SPE and the port network’s arch-angel takes the form of a special
LAPD link called an Expansion Arch-angel Link (EAL). An EAL begins in the PKTINT and terminates at
the port network’s arch-angel. Consequently, the SPE can control everything it needs to be in command of
in a port network, all control links and both buses, if it has a communication path to a PKTINT that can
serve that port network.
In a traditional G3r DEFINITY system, the SPE and the PKTINT both reside in the Primary Port Network
(PPN) and are connected via the carrier’s backplane. The PKTINT serves all of the Expansion Port
Networks (EPN) by having EALs go through the PNC (either CSS, ATM, or direct connect) and terminate
to arch-angels on each port network. EI cards contain these enhanced angels which can be activated into
arch-angels. The figure below shows how the SPE located in a DEFINITY PPN can control its EPNs.
4
SPE
PPN
EPN #1
EPN #2
PKTINT
EI
EI
EI
EAL #1
EAL #2
PNC
Figure 2 – SPE Control of EPNs in Traditional G3r DEFINITY System
However, in the new CM configuration, the SPE no longer resides on the proprietary hardware in the PPN,
but rather externally on the media servers. Therefore, the SPE cannot communicate with the PKTINT
through the carrier’s backplane. To solve this problem, the Server Interface Module (SIM) was created to
provide a physical connection to the PKTINT and an IP interface to communicate with the SPE. The IP
Server Interface (IPSI) card (TN2312) is a conglomeration of a number of components mentioned thus far:
a PKTINT, an enhanced angel (which can be activated into an arch-angel), a SIM, and a tone clock. The
figure below shows how the SPE located on media server can control its EPNs.
SPE
LAN
IPSI
IPSI
IPSI
EAL #1
EPN #1
EAL #3
EAL #2
EPN #2
EI
EI
EPN #3
EI
PNC
Figure 3 – SPE Control of EPNs in an ACM Configuration (IPSIs in all EPNs)
While the above figure shows every EPN in the system having an IPSI, this does not necessarily have to be
the case. Every control link in the system goes between a PKTINT and a circuit pack containing an archangel. If all of the control links went through a single PKTINT, as it does in the traditional G3r DEFINITY
system, a bottleneck would be created. However, the PKTINT resident on the IPSI card has more than
enough capacity to support its own port network control links and control links for other EPNs. If the PNC
supports tunneled control, then not every port network is required to have an IPSI for control. The figure
below shows the control of EPNs with some having IPSIs and others not. Based on IPSI processing
capacity and reliability issues, which is discussed in the next section, it is suggested that there be one IPSI
for every five port networks with a minimum of two per system.
5
SPE
LAN
IPSI
IPSI
EAL #1
EPN #1
EAL #2
EPN #2
EI
EPN #3
EI
EI
EAL #3
PNC
Figure 4 – SPE Control of EPNs in an ACM Configuration (IPSIs in some EPNs)
2.5 What are SPE Restarts ?
The SPE can go through many different types of restarts which can be grouped into three categories – cold
restarts, warm restarts, and hot restarts. Each type of restart has different memory state requirements and
different end-user effects.
The most drastic restart of an SPE is a cold restart. The cold restart clears all memory which deals with call
state and re-initializes the entire system. The end-user effect of a cold restart is that every call in the system
is dropped and that all endpoints are reset. A cold restart has virtually no memory requirements in order to
achieve this type of restart.
A much more graceful SPE restart is a warm restart. This restart re-initializes some of the system, but
keeps all memory which deals with call states intact. The end-user effect of a warm restart is much less
severe than during a cold restart; all calls that are stable remain up, however, some stimuli (e.g. end-user
actions or endpoint events) may be lost. This means that an end-user simply communicating on a call over
an established bearer channel is unaffected, but an end-user in the process of dialing may lose a digit.
Therefore, a warm restart has a memory requirement where all the call state information is valid and up-todate in order to preserve calls.
While a warm restart has some minor effects on end-users, a hot restart of the SPE does not. A hot restart
does not alter or clear any memory and will only restart the SPE software processes using the intact
memory. There is no end-user effect of a hot restart; all calls are preserved and all stimuli occurring during
the restart are received by the SPE and appropriate actions are taken on them. To accomplish this, a hot
restart has the requirement that all of its memory must remain completely intact through the restart.
The following table summarizes all the different types of restarts.
Restart Type
Hot
Warm
Cool
Cold
Reboot
End-user Effect
None
Minimal
Minimal
High
High
Stable Calls
Survive ?
Yes
Yes
Yes
No
No
Approximate Time
for Completion
<1 second
5-9 seconds
45-60 seconds
see below
see below
Table 4 – SPE Restart Summary
6
Executable ACM
Command
reset system 1
reset system 2
reset system 3
The cool restart introduced in the table above is a form of a warm restart that can be used across different
versions of software. It allows the SPE to be actively running and get reset into a new software version
with all stable calls remaining up through the restart. A cool restart is the mechanism to upgrade the SPE’s
release without losing any calls.
As previously stated, cold restarts clear all call state memory but, based on the level of the cold restart, it
may or may not reload the translations. A low level cold restart clears all call state memory, while keeping
its translation memory intact. A high level cold restart, or a reboot, clears both the call state memory and
reloads translations. This implies that if an administrative action is done on the switch, it will be preserved
through a cold restart. However, if the same administrative action is done and the translations are not saved
before a reboot occurs, the changes are lost since the translations are reloaded from the disk.
Restart times are measured from the beginning of the restart until all endpoints are operational and the SPE
is again processing end-user stimuli. Hot restart, warm restarts, and cool restarts never have any endpoints
go out of service and need only wait for the SPE to begin processing once again. This has the implication
that the restart time for these types of resets does not vary too much on the size of the system or the number
of endpoints and therefore approximate restart times can easily be determined without any knowledge of
the system. Cold restarts and reboots reset all endpoints and cannot be deemed completed until all of them
are operational. Determining an approximate restart time mainly depends on three major factors – the types
of endpoints (analog, digital, BRI/PRI, IP …), how many of each is in the system, and the processing speed
of the server. For very small systems running on S87XX media servers, the restart completion time is on
the order of minutes, one to three. For larger systems with a low number of IP & PRI endpoints, the restart
completion is approximately three to six minutes. For systems with a large number of IP & PRI endpoints,
the system may not be fully operational for up to ten to fifteen minutes. During this time, all other endpoint
types will be functioning while the remaining IP & PRI endpoints are coming back into service.
2.6 What are EPN restarts ?
The SPE restarts, described in the previous section, affect the entire system. For example, a cold restart of
the SPE causes all endpoints in the system to be reset and every call to be dropped. SPE restarts are used as
a recovery method for a subset of failures, such as a server hardware malfunction or a software trap. The
type of failure dictates which restart level is needed for recovery. However, there are other types of failures
which need recovery where an entire system restart is not warranted. For example, if a single port network
has an issue which needs a restart to return to normal operation, then only that PN should be reset as
opposed to the entire system. These types of restarts are referred to as EPN restarts. While EPN restarts
are implemented differently than system restarts, the same classifications exist for them – hot, warm, and
cold.
An EPN cold restart is the most extreme PN restart. In an EPN cold restart all circuit packs within the PN
are reset which causes all associated endpoints to reset and all calls to be dropped.
The basic concept of an EPN warm restart is to resynchronize the SPE and the EPN after some event occurs
that caused them to become slightly out of sync in terms of resource allocation. That said, a warm restart
of a PN preserves all established talk paths that it is supporting and does not cause its associated endpoints
to be reset. However, any end-user actions that took place during the time of the fault through the
completion of the EPN warm restart are lost. If, for example, an IPSI temporarily loses its connectivity
with the SPE for less than 60 seconds, it requires an EPN warm restart to return to normal operation. Any
calls in progress within and through that PN would be unaffected (unless of course the connectivity that
was lost was also being used for the bearer traffic), but any end-user stimuli (e.g. dialing or feature
activation) that happened during that time are lost.
A hot EPN restart is used to interchange IPSIs within a PN in a non-faulted environment. IPSI duplication
and IPSI interchanges are described in the next section in detail, but it is important to understand that there
are no adverse effects on any endpoints or any talk paths through a hot EPN restart.
7
Section 3 - Reliability – Single Cluster Environments
3.1 What is a Cluster ?
As described in the What is an Avaya Media Server section, some of the available servers can be
duplicated. This duplication allows one server to run in active mode and the other server to run in standby
mode. The What is a Server Interchange section below describes how the standby server could take over
for the active server if needed. If the standby server is in a refreshed state immediately before the failure, it
could take over without affecting any of the stable calls. In other words, the standby server shares call state
with the active server via memory shadowing and can take over the system in a way that would preserve all
stable calls. A cluster is defined to be a set of servers that share call state. In the case of a server which can
only be simplex, the server itself is a cluster. In the case of a server which can be duplicated, the cluster is
the combination of both of the servers.
3.2 How do single cluster environments achieve high availability ?
The basic structure of the Avaya Communication Manager consists of a decision maker (the SPE), an
implementer (the port network gateways), and a communication pathway for the decision maker to inform
the implementer what actions to carry out. If the SPE ceases to operate or fails to communicate with its
port networks, then the PBX will be inoperable since the SPE makes all the intelligent decisions for the
PNs. For example, if the SPE was powered off, then all of the control links to the arch-angels would go
down. If an arch-angel has no communication path to the SPE, then all messages sent to it from angels on
other boards in the port network would be dropped. If the angels on the TN boards cannot send messages
anywhere to be processed, then endpoint stimuli get ignored. If all endpoint stimuli are dropped, then
service is not being supplied to the end user.
Higher system availability is achieved by providing backups to the system’s basic structural components.
In the case of the SPE, if the servers support duplication, then the standby SPE within a cluster can take
over for its active SPE in fault situations. Once a stable active SPE is achieved, a control link is brought up
to each port network. These control links can traverse different pathways to compensate for
communication failures and can terminate at different locations to compensate for TN boards failures. IPSI
interchanges and EI-control fallback are two methods of achieving this and are described in detail at the
end of this section.
3.3 What is a Server Interchange ?
In the What is an Avaya Media Server section above, a number of server types were listed along with some
of their characteristics. One of these characteristics was duplication. Understanding the catastrophic nature
of an SPE failure, it is important to be able to provide SPE duplication to avoid a single point of failure. If
the servers are duplicated, then one server runs in an active mode and the one server runs in a standby
mode. The server in active mode is supporting the SPE which is currently controlling the Communcation
System. The server in standby mode’s SPE is backing up the active SPE. If there is a failure of the active
SPE, then the standby SPE takes over the system.
One feature of duplicated servers is the ability to shadow memory from the active SPE to the standby SPE.
One technique of achieving this is by having a dedicated memory board (DAJ or DAL board) in each
server and interconnecting them with a single-mode fiber. This method of memory shadowing is referred
to as hardware memory duplication. Another means for shadowing memory between a pair of servers is to
transmit the information over a high-speed, very reliable IP network. This method of memory shadowing
is referred to as software memory duplication and is not available for all types of media servers. For
performance reasons and bandwidth considerations, only a percentage of the system’s total memory is
duplicated between the servers. During steady-state operation, any changes made to the active server’s call
state memory are transmitted to the standby server. After these transmitted changes are applied to the
current state of the standby server’s memory, the standby server’s memory will be up-to-date with the
active server’s memory. The key, however, is that transmitted changes are applied to the current state of
8
the standby server’s memory and for that reason, if the current state was not up-to-date with the active
server before the changes were made, the transmitted changes have no context and therefore are useless. If
a standby server’s memory is up-to-date with the active server, the system is refreshed. If the standby
server’s memory is not up-to-date, the system is in a non-refreshed state. A standby server becomes nonrefreshed if the communication between the servers has a fault or if either SPE goes through a restart. If
the system is non-refreshed, the active and standby servers go through a coordinated refreshing process to
get the standby server up-to-date and prepared to accept transmitted changes.
If the standby SPE is required to take over the system, it needs to transition to become the active SPE. The
process of a standby SPE becoming active is done through an SPE restart. As described earlier, there are
various types of restarts with each having different end user effects. The ideal restart level for an SPE
interchange is a hot restart, whereby the interchange would be completely non-service affecting. However,
this type of restart requires that the standby server have 100% of the memory state of the active server
before the failure. Unfortunately, the system does not achieve 100% memory duplication which implies
that a hot restart interchange is not supported between SPEs running on media servers. However, enough
memory is shadowed to allow a warm restart during the interchange process. This implies that a standby
SPE can take over for an active SPE through a system warm restart if the standby server’s memory is
current (or refreshed). However, if the standby server is not refreshed, the standby SPE can take over for
an active SPE, but only via a cold restart. In other words, if a standby server is refreshed, then it can take
over for an active server without affecting any stable calls in the system. If the standby server is not
refreshed, it can take over the system, but the interchange process will cause all calls in the system to be
dropped.
SPE A
SPE B
SPE
SPE B
LAN
IPSI
LAN
IPSI
IPSI
EAL #1
EPN #1
IPSI
EAL #1
EAL #2
EPN #2
EI
EI
EPN #3
EPN #1
EI
EAL #2
EPN #2
EI
EI
EPN #3
EI
EAL #3
EAL #3
PNC
PNC
SPE A in Active Mode
SPE B in Active Mode
Figure 5 – SPE Interchange
3.4 What is Server Arbitration ?
The decision to determine which server should be active and which server should be standby within a
cluster is the job of the Arbiter. An Arbiter is local to each server within a cluster and instructs its coresident SPE which mode to run based on a number of state-of-health factors. Arbitration heart-beats are
messages passed between two peer arbiters via the IP duplication link and the control networks (for
redundancy) containing the state-of-health vector. The state-of-health vector is primarily comprised of
hardware state-of-health, software state-of-health, and control network state-of-health. The hardware stateof-health is an evaluation of the media server’s hardware (e.g. fan failures, major UPS warnings, server
voltage spikes, etc.) done by the Global Maintenance Manager (GMM) and reported to the Arbiter. The
9
software state-of-health is an evaluation of the SPE software processes (e.g. process violations, process
sanity failures, etc.) done by the Software Watchdog (WD) as well as the Process Manager and reported to
the Arbiter. The control network state-of-health is an evaluation of how many IP connections to IPSI
connected port networks the media server has compared to how many IP connections it expects to have.
This information is determined by the Packet Control Driver (PCD) and reported to the Arbiter.
While the decision tree that an Arbiter goes through has many complex caveats (including anti-thrashing
measures, comparison algorithms for state-of-health vectors, tie-breaker procedures, etc.), it can be
summarized by three basic rules:
1) If an Arbiter and its peer determine that both SPEs in a cluster are in active mode, one Arbiter in
the pair, the one which has been active the longest, will instruct its associated SPE to go into
standby mode.
2) If an Arbiter cannot communicate with its peer, it will instruct its associated SPE to go into active
mode.
3) If an Arbiter has a better state-of-health vector than its peer, it will instruct its associated SPE to
go into active mode (if it is not already) and ensure that the peer SPE is no longer running active
mode.
3.5 What is an IPSI Interchange ?
In order for the SPE to be able to service a port network, a LAPD control link, an EAL, must be established
and functioning. A port network’s EAL begins at a PKTINT, which may or may not be present in its own
carriers, and terminates at an arch-angel, which must reside within the port network itself. If a failure
occurs on a PKTINT which supports the port network, another one must be found to support the EAL if the
port network is to remain operable. There are two basic ways to find another supporting PKTINT for a port
network - transition to another port network’s PKTINT (which is referred to as fall-back and is discussed in
the next section) or interchange to another PKTINT resident in the port network (which is referred to as an
IPSI interchange and is described below).
In the What are Port Networks section above, a number of cabinet types were listed along with some of
their characteristics. One of the traits mentioned was the ability to support duplicated IPSIs which implies
the ability to support duplicated PKTINTs since the PKTINT is one of the components of an IPSI. The
SPE chooses which IPSI will be active and which IPSI will be standby within a pair. The active PKTINT
supports all of the LAPD links (including the EALs and other control links) that are needed to service the
port networks that it supports. The standby PKTINT is available to take over support of the LAPD links if
needed.
There are two types of IPSI interchanges that exist, each of which has different end user effects - a planned
migration and a spontaneous interchange. A planned IPSI migration will not have any effect on the end
users and can occur during periodic maintenance or via an administrator’s command. A planned migration
occurs in non-fault situations and carefully transitions link support from the active PKTINT to the standby
PKTINT in a method whereby no messages are lost. The standby PKTINT is transitioned to an active
PKTINT via a hot restart. A spontaneous IPSI interchange will have no effect on stable calls, but will
possibly drop end user actions during the switchover. A spontaneous IPSI interchange is an action that
SPE takes if an active PKTINT failure occurs or a network fault occurs between the SPE and the active
IPSI. A spontaneous IPSI interchange will inform the standby PKTINT of all existing links and then
transition it to the active PKTINT via a warm restart.
The left side of Figure #5 below shows the control links for port networks #1 and #2. Port network #1 (PN
#1) has a resident IPSI and the controlling link for this port network, the EAL, goes from the IPSI’s
PKTINT to the IPSI’s arch-angel. The other port network does not have its own IPSI and therefore is
utilizing another port network’s PKTINT for service (the next section explains how this is possible). PN
#2’s EAL goes from port network #1’s IPSI’s PKTINT to the EI board residing in its own carrier. When a
failure occurs on the active PKTINT, a communication fault in this example, the SPE will transition over to
the standby IPSI’s PKTINT in PN #1. The right side of Figure #5 shows the new instantiations of the
10
control links after the spontaneous IPSI interchange. Notice that for the non-IPSI controlled port network,
PN #2, the EAL’s near end termination point has shifted from the A-side PKTINT to the B-side PKTINT,
but the far end remains at the same location, on the EI board. However, both termination points of the EAL
for the IPSI controlled port network, PN #1, have shifted. Like the non-IPSI port network’s EAL, the near
end shifted from the A-side PKTINT to the B-side PKTINT, but unlike the non-IPSI port network, the far
end has also shifted from the A-side IPSI to the B-side IPSI. Due to historical reasons, if both IPSIs and
PKTINTs are healthy, SPE maintenance preferences back to the A-side IPSI (unless it is locked down to
the B-side) even though there is no performance effect from being on one side or the other.
SPE
SPE
LAN
LAN
IPSI-A
IPSI-B
IPSI-A
IPSI-B
EAL #1
EPN #1
EPN #2
EI
EPN #1
EI
EAL #1
EPN #2
EI
EI
EAL #2
EAL #2
PNC
PNC
A-Side IPSI Active
B-Side IPSI Active
Figure 6 - IPSI Interchange
Also, the determination of which IPSI will be active and which will be standby is completely independent
of which server is in active mode and which is in standby mode. Since there is no longer a tight association
between the active server side and the active IPSI side (e.g. the B-side server can be in active mode and the
A-side IPSI can be in active mode at the same time), the SPE can independently select which side (A or B)
will be active for each IPSI pair (e.g. some IPSI pairs could have the A-side be active while others are
concurrently active on the B-side).
3.6 What is EI Control Fallback ?
In the section above which describes how port networks are interconnected, different types of PNCs are
discussed. Some of the PNC types not only support bearer traffic between port networks, but also provide
a medium for tunneling control from one port network to another. This tunneling ability allows for some
port networks to be supported by a PKTINT which is located in another port network. In these cases, the
port network has an EAL control link which goes from another port network’s PKTINT to an arch-angel
residing on an EI board within the port network itself. Port networks which are being serviced in this
fashion are referred to as EI controlled. Port networks which are being serviced from a co-resident
PKTINT are referred to as IPSI controlled.
EI fallback is a recovery method whereby a port network transitions from IPSI controlled to EI controlled
due to a fault. If the port network has a simplex IPSI, a fault of that board will cause an EI fallback
(provided the PNC type supports it). If the port network has duplex IPSIs, a fault of the active board will
cause an IPSI interchange. However, if the newly activated board has an additional fault and the pair IPSI
board has not returned to an error-free state, the port network will fallback to EI control.
11
The left side of Figure #6 shows three port networks getting controlled before the introduced fault. Port
network #1 and port network #2 have simplex IPSIs and currently have their EALs go from their coresident PKTINTs to their co-resident arch-angels (both of which are contained within the IPSI). Port
network #3 does not have an IPSI and is getting its control indirectly from the SPE through PN #2. In this
example, if a fault occurs in PN #2’s PKTINT, then the EALs for PN #2 and PN #3 need to transition off of
the faulted PKTINT. The near end of Port network #3’s EAL shifts from the PKTINT in PN #2 to the
PKTINT in PN #1, but the far end remains at its EI board. Port network #2 moves from IPSI controlled to
EI controlled by having the near end of the EAL shift from PN #2’s PKTINT to PN #1’s PKTINT and
having the far end of the EAL shift from the IPSI’s arch-angel to the EI board’s arch-angel. In order to
shift the port network’s EAL from the IPSI to the EI board, a PN warm restart is required.
SPE
SPE
LAN
IPSI
LAN
IPSI
IPSI
EAL #1
EPN #1
IPSI
EAL #1
EAL #2
EPN #2
EI
EI
EPN #3
EPN #1
EI
EPN #2
EI
EI
EPN #3
EI
EAL #2
EAL #3
EAL #3
PNC
PNC
Normal Operation
Faulted Fallback Operation
Figure 7 - EI Fallback Recovery
After the fault in the PKTINT is cleared, the SPE attempts to control that port network through its IPSI
once again. The automated process of maintenance shifting a port network’s EAL from the EI board back
to the IPSI is called fall-up. Fall-up, like fall-back, requires a port network warm restart to shift the far end
termination point of the EAL.
Section 4 - Reliability – Multiple Cluster Environments
Thus far, a complete failure of the controlling cluster would render the communication system inoperable.
As shown in the Reliability - Single Cluster Environments section, there are reparation methods for many
types of failures that keep the system operational and port networks functioning. For instance, a failure of
an active SPE can be addressed by having the standby SPE, within its cluster, takeover for it. This method
works on the premise that failures of the active SPE and the standby SPE are independent. If there is a long
mean time between failures of SPEs and one SPE is backing up the other one, then statistically there will
always be a functional SPE. Therefore, given a functional SPE and a viable control path to the port
networks, the system will be operational. However, true independence of SPE server failures within a
cluster cannot usually be achieved due to the lack of geographical separation. For example, all the
hardware and software reliability numbers calculated for the SPE servers are completely meaningless if the
data room supporting them has an incident whereby everything in the room is destroyed.
12
There is a direct correlation between the availability of a port network and the number of different
pathways that can be used by the SPE to control the PN. While there are recoveries which try different
pathways for controlling purposes, there are situations when all of them fail. For example, if a port
network is in a building and all communication pathways into that building are non-functional, then the
system’s controlling cluster is unable to support that isolated port network. In this case, the port network
ceases to provide service to its endpoints and is useless until a control link is re-established.
The ESS offering addresses theses types of cataclysmic failures. This section covers the previous offerings
which provide the Communcation System protection from catastrophic main cluster failures and port
network isolation using multiple clusters. All of these offerings, including Survivable Remote Processors,
ATM WAN Spare Processors, and Manual Back-Up Servers, are being replaced with ESS.
4.1 How long does it take to be operational ?
The first question always asked in reference to survivability is “How quickly does the PBX provide service
to end users after a fault has occurred?” Before answering this question in terms of an exact time, which is
done in each of the survivable options sections below, the question “What are the critical factors in
determining a recovery timeline?” is addressed. A recovery timeline for any problem can be broken down
into three logical segments – fault detection time, recovery option evaluation time, and recovery action
time.
The fault detection time is the period of time it takes to determine that something is not operating normally
within the system. Some of the survivability options have automated fault detection. For example, the
ATM WAN Spare Processors are continually handshaking keep-alive messages with the main server
complex. If it does not receive a keep-alive response message in an appropriate amount of time, then the
ATM WSP determines there is a fault. Other survivability options, such as Manual Back-up Servers,
require manual detection that a fault has occurred. For automated detections, the time frame for fault
detection is usually on the small order of seconds; however, no such time frame can be easily characterized
for manual detections.
After a fault is detected, a number of issues need to be resolved before taking any recovery actions. The
first issue that needs to be addressed is determining if the fault detection is a false positive. For example, if
two entities are continually handshaking over an IP network, they could detect something is awry if an
abnormal round trip delay is detected. However, if this is a one time occurrence (which is completely
normal for IP networks), then the “fault” detected is not really a fault and nothing should be done. The
next issue to deal with is weighing recovery speed against the effect on end users. If a port network and its
controlling server have a communication fault, then no service is supplied to end users supported by that
PN. As previously discussed, if the fault is remedied within 60 seconds, then the server and the port
network will continue operation without dropping any active calls. If a Survivable Remote Processor takes
over a port network, it will cold restart that PN and, therefore, drop all calls supported by that PN. The
question becomes, if a real fault is detected between the server and the port network, should the PN be
immediately taken over by the Survivable Remote Processor and drop all calls in the process or should the
main SPE be given the chance to re-establish connectivity with the PN and thereby possibly not dropping
any calls if accomplished fast enough. The general rule of thumb is that if the recovery process will not
disrupt service any further, then the recovery option evaluation time can be relatively short as it is with
recovery methods shown in the Reliability - Single Cluster Environments section. If the recovery process
does not preserve calls, as is the case with ESS and the offerings below, then the recovery option evaluation
time should be long enough to give the non-call effecting recoveries every chance.
The last issue that needs to be addressed is the trade off between recovery times versus being a fragmented
system. Running in a fragmented mode has some drawbacks discussed at the end of the Reliability Multiple Cluster Environments section. If a fault prevents the server from communicating with a PN for an
extended period of time, then that PN goes out of service. All the recovery options at this time involve
resetting the port network and tearing down all existing calls, so fragmentation becomes the only issue. If
it was known that the fault would be fixed within two minutes for example, it may be wise to prevent the
13
PN from being fragmented and deal with no service during that period of time because if the PN fragments
it will be operating as its own island away from the rest of the PBX and may require another PN restart to
agglomerate it back with the rest of the system. Unfortunately, there is no crystal ball that exists which can
be queried to determine the outage time in advance. At this point, the recovery option evaluation time
becomes a no service time since the PN is out of service longer than it needs to be to prevent fragmentation.
The last part of the recovery timeline is the recovery action time portion. After the fault is detected and the
decision has been made to take a particular recovery, that recovery takes place. The period of time that the
recovery action takes is dependant on what is being done. For example, if port network goes into fall back
because of an IPSI fault, only a single EPN warm restart is done which is very fast. However, when a
Manual Back-Up Server takes over a system, it requires a complete system reboot (a cold restart) of the
entire SPE which takes much longer.
4.2 What are Survivable Remote Processors (SRP) ?
A communication system configured to use a center-stage switch PNC has its port networks interface with
the CSS through EI boards. Ideally, a point-to-point fiber connects the PN’s EI board directly to a Switch
Node Interface (SNI) board (TN571) in the CSS. However, direct fiber runs have distance limitations and
geographically disperse port networks maybe desired. Therefore, a port network can utilize the PSTN to
interface its EI board with its peer SNI. This type of EPN, called a DS1C remoted port network, has its EI
board connected to a dedicated DS1 board which communicates with another DS1 board residing in the
CSS. This second DS1 board is connected to the EI board’s peer SNI. This DS1C remoted PN
communicates with the rest of the system through this connection – both for bearer traffic and for tunneled
control.
If the main processing complex ceases to operate or the connectivity between the sites has a fault, then the
DS1C remoted port network has the potential to become a very expensive paper weight. An alternate
controlling complex, called an SRP, can be used to control a DS1C remoted port network in these types of
situations. The maintenance board in the remoted EPN is continuously monitoring the control connection
link through the PSTN. If a fault is detected, it switches the fiber connecting the EI board to the DS1 to
connect the EI board with the SRP. The port network’s arch-angel is still resident on the remoted EPN’s EI
board, but the other terminating side of the EAL is now on the SRP as opposed to the IPSI at the main
controller’s site. On the left side of the figure below, the DS1C remoted EPN is being controlled from the
main site during normal operation. The right side of the figure shows a fragmentation fault and the failover
of the DS1C remoted EPN to a local SRP.
SPE
SPE
DS1C
REMOTED
EPN #2
LAN
DS1C
REMOTED
EPN #2
LAN
EI
EI
IPSI
IPSI
EAL #1
EAL #1
PSTN
EPN #1
PSTN
EPN #1
EAL #2
EAL #2
EI
DS1
EI
DS1
SNI SNI
CSS
DS1
DS1
SNI SNI
CSS
SRP
Normal Operation
Fragmented Operation
Figure 8 – SRP Control of DS1C Remoted EPN
14
SRP
As discussed previously, the SRP will take over the remoted DS1C port network only after all other
recovery methods have been exhausted (server interchanges, IPSI interchanges, PNC interchanges, etc).
There are two main reasons for ordering the recoveries this way – call preservation and port network
isolation.
Before introducing the possibility of multiple controlling clusters, all control links to port networks came
from the main active SPE. If another control pathway is needed, then the control link is transferred to a
new path. If the logical EAL link is transferred onto another physical medium, the port network and the
SPE become slightly out of synch. A port network warm reset is done to re-synch the EPN and the SPE.
This resynchronization consists of the SPE verifying its view of resource allocation with the actual usage in
the EPN. For example, if the SPE believes a call is using some resources on the TDM bus within the port
network and an audit of those resources shows that they are still in use, then the SPE assumes that the call
associated with those resources is still active and continues to allow it to proceed normally. However, if an
audit shows that the resources are no longer in use, the SPE derives the conclusion that the call has been
terminated (e.g. a user went on hook) and updates its resource usage maps.
The What is a Server Interchange section described how a standby SPE can takeover for an active SPE
without dropping any stable calls since call state was shadowed between the SPEs within the main cluster.
Unfortunately, there is no call state shadowing between SPEs which are in different clusters (the main SPEs
are in the main cluster and the SPE being used as an SRP is in another cluster). When an SRP assumes
control of an EPN, it needs to synchronize with it in order to assume control. However, since the SRP has
no call state information it has nothing with which to audit the EPN against. Therefore, the only way to
synchronize the SRP SPE and the EPN is via a cold restart which tears down all existing calls.
Once an SRP assumes control of a DS1C remoted port network, that EPN becomes an isolated entity. In
other words, an SRP controlling an EPN becomes a stand-alone PBX separated from the rest of the main
system. The effects of this isolation can be summarized in the following call flow. Assume that user A is
supported by some PN at the main site and that user B is supported by the DS1C remoted port network. In
a non-faulted environment, if user A dials user B’s number, the SPE will attempt to route the call. The SPE
knows the status of user B, on hook in this example, since it controls the DS1C remoted EPN and therefore
can successfully route the call to the end user through the port networks involved and the PNC. However,
if the remoted EPN is currently isolated away under the control of an SRP, the call flow changes
dramatically. When user A dials user B’s number the SPE determines that it cannot route the call. The
SPE unfortunately does not know the status of user B since the support EPN is not within its control and
therefore routes the call to an appropriate coverage path. In other words, when a user is not under control
of the SPE, it assumes that the user’s phone is out-of-service even if it is being supported by another
controlling entity such as the SRP.
In addition, the isolated system consisting of the SRP and the EPN no longer has access to centralized
resources at the main location. For example, if the interface to the voicemail system is located on a
different PN at the main site, then the isolated fragment will not have access to it. This will cause intercept
tone to be played to any end user which is supposed to be routed to voicemail coverage. Also, any calls
that are generated or received by the SRP controlled EPN will not be part of the overall system’s call
accounting. This implies that a call made by a user off an EPN being controlled by an SRP will not appear
in the main system’s Call Detail Records (CDR).
The SPE on the SRP is always operational and is continuously attempting to bring a control link up to the
remoted EPN. Under normal operation, there is no pathway from the SRP to the EI, where the control link
terminates, since the maintenance board has chosen to have the fiber connection from the EI go to the
supporting DS1 board. Determining the recovery time for this port network therefore breaks down into the
following time periods – maintenance board detection of a control link failure from the main SPE,
maintenance board no service time waiting for a possible quick network recovery, and SRP bringing up an
EAL after the maintenance board switches over the fiber and cold restarting the port network. The
maintenance board can detect an anomaly in the control link very quickly (within 2-3 seconds), but will
wait in a no service time for approximately two minutes. Once the fiber switchover takes place, the SRP
15
will create an EAL within 180 seconds since it is a periodic attempt and the pathway now exists. The EPN
cold restart will complete within approximately 1-2 minutes. For these reasons, the complete recovery time
for a DS1C remoted EPN from a catastrophic fault, either due to a network fault or a main server crash, is
approximately 7-10 minutes.
As shown in Figure 8, the SRP controls an EPN via a point-to-point fiber which terminates at the EPN’s EI
board. Since there is no CSS PNC involved with this connection, the SRP can only control one port
network. Therefore, it can be said, an SRP provides survivability for a port network, but not survivability
for an entire system. In fact, since the SRP can only control one port network, it does not need translation
information for the rest of the system. This has the advantage that SRPs can have distinct translations from
the main system which gives the administrator the ability to have the EPN operate differently in
survivability mode if desired (e.g. completely different route patterns or call vectors). However, since the
SRP uses distinct translations, it has the disadvantage of needing double administration in some cases. In
order to add a user to the main system, the translations on the main server need to be edited. Since there is
no auto-synchronization of translations between the main server and the SRP, the same translation edits
have to be done manually on the SRP.
Once the connectivity between the main server and the DS1C remoted EPN has been restored, it is up to
the system administrator to manually put the system back together. The main server has no knowledge that
an SRP has taken over the remoted port network. Therefore, it is continuously attempting to bring that
EPN back into service by instantiating an EAL to it for control. During the network fragmentation, it is
obvious that every EAL creation attempt will fail (no pathway available). After the network fragmentation
has been repaired, the EAL creation attempts will still fail since the fiber connectivity to the EI has been
physically shifted. When the system administrator determines it is time to agglomerate the system, they
will issue a command on the SRP forcing the maintenance board to swing the fiber back to the EI board
from the DS1 board. The next periodic EAL creation attempt will now succeed since there is a viable
pathway. It is important to keep in mind that when the main server resumes control over the DS1C
remoted port network all current calls on that EPN will be dropped since the main server has no
information about them. A cold restart of the EPN is required to bring it back online with the rest of the
system.
4.3 What are ATM WAN Spare Pairs (ATM-WSP) ?
A communication system configured to use an ATM PNC has its port networks interface with an ATM
backbone switch through ATM EI boards (TN2305/6). ATM WSP is an offering for DEFINITY G3r
which provides alternate sources of control for EPNs that are unable to communicate with the main SPE.
This communication failure can occur due to two reasons – an ATM network fragmentation or a complete
main server complex malfunction. In the first case, an ATM network fragmentation, the control links from
the main server have no viable pathway to get to the port network. It is important to note however, that if
the ATM PNC is duplicated with the critical reliability offer, then both ATM networks need to fragment in
order to have PNs become isolated. For example, if the A-side ATM network is fragmented, the main SPE
would transfer the controlling EAL to traverse the B-side ATM network. The second case, a complete
main server complex failure, also prevents EPNs from receiving service since the controlling entity is
unable to perform its task. If the SPEs are duplicated with the high or critical offer, then both SPEs need to
have complete failures in order to leave the PN out-of-service. For example, if the A-side SPE has incurred
a failure, then the B-side SPE would takeover the system through a server interchange.
ATM WAN Spare Processors are added strategically throughout a system to provide alternate sources of
control if needed. They consist of an SPE and an interface into the ATM network. All of the ATM WSPs
in a system are priority ranked and are continuously heart-beating with each other. The figure below shows
an ATM PNC PBX leveraging ATM WSP for higher system availability.
16
EPN #1
SPE
PPN
EI
EAL #1
PKTINT
EPN #3
EI
EAL #4
EI
ATM WSP
#2
PKTINT
EI
ATM PNC
EAL #2
SPE
SPE
SPE
EAL #3
ATM WSP
#1
PKTINT
EI
EI
ATM WSP
#3
PKTINT
EI
EI
EPN #2
EPN #4
Figure 9 – G3r with ATM WAN Spare Processors
If an ATM WSP loses heart-beats with the main server for 15 minutes, it will then assume control of all
EPNs it can reach if it is the highest ranked spare processor on its fragment. The figure below shows a
catastrophic main server failure and control shift for all EPNs to ATM WSP #1. The other ATM WSPs in
the system did not takeover any port networks since there is a higher ranking spare processor with which to
communicate. In addition, since all the port networks failover to the same SPE, as it did before, the system
is providing 100% equivalent service.
EPN #1
SPE
PPN
EPN #3
EI
PKTINT
EI
EAL #4
EAL #1
EI
SPE
ATM WSP
#2
PKTINT
EI
ATM PNC
SPE
ATM WSP
#1
PKTINT
EI
SPE
EAL #3
EI
EI
EAL #2
EPN #2
ATM WSP
#3
PKTINT
EI
EPN #4
Figure 10 – G3r Failure with ATM WSP Takeover
Figure 11 shows an ATM network fragmentation and the failover of EPNs #3 and #4 to ATM WSP #2.
ATM WSP #1 can not takeover those EPNs since no viable pathway exists to communicate with them and
ATM WSP #3 does not takeover those EPNs since it can communicate with ATM WSP #2 which is ranked
higher. While all port networks are receiving service from SPEs that have equivalent power as the main
SPE, the system is running in a handicapped fashion. As discussed in the What are Survivable Remote
Processors section above, some centralized resources may not be accessible by everyone and call flows
between the fragmentations are disturbed. Each ATM WSP will provide equivalent service to EPNs that
the main SPE did, but there is no guarantee that the ATM WSPs will control all system resources.
17
EPN #1
SPE
PPN
EAL #1
PKTINT
EPN #3
EI
ATM WSP
#1
PKTINT
EI
ATM WSP
#2
PKTINT
EI
EI
EAL #4
EI
SPE
SPE
ATM PNC
EAL #2
EAL #3
SPE
EI
EI
EPN #2
ATM WSP
#3
PKTINT
EI
EPN #4
Figure 11 – ATM Network Fragmentation with ATM WSP Takeover
ATM WSPs are not designed to protect the system against all types of failures especially since the ATM
WSPs does not attempt to assume control of any port networks until after a lack of communication with the
main SPE for 15 minutes. If the communication between the main server and an EPN is severed for less
than 60 seconds, the main SPE should continue to control the EPN once the connectivity is restored. If the
ATM WSP took over the EPN too quickly, then an unnecessary fragmentation and EPN cold restart occurs.
ATM WAN Spare Processors are designed to protect the system against catastrophic failures only. They
are not designed to shield against temporary outages or software glitches. With that said, upon a
catastrophic failure, an EPN is taken over by an ATM WSP within 15 minutes provided connectivity exists
and then it goes through a cold restart which takes approximately 1-2 minutes. Therefore, an EPN which
has its control transferred due to a fault from a main server to an ATM WSP will be operational within
approximately 17 minutes.
At all times the main server is attempting to control all EPNs regardless if the EPNs are being controlled by
a spare processor. The control link that the main server is trying to instantiate will be blocked by the ATM
EI board. The ATM EI board only allows one EAL link at a time to be terminated to it and when the ATM
WSP took over the port network an EAL is created. Therefore, in order to agglomerate the system back to
the main server controlling ATM WSPs need to be reset. If an ATM WSP is reset, then all EALs which it
is using to control port networks are dropped. Once the EAL is dropped, the main server’s cyclical attempt
to terminate an EAL to that EPN will succeed and allow the main server to resume control.
4.4 What are Manual Backup Servers (MBS) ?
Communication Manager is the next generation of the DEFINITY PBX whereby the SPE is no longer
resident within the system, but rather controls all EPNs either directly or indirectly over IP. Manual
Backup Servers (MBS) is an offering that provides an alternate source of control to the PNs if a
catastrophic event prevents the primary cluster from providing service by leveraging this IP control
paradigm. MBS was created as an interim offer until ESS became generally available. The figure below
shows how an MBS cluster can be positioned into a system to provide an alternate control source which is
independent of the main cluster. This independence is crucial to increasing system availability because if
the MBS is not independent then the event which rendered the main servers useless may also affect the
MBS servers. While the example in the figure below shows a CCS PNC, the MBS offering works in
conjunction with any of the PNC options available (CSS PNC, ATM PNC, Direct Connect, or IP PNC).
18
SPE
MBS
SPE
MBS
LAN
IPSI
LAN
IPSI
IPSI
EAL #1
EPN #1
IPSI
EAL #1
EAL #2
EPN #2
EI
EI
EPN #3
EPN #1
EI
EAL #2
EPN #2
EI
EPN #3
EI
EAL #3
EI
EAL #3
PNC
PNC
Normal Operation
MBS Compensating for Main SPE Failure
Figure 12 – CM with MBS Servers
Upon a failure scenario which prevents the main servers from controlling the port networks, the end users
of the Communication System shall be out-of-service. There are many types of failures, described in the
Reliability - Single Cluster Environments section, which can be resolved by the main servers without any
intervention. However, if the failure is such that the main servers are not going to be operational for an
extended period of time and all other recovery methods fail, then the MBS servers can be manually
activated to take over the system. Under normal situations, the MBS servers are in a dormant state with
translation sets synchronized to it on a demand basis. The activation of an MBS cluster is the process of
taking the dormant SPE into an active state, or, in other words, doing a complete system reboot without any
call state (no memory shadowing exists between the main cluster and the MBS cluster). This implies that
when an MBS takes over control of the system, all EPNs go through a cold restart and any calls supported
by these port networks are dropped.
While the concept of the MBS servers was the basis for the ESS offering, there are many pitfalls with it
most notably a non-deterministic recovery time. While the actual reboot time of the system (the method
the MBS servers use to recover the system) can be assumed to be on the small order of minutes (see Table
4 in the What are SPE Restarts section), the other parts of the overall recovery time, detection time and no
service time, are based on non-automated procedures and therefore have a tremendous amount of variance
and no upper bound due to the human factor. There are some situations where the communication system
is being tightly monitored at all times and, in this case, the complete system recovery time can hover
around 10 minutes. However, if the system is not being closely monitored, it could takes tens of minutes to
just detect that there is a problem and then a time consuming scramble to manually activate the MBS.
Another pitfall is that the MBS servers are not continually proving their viability since MBS servers remain
in a completely dormant state during normal operation. Ideally, the MBS would always be running
diagnostics on its hardware and resident software and checking its connectivity to all of the port networks’
IPSIs. Without this information, there is no way of knowing ahead of time if the MBS servers will be able
to provide service to the port networks if needed and no way of alarming the issues so they can be
addressed before the MBS is ever activated. Furthermore, the MBS offering does not scale well and, in
fact, there can only be one MBS for a main server. This restriction limits the complexity of disaster
recovery plans and the number of different catastrophic scenarios which can be protected against.
19
MBS servers also have operational pitfalls. The key to the MBS servers successfully taking over the
system is that the main cluster is completely eliminated from the equation. The MBS servers are exact
duplicates, from a software point of view, of the main servers and are unaware of the main servers’
presence in the system. Therefore it cannot operate in conjunction with the main servers for any length of
time. For example, if a situation arose whereby the MBS servers were activated while the main servers
were still operational, then the port networks would consistently thrash back and forth between getting
control from the main cluster and the MBS cluster (this would render the port network inoperable).
Furthermore, suppose the MBS is activated at an appropriate time while the main cluster is inactive, but at
a later time the main cluster is fixed and comes back on-line. This would cause the same thrashing
problem. This leads to the largest downfall of the MBS offering – it does not, although it appears to,
provide protection against network fragmentation. If the network fragments as shown in the diagram
below, the port networks unable to communicate with the main server will be out-of-service (fallback is
unavailable since there is no CSS or ATM PNC in this example). If the MBS is activated, it will assume
control of the port networks it can reach. During the time that the network is fragmented, both the main
and MBS servers are attempting to control all port networks in the system but are blocked from doing so
due to a lack of connectivity. However, this implies that the system is only stable while the fragmentation
continues to exist because when the network is healed, both clusters will be able to connect to all port
networks and therefore fight over control of all them. It cannot be stated strongly enough that MBS servers
should never be used to protect against network fragmentation faults. The ESS offering provides all the
advantages of MBS without any of the drawbacks.
Stable only while
network is
fragmented. The
fragmentation is
preventing
contention.
SPE
MBS
LAN
IPSI
IPSI
IPSI
EAL #1
EPN #1
EAL #3
EAL #2
EPN #3
EPN #2
IPSI
EAL #4
EPN #4
Figure 13 – MBS Attempting to Assist in Network Fragmentation Fault
Section 5 - ESS Overview
5.1 What are Enterprise Survivable Servers (ESS) ?
ESS is an offering which simultaneously protects Avaya Communication Systems against both catastrophic
main server failures and network fragmentations. This is achieved by having ESS servers placed
strategically within an enterprise as an alternate source of control for port network gateways when control
from the primary source is disrupted for an extended period of time. Unlike the MBS offering, the control
transfer between the primary source and the alternate source is automatic in the ESS offering and multiple
ESS servers may be operational concurrently.
20
ESS servers consist of media servers executing the SPE software. While the SPE software running on ESS
servers is identical to the SPE software running on the main server, it is licensed differently which
drastically affects its operational behavior. Some of these behavioral variations include different alarming
rules, capability to register with a main server, and ability to receive translation synchronization updates.
Currently, ESS licensed software can only execute on a subset of the media servers available – S87XX and
S8500 Media Servers.
5.2 What is ESS designed to protect against ?
Upon a main cluster failure, port networks lose their control source. If this failure is catastrophic in nature,
whereby the main servers will not be operational again for an extended period of time, port networks can
search out and find ESS servers that will provide them service. Therefore, ESS protects port networks
against extended failures of the main controlling cluster. For most scenarios, the desired failover behavior
caused by a catastrophic main server failure is to have all surviving port networks find service from the
same ESS server in order to keep the system in a non-fragmented state. The figure below shows a
catastrophic main cluster failure with all of the port networks failing over to a single ESS cluster.
ESS #1
Main SPE
IP
Network
EPN #1
ESS #2
EPN #2
EPN #3
EPN #4
Figure 14 – ESS Protecting Against Catastrophic Server Failure
Upon a connectivity failure to the main cluster, port networks lose their control source. If this
fragmentation continues to exist over an extended period of time, port networks can search out and find
ESS servers on their side of the fragmentation that will provide them service. Therefore, ESS protects port
networks against significantly long fragmentations away from the main controlling cluster. For most
scenarios, the desired failover behavior caused by a network fragmentation is to have all port networks
which can still communicate with each other to agglomerate together forming a single stand-alone system.
The figure below shows a network fragmentation with all of the port networks on the right side of the
fragment grouping together to form a single autonomous system.
ESS #1
Main SPE
IP
Network
EPN #1
ESS #2
EPN #2
EPN #3
EPN #4
Figure 15 – ESS Protecting Against Network Fragmentation
It cannot be emphasized enough that ESS increases the availability of a communication system when faced
with catastrophic server failures or extended network outages. ESS builds upon all of the existing fault
recovery mechanisms which are already integrated into the system. For example, when a network outage is
21
detected, the PBX attempts to re-establish PN control by interchanging IPSIs (if the port network has
duplicated IPSIs), fall-back control through another PN (if the port network is leveraging CSS or ATM
PNC), and reconnecting to the PN and warm restarting it (if the outage is persistent for less than 60
seconds) before ESS servers ever get involved. ESS is the last line of defense to provide service to port
networks while in these severely handicapped situations.
5.3 What is ESS NOT designed to protect against ?
The ESS offering is designed to protect systems against major faults which exist over a prolonged period of
time. ESS builds upon all of the existing fault recovery mechanisms which are already integrated into the
communication system. This section will present a number of common faults which can occur and explain
how they are resolved without ESS. Attempting to have ESS address these types of faults causes overall
system reliability to actually be decreased since race conditions between ESS and existing recoveries are
introduced.
Fault Example: Main Active Server Hardware Failure
Resolution: Server Interchange
Description: When the active server encounters a hardware fault, either the local Arbiter will be informed
of the degraded hardware state of health (SOH) or the standby server’s Arbiter will lose contact with its
peer. Regardless of the detection mechanism, the decision will be made for the standby server to take over
control of the system from the currently active server. In addition, resolving this problem by interchanging
servers will preserve all calls within the system. ESS will not get involved with the recovery unless the
main server is a simplex server or the standby server is in an inoperable state.
Fault Example: Extended Control Network A Outage
Resolution: IPSI Interchange or Control Fall-back (depending on system configuration)
Description: When the control network A outage is detected, the system will attempt to transition its PN
control link from the active IPSI, which is connected through control network A, to the standby IPSI, which
is connected though control network B provided the port network has duplicated IPSIs. If the IPSI
interchange fails or the standby IPSI does not exist, the system attempts to transition its PN control link
from an IPSI local in the PN to an IPSI which it can communicate with in another PN provided the PNC
supports control link tunneling (CSS or ATM PNC). If the problem is resolved by either of these methods,
all calls within affected port networks will be preserved. Only if both IPSIs experience extended network
faults simultaneously (control network A and B) and fall-back control is not supported through the PNC
(IP), would the system need ESS to take over control for that port network.
Fault Example: Short Control Network Outage
Resolution: EPN Warm Restart
Description: When a control network outage occurs and no other recovery mechanism can work around
the problem (discussed in the fault example above), the controlling server continuously attempts to
establish a new connection to the port networks. If the network outage is less than 60 seconds, the SPE will
reconnect to the PN’s IPSI and resynchronize resource allocation with the port network. This
resynchronization process will preserve all of the calls (both call state and bearer connection) currently
being supported by the PN. If the outage is over an extended period of time, then the ESS would takeover
control of the port network. It is important to ensure that the call preserving recovery has a chance to
complete before allowing ESS servers to assume control.
Fault Example: Software Fault
Resolution: System Restart or Server Interchange
Description: If a software fault occurs on the main active server, the CM application will attempt to initiate
a system restart it deems necessary to resolve the software fault and, at the same time, inform the local
Arbiter. If the system restart fails to resolve the issue, the Arbiter may choose to interchange servers or to
initiate another system restart at a higher level. Depending on the restart level chosen to resolve the issue,
calls may (warm restart) or may not (cold restart) be preserved. ESS would enter the recovery solution
only if the software faults are severe enough to completely freeze both main servers preventing them from
communicating with IPSIs and issuing self reboot actions.
22
Fault Example: Complete IPSI Failure
Resolution: IPSI Interchange
Description: An IPSI is made up of a number of components including a PKTINT, an arch-angel, and a
tone clock. If a complete IPSI failure occurs causing everything on it to cease operation, then the system
will attempt to transition the control link to the standby IPSI. However, since the tone clock for the PN
also has failed, the IPSI interchange will not preserve calls. If the port network did not have duplicated
IPSIs then it would become inoperable. The IPSI is in charge of seeking out alternate ESS clusters when
needed and if the IPSI is completely down, then ESS clusters will not be requested to take over. Also, even
if an ESS took over the ISPI, the tone clock is a critical component of a PN and without it the port networks
cannot operate.
Fault Example: Complete Port Network Power Outage
Resolution: None
Description: Without power to the PN, it cannot provide any services to its endpoints. Until power is
restored, the resources provided to the system through the powerless port network are out of service. In
situations like this, ESS does not help. In addition, since no calls could exist within a powered off cabinet,
no calls are preserved when the power is restored to the cabinet.
One of the greatest assets of an Avaya Communication System is the ability it has to resolve faults that
occur. When the DEFINITY G3r system evolves into the Avaya Communication Manager additional
recovery mechanisms were invented, but they work as add-ons to, not replacements for, the already
existing ones. ESS provides a new layer of system availability without compromising the tried-and-true
mechanisms that are already built into the system. If ESS is used to protect against the faults mentioned
above, then unnecessary resets will occur along with the possibility of unwanted system fragmentation.
Section 6 - How does ESS Work ?
Except for some basic initial configuration of the ESS servers, all system administration is done on the
main cluster. Once an ESS server informs the main server that it is present (via registration), the main
server synchronizes the entire system translation set to it (via file synch). This synchronization process
continues periodically through maintenance and/or on a demand basis to ensure that the ESS clusters have
updated translations. With these full system translations, the ESS servers have the information required to
connect to all IPSIs such as IP addresses, ports, QOS parameters, and encryption algorithms. Once the ESS
servers connect and authenticate with the IPSIs, they advertise administered factors which the IPSIs use to
calculate a priority score for the ESS cluster. The IPSI is always maintaining an ordered failover
preference list based on these priority scores. This priority list is dynamic based on network and server
conditions and only contains currently connected viable servers. Upon an IPSI losing connectivity to its
controlling cluster, it starts a no service timer. If a re-connection does not occur before the timer expires,
the IPSI will request service from the highest ranking viable ESS clusters in its preference list. This
algorithm prevents knee jerk and non-fault control shifts, but the IPSI service requesting process can be
overridden by manual or scheduled agglomeration procedures.
The rest of this section examines in detail each of the operational steps of the ESS offering.
6.1 What is ESS Registration ?
When an ESS is installed into a system, it is configured with an IP address (local IP interface), a remote
registration IP address (CLAN gatekeeper), and a unique server ID (SVID). In addition, a license file will
be loaded which provides the ESS cluster with a unique module ID (otherwise known as a cluster ID or
CLID) and a system ID (SID). Concurrently, the ESS is added to the system translations on the main
server by creating a record with all of this information (IP address, CLID, SVID, and SID) along with
associated factors which will be described later.
23
Upon an initial start-up or reset, an ESS will send a registration packet (an RRQ) containing its configured
data to a remote registration IP address (port 1719). The initial time an ESS attempts to register with the
main server, it only tries at the originally configured registration IP address. However, after the initial
translation synchronization (discussed in the next section), the ESS first tries to register at the configured
registration IP address, but upon a failure, tries all possible registration points, regardless of network
region, administered within the system. Upon the main server receiving the registration request, it attempts
to validate the ESS server by ensuring that the IP address, CLID, and SVID match the administered data for
that ESS and that the SID provided by the ESS matches its own system ID. If it authenticates, the main
server will respond to the ESS server with a confirmation (an RCF) and continue handshaking with it. The
handshaking takes the form of heartbeats (KA-RRQs, a slight modified version of RRQs) and heartbeat
responses (KA-RCFs, a slightly modified version of RCFs). Keep alive handshaking is performed by every
ESS server on a periodic basis of once per minute. Since the keep-alive packets are very small, occur only
once per minute, and there are relatively few ESS clusters within a system (maximum of 63), the bandwidth
required for this is negligible (approximately 2.5 Kbps per ESS cluster). Figure #16 below shows the
registration link between the ESS clusters and the main cluster. If the ESS cluster consists of a duplicated
server, an S87XX media server, then only the active server within the cluster will register.
IPSI
Main SPE
CLAN
IP Network
EPN #1
IP
Network
IPSI
CLAN
ESS
EPN #2
Registration Pathway
Figure 16 – ESS Registration Pathway
Currently, as the figure shows, CLANs are utilized as the registration entry point to the main servers. It is
important to note that CLANs in a port network controlled by an ESS cannot serve as a registration access
point for another ESS cluster within the system (ESS servers do not register with each other).
6.2 How are ESS Translations Updated ?
Other than for alarming reasons and providing a medium for issuing some remote commands (discussed
later), the main purpose of ESS registration is to allow the main server to know the translation status of
each ESS. Administration changes made on the main server are not updated in real time to the ESS
servers. Rather, the translation changes are synched to the ESS server on either an automated periodic
basis once per day or on demand with a manually entered command. Since the ESS offering is designed to
only protect against catastrophic failures, taking over port networks with slightly out-of-date translations is
completely valid. For example, if an end user has added a new feature button to his/her phone and a failure
occurs before periodic maintenance has updated the ESS server, then the phone operates as if no
administration actions took place (e.g. no new feature button) when it is taken over by the ESS server.
Whenever synchronization between the main server and the ESS server is requested, the main server
checks to see if it is necessary based on the current translation state of the ESS (which was provided in its
registration heartbeats). If the ESS server has translations that are identical to the main server’s, then the
actual synchronization process is skipped. However, if the ESS translations are not up to date with the
main server, the main server synchronizes the ESS server by supplying the complete system translation file
or by providing the changes since the last synchronization (incremental file synch). Incremental file synchs
require the ESS cluster to be at most one translation synchronization behind since the changes need to be
applied to identical images. This implies that full translation files need to be synched whenever initial
startup occurs or if the ESS server has been un-registered for an extended period of time. Obviously the
preferred method is to only transmit the changes since the amount of data needed to be sent is relatively
24
small (very dependant on the number of changes, but usually in the 100s of kilobyte range) as opposed to
transmitting the entire system translation file which is relatively large (very dependant on the size of the
system, but usually ranges from 2 to 30 megabytes and can be compressed to approximately a 5:1 ratio).
The bandwidth required for this process is dictated by the rule that the synch process must be completed
within 5 minutes. Therefore, the bandwidth required in the worst cases (largest systems synching full
translation sets which are compressed) is approximately 200 Kbps over that 5 minute period.
ESS servers are operational at all times proving their viability by running the CM software, monitoring
system state-of-health, and handshaking with the IPSIs. After translation synchronization completes, the
ESS server must stop running on the old translation set and begin executing on the new set. In order to
achieve this, the ESS is required to perform a reboot (as described previously in the What are SPE Restarts
section). Therefore, the entire synch process, under normal conditions, can be summarized by the
following steps:
1.
2.
3.
4.
The main server creates a compressed file consisting of the admin changes since the last synch.
This file is transferred from the main server to the ESS server.
The ESS server applies the changes given to it onto its existing translation file.
The ESS initiates a self reboot in order to begin running on the new translations.
However, based on the current state of the system, some of these steps may be altered. The first step may
require an entire system file transfer as opposed to just the changes which causes the third step to swap out
the files rather than applying changes. Rebooting an ESS server which is not currently controlling any port
network gateways does not affect end-users what-so-ever except for not being available as an alternative
source of control if needed during the reboot cycle (approximately 2 – 3 minutes). On the other hand, if the
ESS is actively controlling port networks, then a reboot of an ESS server would cause an outage to all of
the supported end-users. Therefore, the fourth step of the process will be delayed until the ESS server is no
longer controlling any end-users or gateways. That being said, administration changes made on the main
server will not be made available to an ESS which is actively controlling resources even if translation
synchronization occurs. Another consequence of the fourth step is that an excessive number of demand
synchronization requests will cause an ESS to go through many resets. While the resets themselves do not
affect system operation or performance, the more time ESS servers spend resetting, the less time they are
available as an alternate control source. Therefore, demand synchronizations, via the “save translations
ess” command, should only be performed after critical administration has been completed which is
absolutely needed in catastrophic failure modes and cannot wait for the automated synch process.
The license file installed on the ESS server gives the CM software running on it the ability to receive file
synchronizations from the main server. However, the ability to make administration changes locally on the
ESS is not prohibited although there are three important caveats to take into account. The first caveat is
that any administration changes made on the ESS are lost whenever translation synchronization takes place
followed by the subsequent reboot. Therefore, administration changes can be made to an ESS server while
it is not in control of any resources and have them be effective if the ESS needs to take over any port
networks, but all the changes are lost when periodic or demand translation synchronization takes places.
The second caveat is that translation changes made on an ESS cannot be re-synchronized back to the main
server. Hence, administration changes made locally to an ESS server which is controlling end-points will
not be reflected in the overall system translations and will only be effective until the main server resumes
control of the end-points again. The last caveat is that translation changes made locally to an ESS cannot
be saved to disk. Since an ESS server is just as powerful as a main server and has identical RTUs, a
customer could conceivably purchase an ESS server and illegally make it into its own stand-alone system at
a substantial discount. To prevent this from happening two major blockades are introduced; ESS servers
enter “no license” mode whenever it controls resources which gives a limited operational lifespan of the
server and elimination of the ability to save translations to disk on ESS servers which puts all
administration changes done at risk if a system reboot occurs (which is what happens if the “no license”
mode timer expires).
The figure below shows the network interconnectivity required to have ESS servers get synchronizations
from the main server. It is important to note that the secure transfer of the translation file or translation
25
changes file takes a drastically different path than the registration link. File synchs are done directly
between media servers (utilizing port 21874) without going through CLANs. Also, the diagram shows two
separate, distinct networks, the control LAN and the corporate LAN, but they can actually be implemented
as one and the same.
Translation Synchronization Pathway
Corporate
LAN
IPSI
Main SPE
Control
LAN
CLAN
IP Network
EPN #1
IP
Network
IPSI
CLAN
ESS
EPN #2
Registration Pathway
Figure 17 – ESS Registration Pathway
Moreover, the diagram is leveraging a simplex S8500 Media Server as the main cluster and a simplex
S8500 Media Server as the ESS cluster. If either of the clusters consisted of duplex S87XX Media Servers,
then the translation synchronization will only happen between the active media servers. In a duplex media
server cluster, it is the responsibility of the active SPE to update the standby SPE translations.
Unfortunately, there is a negative implication of synchronizing translations on a periodic basis as opposed
to being done in real time – some phone state information will be lost. There are a class of features that
end-users can execute which change the state of their phone and the associated phone’s translation
information. Two primary examples of these types of features are send-all-calls and EC500 activation. If
an end-user changes the state of these features (either enabling or disabling) and a failure occurs causing an
ESS to takeover before the translation changes have been sent to the ESS server, the feature states will
revert to their previous state. For example, imagine a scenario whereby a user hits their send-all-calls
feature button in the morning and then goes on vacation. If a catastrophic failure occurs that afternoon
before a translation synch occurs and the end-user is transferred to an ESS, then the send-all-calls feature
will be deactivated (its previous state). Any calls that come to that user while being controlled by the ESS
are not immediately sent to coverage.
6.3 How do Media Servers and IPSIs Communicate ?
Once an ESS server receives a translation set from the main server, it has all the information needed in
order to connect to all the IPSIs within a system. In other words, once the ESS servers have each IPSI’s IP
address & listen port and each IPSI’s encryption algorithm settings & QOS parameters, it will initiate a
TCP socket connection to all IPSIs (utilizing port 5010). If the connection is unsuccessful, the servers will
periodically attempt a new connection every 30 seconds, otherwise the authentication process commences.
The authentication process consists of the validation of key messages based on an encryption method
commonly known between the servers and the IPSIs. Once the server and IPSI can communicate securely
over this established TCP link, the ESS server uniquely identifies itself (via cluster ID and server ID).
Following that, the IPSI will request priority factors (discussed in the next section) from the connecting
servers in order to calculate a priority score (discussed in the next section) for that server which will be
used to create the IPSI’s self-generated priority list (discussed below). At a high level, with the details to
follow, the IPSI receives pre-programmed factors from each ESS server that it will use to calculate a score
for that ESS cluster. The higher the score an IPSI assigns for an ESS cluster, the higher the ESS cluster
26
resides in the failover preference list. The following flow chart shows the initial message sequence that the
server and the IPSI go through in order to establish a session.
Media Server
IPSI
Open TCP Socket
TCP Socket Up Confirmation
Authentication & Encryption Setup
Authentication & Encryption Completion
ESS Indication Message
Factors Request Message
Factors Message
Notification
update message
gets sent to all
media servers
currently
connected to the
IPSI
Notification Update Message
Figure 18 – Message Flow upon Server-IPSI Connection
After the socket has been brought up between the IPSI and a non-controlling ESS server, there are three
tasks that occur. First, the ESS servers need to be consistently monitoring the health of the IP connection
and this is achieved by doing an application level keep alive once per second. These heartbeat messages
can easily be seen by monitoring traffic on the control network and have an average size of 10 bytes (8
bytes for server to IPSI heartbeats and 12 bytes for IPSI to server response heartbeats). This implies, since
these messages are sent as minimally sized TCP packets, that there is a static bandwidth of approximately 1
Kbits per second required between the servers and IPSIs for these operations.
Secondly, the IPSI is responsible for keeping every connected server up to date with respect to its own selfgenerated priority preference list. This is done by sending up priority list notification messages whenever a
change occurs (new server connecting, existing server disconnecting, or change in controlling cluster). An
ESS server gets the current priority list upon its initial connection to the IPSI and then gets any subsequent
updates. For example, if an ESS server disconnects from an IPSI for any reason, the IPSI sends updated
priority lists, with that ESS server removed, to all servers which are still connected. If the disconnected
ESS reconnects to the IPSI the same procedure is followed whereby the reconnecting ESS is inserted
appropriately back into the priority list and all connected servers receive this update. The notification
message size varies based on the number of ESS clusters and the ESS cluster types (simplex or duplex),
from a minimum of 8 bytes to a maximum of 68 bytes. Since priority preference lists do not change on a
periodic basis, only on network failures, ESS server failures, or certain ESS maintenance operations and the
notifications are relatively small, virtually no static bandwidth is required for this operation.
The third responsibility is to ensure there is always a socket up between servers and IPSIs. This implies
that upon a connectivity failure, the server attempts to restore a TCP connection. If there is a socket failure
the server will periodically try to bring up a new socket every 30 seconds. As the figure #18 above shows,
there are initial handshakes that occur when a socket is established. This exchange comprises of
27
approximately 512 bytes. However, since socket connections go down during failure scenarios and this can
not be occurring with any type of frequency, there is no static bandwidth required for this operation.
To summarize, the bandwidth required between ESS servers and IPSIs during normal operation (main
server in control of the entire system) is only 1 Kbits per second. During failure scenarios there is a slight
spike in traffic between servers and IPSIs. It is important to realize however, that the 1 Kbits / second
bandwidth requirement is for non-controlling servers only. The bandwidth required between a controlling
server and an IPSI is approximately 64 Kbits / second per 1000 busy hour call completions supported by
that IPSI. When designing a system, enough bandwidth must be available between ESS servers and port
networks in case the ESS is required to take over port network control. In other words, even though only 1
Kbits per second are needed during normal operation between ESS servers and IPSI, the network
bandwidth requirements must be established assuming the ESS will be controlling port networks.
6.4 What is a Priority Score ?
Any disaster recovery plan worth its weight provides survivability for critical components in a
deterministic manner. ESS addresses this by having critical components (the port network gateways)
generate, via its resident IPSI, a ranked ordering of alternate control sources (ESS clusters). The IPSI ranks
connecting ESS servers using a priority score calculated from factors provided by the ESS clusters.
The most basic way to have a system react to catastrophic faults is to have all IPSIs share the same failover
preference list. To achieve this, each ESS cluster can be assigned a base score value, between 1 and 100,
which it will advertise to the IPSIs upon a connection. Suppose there are six port network gateways within
a system distributed in three separate locations as shown in the figure below.
Main SPE
IP LAN
Main Location
ESS #40
ESS #20
ESS #30
IP LAN
Location #1
EPN #1
EPN #2
IP LAN
Location #2
EPN #3
EPN #4
EPN #5
IP LAN
Location #3
EPN #6
Figure 19 – Basic System Layout
If the disaster recovery plan states that all the port networks should attempt to get service from ESS #20,
then from ESS #30, and then finally from ESS #40 in failure situations, the ESS clusters can be given the
following base scores:
ESS CLID
Base Score
20
75
30
50
40
25
Table 5 – ESS Cluster Admin
28
The IPSIs will rank the ESS clusters based on their base scores (higher the better) and in this example each
IPSI will have identical priority lists as follows:
Port Network #:
1st Alternative:
2nd Alternative:
3rd Alternative:
PN #1 PN #2 PN #3 PN #4
#20
#20
#20
#20
#30
#30
#30
#30
#40
#40
#40
#40
Table 6 – IPSI Priority Lists
PN #5
#20
#30
#40
PN #6
#20
#30
#40
However, the network design may not allow this to be feasible. For example, suppose the WAN links
interconnecting the locations is limited and attempts should be made to avoid sending control traffic over
them unless absolutely necessary. This requirement can be realized if the IPSIs generated the following
lists:
Port Network #:
1st Alternative:
2nd Alternative:
3rd Alternative:
PN #1 PN #2 PN #3 PN #4
#20
#20
#20
#30
#30
#30
#30
#20
#40
#40
#40
#40
Table 7 – IPSI Priority Lists
PN #5
#30
#20
#40
PN #6
#40
#20
#30
To achieve this ESS, introduces two new concepts – communities and local preference. Every ESS server
in the system is assigned to a community (with a default of 1) and every IPSI is assigned to a community
(with a default of 1). By definition, if an ESS server is assigned to the same community as an IPSI, then
the ESS server and IPSI are in the same community or local to each other. The local preference attribute
can be assigned to an ESS server which can boost the priority score calculated for it by an IPSI if it is local
to it. The priority score (PS) calculation is written generically as:
N
PS = ∑ wi vi
i =1
Calculation 1 – Generic Priority Score Calculation
with
Index (i)
1
2
Factor Type
base score
local preference boost
Factor Weight (w)
1
250
Factor Value (v)
Administered base score (1-100)
0 – if not within same community or
boost is not enabled
1 – if in the same community and
boost is enabled
Table 8 – Factor Definitions
The 250 weight for the local preference boost guarantees that an ESS with this enabled will rank higher
with respect to the IPSI if it is local than the ESS with the highest base score. If each ESS cluster in this
example has this local preference boost enabled, shown below in table 9 below, then the IPSIs will all
generate preference lists as shown above in table 8.
ESS CLID
20
30
40
Base Score
Locally Preferred
75
Yes
50
Yes
25
Yes
Table 9 – ESS Cluster Admin
Community
1
2
3
With this administration, it is important to realize that upon a catastrophic main server fault three
autonomous systems will be created. All of the IPSIs in community #1 will failover to ESS #20 which is in
community #1, all of the IPSIs in community #2 will failover to ESS #30 which is in community #2, and
29
the IPSI in community #3 will failover to ESS #40. If an additional disaster recovery plan requirement is
added stating that the system should attempt to remain as one system as long as possible, then ESS #10 can
be added to the system layout.
Main SPE
ESS #10
IP LAN
Main Location
ESS #40
ESS #20
ESS #30
IP LAN
Location #1
EPN #1
EPN #2
IP LAN
Location #2
EPN #3
EPN #4
IP LAN
Location #3
EPN #5
EPN #6
Figure 20 – System Layout with System Preferred ESS
However, even if ESS #10 is given a base score of 100, the IPSIs priority lists will unfortunately be as
follows:
Port Network #: PN #1 PN #2 PN #3 PN #4 PN #5
#20
#20
#20
#30
#30
1st Alternative:
2nd Alternative:
#10
#10
#10
#10
#10
3rd Alternative:
#30
#30
#30
#20
#20
4th Alternative:
#40
#40
#40
#40
#40
Table 10 – Undesired IPSI Priority Lists
PN #6
#40
#10
#20
#30
Even though the desired IPSI priority lists should be:
Port Network #:
1st Alternative:
2nd Alternative:
3rd Alternative:
4th Alternative:
PN #1 PN #2 PN #3 PN #4
#10
#10
#10
#10
#20
#20
#20
#30
#30
#30
#30
#20
#40
#40
#40
#40
Table 11 – Desired IPSI Priority Lists
PN #5
#10
#30
#20
#40
PN #6
#10
#40
#20
#30
The desired IPSI priority lists causes the IPSIs to attempt to transfer over to ESS #10 in the case of a main
cluster failure and then failover to local service if ESS #10 is unable to provide adequate control. To
achieve this, ESS#10 must be given the system preferred attribute. This attribute will boost the priority
score calculated of a system preferred ESS above any local locally preferred ESS clusters. This factor fits
into the factor definitions as:
Index (i)
1
2
3
Factor Type
base score
local preference boost
Factor Weight (w)
1
250
Factor Value (v)
Administered base score (1-100)
0 – if not within same community or
boost is not enabled
1 – if in the same community and
boost is enabled
system preference boost
500
0 – if boost is not enabled
1 – if boost is enabled
Table 12 – Factor Definitions
30
The 500 weight obviously allows system preferred ESS clusters to rank higher than all other clusters (even
one that has a 100 base score and is locally preferred, 350 < 500). It is important to understand that the
only need for assigning the system preferred attribute is to compensate for local preference attribute usage.
In other words, if no ESS servers are given the local preferred setting, then there is no need to use the
system preferred attribute.
For simplicity, the IPSI was designed to treat all servers, main and ESS, consistently when calculating
preference scores. The leads to the last factor that needs to be discussed – main server attribute. Since
priority preference lists include the main cluster and it should always be ranked the highest (under normal
conditions, the IPSIs should be receiving their control from the main cluster).
Index (i)
1
2
Factor Type
base score
local preference boost
Factor Value (v)
Administered base score (1-100)
0 – if not within same community or
boost is not enabled
1 – if in the same community and
boost is enabled
system preference boost
500
0 – if boost is not enabled
1 – if boost is enabled
main cluster boost
1000
0 – if an ESS cluster
1 – if the main cluster
Table 13 – Factor Definitions
3
4
Factor Weight (w)
1
250
Therefore, the disaster requirements are achieved by the following administration (the main cluster is
always assigned a CLID of 1):
ESS CLID
1
10
20
30
40
Base Score
x
100
75
50
25
Main
System Preferred Locally Preferred
Yes
x
x
No
Yes
x
No
No
Yes
No
No
Yes
No
No
Yes
Table 14 – ESS Cluster Admin
Community
1
1
1
2
3
Which results in the following lists generated independently by each IPSI:
Port Network #:
1st Alternative:
2nd Alternative:
3rd Alternative:
4th Alternative:
5th Alternative:
PN #1 PN #2 PN #3 PN #4
#1
#1
#1
#1
#10
#10
#10
#10
#20
#20
#20
#30
#30
#30
#30
#20
#40
#40
#40
#40
Table 15 – IPSI Priority Lists
PN #5
#1
#10
#30
#20
#40
PN #6
#1
#10
#40
#20
#30
To throw a final wrench into the picture, suppose the WAN link between locations 2 and 3 in Figure #20 is
extremely limited and there is only enough bandwidth for one control link to traverse it. This implies ESS
#40 cannot be an alternate control source for any port network gateway except for port network #6. This
requirement manifests itself in the following preference lists:
Port Network #:
1st Alternative:
2nd Alternative:
3rd Alternative:
4th Alternative:
5th Alternative:
PN #1 PN #2 PN #3 PN #4
#1
#1
#1
#1
#10
#10
#10
#10
#20
#20
#20
#30
#30
#30
#30
#20
x
x
x
x
Table 16 – IPSI Priority Lists
31
PN #5
#1
#10
#30
#20
x
PN #6
#1
#10
#40
#20
#30
ESS #40 must be administered to only offer its services to IPSIs within the same community or, stated
differently, provide service locally only, which is realized by giving ESS #40 the local only attribute. This
attribute is unlike the other factors whereby there is not a boost in score, but rather if this attribute is
enabled for an ESS, that ESS server does not even attempt to contact IPSIs which are not within its own
community and thereby never get onto non-local IPSI priority lists. The administration to achieve the
priority lists in table X above is:
ESS CLID
1
10
20
30
40
Base Score
x
100
75
50
25
Main
Yes
10
20
30
40
Sys Preferred
Locally Preferred
x
x
Yes
x
No
Yes
No
Yes
No
Yes
Table 17 – ESS Cluster Admin
Community
1
1
1
2
3
Local Only
x
No
No
No
Yes
It is the combination of these basic factors which give the ESS offering the ability to achieve very
complicated deterministic failovers for IPSIs, but at the same time is easily administered. While every IPSI
independently generates its own priority list some commonalities will exist among all of them. The IPSI’s
priority list can be broken down generically as follows:
Main Server
Common Among
All IPSIs
Possibly
Different Among
All IPSIs
Figure 21 – Generic IPSI Priority Lists
Where the break line between “common among all IPSIs” and “possibly different among all IPSIs” is based
on the number of system preferred ESS clusters administered and the number of ESS which have the local
preference attribute enabled. If the local preference attribute is not utilized, then the last section of the
priority lists, “possibly different among all IPSIs,” would always be empty.
6.5 How do IPSIs Manage Priority Lists ?
The previous section discussed how IPSIs generate their priority failover lists in non-faulted environments
(every ESS can connect to every IPSI). Whenever a new cluster connects to an IPSI, a priority score is
calculated for that cluster based on the advertised factors. The IPSI then dynamically inserts the cluster
into its priority list based on the priority score and informs all currently connected ESS servers about the
updates. Whenever an IPSI cannot communicate with a media server due to either a server fault or network
fragmentation, it is dynamically removed from the priority list and all ESS servers still connected will
receive updates.
32
ESS #10
ESS #20
Main (CLID #1)
ESS #30
IP LAN
EPN #1
Figure 22 – IPSI Interfacing with Multiple ESS Clusters
Using the diagram above, under non-faulted environments, the IPSI’s priority failover list will be:
PN #1
1st Alternative:
#1
2nd Alternative:
#10
3rd Alternative:
#20
4th Alternative:
#30
Table 18 – IPSI Priority Lists
If a communication failure is detected (discussed in the next section) between the IPSI and, for this
example, ESS #20, then the IPSI would remove ESS #20 from its priority list.
PN #1
1st Alternative:
#1
2nd Alternative:
#10
3rd Alternative:
#30
4th Alternative:
none
Table 19 – IPSI Priority Lists
Once the communication between the IPSI and the ESS is restored, the IPSI will insert it back into the list
appropriately.
PN #1
1st Alternative:
#1
2nd Alternative:
#10
3rd Alternative:
#20
4th Alternative:
#30
Table 20 – IPSI Priority Lists
It should be obvious that these concepts are easily extended to the addition/removal and enabling/disabling
of ESS servers. For example, if an ESS server is added to the system (ESS #40 with a base score of 60), it
will be inserted into the IPSI’s list appropriately.
PN #1
1st Alternative:
#1
2nd Alternative:
#10
3rd Alternative:
#40
4th Alternative:
#20
5th Alternative:
#30
Table 21 – IPSI Priority Lists
33
Since the IPSIs do not keep any type of historical records of previously connected ESS clusters, the system
survivability plans can be altered at any time (e.g. move ESS clusters around strategically) or implemented
over time (e.g. no flash cuts are required).
While this list manipulation is very simple, the complexity occurs when port networks have duplicated
IPSIs. It is absolutely critical that a pair of IPSIs in the same PN share identical priority lists. This is
achieved by communication between the IPSI pairs over the carrier’s backplane. In a stable state shown in
Figure 23, the IPSIs will each have priority lists of:
A-side IPSI B-side IPSI
1st Alternative:
#1
#1
2nd Alternative:
#10
#10
3rd Alternative:
#20
#20
Table 22 – IPSI Priority Lists
Main (CLID #1)
ESS #10
ESS #20
Control
Network A
Control
Network B
IPSI – A
IPSI – B
EPN #1
Figure 23 – Port Network with Duplicated IPSI and Multiple ESS Clusters
If a failure occurs which causes ESS #10 to be unable to keep a session up with the B-side IPSI, the
following steps take place:
1.
2.
3.
Socket failure is detected by B-side IPSI.
The B-side IPSI informs the A-side IPSI and checks if ESS #10 is still
connected to the A-side IPSI (which it is).
The B-side IPSI keeps ESS #10 in its list because the A-side
IPSI is still connected to ESS #10.
A-side IPSI B-side IPSI
1st Alternative:
#1
#1
2nd Alternative:
#10
#10
3rd Alternative:
#20
#20
Table 23 – IPSI Priority Lists
If the A-side IPSI then looses its connection to ESS #10, then the following procedural steps take place:
1.
Socket failure is detected by A-side IPSI
34
2.
3.
4.
The A-side IPSI informs the B-side IPSI and checks if ESS #10 is
connected to the B-side IPSI (which it is not).
The A-side IPSI removes ESS #10 from its priority list because it is not
connected to itself or its peer.
The B-side IPSI removes ESS #10 from its priority list for the same
reasons.
A-side IPSI B-side IPSI
1st Alternative:
#1
#1
2nd Alternative:
#20
#20
3rd Alternative:
none
none
Table 24 – IPSI Priority Lists
The conclusion of this should be that an ESS server does not get eliminated from either IPSIs list in a PN
unless it cannot communicate to either of the IPSIs. This concept is further extended to cover the case
where the ESS cluster has duplicated servers. As long as at least one server from an ESS cluster can
communicate with at least one IPSI in a port network, it will remain on the priority lists of both IPSIs in the
PN.
The final caveat surrounding the management of the IPSI’s priority lists deals with the maximum size of
these lists. If there was no cost associated with an ESS being on an IPSIs lists, then it would make sense to
have maximum list sizes of 64 clusters (maximum of 63 ESS clusters along with a main cluster). However,
there are a number of costs, albeit small, of having an ESS in an IPSI’s priority list. The first cost is the
number of heartbeat messages that are continuously being exchanged between the ESS cluster and the IPSI.
While each message is small and only occurs once per second, in a mesh connectivity configuration (every
ESS server connected to every IPSI) unnecessary traffic is introduced onto the network. Secondly, the IPSI
needs to spend precious resource cycles and memory managing the ESS clusters on its list. And finally,
since every connected IPSI gets notified whenever any changes occur, a very large number of ESS clusters
on a list increase the probability of changes occurring and increase the number of notification messages that
are sent out when these changes happen. Therefore, the maximum list size of an IPSI for the ESS offering
is 8 clusters. In other words, the IPSIs will maintain a priority list, under normal conditions, which
contains the main cluster and the 7 best (based on calculated priority scores) ESS alternatives. This leads to
the question of what happens to all the other ESS clusters in the system with respect to that IPSI. The
answer is that if an ESS’s calculated priority score does not rank it high enough to be within the top 8, then
it will be rejected. If an ESS is rejected from an IPSI it will disconnect from it and start a 15 minute timer.
Once the timer expires, the ESS will re-attempt to get onto the IPSI’s priority list.
ESS #30
ESS #40
ESS #50
ESS #20
ESS #10
ESS #60
ESS #70
IP LAN
ESS #80
Main (CLID #1)
ESS #90
EPN #1
Figure 24 – IPSI Interfacing with More Than 8 ESS Clusters
The figure above shows a system with 9 ESS servers. While there is no correlation between CLID and
priority ranking (except for the main cluster which is always CLID #1), this example will assume
35
administration which causes lower CLIDs to have higher priority scores. This gives the IPSI the following
priority list:
PN #1
1st Alternative:
#1
2nd Alternative:
#10
3rd Alternative:
#20
4th Alternative:
#30
5th Alternative:
#40
6th Alternative:
#50
7th Alternative:
#60
8th Alternative:
#70
Table 25 – IPSI Priority Lists
Notice that ESS #80 and ESS #90 are not on the IPSI’s list even though network connectivity exists and are
in a rejected state with respect to this IPSI. After the reject timer (15 minutes) expires, ESS #80 and #90
will attempt to re-connect to the IPSI only to reach the same result – get rejected again. However, suppose
that some network fragmentation fault takes place which prevents ESS #30, #40, and #50 from
communicating with the IPSI. This will leave the IPSI’s list looking as follows:
PN #1
1st Alternative:
#1
2nd Alternative:
#10
3rd Alternative:
#20
4th Alternative:
#60
5th Alternative:
#70
6th Alternative:
none
7th Alternative:
none
8th Alternative:
none
Table 26 – IPSI Priority Lists
If the network connectivity is not restored within 15 minutes, ESS #80 and #90 reject timers will expire and
attempt to reconnect to the IPSI. Unlike the previous time however, the IPSI will insert them on the list as
the 6th and 7th alternatives respectively.
PN #1
1st Alternative:
#1
2nd Alternative:
#10
3rd Alternative:
#20
4th Alternative:
#60
5th Alternative:
#70
6th Alternative:
#80
7th Alternative:
#90
8th Alternative:
none
Table 27 – IPSI Priority Lists
When the network faults are resolved, the isolated ESS clusters will reconnect with the IPSI and the IPSI
will insert them onto its lists. When the first ESS returns, it will be inserted as the 4th alternative and all the
lower ranked alternatives will be shifted down. However, when the next ESS server reconnects, the IPSI
will already have a full list. This is resolved by having the IPSI reject the lowest ranked alternative, in this
case ESS #90, and then insert the returning ESS cluster appropriately. This procedure will continue until
the priority list goes back to how it was originally.
36
6.6 How are Communication Faults Detected ?
After a media server connects to an IPSI and is inserted onto the priority list, heartbeat handshaking
commences. These heartbeats are generated from the media server once per second and sent to the IPSI.
Upon reception of a media server’s heartbeat, the IPSI will generate a heartbeat response and send it back
to the media server. An application keep-alive is required because quick communication failure detection
is needed. Therefore, a communication failure can be detected if there is an explicit closure of the TCP
socket or there is a period of time that goes by without receiving a heartbeat.
From the server’s perspective, a heartbeat response is expected from the IPSI in response to the periodically
generated heartbeats. Under normal conditions, this response is received within a few milliseconds of
sending the heartbeat message. This time delay is a function of the network latency and IPSI processing
time (which is extremely minimal). However, the media server only requires that a response was received
before it needs to send out the next heartbeat message. If the response is not received within one second
(the time interval between heartbeats), a sanity fault is encountered. If three consecutive sanity failures
occur and no other data has been transferred from the IPSI to the server within that time, the server closes
the socket signifying a communication failure and socket recovery (as described previously) begins. If
there are connectivity problems which caused this communication failure, then the IPSI will probably not
receive the TCP socket closure. In this case, the IPSI will have to detect the connectivity problem on its
own.
The IPSI’s perspective is very similar to the server’s perspective. The IPSI expects to receive a heartbeat
from the server once per second and if it does not get one, a sanity failure occurs. If six consecutive sanity
failures occur and no other data has been received from the server during that time, a communication
failure is detected and this causes the IPSI to close the existing TCP socket. Like the previous perspective
above, if there is a connectivity problem, then it is up to the server to self-detect the issue since it probably
will not receive the TCP socket closure from the IPSI.
6.7 Under What Conditions do IPSIs Request Service from ESS Clusters ?
Up to this point in the discussion, how the IPSIs create and manage their priority lists has been covered, but
not for which it is used. The discussion now shifts gears to examine how IPSIs use these ESS ranked lists
to request service when needed. In order to eliminate any ESS contention over controlling an IPSI, the
power of selecting an IPSI’s master is done by the IPSI itself. The IPSIs follow the basic rule that if no
service is being supplied, request an ESS cluster to take control.
In the Background section, it was shown that there are two different ways that a port network can be
controlled – either directly though a co-resident IPSI or indirectly through an EI board when leveraging an
ATM or a CSS PNC. When a port network is being controlled, an arch-angel has been designated which is
the master of all other angels within the PN. By definition, if an angel has an arch-angel as a master, the
angel is referred to as being scanned. Using this definition, an IPSI is said to be getting service if it has a
master cluster or is being scanned. There are a few implications to this. First and foremost, if an IPSI has a
media server which has declared itself the IPSI’s master and remains connected, it will not request service
from another ESS cluster. This allows the current master server to attempt all other recoveries, as
described in the Reliability - Single Cluster Environments section, without having the port network failover
to another cluster. This is critical as failing over to an ESS server should only occur after all other recovery
methods have been exhausted. For example, the recovery plan to address a failure may consist of IPSI
interchanges or server interchanges and if the IPSI requests service from another cluster it would prevent
these operations. The second implication is that if the port network is being controlled indirectly through
another PN, then the IPSI should not attempt to find its port network another master. One of the recoveries
previously discussed is to have a port network go into fallback mode upon a connectivity failure if possible.
In Figure 25 below, the IPSI in port network #2 cannot communicate with the main cluster and therefore
shifts the ESS to the first alternative. However, the main cluster gained control of the port network through
the PNC. Even thought the IPSI in PN #2 does not have a master, it should not request service since the
port network is up and functioning.
37
Main (CLID #1)
ESS #10
Main (CLID #1)
LAN
LAN
IPSI
IPSI
IPSI
IPSI
EAL #1
EAL #1
EAL #2
EPN #2
EPN #1
ESS #10
EI
EI
EPN #2
EPN #1
EI
EI
EAL #2
IPSI #1
Priority
List
#1
#10
PNC
Normal Operation
IPSI #2
Priority
List
#1
#10
IPSI #1
Priority
List
#1
empty
PNC
Fallback Recovery
Preferred over ESS
Failover
IPSI #2
Priority
List
#10
empty
Figure 25 – Fallback Recovery Occurs is Preferred to ESS Takeover
Another topic covered in the Reliability – Single Cluster Environments section is the recovery methods
used if an IPSI and server become disconnected. It is stated that if the connectivity is restored within 60
seconds, then the server shall regain control of the port network via a warm restart (which does not affect
stable calls). Therefore, an IPSI gives its previous master some time, referred to as the no service time, to
re-connect after a connectivity failure before requesting service from another ESS cluster. If an IPSI loses
its connection to its master cluster, then it starts a no service timer. If the cluster reconnects within the no
service timer, the IPSI remains under its control. If the no service timer expires and the IPSI is not getting
scanned, then the IPSI requests service from the highest ranked ESS cluster in its list.
Since IPSIs all act independently with respect to requesting service, there are times that could cause a
single system to fragment into smaller autonomous systems. In Figure 26 below, the IPSI in port network
#3 requests service from the ESS which it can communicate with after the no service timer expires. If the
network then heals itself the main cluster reconnects to PN #3’s IPSI. In this case, the IPSI will place the
main cluster as the 1st alternative and shift the ESS, which is currently controlling the IPSI, to the 2nd
alternative. It is very important to realize, that a reconnection of a higher ranked ESS does not imply that
the IPSI will shift control over to it. The IPSI in PN #3 will remain under the control of the ESS until
another fault occurs or control is overridden (discussed in the next section). In this situation, every port
network provides service to all of its end users, but does not get all the advantages of being one system
such as availability of centralized resources, simple inter-port network dialing, and efficient resource
sharing. Therefore, there is a trade-off between waiting for a previous controlling cluster to return (during
that time no service is being provided to the port network) and requesting service from another ESS cluster
(which causes system fragmentation). If an outage is only going to last a few minutes, it may be worth
while to be out-of-service during that time, but remain as one large system when the outage is rectified.
Since no crystal balls exist which can be queried when an outage occurs to determine how long the outage
will be, the IPSI relies on the no service timer. This no-service timer is configurable by the customer with
a range of 3 to 15 minutes. The minimum of three minutes was derived from waiting at least one minute to
allow pre-ESS recoveries and then two additional minutes to avoid fragmentation.
38
Main
ESS
Main
IP LAN
EPN #1
EPN #2
ESS
IP LAN
EPN #3
EPN #1
EPN #2
EPN #3
Fragmented Network – Two
Autonomous Systems
Single System
Figure 26 – Autonomous Systems Created Upon ESS Takeover
The final piece of the requesting service discussion involves the duplication of IPSIs within a port network.
The previous section covered how the priority lists between the peer IPSIs are guaranteed to be identical,
but the service requests must also be synchronized. The IPSI is a conglomeration of a number of different
components with one being the tone clock. When two IPSIs are within the same PN, one of the IPSI tone
clocks drives the port network and the other IPSI’s tone clock is in a standby mode. The IPSI which has
the active tone clock, otherwise known as the selected IPSI, owns the responsibility of the no service timer
and when the request for service will be sent. Upon a connectivity failure from the master cluster to both
IPSIs in the PN, the selected IPSI starts the no service timer. When the no service timer expires, the
selected IPSI sends out a request for service and also instruct its peer IPSI to do the same via the backplane
communication channel. It is important to see however, if the master cluster only loses connectivity to the
selected IPSI and not the standby IPSI, the controlling server will attempt an IPSI interchange. This IPSI
interchange, driven completely by the server, causes the previously standby IPSI to become the port
network’s active, or selected, IPSI.
The following figure shows the handshaking that occurs after an IPSI decides to send a request for service
to its highest ranked alternative ESS once the IPSI’s no service timer expires. If the handshaking is not
successful and the requested ESS does not takeover the IPSI, the IPSI will attempt to get service from the
next alternative in its priority list.
Media Server
IPSI
Service Request
Control Takeover Message
Notification Update Message
(Takeover Confirmation Included)
Figure 27 – Request For Service Message Flow
39
Notification
update message
gets sent to all
media servers
currently
connected to the
IPSI
6.8 How Does Overriding IPSI Service Requests Work ?
When catastrophic failure situations occur, ESS servers will assume control of all port network gateways
which cannot be given service by the main servers. As shown, this can occur for two reasons – main server
goes out-of-service or the network prevents communication from the IPSI to the server. Eventually, the
failure, of either type, will be fixed and the conditions whereby the IPSI shifts control over to the ESS
server no longer exist. However, since shifting control between clusters is not connection preserving for all
types of calls, the IPSI does not automatically go back to the main server. As covered in the last section,
the IPSI does not request service unless it does not have a current master.
Hence, unless there is an override to the master selection process, the port networks which are being
controlled by ESS servers will never return to be part of the main system again until another fault occurs.
The ESS offering introduced a new command into the already large suite supported by CM to address this
situation. The command “get ipserver-interface forced-takeover <PN # / all>” can be issued from the main
cluster in order to have the IPSI revert control over to it. This command must be used with caution because
it causes a service outage on the port network that it is forcing over under its control. The IPSI, upon
reception of the unsolicited takeover message (all other takeovers are in response to a service request) from
the main cluster, informs its current master cluster it is shifting control back to the main cluster. Along
with the manual agglomeration command introduced here, there are methods to put the system back
together on a scheduled basis. If a fault occurs that requires ESS servers to assume control of some or all
of the port networks, the system can be administered to converge back to the main cluster at a specified
time. However, unlike the manual command, all the port networks under control of ESS servers are shifted
over, and therefore, being reset at the same time. There is never a good time to have an outage, but
controlling when it occurs is much better than the alternative – unplanned service outages.
Section 7 - ESS in Control
This section is broken into two topics – how existing calls are affected by failing over to ESS control and
what the operational consequences are if a system becomes fragmented. ESS servers offer equivalent
service (no feature debt and no performance degradation) to resources they assume control of as the main
servers provided. However, there are many types of conditions when ESS servers only take control of
some of the port networks in the system and that has a number of caveats. These topics include how
dialing patterns may change in fragmented situations and what happens to centralized resources such as
voicemail and other adjuncts such as CMS and CDR.
7.1 What Happens to Non-IP Phone Calls During Failovers ?
If a failure occurs which requires an ESS to assume control of a port network, calls supported by that PN
are affected drastically. As previously discussed, when the control of a PN shifts between entities that do
not share call state data, the control transfer process requires a cold restart of the PN. This cold restart
causes all TDM resource allocation to be reset and therefore any calls using those resources to be torn
down. Since all analog phones, digital phones, and PRI trunks always use their associated port network’s
TDM bus for a call, the calls they are members of are reset during this failover. IP telephony does not
follow the same paradigm and is discussed in the next section. The following timeline shows the effects on
the calls when recoveries take place.
40
60 seconds
If previous master
returns, EPN warm
restart recovery. All calls
within the EPN are
preserved.
No Service Timer
If previous master
returns, EPN cold restart
recovery. All calls within
the EPN are dropped.
ESS takes over which
causes an EPN cold
restart recovery. All calls
within the EPN are
dropped.
Figure 28 – Call Effects Based on Different Recovery Mechanisms
It is very important to understand what “calls are preserved” means. If a call is preserved, that means the
bearer connection and the call control are both unaffected when the recovery is complete. The next section,
which deals with IP telephony, relies on the concept of “connection preservation.” If only the connection is
preserved, then the bearer connection is unaffected, but the call control is lost. For example, if a user is on
a call and a call preserving recovery takes place, there is no effect on the end user. However, if a user is on
a call and a connection preserving recovery occurs, the communication pathway is undisturbed, but all call
control is lost (e.g. the user will be unable to transfer or conference the call in the future).
Another question that needs to be answered is what happens to the calls while waiting for a recovery.
Stated differently, if a port network fails over to an ESS server after five minutes, what happened to the
existing calls during that time? The answer is that the bearer connection continues to persist until the port
network is restarted, but with zero call control (cannot put the call on hold, retrieve another call appearance,
or even hang-up). This comes from the fact that if a PN loses its control link it preserves the current
resource allocations. The following figure shows five phone calls which are currently active. The phones
in this example are non-IP endpoints such as digital or analog.
2
3
4
PSTN
PSTN
EPN #1
EPN #2
EPN #3
EPN #4
5
1
PNC
Figure 29 – Effects on Non-IP Phone Calls during Failover
Suppose the two port networks on the right, PN #3 and PN #4, encounter some condition which causes
them to failover to an ESS server. Calls #1 and #2 are not affected by this failover since their supporting
port networks does not have a control shift and they are not using any resources on PN #3 or PN #4. Call
#3 drops because it is not completely independent of PN #3 (the controlling server of PN #2 tears down call
#3 once it loses control of all the resources it was managing involved with the call). Calls #4 and #5 also
drop because the ESS taking over performs an EPN cold restart on port networks #3 and #4.
41
7.2 What Happens to IP Phone Calls During Failovers ?
A traditional TDM phone (e.g. analog, digital, or BRI) is hardwired to a circuit pack which supports it.
Through this wire, the phone receives signaling (e.g. when to ring or what to display). In addition, the wire
is also used as the medium for bearer traffic. An H.323 IP phone does not have a single dedicated pathway
for both of these functions.
Digital
Line
Card
Signaling &
Bearer
MEDPRO
CLAN
Bearer
1
Signaling
IP
Network
Digital / Analog
Phone
Bearer
2
H.323 IP Phone
H.323 IP Phone
Figure 30 – Control and Bearer Paths for IP and TDM Phones
The H.323 IP phone and the CLAN establish a logical connection to be used for call control signaling. The
MEDPRO and the IP phone exchange RTP streams when certain types of bearer paths are required. For
instance, when an IP phone dials the digital phone’s extension (call #1 in Figure 30), all of the signaling
goes through the CLAN which is the gatekeeper for the SPE controlling the port network. The actual
bearer communication pathway goes from the phone over IP to the MEDPRO, which is a conduit onto the
TDM bus and then to the digital phone. If the IP phone called another IP phone (call #2 in Figure 30), the
signaling would still go through the CLAN, but the bearer path will be directed at the terminating IP phone
rather than the MEDPRO.
Unlike digital phones whereby a failure of the circuit pack supporting it means it becomes inoperable, IP
phones have the ability to shift from where they are getting control. In other words, if an IP phone fails to
communicate with its current gatekeeper, the CLAN in this case, it attempts to get service from another one
within the system. While the IP phone seeks out an alternate gatekeeper it keeps up its current bearer path
connection. If the IP phone attempts to register with a CLAN that is being controlled by the same entity
which is controlling its previous CLAN gatekeeper (shown on the left side of Figure 31), a control link is
immediately established and the call is preserved. The IP phone continues to search gatekeepers until this
condition is met. If it fails to do this, the IP phone will keep the existing connection up until the end user
decides to terminate the call (e.g. hang-up). After the end user hangs up, the IP phone will search out a
gatekeeper without the conditions just discussed.
42
Main
ESS
Main
ESS
IP
IPSI
controlled
by same
server
IP
IPSI
EPN #1
IPSI
EPN #2
CLAN
EPN #1
CLAN
EPN #2
CLAN
CLAN
failed
control
link
IP
failed
control
link
IP
new control link,
established during
call
Call Preserving
controlled
by
different
servers
IPSI
new control link,
established after
call is ended
Connection Preserving
Figure 31 – H.323 IP Phone Link Bounce
That being said, if an IP phone is not using any port network resources for the bearer portion of its call, the
connection is preserved through a failover. If the port network supporting these three calls in Figure 32
below fails over to an ESS, only calls #1 and #2 are dropped since PN resources are being used from the
bearer path. Call #3 has two IP phones communicating directly with each other (a shuffled call) and is
connection preserving through the failover.
Digital
Line
Card
PSTN
DS-1
Card
MEDPRO
1
2
IP
Network
Digital / Analog
Phone
H.323 IP Phone
3
H.323 IP Phone
H.323 IP Phone
Figure 32 – Effects on IP Phone Calls during Failover
7.3 What Happens to H.248 Gateway Calls During Failovers ?
Up to this point in the paper the only gateways that have been discussed were port networks. There is
another series of gateways that the Avaya Communication Server supports – H.248 gateways (G700, G350,
and G250). These gateways, which support digital and analog phones, PRI trunks, and VoIP resources, get
controlled by the SPE indirectly through CLAN gatekeepers as H.323 IP phones do and also share the
ability to seek out alternate gatekeepers. Therefore, since ESS servers provide survivability to the CLAN
gatekeepers, they also provide survivability for H.248 gateways. This section covers the effects of calls
43
supported by H.248 gateways as they failover to the control of an ESS. The ability for H.248 gateways to
also utilize Local Spare Processors (LSP) as alternate sources of control is discussed in the LSP/ESS
Interaction section of this paper.
SPE
IP
Digital / Analog
Phone
IPSI
EPN #1
CLAN
MEDPRO
Digital
Line Card
1
PSTN
Signaling
IP
Bearer
2
H.248 Gateway
Digital / Analog
Phone
3
Digital / Analog
Phone
Digital / Analog
Phone
Digital / Analog
Phone
Figure 33 – H.248 Gateway Integrated into the System
The figure above shows how H.248 gateways integrate into the communication system. As with H.323 IP
phones, the H.248 gateway receives its control from the SPE via the CLAN gatekeeper. If the gateway
loses its control link to the CLAN, it seeks out other possible gatekeepers that could provide it service. The
gateway has a pre-programmed fixed failover list of alternate gatekeepers if a failure occurs. The gateway
registers and brings up a control link to the first alternate gatekeeper that it can contact. If the gateway
registers with a CLAN that is controlled by the same SPE which controlled its previous CLAN, then all the
calls on the gateway are preserved (left side of Figure 34 below). If the gateway registers with a CLAN
that has a master which is unaware of its calls, then the failover is connection preserving only (right side of
Figure 34 below).
44
Main
ESS
Main
ESS
IP
controlled
by same
server
IP
IPSI
IPSI
EPN #1
IPSI
EPN #2
CLAN
H.248 Gateway
CLAN
IP
new control link,
established
immediately
Call Preserving
EPN #2
CLAN
failed
control
link
IP
failed
control
link
EPN #1
CLAN
controlled
by
different
servers
IPSI
H.248 Gateway
new control link,
established
immediately
Connection Preserving
Figure 34 – H.248 Gateway Control Link Bounce
Figure 33 shows how H.248 gateways integrate in the system and some example calls. If the port network
containing the CLAN gatekeeper fails over to an ESS server, the existing calls are affected. Call #1 would
be dropped because the bearer path of the call traverses the TDM bus of the port network. However, calls
#2 and #3 only use the H.248 gateway resources for the bearer path and therefore do not lose their
connections over the failover to an ESS.
7.4 How are Call Flows altered when the System is Fragmented ?
Phone #2
Phone #3
Phone #1
Phone #4
4
PSTN
5
3
1
2
EPN #1
PSTN
EPN #2
PNC
Figure 35 – Phone Calls on Non-Faulted System
The system laid out above has four digital phones with extensions 2001, 2002, 2003, and 2004 respectively
and two PSTN trunks. In addition, suppose there is a route pattern which attempts trunk #1 and then trunk
#2. The following table shows the example calls and the steps involved to establish them.
45
Phone Call
#1: Phone #1 ! Phone #2
#2:
#3:
#4:
#5:
Establishment Steps
Phone #1 dials “2002”.
SPE identifies “2002” as phone #2’s extension.
SPE is in control of phone #2 and knows it is in
service.
4. SPE rings phone #2 and the user answers.
5. SPE connects a bearer path between the phones.
Phone #3 ! Phone #4
Same as previous call.
Phone #2 ! Phone #3
Same as previous call.
Phone #1 ! External Number
1. Phone #1 dials external number.
2. SPE determines route pattern for dialed number.
3. SPE gets 1st option of route pattern (trunk #1).
4. SPE is in controls trunk #1 and knows it is in service.
5. SPE sends call out to the PSTN through trunk #1.
Phone #4 ! External Number
Same as previous call.
Table 28 – Phone Call Establishment Steps (Non-Faulted)
1.
2.
3.
Some of the five call flows just reviewed change drastically if the system becomes fragmented. The
following table examines the steps taken during the same call examples when port network #2 has failed
over to an ESS server. It is very important to note that these flow changes only occur because the system is
fragmented. If the entire system (PN #1 and PN #2) failed over to ESS control, then the call flows would
remain unchanged.
Phone #2
Phone #3
Phone #1
Phone #4
4
PSTN
5
1
PSTN
2
EPN #1
EPN #2
Call
PNC
3
does not succeed.
Figure 36 – Phone Calls on Fragmented System
Phone Call
#1: Phone #1 ! Phone #2
#2: Phone #3 ! Phone #4
#3: Phone #2 ! Phone #3
1.
2.
3.
4.
5.
Establishment Steps
Identical to non-faulted call flow.
Identical to non-faulted call flow.
Phone #2 dials “2003”.
SPE identifies “2003” as phone #3’s extension.
SPE does not control phone #3 and therefore
assumes it is out-of-service.
SPE send the call to phone #3’s defined coverage
path.
If an in service coverage path termination point is
found, the call is routed there; otherwise, re-order
tone is given to the originator (phone #2). (see
below how phone #1 can successfully call phone #3
in a fragmented system)
46
#4: Phone #1 ! External Number
#5: Phone #4 ! External Number
Identical to non-faulted call flow.
Phone #4 dials external number.
SPE determines route pattern for dialed number.
SPE gets 1st option of route pattern (trunk #1).
SPE does not control trunk #1 and therefore
assumes it is out-of-service.
5. SPE then gets 2nd option of route pattern (trunk #2).
6. SPE is in control of trunk #2 and knows it is in
service.
7. SPE sends call out to the PSTN through trunk #2.
Table 29 – Phone Call Establishment Steps (Fragmented System)
1.
2.
3.
4.
If a user desired to call a phone which was under control of a portion of the system which has been
fragmented away, then the user will have to dial differently. Assuming that DID numbers map to trunks
local to the user, if the originating phone can dial the terminating phone’s extension as an external number,
the SPE processes it as such and routes it out through a PSTN trunk under its control. The PSTN would
then route the incoming call to the port network which is fragmented away. The ESS SPE would then route
the incoming PSTN call to the appropriate extension. Using the previous example, call #3 with phone #1
dialing phone #3’s extension as an external number would succeed as follows:
Phone #2
outgoing
PSTN
call
Phone #3
Phone #1
Phone #4
outgoing
PSTN
call
3
PSTN
PSTN
EPN #1
call routed
over the
PSTN
EPN #2
PNC
Figure 37 – Extension to Extension Call Routed over PSTN
In addition, when a system enters into a fragmented situation, different call routing may be desired. In the
How does ESS Works section, it was discussed that changes can be made to ESS server’s translations, but
any changes will be lost when the ESS reset. This behavior can be leveraged to have ESS systems utilize
resources differently than the main server does without effecting the permanent system’s translation set.
For example, if the disaster plan requires that all incoming calls to the part of the system which is
fragmented away get rerouted to a recorded announcement, administration changes can be made on the
ESS system to accomplish this. However, these changes are not reflected on the main server and therefore
once the port network returns to the main server, the calls are handled as normal. This manual technique
can be used if a system is going to be staying in a fragmented state for an extended period of time.
7.5 What Happens to Centralized Resources when a System is Fragmented ?
A big advantage to collapsing autonomous systems together is the more efficient usage of resources such as
trunks, tone detectors, and announcements. Another advantage is the centralization of adjuncts. For
example, rather than having two distinct CDR systems serving two distinct systems, one centralized CDR
can serve one larger PBX switch. However, the cost benefit of doing this must be weighed against the risk
of being isolated away from the resource in failure situations. Suppose two separate systems (shown on the
47
left side of Figure 38) are collapsed together and cost savings can be realized by the elimination of two
PSTN trunks (shown on the right side of Figure 38). The danger of eliminating the trunks in this way (two
from site #2 rather than one from each) is upon a network fragmentation failure which causes the ESS
server to takeover EPN #2 and becomes its own stand-alone system, the users under the ESS server’s
control are functional, but have no trunk access.
Main – Site #1
Main – Site #2
IP
ESS
IP
IPSI
IP
IPSI
EPN #1
TR #1
Main
IPSI
EPN #1
TR #2
TR #3
IPSI
EPN #1
TR #1
TR #4
EPN #2
TR #2
PSTN
PSTN
Separate Systems
Collapsed Systems
Figure 38 – Two Systems Collapsed Together
Many adjuncts interface with the communication system over IP through a CLAN gatekeeper such as CMS
and CDR. These adjuncts, unlike H.323 IP phones and H.248 gateways, do not have the ability to find an
alternate gatekeeper if a problem arises. For that reason, the adjuncts are tightly associated with a CLAN,
or in other words, these adjuncts are bound to the port network. The following diagram shows an adjunct
interfacing with the system. As long as the link is up, the SPE continuously transmits pertinent data over
the adjunct link. The service state of the link is inherently tied to the state of the port network and the state
of the CLAN. With these short-comings, the communication between adjuncts and the SPE is increased by
having multiple communication links – a primary link and an optional secondary link. The SPE does not
treat them as an active/standby pair, but rather as both simultaneously functioning in an active mode (the
relevant data is transmitted on both links concurrently). In some cases, both links may terminate to the
same physical adjunct (shown below) and in others, the links may terminate to different physical adjuncts.
48
Main
ESS
Main
ESS
IP
IPSI
IPSI
EPN #1
IP
IPSI
EPN #2
IPSI
EPN #3
IPSI
EPN #1
CLAN
EPN #2
CLAN
primary
link
IP
primary
link
IP
secondary
link
Adjunct
Primary Adjunct Link
EPN #3
CLAN
IP
Adjunct
IPSI
Primary & Secondary Adjunct Links
Figure 39 – Adjuncts Interfacing over IP with PBX
If a catastrophic server failure takes place and the entire system is taken over by an ESS cluster, the ESS’s
SPE re-establishes the adjunct links and sends the appropriate data. The ESS SPE gained control of the
adjunct links once it gains control of the port network with which it is associated. The interesting cases are
when the system fragments. Assume a network failure which causes PN #2 and PN #3 to failover to the
ESS server (left side of Figure 40 below). The main server continues to supply service to PN #1 and
therefore still communicates over the adjunct link. The ESS server, on the other hand, controls PN #2
which has the other adjunct link. Therefore it transmits appropriate data over that link. In this situation,
both fragmentations are still interfacing with the adjunct.
Main
ESS
Main
ESS
IP
IPSI
IPSI
EPN #1
CLAN
IP
Adjunct
IP
IPSI
EPN #2
IPSI
EPN #2
IPSI
EPN #1
CLAN
CLAN
IP
IP
under ESS
control
IPSI
EPN #2
EPN #2
CLAN
IP
Adjunct
under main
SPE control
ESS Fragment Does Not Interface
with Adjunct
ESS Fragment Interfaces with Adjunct
Figure 40 – Adjuncts Working in Fragmented Situations
49
The right side of the figure above shows a different network fragmentation whereby EPN #3 is isolated
away from the rest of the system. In this case, the main server still owns both the primary and secondary
links and continues to pump relevant data to the adjuncts over them. The autonomous system that was
created made up of EPN #3 and the ESS server is unable to interface or utilize the adjuncts to which it no
longer has access. For example, suppose the adjunct links in the example were CDR links. During this
fragmented operation, all calls on the main server’s portion of the switch would be sent over the primary
and secondary CDR links. Unfortunately, the fragmented portion of the system would not have any
medium to output the call data on and after the SPE’s internal CDR buffers filled up, the data would be
lost.
7.6 What Happens to Voicemail Access when a System is Fragmented ?
call rerouted to
voicemail
Voicemail
System
PNC
EPN #1
Phone #1
EPN #2
PSTN
incoming call
to phone #1
Figure 41 – Call Rerouted to Voicemail
The figure above shows a voicemail system interfaced to the PBX over trunks that connected with PN #1.
Any calls that are attempting to terminate to phone #1, but go unanswered, are rerouted to phone #1’s
coverage point which is usually voicemail. The SPE sends the call to the voicemail system and informs it
who the original call is for, who is calling, and the reason the call is being rerouted to the voicemail system
(e.g. busy or ring/no answer). With this information the voicemail system can place the call into phone
#1’s mailbox and plays the appropriate greeting. After a message is left for the phone, the voicemail
instructs the SPE to turn on phone#1’s message waiting indicator (MWI). If phone #1 dials the voicemail
system to get its messages, the voicemail system receives the appropriate information to identify phone #1
and place it in its mailbox. If this information about the calling and called parties is not supplied, the
voicemail system provides a generic greeting and allows the calling party to manually identify itself for
message retrieval or whom they are calling in order to leave a message.
Suppose a failure occurs which causes the system to fragment in a way that PN #2 falls under the control of
an ESS server. In this situation, a call needing to go to phone #1’s coverage point will need an alternate
route because the voicemail system is no longer directly accessible from the fragmented portion of the
switch. The SPE on the ESS server is intelligent enough to reroute the call over the PSTN to get to the
voicemail system. In this rerouted call through the PSTN the ESS SPE provides all the information needed
to allow the call to enter the correct mailbox. The same procedure is followed when phone #1 desires to
check its messages and the voicemail system puts phone #1 into the appropriate mailbox. The only draw
backs to operating in this environment is that all calls going to voicemail coverage on the fragmented site
traverse the PSTN and the message waiting indicators on the phones not on the same part of the system as
the voicemail are non-functional (no indication of new messages).
50
call rerouted to
voicemail
Voicemail
System
PNC
EPN #1
Phone #1
EPN #2
PSTN
incoming call
to phone #1
Figure 41 – Call Rerouted to Voicemail Over PSTN
Section 8 - ESS Variants Based on System Configuration
The ESS offering ensures port networks are able to receive control from alternate sources upon a
catastrophic failure of the main servers or lack of network connectivity to them for an extended period of
time. Up to this point in the paper, recovery descriptions and explanations have been done independently
of the system’s configuration. Unfortunately, there are subtle behavioral differences in port network
operation depending on the type of PNC to which the PN is connected. Therefore, this section simulates
both types of failure scenarios for each type of PNC configuration available and discusses the slight
variations in operational behavior that occur.
8.1 Basic IP PNC Configured System
A port network can be controlled in two different manners – either directly through a resident IPSI or
indirectly from a non-resident IPSI tunneled through the PNC. As the Background section covered, an IP
PNC does not support tunneling of control links, so every port network using an IP PNC must contain an
IPSI in order to be operational. In addition, the point of an IP PNC is to transmit bearer communication
between port networks over an IP network. The figure below shows a system which is levering an IP PNC.
IPSI Control
Links
Main
ESS
Control LAN
IPSI
IPSI
EAL #1
IPSI
EAL #2
IPSI
IPSI
EAL #4
EAL #3
IPSI
EAL #5
EAL #6
EPN #1
EPN #2
EPN #3
EPN #4
EPN #5
EPN #6
MEDPRO
MEDPRO
MEDPRO
MEDPRO
MEDPRO
MEDPRO
Inter-PN
Bearer Traffic
IP PNC
Figure 42 – Non-faulted System with IP PNC
51
In a non-faulted situation, every PN in the system is receiving control from the main server directly over IP
(no tunneled control). In addition, every IPSI has created a priority list consisting of the main cluster
followed by the ESS cluster.
8.2 Catastrophic Server Failure in an IP PNC Environment
The simplest scenario to describe is the failover of an entire system (every port network) to an ESS cluster
that is configured to have its port networks use an IP PNC. The figure below shows a catastrophic failure
leaving the main cluster inoperable and all the IPSIs failing over to the ESS cluster.
IPSI Control
Links
Main
ESS
Control LAN
IPSI
IPSI
EAL #1
IPSI
EAL #2
IPSI
IPSI
EAL #4
EAL #3
IPSI
EAL #6
EAL #5
EPN #1
EPN #2
EPN #3
EPN #4
EPN #5
EPN #6
MEDPRO
MEDPRO
MEDPRO
MEDPRO
MEDPRO
MEDPRO
Inter-PN
Bearer Traffic
IP PNC
Figure 43 – Catastrophic Server Failure in an IP PNC Environment
The following list discusses the steps that occur when the non-faulted system (Figure 42) encounters a
server failure causing all the system’s port networks to transition over to an ESS (Figure 43).
1.
2.
3.
4.
5.
6.
7.
8.
An event renders the main cluster inoperable.
The socket connections between the main cluster and the IPSIs all fail.
Each IPSI independently detects the loss of its control link.
Each IPSI removes the main cluster from its priority list and moves the ESS cluster
into the 1st alternative slot.
Each IPSI starts its own no service timer.
When each IPSI’s no service timer expires, it requests service from its 1st alternative
(the ESS).
The ESS acknowledges each IPSI’s service request by taking control of it.
Every port network is then brought back into service via a cold restart.
After the failover is completed, the system operates exactly the same as before the failure since one
autonomous system image is preserved (a non-fragmented system), the IP PNC is still used as the
connectivity medium for inter-port network bearer traffic, and the ESS servers have no feature debt,
identical RTUs, and no performance compromises as compared to the main server. If the main server
comes back into service, it will inform all the IPSIs that it is available to supply them service if needed, but
52
the IPSIs will remain under the control of the ESS until the system administrator decides to transition them
back.
8.3 Network Fragmentation Failure in an IP PNC Environment
The next scenario examined is the effect on the system if an event occurs that fragments the control
network, as shown in figure below, causing part of the system to failover to an ESS server.
IPSI Control
Links
Main
ESS
Control LAN
IPSI
IPSI
EAL #1
IPSI
EAL #2
IPSI
IPSI
EAL #4
EAL #3
IPSI
EAL #6
EAL #5
EPN #1
EPN #2
EPN #3
EPN #4
EPN #5
EPN #6
MEDPRO
MEDPRO
MEDPRO
MEDPRO
MEDPRO
MEDPRO
Inter-PN
Bearer Traffic
IP PNC
Figure 44 – Network Fragmentation in an IP PNC Environment
The following list discusses the steps that occur during the transition of a non-faulted system (Figure 42)
into a fragmented system (Figure 44).
1.
An event occurs that fragments the control network preventing communication
between the main server and some port networks.
2. The socket connections between the main cluster and port networks #4, #5, and #6
all fail.
3. The socket connections between the ESS cluster and port networks #1, #2, and #3 all
fail.
4. Each IPSI in PN #4, #5, and #6 independently detects the loss of the main server’s
control link.
5. Each IPSI in PN #1, #2, and #3 independently detects the loss of the ESS’s keepalive link.
6. Each IPSI in PN #4, #5, and #6 removes the main cluster from its priority lists and
moves the ESS cluster into the 1st alternative slot.
7. Each IPSI in PN #1, #2, and #3 removes the ESS cluster from their priority lists
leaving them with no alternatives to the main cluster.
8. Each IPSI in PN #4, #5, and #6 each starts its own no service timers.
9. When each IPSI’s no service timer expires, it requests service from its 1st alternative
(the ESS).
10. The ESS acknowledges each IPSI’s service request by taking control of it.
11. Port networks #4, #5, and #6 are brought back into service via a cold restart.
53
Even though the same ESS server is being used to provide survivability as the previous example, the end
users do not receive 100% equivalent service. The drawbacks to running in a fragmented mode were
discussed in the ESS in Control section. The important things to note are that every port network is
providing service to its end users and that both fragments still use the IP network for bearer traffic between
the PNs. In the following scenarios, this is not always the case. If the network fragmentation is mended,
the main server will inform port networks #4, #5, and #6 that it is available to supply them service if
needed, but the IPSIs will remain under the control of the ESS until the system administrator decides to
transition them back. If the network experiences the same problem again (e.g. a flapping network), the
ISPI lose contact with the main server, but since it is not being controlled by it, there are no adverse effects
on the end users.
8.4 Basic ATM PNC Configured System
Unlike the IP PNC discussed in the previous section, the ATM PNC supports tunneled control links and
therefore not every port network is required to have an IPSI. In order for the SPE to control a non-IPSI
connected PN, it must select an IPSI in another port network to indirectly support it. The SPE preferences,
but with absolutely no guarantees, control for non-IPSI connected PNs to be controlled by an IPSI in its
same community. The following diagram, showing a non-faulted ATM PNC configured system, has PN #1
getting service through PN #2 and PN #6 getting service through PN #5. The ATM PNC in this case is
being used as the inter-port network bearer medium and also as a conduit for control links to non-IPSI
connected PNs. In addition, every IPSI has created a priority list consisting of the main cluster followed by
the ESS cluster.
IPSI Control
Links
Main
ESS
Control LAN
IPSI
IPSI
EAL #2
EPN #1
EPN #2
EI
IPSI
EPN #3
EI
IPSI
EAL #4
EAL #3
EPN #4
EI
EAL #5
EPN #5
EI
EAL #1
ATM PNC
EI
EPN #6
EI
EAL #6
Inter-PN
Bearer Traffic
Figure 45 – Non-faulted System with ATM PNC
Be aware however, that the stability of PN #6 is inherently tied to PN #5 and the health of the ATM PNC.
Therefore a reset of PN #5 leads to a reset of PN #6. In addition, the SPE must get port network #5 into
service before attempting to tunnel control through it for PN #6. The implication of this is that upon a
system cold restart, PN #5 becomes operational before PN #6 does. Also, if there is some malfunction of
the ATM PNC whereby the communication between PN #6 is severed from the rest of the port networks,
PN #6 goes out of service since, without an IPSI, there is no way to get a control link to it from the SPE. In
summary, IPSI connected port networks stability does not rely on the stability of other PNs, comes into
service faster than non-IPSI connected PNs after restarts, and are not completely reliant on the ATM PNC
for control.
54
8.5 Catastrophic Server Failure in an ATM PNC Environment
The next scenario described is the failover of an entire system (every port network) to an ESS cluster that is
configured to have its port networks use an ATM PNC. The figure below shows a catastrophic failure
leaving the main cluster inoperable and all the IPSIs failing over to the ESS cluster.
IPSI Control
Links
Main
ESS
Control LAN
IPSI
IPSI
EAL #2
EPN #1
EPN #2
EI
EI
IPSI
IPSI
EAL #4
EAL #3
EPN #3
EPN #4
EI
EI
EAL #5
EPN #5
EPN #6
EI
EAL #1
ATM PNC
EI
EAL #6
Inter-PN
Bearer Traffic
Figure 46 – Catastrophic Server Failure in an ATM PNC Environment
The following list discusses the steps that occur when the non-faulted system (Figure 45) encounters a
server failure causing all the system’s port networks to transition over to an ESS (Figure 46).
1.
2.
3.
4.
5.
6.
7.
8.
9.
An event renders the main cluster inoperable.
The socket connections between the main cluster and the IPSIs all fail.
Each IPSI independently detects the loss of its control link.
Each IPSI removes the main cluster from its priority list and moves the ESS cluster
into the 1st alternative slot.
Each IPSI starts its own no service timer.
When each IPSI’s no service timer expires, it requests service from its 1st alternative
(the ESS).
The ESS acknowledges each IPSI’s service request by taking control of it.
Every IPSI connected port network is then brought back into service via a cold
restart.
The SPE tunnels control links to every non-IPSI connected port network and brings
them into service via a cold restart.
After the failover is completed, the system operates exactly the same as before the failure since one
autonomous system image is preserved (a non-fragmented system), the ATM PNC is still used as the
connectivity medium for inter-port network bearer traffic, and the ESS servers have no feature debt,
identical RTUs, and no performance compromises as compared to the main server. If the main server
comes back into service, it will inform all the IPSIs that it is available to supply them service if needed, but
55
the IPSIs will remain under the control of the ESS until the system administrator decides to transition them
back.
8.6 Network Fragmentation Failure in an ATM PNC Environment
This next scenario, network fragmentation failure, has two variations that need to be addressed. The first
and much more common variation is a failure condition that only causes the control network to fragment
while leaving the ATM PNC intact. The second variation is a failure condition that causes both the control
network and the ATM PNC to fragment simultaneously. This can occur if the control network and the
ATM PNC network are one and the same or using the infamous backhoe explanation. The following figure
shows the first variation whereby only the control network fragmented and how the system restores control
to the effected port networks.
IPSI Control
Links
Main
ESS
Control LAN
IPSI
IPSI
EAL #2
EPN #1
EPN #2
EI
EI
IPSI
IPSI
EAL #3
EPN #3
EPN #4
EI
EPN #5
EI
EAL #4
EPN #6
EI
EI
EAL #5
EAL #1
ATM PNC
EAL #6
Inter-PN
Bearer Traffic
Figure 47 – Control Network Only Fragmentation in an ATM PNC Environment
The following list discusses the steps that occur when the non-faulted system (Figure 45) encounters a
control network fragmentation causing some of the system’s port networks to go into fallback control
(Figure 47). It is important to note that this type of failure is resolved with system capabilities which
existed prior to the ESS offering. This scenario is being reviewed to show how ESS allows other recovery
mechanisms, which usually have a less dramatic effect on end users, to attempt to resolve issues before the
ESS jumps in to provide service.
1.
2.
3.
4.
5.
6.
An event occurs that fragments the control network preventing communication
between the main server and some port networks.
The socket connections between the main cluster and port networks #4 and #5 both
fail.
The socket connections between the ESS cluster and port networks #2 and #3 both
fail.
The main SPE detects the loss of connectivity to the IPSIs in PN #4 and #5.
The IPSIs in PN #4 and #5 independently detect the loss of the main server’s control
link.
The IPSIs in PN #2 and #3 independently detect the loss of the ESS’s keep-alive
link.
56
7.
The IPSIs in PN #4 and #5 remove the main cluster from its priority lists and moves
the ESS cluster into the 1st alternative slot.
8. The IPSIs in PN #2 and #3 remove the ESS cluster from their priority lists leaving
them with no alternatives to the main cluster.
9. The IPSIs in PN #4 and #5 each starts its own no service timers.
10. The main SPE takes control of PN #4, #5, and #6 indirectly over the ATM PNC
through the IPSIs in PN #2 and #3.
11. The IPSIs in PN #4 and #5 detect that someone is controlling their associated port
networks and cancel their no service timers.
Since the port networks on the right side of the fragmentation are taken over by the main SPE through a
fallback recovery (discussed in the Reliability - Single Cluster Environments section) the port networks are
brought back into service via a warm restart implying no stable calls are affected. In addition, the IPSIs in
PN #4 and #5 do not request service from the ESS server while the main server is controlling their port
networks through the ATM PNC. If the tunneled link has a failure, then the IPSI will begin its no service
timer once again and the process starts over at step #9 above. Once the network fragmentation is fixed, the
main SPE will contact the IPSIs in PN #4 and #5 and immediately take control of them since the SPE
knows it is already controlling (albeit indirectly) their associated port networks. After the SPE deems the
connectivity between itself and the IPSIs is stable, it transitions control links from being tunneled through
the PNC to the IPSI directly (control link fall-up).
Up to this point in the paper, the ATM PNC has been abstractly represented as a simple ATM switched
network cloud. In reality, however, the cloud may represent a single ATM switch or a number of ATM
switches integrated together to form the ATM PNC network. The second fragmentation variation, whereby
the control network and the ATM PNC both fragment concurrently, needs to be looked at in both
frameworks. Figure #48 below shows a system that has a fragmented control network and a fragmented
ATM PNC. Also shown in the figure are the ATM EI boards interfacing with a local ATM switch. The
combination of both of these interconnected ATM switches creates the ATM PNC infrastructure.
IPSI Control
Links
Main
ESS
Control LAN
IPSI
IPSI
EAL #2
EPN #1
EI
EPN #2
EI
IPSI
EAL #3
EPN #3
IPSI
EAL #4
EPN #4
EI
EI
EAL #5
EPN #5
EPN #6
EI
EI
EAL #1
EAL #6
Inter-PN
Bearer Traffic
ATM
Switch
ATM
Switch
Inter-PN
Bearer Traffic
ATM PNC
Figure 48 – Control Network and ATM PNC Fragmentation (Multi-switched ATM Network)
57
The following list discusses the steps that occur when the non-faulted system (Figure 45) encounters a
control network fragmentation and ATM PNC fragmentation causing the some of the system to be taken
over by an ESS cluster (Figure 48).
1.
An event occurs that fragments the control network preventing communication
between the main server and some port networks and that fragments the ATM PNC
preventing fallback recovery.
2. The socket connections between the main cluster and port networks #4 and #5 both
fail.
3. The socket connections between the ESS cluster and port networks #2 and #3 both
fail.
4. The IPSIs in PN #4 and #5 independently detect the loss of the main server’s control
link.
5. The IPSIs in PN #2 and #3 independently detect the loss of the ESS’s keep-alive
link.
6. The IPSIs in PN #4 and #5 remove the main cluster from its priority lists and moves
the ESS cluster into the 1st alternative slot.
7. The IPSIs in PN #2 and #3 remove the ESS cluster from their priority lists leaving
them with no alternatives to the main cluster.
8. The IPSIs in PN #4 and #5 each starts its own no service timers.
9. When each IPSI’s no service timer expires, it requests service from its 1st alternative
(the ESS) since no fallback recovery took place.
10. The ESS acknowledges each IPSI’s service request by taking control of it.
11. Port networks #4 and #5 are brought back into service via a cold restart.
12. The ESS tunnels a control link through the IPSI in PN #5, for this example, over the
ATM PNC to port network #6 and brings PN #6 back into service via a cold restart.
The left side of the system (PN #1, #2, and #3) remains under the control of the main servers and is
operating as a stand-alone system. Since the ATM switch is still interconnecting port networks #1, #2, and
#3, bearer connectivity between them is possible and the tunneled control link to PN #1 is still viable.
Other than being in a fragmented state, the left side of the system is operating equivalently as it was before
the failure. The right side of the system (PN #4, #5, and #6) experiences a cold restart when it is taken over
by the ESS server, but then operates as it did before the failure occurred but in a fragmented environment
(see the ESS in Control section). Since the port networks (PN #4, #5, and #6) are still interconnected via an
ATM switch, inter-port network bearer traffic is still possible along with tunneled control for PN #6.
When the control network is healed, the main server informs the IPSIs in PN #4 and #5 that it is available
to provide service and the ESS server informs the IPSIs in PN #2 and #3 that it available to provide service.
All the IPSIs make appropriate changes to their priority lists based on the recovered connectivity, but do
not switch the cluster from which they are currently getting service. Immediately after the fragmentation
fault occurs, the main server continuously attempts to establish tunneled control links to port networks #4,
#5, and #6 through the ATM network, but fails to do so due to the lack of connectivity. After the ATM
PNC is healed, the connectivity is restored, but the tunneled control links still fails because the EI boards,
where the control link terminates, rejects it. The EI boards in PN #4 and #5 disallow the control link
because the port network is already being controlled through the IPSIs. This is accomplished by a method
discussed in the last section of this paper. The EI board in PN #6 also disallows the control link because
the EI board is designed to only allow one control link at any one time and it already has a control link up
to it through PN #5. Also, the ATM PNC is self-managed (see the Background section) allowing two
intelligent entities (the main SPE and the ESS SPE) to both use its resources at the same time without
conflicting with each other. If this was not the case, it would prevent the two systems from acting
independently and still allow inter-port network bearer communication over the ATM PNC.
Figure 49 below shows the same fragmentation of the control network and the ATM network as discussed
above and, in this case, the ATM network consists of a single ATM switch which happens to be on the left
side of the fragmentation.
58
IPSI Control
Links
Main
ESS
PN #6 is
Out-of-Service
Control LAN
IPSI
IPSI
EAL #2
EPN #1
EI
EPN #2
EI
IPSI
EAL #3
EPN #3
IPSI
EAL #4
EPN #4
EI
EI
EAL #5
EPN #5
EPN #6
EI
EI
EAL #1
Inter-PN
Bearer Traffic
ATM
Switch
No Inter-PN
Bearer Traffic
ATM PNC
Figure 49 – Control Network and ATM PNC Fragmentation (Single-switched ATM Network)
The following list discusses the steps that occur when the non-faulted system (Figure 45) encounters a
control network fragmentation and ATM PNC fragmentation causing the some of the system to be taken
over by an ESS cluster (Figure 49).
1.
An event occurs that fragments the control network preventing communication
between the main server and some port networks and that fragments the ATM PNC
preventing fallback recovery.
2. The socket connections between the main cluster and port networks #4 and #5 both
fail.
3. The socket connections between the ESS cluster and port networks #2 and #3 both
fail.
4. The IPSIs in PN #4 and #5 independently detect the loss of the main server’s control
link.
5. The IPSIs in PN #2 and #3 independently detect the loss of the ESS’s keep-alive
link.
6. The IPSIs in PN #4 and #5 remove the main cluster from its priority lists and moves
the ESS cluster into the 1st alternative slot.
7. The IPSIs in PN #2 and #3 remove the ESS cluster from their priority lists leaving
them with no alternatives to the main cluster.
8. The IPSIs in PN #4 and #5 each starts its own no service timers.
9. When each IPSI’s no service timer expires, it requests service from its 1st alternative
(the ESS) since no fallback recovery took place.
10. The ESS acknowledges each IPSI’s service request by taking control of it.
11. Port networks #4 and #5 are brought back into service via a cold restart.
The left side of the system (PN #1, #2, and #3) operates in exactly the same manner as the previous failure.
Since the single ATM switch resides on the left side of the fragmentation, the port networks #1, #2, and #3
59
remain interconnected. The main SPE is controlling port networks #2 and #3 through the resident IPSI and
port network #1 via the ATM PNC. However, the right side of the fragmentation is tremendously different
than the previous example since there is no ATM switch interconnecting port networks #4, #5, and #6
which has two major implications. The first issue is that no tunneled control links are possible (no viable
pathway) and therefore PN #6 cannot receive control from the SPE. Port networks #4 and #5 have resident
IPSIs and therefore their service state is independent of the interconnection ATM PNC health. However,
without an ATM switch interconnecting port networks #4 and #5 there will be no inter-port network bearer
communication. For example, a call entering through a trunk in PN #4 destined for a user off of PN #5
would not be able to be completed and is instead routed to the called party’s coverage. In other words, the
port networks #4 and #5 are in service and controlled by the same SPE (the ESS server), but function as
stand-alone islands without the ATM PNC connectivity.
When the faulted ATM network is fixed, two major events take place. First of all, port networks #4 and #5
will be able to communicate over the ATM PNC again. Once the SPE recognizes this, it begins to allow
inter-port network bearer traffic again since a viable pathway exists. Furthermore, once PN #6 can
communicate with the ATM PNC, it can receive a tunneled control link to provide it service. The problem,
however, is that it is non-deterministic race which SPE, the main or the ESS, gets a control link up to it
first. As talked about in the previous section, when an SPE is not in control of a PN, it continuously
attempts to bring up a control link through the ATM PNC. This operation usually fails because either there
is no pathway for the link to be established on or the terminating EI rejects the establishment if it is already
being controlled. In this case however, once the link establishments are no longer blocked by the ATM
PNC fragmentation and since the main SPE and the ESS SPE do not coordinate recovery actions, it is a
race condition between them in respect to who is going to get control of PN #6 first. Said differently, once
the ATM PNC is healed all the port networks receive service, but there is no administrator control in the
decision of which SPE controls it. The only way to ensure that non-IPSI connected port networks receive
their control from a certain SPE is to have that SPE control all the IPSIs in the system. If other SPEs in the
system do not control any IPSIs, they have no entry point into the ATM PNC and therefore never attempt
to establish control links to the non-IPSI connected PNs.
An undesirable scenario can occur if not every port network has a resident IPSI is port network called
hijacking. With the example above, the main SPE brought up a tunneled control link through PN #3 to PN
#6 before the ESS server did. Port network #6 would be brought into service via a cold restart and then be
part of the partial system controlled by the main SPE. Since the ESS SPE does not control PN #6 it is
continuously attempting to bring up a control link to it, but is blocked by the EI board since another control
link is already established from the main SPE. However, if another event occurs that causes a short
communication fault between the main SPE and PN #6, then PN #6 would temporarily be un-controlled
(control link goes down). Typically, the main SPE re-establishes a control link very quickly and resumes
controlling the SPE by bring it into service via a warm restart (no effect on stable calls). Unfortunately,
there is a chance that the ESS SPE brings up its control link to port network #6 during this time and if this
happens, PN #6 would be brought into service under the control of the ESS server via a cold restart
(dropping all active calls) and the main SPE’s attempts to restore service would be blocked. This type of
event is referred to as port network hijacking.
While a system configured to use an ATM PNC does not require an IPSI in every port network to be
functional and be protected by the ESS offering, there are a number of drawbacks from doing so. First of
all, the port networks service state is inherently tied to the availability of the ATM PNC and conditions can
arise, as shown in the previous example, whereby non-IPSI port networks do not receive control from ESS
clusters in failure situations. Second of all, there are no mechanisms built that allow the administrator to
dictate where a non-IPSI connected port network gets control from and it is non-deterministic where a nonIPSI connected port work gets service from during recovery. And finally, recovery of non-IPSI connected
port networks take longer than IPSI connected port networks do since they rely on IPSI connected port
networks to be in service before tunneling control links. It can be concluded, due to the reasons mentioned,
a system that has IPSIs in every port network has higher availability than systems which do not.
60
8.7 Basic CSS PNC Configured System
Another configuration covered is a system configured to use a CSS PNC. As in the ATM PNC case, the
CSS PNC supports not only inter-port network bearer connections, but also a medium for tunneling control
links. Therefore, a system leveraging a CSS PNC is not required to have an IPSI in every port network.
The following diagram, showing a non-faulted CSS PNC configured system, has PN #1 getting service
through PN #2 and PN #6 getting service through PN #5. The CSS PNC in this case is being used as the
inter-port network bearer medium and also a conduit for control links to non-IPSI connected PNs. In
addition, every IPSI has created a priority list consisting of the main cluster followed by the ESS cluster.
IPSI Control
Links
Main
ESS
Control LAN
IPSI
IPSI
EAL #2
EPN #1
EPN #2
EI
EI
IPSI
EPN #3
IPSI
EAL #4
EAL #3
EPN #4
EI
EAL #5
EPN #5
EI
EAL #1
CSS PNC
EI
EPN #6
EI
EAL #6
Inter-PN
Bearer Traffic
over CSS PNC
Figure 50 – Non-faulted System with CSS PNC
In the How are Port Networks Interconnected section the CSS PNC is described to be SPE managed. The
ramifications of a PNC being SPE managed is that only one intelligent entity may utilize its resources and
if two or more intelligent entities attempted to simultaneously use the PNC, many resource conflicts will be
encountered causing complete CSS PNC failure. The IP PNC and ATM PNC are self managed and
therefore do not have the same limitations (see Figure 44 and Figure 48 respectively). With many
intelligent entities within the system (main SPE and ESS SPEs) the only way to guarantee only one SPE is
utilizing the CSS PNC at one time is to block all other SPEs from using it. The main SPE and the ESS
SPEs do not communicate about call state or resource allocation, so the blocking cannot be on a dynamic
basis. Instead, a requirement exists in the ESS offering stating that an ESS SPE never attempts to utilize
the CSS PNC under any circumstances. As the fault scenario unfold below, this has a major effect on port
network recovery and the method PNs communicate with each other in a failure mode.
8.8 Catastrophic Server Failure in a CSS PNC Environment
The next scenario described is the failover of IPSI connected port networks to an ESS cluster that is
configured to have its port networks use a CSS PNC. The figure below shows a catastrophic failure leaving
the main cluster inoperable and all the IPSIs failing over to the ESS cluster.
61
IPSI Control
Links
Main
Control LAN
PN #1 is
Out-of-Service
IPSI
IPSI
EAL #2
EPN #1
EI
ESS
PN #6 is
Out-of-Service
IPSI
IPSI
EAL #4
EAL #3
EAL #5
EPN #2
EPN #3
EPN #4
EPN #5
MEDPRO EI
MEDPRO EI
EI MEDPRO
EI MEDPRO
EAL #1
CSS PNC
Inter-PN
Bearer Traffic
over IP PNC
EPN #6
EI
EAL #6
IP PNC
Figure 51 – Catastrophic Server Failure in a CSS PNC Environment
The following list discusses the steps that occur when the non-faulted system (Figure 50) encounters a
server failure causing some of the system’s port networks to transition over to an ESS (Figure 51).
1.
2.
3.
4.
5.
6.
7.
8.
An event renders the main cluster inoperable.
The socket connections between the main cluster and the IPSIs all fail.
Each IPSI independently detects the loss of its control link.
Each IPSI removes the main cluster from its priority list and moves the ESS cluster
into the 1st alternative slot.
Each IPSI starts its own no service timer.
When each IPSI’s no service timer expires, it requests service from its 1st alternative
(the ESS).
The ESS acknowledges each IPSI’s service request by taking control of it.
Every IPSI connected port network is then brought back into service via a cold
restart.
Abiding by the rule that ESS servers can never utilize the CSS PNC prevents a system configured with a
CSS PNC from operating in the same manner or with equivalent service when being controlled by the main
cluster versus being controlled by an ESS cluster. As Figure 51 shows, port networks without IPSIs (PN #1
and #6) are not provided service from the ESS server and therefore are not supplying service to their end
users. This is due to the fact that without an IPSI, the only method to control a port network is with a
tunneled control link through the CSS PNC and since the ESS SPE cannot use any CSS PNC resources, no
pathway for the control link to traverse exists. This leads to a very basic design principle – a port network
in a CSS PNC environment must have a resident IPSI if it to be provided additional reliability by the ESS
offering.
Another side-effect of the ESS SPE’s inability to use the CSS PNC is a different communication medium
between the port networks is needed. If an ESS server controls multiple CSS PNC connected port
62
networks, it routes inter-port network bearer traffic over an IP PNC. In other words, an ESS SPE ignores
the existence of EI boards in port networks it controls and treats the PN as if it were configured to use an IP
PNC. This implies that for a port network to communicate with other PNs in survivability mode, it needs
to have a MEDPRO allowing it to interface with an IP network for VoIP traffic. If the port network does
not have a MEDPRO, the ESS still provides it service, but the PN becomes its own island unable to
communicate with other PNs controlled by the ESS.
When the main cluster is repaired, it informs all the IPSIs that it is available to provide it service if they
require it. The IPSIs place the main cluster into their priority lists appropriately, but do not shift under the
main clusters control automatically. The main SPE has the ability to utilize the CSS PNC to provide
service to PN #1 and #6, but it does not have to an access point to do this. Once the control of one IPSI is
transferred to the main SPE, it immediately tunnels control links through that IPSI to port networks #1 and
#6.
8.9 Network Fragmentation Failure in a CSS PNC Environment
This next scenario, network fragmentation failure, has two variations that need to be addressed. The first
and much more common variation is a failure condition that only causes the control network to fragment
while leaving the CSS PNC intact. The second variation is a failure condition that causes both the control
network and the CSS PNC to fragment simultaneously. This variation is very unlikely due to the fact that
the control network and the CSS PNC are always completely separate networks implying the only way for
this to statistically occur is if there are single points of failure that they share such as a conduit. If the
wiring of the networks is done within the same physical conduit and that conduit was cut, both networks
would fragment at the same time. The following figure shows the first variation whereby only the control
network fragmented and how the system restores control to the effected port networks.
IPSI Control
Links
Main
ESS
Control LAN
IPSI
IPSI
EAL #2
EPN #1
EPN #2
EI
EI
IPSI
IPSI
EAL #3
EPN #3
EPN #4
EI
EPN #5
EI
EAL #4
EPN #6
EI
EI
EAL #5
EAL #1
CSS PNC
EAL #6
Inter-PN
Bearer Traffic
over CSS PNC
Figure 52 – Control Network Only Fragmentation in a CSS PNC Environment
The following list discusses the steps that occur when the non-faulted system (Figure 50) encounters a
control network fragmentation causing some of the system’s port networks to go into fallback control
(Figure 52). It is important to note that this type of failure is resolved with system capabilities which
existed prior to the ESS offering. This scenario is being reviewed to show how ESS allows other recovery
63
mechanisms, which usually have a less dramatic effect on end users, to attempt to resolve issues before the
ESS jumps in to provide control.
1.
An event occurs that fragments the control network preventing communication
between the main server and some port networks.
2. The socket connections between the main cluster and port networks #4 and #5 both
fail.
3. The socket connections between the ESS cluster and port networks #2 and #3 both
fail.
4. The main SPE detects the loss of connectivity to the IPSIs in PN #4 and #5.
5. The IPSIs in PN #4 and #5 independently detect the loss of the main server’s control
link.
6. The IPSIs in PN #2 and #3 independently detect the loss of the ESS’s keep-alive
link.
7. The IPSIs in PN #4 and #5 remove the main cluster from its priority lists and moves
the ESS cluster into the 1st alternative slot.
8. The IPSIs in PN #2 and #3 remove the ESS cluster from their priority lists leaving
them with no alternatives to the main cluster.
9. The IPSIs in PN #4 and #5 each starts its own no service timers.
10. The main SPE takes control of PN #4, #5, and #6 indirectly over the ATM PNC
through the IPSIs in PN #2 and #3.
11. The IPSIs in PN #4 and #5 detect that someone is controlling their associated port
networks and cancel their no service timers.
Since the port networks on the right side of the fragmentation were taken over by the main SPE through a
fallback recovery (discussed in the Reliability - Single Cluster Environments section) the port networks
were brought back into service via a warm restart implying no stable calls were affected. In addition, the
IPSIs in PN #4 and #5 do not request service from the ESS server while the main server is controlling their
port networks through the CSS PNC. If the tunneled link has a failure, then the IPSI will begin its no
service timer once again and the process starts over at step #9 above. Once the network fragmentation is
fixed, the main SPE will contact the IPSIs in PN #4 and #5 and immediately take control of them since the
SPE knows it is already controlling (albeit indirectly) their associated port networks. After the SPE deems
the connectivity between itself and the IPSIs is stable, it transitions control links from being tunneled
through the PNC to the IPSI directly (control link fall-up). It should be apparent that in this failure
scenario, systems configured with an ATM PNC or with a CSS PNC behave identically.
EPN #1
EPN #2
EPN #3
EPN #4
EPN #1
EPN #2
EPN #3
EPN #4
EI
EI
EI
EI
EI
EI
EI
EI
Carrier #1
Carrier #1
CSS PNC
Carrier #2
CSS PNC
Single Carrier CSS
Multi-Carrier CSS (Split CSS)
Figure 53 – CSS PNC Layouts
The CSS PNC, as discussed in the How are Port Networks Interconnected section, can be physically laid
out in two manners – a single carrier format or over multiple carriers (referred to as split CSS) as displayed
above in Figure #53. Since an ESS server never attempts to use the CSS PNC, it is inconsequential for this
discussion which layout is being used. In the case of a single carrier, the port networks on the right side of
the fragment lose physical connectivity between each other that the CSS carrier provided. In the case of
64
split CSS, the port networks on the right side of the fragment may still have physical connectivity through
on of the carriers making up the CSS, but the ESS server will not leverage it. Figure #54 shows a system
that has a fragmented control network and a fragmented CSS PNC.
IPSI Control
Links
Main
ESS
Control LAN
IPSI
IPSI
EAL #2
EPN #1
EPN #2
EI
EI
PN #6 is
Out-of-Service
IPSI
IPSI
EAL #4
EAL #3
EPN #3
EI
EAL #5
EPN #4
EPN #5
EI MEDPRO
EI MEDPRO
EPN #6
EI
EAL #1
CSS PNC
Inter-PN
Bearer Traffic
over CSS PNC
IP PNC
Inter-PN
Bearer Traffic
over IP PNC
Figure 54 – Control Network and CSS PNC Fragmentation
The following list discusses the steps that occur when the non-faulted system (Figure 50) encounters a
control network fragmentation and CSS PNC fragmentation causing the some of the system to be taken
over by an ESS cluster (Figure 54).
1.
An event occurs that fragments the control network preventing communication
between the main server and some port networks and that fragments the ATM PNC
preventing fallback recovery.
2. The socket connections between the main cluster and port networks #4 and #5 both
fail.
3. The socket connections between the ESS cluster and port networks #2 and #3 both
fail.
4. The IPSIs in PN #4 and #5 independently detect the loss of the main server’s control
link.
5. The IPSIs in PN #2 and #3 independently detect the loss of the ESS’s keep-alive
link.
6. The IPSIs in PN #4 and #5 remove the main cluster from its priority lists and moves
the ESS cluster into the 1st alternative slot.
7. The IPSIs in PN #2 and #3 remove the ESS cluster from their priority lists leaving
them with no alternatives to the main cluster.
8. The IPSIs in PN #4 and #5 each starts its own no service timers.
9. When each IPSI’s no service timer expires, it requests service from its 1st alternative
(the ESS) since no fallback recovery took place.
10. The ESS acknowledges each IPSI’s service request by taking control of it.
11. Port networks #4 and #5 are brought back into service via a cold restart.
65
The left side of the system (PNs #1, #2, and #3) remains under the control of the main servers and is
operating as a stand-alone system. Since the CSS PNC is still interconnecting port networks #1, #2, and
#3, bearer connectivity between them is possible and the tunneled control link to PN #1 is still viable.
Other than being in a fragmented state, the left side of the system is operating equivalently as it was before
the failure. The right side of the system does not recover as smoothly as in the ATM PNC environment.
First of all, PNs #4 and #5 are being supplied service from the ESS server, but the inter-port network traffic
is now being routed over an IP network. In addition, port network #6 is not supplied service from either the
main SPE or the ESS SPE and therefore is inoperable. This implies that a CSS PNC connected port
network is required to have a resident IPSI (for control) and a MEDPRO (for inter-port network bearer
communication) if it is to be supplied service by an ESS in failure modes.
When the control network is healed, the main server informs the IPSIs in PN #4 and #5 that it is available
to provide service and the ESS server informs the IPSIs in PN #2 and #3 that it available to provide service.
All the IPSIs make appropriate changes to their priority lists based on the recovered connectivity, but do
not switch the cluster that they are currently getting service from. Immediately after the fragmentation fault
occurs, the main server continuously attempts to establish tunneled control links to port networks #4, #5,
and #6 through the CSS PNC, but fails to do so due to the lack of connectivity. After the CSS PNC is
healed, the connectivity is restored, but the tunneled control links to PN #4 and #5 still fails because the EI
boards, where the control link terminates, rejects it. The EI boards in PN #4 and #5 disallow the control
link because the port network is already being controlled through the IPSIs. This is accomplished by a
method discussed in the last section of this paper. However, the EI board in PN #6 accepts the control link
because no other control links were terminating to it and the main server takes control of the port network.
Unlike the ATM PNC scenario whereby the main SPE and the ESS SPEs race to control non-IPSI
connected PNs in some cases, there is no race condition since only the main SPE is ever attempting to
tunnel control links through the CSS PNC. After the administrator agglomerates the system by transferring
control of PN #4 and #5 from the ESS server to the main server, the bearer traffic between the port
networks is sent back over the CSS PNC rather than the IP PNC.
8.10 Mixed PNC Configured System
The final configuration covered in this section is a system designed to use a mixed PNC environment. In
the three previous sections, the PNC configurations discussed were all inclusive meaning that all port
networks shared the same PNC. In an IP PNC configured system, all of the port networks are utilizing an
IP network for inter-port network bearer communication. In an ATM PNC configured system, all of the
port networks are utilizing an ATM network for inter-port network bearer communication. In a CSS PNC
configured system, all of the port networks were utilizing a center-stage switch for inter-port network
bearer communication. The mixed PNC configuration allows some of the port networks to utilize one
instance of either a CSS PNC or ATM PNC and the other port networks in the system to utilize IP PNC.
Figure 55, in the ESS in Action – A Virtual Demo section, shows a system using the mixed PNC
configuration. A port network in a system configured to use mixed PNC operation in a failure scenario is
based on what type of PNC it is interfacing with. For example, the port networks connected to either an
CSS PNC or an ATM PNC operate in survivability mode as if the entire system was using CSS PNC or
ATM PNC respectively and port networks connected to an IP PNC operate in survivability mode as if the
entire system was configured to use IP PNC.
Section 9 - ESS in Action - A Virtual Demo
9.1 Demo - Setup
To summarize the material presented thus far and to show how it is applied to a real system, this section
takes you through a “virtual demo” of the ESS feature. The example system, shown in Figure 55 below, is
supplying phone service to three geographically disperse campuses that are interconnected through an IP
WAN. The main site, on the left, is an upgraded DEFINITY switch consisting of traditional MCC cabinets
interconnected with a CSS PNC. The other two locations are newer sites using G650 cabinets that are
interconnected to each other and back to the main site via an IP network. This layout is another example of
66
a system configured to use a mixed PNC. Except for PN #4, all the other port networks have a resident
IPSI through which the main server is using to control them by means of a direct EAL. Port network #4 is
a non-IPSI connected port network and gets its control from the main server indirectly via a tunneled EAL
through port network #3 in this case.
Main
ESS #10
ESS #20
LAN
ESS #30
LAN
IPSI
IPSI
IPSI
EPN #1
EPN #2
EPN #3
EPN #4
EI
EI
EI
EI
LAN
IPSI
IPSI
IPSI
EPN #5
EPN #6
EPN #7
CSS PNC
Figure 55 – Communication System Layout for Virtual Demo
There are a number of steps that can be taken to increase a systems reliability including the duplication of
IPSIs with in each PN, the duplication of the CSS PNC, and the addition of ESS servers. Since the focus of
this virtual demo is the ESS feature and to keep the failure cases as simple as possible, standard reliability
(simplex IPSIs and simplex PNC) is used in this example. Given the following list of objectives the
strategic placement of ESS servers can be determined and prioritizations of them can be derived.
1.
2.
3.
4.
All sites must be able to survive in isolation from the rest of the system.
All sites must be able to survive a catastrophic main server failure.
There is more than adequate WAN bandwidth available for port network control links and
inter-port network bearer communication.
Upon any failure, the number of autonomous systems created should be minimized.
Objective #1 requires an ESS server to exist at every remote location since each site needs to be operational
if it gets fragmented away from the rest of the system. Objective #2 subtly requires that an ESS server also
be placed at the main location because it must be able to survive a catastrophic main server failure. It is
unable to achieve this if it does not have a local control source alternative and becomes isolated from the
other sites. Objective #3 and objective #4 dictate how the priority scores and attributes for the three ESS
clusters should be administered. Since bandwidth is not an issue and the failover objective is to minimize
fragmentation, none of the ESS clusters should utilize the local preference or the local only attributes which
implies they also do not need to override local preference boosts with the system preferred option (see the
What is a Priority Score section for more details). In addition, the objectives do not specify a desired
failover order, so there are no restrictions for the priority score rankings of the ESS clusters. The
administration of the ESS clusters for this demonstration is shown in the SAT screen shot below by
executing the “display system-parameters ess” command.
67
display system-parameters ess
ENTERPRISE SURVIVABLE SERVER INFORMATION
Page
1 of
7
Cl Plat
Server
A
Server
B
Pri Com Sys Loc Loc
ID Type
ID
Node Name
ID
Node Name
Scr
Prf Prf Only
-----------------------------------------------------------------------------MAIN SERVERS
1
Duplex
1
172.16 .192.1
2
172.16 .192.2
ENTERPRISE SURVIVABLE SERVERS
10
20
30
Simplex 10
Duplex 20
Simplex 30
ESS10-10
ESS20-20
ESS30-30
21
ESS20-21
75
100
50
1
1
1
1
1
1
1
1
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
Screen Shot 1 – ESS Cluster Administration
The first observation from the ESS cluster administration is that ESS #10 and ESS #30 have platform form
types (Plat Type) of simplex implying they are from the S8500 family of media servers and therefore have
no Server B information. ESS #20 has a platform type of duplex meaning it is from the S87XX family of
media servers. The second observation is that the ranked order of the ESS clusters is ESS #20, then ESS
#10, and finally ESS #30 since they have priority scores of 100, 75 and 50 respectively. It is important to
remember that the cluster IDs assigned come directly from the license file on each server and the main
cluster always has an ID of #1. Once the ESS clusters are added to the system and installed they need to
register with the main server to avoid alarming and to get updated translations. The status of the
registration links and the ESS clusters can be seen in the SAT screen shot below using the command “status
ess clusters” command.
This command shows a quick view of the status of every ESS cluster in the system. It includes basic
information such as if the ESS cluster is enabled (Enabled?), if the ESS cluster is registered (Registered?),
and which server is active within the each cluster (Active Server ID). In addition, it also shows the
administrator if the ESS cluster’s translation set is up-to-date (if the time stamp under Translations
Updated for the ESS cluster matches the main cluster’s time stamp) and what software load each cluster is
currently running (Software Version).
68
status ess clusters
Cluster ID
1
ESS CLUSTER INFORMATION
Cluster
ID
Enabled?
1
10
20
30
y
y
y
y
Active
Server
ID
Registered?
1
10
20
30
y
y
y
y
Translations
Updated
19:30
19:30
19:30
19:30
Software
Version
11/14/2005
11/14/2005
11/14/2005
11/14/2005
R013x.01.0.626.0
R013x.01.0.626.0
R013x.01.0.626.0
R013x.01.0.626.0
Command successfully completed
Command:
Screen Shot 2 – ESS Cluster Registration Status
As described in the How do IPSIs Manage Priority Lists section, IPSIs maintain independent priority
failover lists of ESS clusters. For this reason, the command “status ess port-networks” queries each IPSI in
real-time for its priority list and then displays them for the administrator. This command is critical in
verifying the actual failover behavior of the IPSIs match the desired failover behavior as administered and
in determining which clusters are currently in control of which IPSIs.
status ess port-networks
Cluster ID
Com
PN Num
1
ESS PORT NETWORK INFORMATION
Port IPSI
Intf Inft Ntwk Gtway
Loc Type Ste Loc
Pri/ Pri/
Sec Sec
Loc State
Cntl Connected
Clus Clus(ter)
ID IDs
1
1
1A01 IPSI up
1A01
1A01 actv-aa
1
1
20
10
30
2
1
2A01 IPSI up
2A01
2A01 actv-aa
1
1
20
10
30
3
1
3A01 IPSI up
3A01
3A01 actv-aa
1
1
20
10
30
4
1
4A01 EI
up
3A01
5
1
5A01 IPSI up
5A01
5A01 actv-aa
1
1
20
10
30
6
1
6A01 IPSI up
6A01
6A01 actv-aa
1
1
20
10
30
7
1
7A01 IPSI up
7A01
7A01 actv-aa
1
1
20
10
30
Command successfully completed
Command:
Screen Shot 3 – Non-Faulted Environment IPSI Priority Lists
69
The first column (PN) on the left shows the port network number for which all the data in that row
corresponds to. The next column (Com Num) shows what community the port network and its resident
IPSI are assigned to. The fifth column, Port Ntwk Ste, presents the port network state, either up (in-service)
or down (out-of-service), with respect to the sever executing the command. In this case, the command was
run on the main server and the main cluster is currently controlling all of these port networks in a nonfaulted environment.
The third and fourth columns, Intf Loc and Intf Type, along with the sixth column, IPSI Gtway Loc,
describe how the port network is currently being given service by the controlling server. The interface
location is the board where the EAL link is terminating to and the interface type is the type of board (EI or
IPSI) where the arch-angel is residing. The IPSI gateway location field displays to the administrator which
IPSI is supporting the control link for this port network. As described in the How are Port Networks
Controlled section, there are two methods to control a port network -- either directly through a resident
IPSI (left side of Figure 56) or indirectly though an IPSI residing in another PN (right side of Figure 56). If
a port network’s IPSI gateway location is the same as the port network’s interface location, then it can be
concluded that the port network is being controlled directly from a server through its own IPSI. Otherwise,
if they do not match, the port network is getting controlled indirectly through another port network’s IPSI.
To determine which port network is supporting another port network in these indirect control situations,
figure out what port network the IPSI gateway board resides in via the “list cabinet” command.
SPE
SPE
IPSI Gateway
IPSI Gateway
LAN
IPSI
LAN
IPSI
EAL #1
EPN #1
EPN #1
Interface Location
(Type ! IPSI)
EPN #2
EI
EI
PNC
Direct Port Network Control
Interface Location
(Type ! EI)
EAL #2
Indirect Port Network Control
Figure 56 – Direct PN Control versus Indirect PN Control
The next two columns, Pri/Sec Loc and Pri/Sec State, describe the port network’s IPSI(s), if they exist.
This screen shot shows that PN #1 has its primary IPSI in board location 1A01 and it is currently active
supporting the arch-angel (actv-aa). If the port network had duplicated IPSIs, a second row would appear
immediately below describing the secondary IPSI. Since this is a stable non-faulted environment, the
primary IPSI location should also be the IPSI gateway location (since this IPSI is supporting the PN) and
the interface location (since this IPSI is supporting the arch-angel). There are failure scenarios for which
these three values may not be equal. For example, if the port network experienced an IPSI outage, control
could be tunneled through the CSS PNC. In this case, the interface location would be the EI board (where
the arch-angel resides) and the IPSI gateway location would be another port network’s IPSI location (where
the hybrid EAL terminates). The primary location of the IPSI would remain unchanged since that is an
administered location and the primary IPSI state would be active (because it is the active IPSI regardless if
it is being used or not).
The rest of the row contains the results of a real-time query to the IPSIs concerning whom their master
cluster is and what their priority lists currently are. The Cntl Clus ID column is the cluster ID returned by
the IPSI referring to its currently controlling cluster. In this example, cluster ID #1 is returned which is the
70
main cluster’s default ID. The next columns, Connected Clus(ter) IDs, show the IPSI’s priority list. In this
example based on the administration previously done, each IPSI has a preference list headed by the main
cluster (ID #1) followed by ESS clusters #20, #10, and #30.
It is important to note that the entry for port network #4, the non-IPSI connected PN, has significant
differences from the others. First of all, the interface type is always an EI board since there are only two
types of boards which can support arch-angels, EIs and IPSIs, and no IPSIs reside in the port network. The
next difference is that the interface location and the IPSI gateway location are never the same since the port
network is always receiving control indirectly through another port network. And finally, there are no
entries for primary / secondary IPSI locations since no IPSIs exist in the port network.
The last item to analyze before introducing failure into the system is the port network view from the ESS
clusters themselves. The execution of the “status ess port-networks” command from ESS #20 is shown
below.
status ess port-networks
Cluster ID
Com
PN Num
20
ESS PORT NETWORK INFORMATION
Port IPSI
Intf Inft Ntwk Gtway
Loc Type Ste Loc
Pri/ Pri/
Sec Sec
Loc State
Cntl Connected
Clus Clus(ter)
ID IDs
1
1
down
1A01 active
1
1
20
10
30
2
1
down
2A01 active
1
1
20
10
30
3
1
down
3A01 active
1
1
20
10
30
5
1
down
5A01 active
1
1
20
10
30
6
1
down
6A01 active
1
1
20
10
30
7
1
down
7A01 active
1
1
20
10
30
Command successfully completed
Command:
Screen Shot 4 – “status ess port-networks” from ESS #20 in Non-faulted Environment
The most glaring difference from the execution of the command from ESS #20 as compared to when
executed from the main server is the port network state and fields associated with it. Since the ESS server
does not control the IPSI, and therefore not control the port network, from its perspective the port network
does is not in service, or down. Also, the ESS server does not attempt to establish EAL links to the port
network which implies there is no information to report in the interface location, interface type, and IPSI
gateway fields. Another difference is that port network #4 has been omitted from the list of port networks.
Abiding by the rule that ESS servers cannot tunnel control links through the CSS PNC and given that PN
#4 is a non-IPSI connected port network, it is impossible for the ESS server, under any conditions, to take
control of port network #4 or provide any relevant status of it. This same viewpoint is shared by all of the
ESS clusters in this non-faulted situation.
9.2 Demo – Catastrophic Main Server Failure
Now that the setup of the demo has been reviewed, it is time to introduce the first major fault into the
system - a catastrophic main server failure. Suppose a cataclysmic event occurs that renders the entire
71
main cluster inoperable. Within a few seconds of the failure, the IPSIs detect that the connection to the
main cluster is down which causes them to adjust their priority lists by removing the main’s cluster ID from
it and starting their no-service timers. Screen Shot #4 shows the results of the “status ess port-networks”
executed from ESS #20 again after the failure, but before the IPSIs no service timers have expired.
status ess port-networks
Cluster ID
Com
PN Num
20
ESS PORT NETWORK INFORMATION
Port IPSI
Intf Inft Ntwk Gtway
Loc Type Ste Loc
Pri/ Pri/
Sec Sec
Loc State
Cntl Connected
Clus Clus(ter)
ID IDs
1
1
down
1A01 active
*
20
10
30
2
1
down
2A01 active
*
20
10
30
3
1
down
3A01 active
*
20
10
30
5
1
down
5A01 active
*
20
10
30
6
1
down
6A01 active
*
20
10
30
7
1
down
7A01 active
*
20
10
30
Command successfully completed
Command:
Screen Shot 5 – Viewpoint from ESS #20 Immediately after Main Cluster Failure
There are two very important pieces of information that the results of the command give the administrator.
For every IPSI in the system, the controlling cluster ID transitioned from #1 (the main cluster’s ID) to ‘*’.
The ‘*’ implies to the administrator that the IPSIs are reporting they do not currently have a controlling
cluster. Also, each IPSI’s connected cluster IDs list has been adjusted by removing the main cluster’s ID
and shifting all the ESS clusters up one position. Since the main cluster no longer appears on any list, the
administrator can determine a catastrophic event has taken the main servers out-of-service. All of the other
ESS clusters in the system share the same viewpoint as ESS #20.
After the no service timer for each IPSI expires, the IPSI sends a request for service to its highest ranking
cluster in the priority list – ESS #20. ESS #20 receives the request and takes over all the port networks,
establishes a control link, and brings the IPSI’s PN back into service via a cold restart. PN #4 does not
have a resident IPSI and since the ESS servers do not support tunneled control links, port network #4
remains non-functional. From this point forward in the demo, it is important to understand the viewpoints
of the system from both ESS #20 and ESS #30.
72
status ess port-networks
Cluster ID
Com
PN Num
20
ESS PORT NETWORK INFORMATION
Port IPSI
Intf Inft Ntwk Gtway
Loc Type Ste Loc
Pri/ Pri/
Sec Sec
Loc State
Cntl Connected
Clus Clus(ter)
ID IDs
1
1
1A01 IPSI up
1A01
1A01 actv-aa
20
20
10
30
2
1
2A01 IPSI up
2A01
2A01 actv-aa
20
20
10
30
3
1
3A01 IPSI up
3A01
3A01 actv-aa
20
20
10
30
5
1
5A01 IPSI up
5A01
5A01 actv-aa
20
20
10
30
6
1
6A01 IPSI up
6A01
6A01 actv-aa
20
20
10
30
7
1
7A01 IPSI up
7A01
7A01 actv-aa
20
20
10
30
Command successfully completed
Command:
Screen Shot 6 – Viewpoint from ESS #20 after Failover due to Main Cluster Failure
status ess port-networks
Cluster ID
Com
PN Num
30
ESS PORT NETWORK INFORMATION
Port IPSI
Intf Inft Ntwk Gtway
Loc Type Ste Loc
Pri/ Pri/
Sec Sec
Loc State
Cntl Connected
Clus Clus(ter)
ID IDs
1
1
down
active
20
20
10
30
2
1
down
active
20
20
10
30
3
1
down
active
20
20
10
30
5
1
down
active
20
20
10
30
6
1
down
active
20
20
10
30
7
1
down
active
20
20
10
30
Command successfully completed
Command:
Screen Shot 7 - Viewpoint from ESS #30 after Failover due to Main Cluster Failure
Screen shots #6 and #7 show the current control state of the system. Since the “status ess port-networks”
command queries the IPSIs in real time from their priority lists, both ESS cluster should report the same
information. They both show that ESS #20 is the controlling cluster for every IPSI and they both show the
same priority lists for each IPSI. The viewpoint of ESS #20, since it is the controlling cluster of every port
network, is that all port networks are up and their EAL links are described appropriately. ESS #30’s
viewpoint is different since it is does not have any control links established and reports that from its
73
perspective, all the port networks are out-of-service. Figure 57 shows the control links for every IPSI
connected port network after the system recovers from the main cluster outage.
Main
ESS #10
ESS #20
LAN
ESS #30
LAN
IPSI
IPSI
IPSI
EPN #1
EPN #2
EPN #3
EPN #4
EI
EI
EI
EI
LAN
IPSI
IPSI
IPSI
EPN #5
EPN #6
EPN #7
EPN #4 is
out-of-service
CSS PNC
Figure 57 – Communication System In-service after Main Cluster Failure
9.3 Demo – Extended Network Fragmentation
The next phase of the demo is the introduction of a severe network fragmentation fault on the WAN link
between sites #2 and #3 on top of the already failed main cluster. Once the network fragmentation occurs,
as shown in Figure 58, the IPSI in port network #7 detects the connectivity loss to its controlling cluster
and starts its no service timer. The IPSIs on the other side of the fragmentation detect the connectivity loss
to ESS #30 and make the appropriate changes to their priority lists, but service to them is uninterrupted.
Unfortunately, unlike the previous situations, the network fragmentation prevents all ESS servers from
being able to contact every IPSI and therefore they are unable to provide the administrator with a complete
system view.
Executing “status ess port-networks” from ESS #20 provides a system view of the left side of the
fragmentation. As Screen Shot #8 shows, port networks #1, #2, #3, #5, and #6 are in service and being
controlled by ESS #20. In addition, the all of these port network’s IPSIs have removed ESS #30 as an
alternate source of control because of the lack of network connectivity. And finally, ESS #20 does not
report any status about PN #7 because it cannot communicate with it.
74
status ess port-networks
Cluster ID
Com
PN Num
20
ESS PORT NETWORK INFORMATION
Port IPSI
Intf Inft Ntwk Gtway
Loc Type Ste Loc
Pri/ Pri/
Sec Sec
Loc State
Cntl Connected
Clus Clus(ter)
ID IDs
1
1
1A01 IPSI up
1A01
1A01 actv-aa
20
20
10
2
1
2A01 IPSI up
2A01
2A01 actv-aa
20
20
10
3
1
3A01 IPSI up
3A01
3A01 actv-aa
20
20
10
5
1
5A01 IPSI up
5A01
5A01 actv-aa
20
20
10
6
1
6A01 IPSI up
6A01
6A01 actv-aa
20
20
10
7
1
down
active
Command successfully completed
Command:
Screen Shot 8 – Viewpoint from ESS #20 Immediately after Network Fragmentation
On the other side of the fragmentation, executing the status command from ESS #30 gives a much different
view of system control. Immediately after the network failure occurred, ESS #30 lost communication with
every IPSI except the one residing in port network #7, hence, ESS #30 does not report any control
information or current priority lists for those IPSIs. Additionally, port network #7’s IPSI detects the
connectivity failure to the main server and begins recovery actions by removing the main server from its
priority list and starts its no service timer. The screen shot below is a snapshot of ESS #30’s viewpoint
during the no service time interval.
75
status ess port-networks
Cluster ID
Com
PN Num
30
ESS PORT NETWORK INFORMATION
Port IPSI
Intf Inft Ntwk Gtway
Loc Type Ste Loc
Pri/ Pri/
Sec Sec
Loc State
Cntl Connected
Clus Clus(ter)
ID IDs
1
1
down
active
2
1
down
active
3
1
down
active
5
1
down
active
6
1
down
active
7
1
down
active
*
30
Command successfully completed
Command:
Screen Shot 9 - Viewpoint from ESS #30 Immediately after Network Fragmentation
Once the no service timer expires, PN #7’s IPSI requests service from its best option which happens to be
the only ESS cluster it can communicate with (ESS #30). ESS #30, after receiving the service request,
takes over PN #7 by establishing a control link and then brings it into service by cold restarting it. Figure
58 below shows the new control link from ESS #30 to PN #7 along with the unaffected control links from
ESS #20 to the other port networks.
Main
ESS #10
ESS #20
LAN
ESS #30
LAN
IPSI
IPSI
IPSI
EPN #1
EPN #2
EPN #3
EPN #4
EI
EI
EI
EI
CSS PNC
LAN
IPSI
IPSI
IPSI
EPN #5
EPN #6
EPN #7
EPN #4 is
out-of-service
Figure 58 – Communication System In-service after Network Fragmentation
Executing the status command from ESS #20 does not reveal any no information that it did before ESS #30
took over PN #7. Since the network is fragmented, no pathway exists for ESS #20 to get status from PN
#7’s IPSI and report it to the administrator. On the other hand, “status ess port-networks” executed from
ESS #30 shows that it is controlling port network #7 and that port network #7 is in-service.
76
status ess port-networks
Cluster ID
Com
PN Num
30
ESS PORT NETWORK INFORMATION
Port IPSI
Intf Inft Ntwk Gtway
Loc Type Ste Loc
Pri/ Pri/
Sec Sec
Loc State
1
1
down
active
2
1
down
active
3
1
down
active
5
1
down
active
6
1
down
active
7
1
7A01 IPSI up
7A01
7A01 actv-aa
Cntl Connected
Clus Clus(ter)
ID IDs
30
30
Command successfully completed
Command:
Screen Shot 10 - Viewpoint from ESS #30 after Failover due to Network Fragmentation
9.4 Demo –Network Fragmentation Repaired
At this point, the strategic placement of ESS servers within the system has allowed the system to continue
to provide service to all its end-users through both a catastrophic main server failure and an extended
network fragmentation. The rest of the demonstration concentrates on restoring the fragmented system
back into one autonomous system controlled by the main servers. While the network is fragmented, all the
servers are attempting to bring up TCP sockets to all of the IPSIs which they are not communicating with,
but the attempts are always unsuccessful since network connectivity is faulted. However, once the network
is repaired, all ESS servers will be able to communicate with all IPSIs and all of the socket establishment
attempts will be successful. This has two major effects on the system, but does not cause any adverse
effects to any end-users. Every ESS cluster is going to be able to provide a complete system view and all
of the IPSIs are going to be able to include all of the ESS clusters in their priority failover lists. As Figure
59 shows, ESS #30 continues to provide service to PN #7 even though ESS #10 and ESS #20 have
informed the IPSI that they are available. The results of the status commands from both ESS #20 and ESS
#30 in Screen Shots #11 and #12 reflect this situation.
77
Main
ESS #10
ESS #20
ESS #30
LAN
LAN
IPSI
IPSI
IPSI
EPN #1
EPN #2
EPN #3
EPN #4
EI
EI
EI
EI
LAN
IPSI
IPSI
IPSI
EPN #5
EPN #6
EPN #7
EPN #4 is
out-of-service
CSS PNC
Figure 59 –Network Fragmentation Fixed
status ess port-networks
Cluster ID
Com
PN Num
20
ESS PORT NETWORK INFORMATION
Port IPSI
Intf Inft Ntwk Gtway
Loc Type Ste Loc
Pri/ Pri/
Sec Sec
Loc State
Cntl Connected
Clus Clus(ter)
ID IDs
1
1
1A01 IPSI up
1A01
1A01 actv-aa
20
20
10
30
2
1
2A01 IPSI up
2A01
2A01 actv-aa
20
20
10
30
3
1
3A01 IPSI up
3A01
3A01 actv-aa
20
20
10
30
5
1
5A01 IPSI up
5A01
5A01 actv-aa
20
20
10
30
6
1
6A01 IPSI up
6A01
6A01 actv-aa
20
20
10
30
7
1
30
20
10
30
down
active
Command successfully completed
Command:
Screen Shot 11 - Viewpoint from ESS #20 after Network is Repaired
78
status ess port-networks
Cluster ID
Com
PN Num
30
ESS PORT NETWORK INFORMATION
Port IPSI
Intf Inft Ntwk Gtway
Loc Type Ste Loc
Pri/ Pri/
Sec Sec
Loc State
Cntl Connected
Clus Clus(ter)
ID IDs
1
1
down
active
20
20
10
30
2
1
down
active
20
20
10
30
3
1
down
active
20
20
10
30
5
1
down
active
20
20
10
30
6
1
down
active
20
20
10
30
7
1
30
20
10
30
7A01 IPSI up
7A01
7A01 actv-aa
Command successfully completed
Command:
Screen Shot 12 – Viewpoint from ESS #30 after Network is Repaired
9.5 Demo – Main Cluster is Restored
At this point a number of things can be done. The administrator can leave the system running as is only
suffering from the effects of operating in a fragmented mode (see the How are Call Flows Altered when the
System is Fragmented section). Another approach the administrator could take is to shift port network #7
under the control of ESS #20 and then operate as one homogenous switch with every IPSI connected port
network receiving its control from one source, but the non-IPSI connected port network (PN #7) would still
be out of service. This approach can and should be used if the main servers are not going to be operational
for an extremely long period of time. This demo takes a another approach to returning the system to
normal operation – fix the main cluster and after it becomes stable again, transition all of the port networks
in a controlled manner back under its control. The transition of the port networks back to the main servers
can be achieved by forcing one port network at a time or all of them at once.
Once the main cluster is operational, it contacts all of the IPSIs and informs them that it is available to
provide service and also gathers the IPSIs current state information. Upon the main servers reconnection,
the IPSIs insert the main cluster back into its priority list, but continue receive service from their current
master cluster. The following screen shot shows that the main server knows the control status of the IPSI
for each port network and is an alternative for each one, but is not in control of any of them.
79
status ess port-networks
Cluster ID
Com
PN Num
1
ESS PORT NETWORK INFORMATION
Port IPSI
Intf Inft Ntwk Gtway
Loc Type Ste Loc
Pri/ Pri/
Sec Sec
Loc State
Cntl Connected
Clus Clus(ter)
ID IDs
1
1
down
active
20
1
20
10
30
2
1
down
active
20
1
20
10
30
3
1
down
active
20
1
20
10
30
4
1
down
5
1
down
active
20
1
20
10
30
6
1
down
active
20
1
20
10
30
7
1
down
active
30
1
20
10
30
Command successfully completed
Command:
Screen Shot 13 – Viewpoint from Main Cluster after the Main Cluster has been Restored
The next screen shot shows ESS #20’s viewpoint of the system after the main cluster becomes operational
once again. ESS #20 is still in control of port networks #1, #2, #3, #5, and #6 even though it is no longer
the best alternative according to the dynamically adjusted priority lists. The viewpoint from ESS #30 (not
shown) is in complete agreement with the main cluster’s and ESS #20’s viewpoints and is in control of port
network #7.
status ess port-networks
Cluster ID
Com
PN Num
20
ESS PORT NETWORK INFORMATION
Port IPSI
Intf Inft Ntwk Gtway
Loc Type Ste Loc
Pri/ Pri/
Sec Sec
Loc State
Cntl Connected
Clus Clus(ter)
ID IDs
1
1
1A01 IPSI up
1A01
1A01 actv-aa
20
1
20
10
30
2
1
2A01 IPSI up
2A01
2A01 actv-aa
20
1
20
10
30
3
1
3A01 IPSI up
3A01
3A01 actv-aa
20
1
20
10
30
5
1
5A01 IPSI up
5A01
5A01 actv-aa
20
1
20
10
30
6
1
6A01 IPSI up
6A01
6A01 actv-aa
20
1
20
10
30
7
1
30
1
20
10
30
down
active
Command successfully completed
Command:
Screen Shot 14 - Viewpoint from ESS #20 after the Main Cluster has been Restored
80
To force all of the IPSI to leave their current controlling cluster to return to the main server, a call
disruptive command “get forced-takeover ipserver-interface all” has been introduced. This command
causes the SPE that it was executed on to send special messages to the IPSIs informing shut down their
current control links and allowing this new SPE to be its master. After this command is executed, the
viewpoints of all the servers in the system will return to their original non-faulted viewpoints as shown a
while back in Screen Shots #3 and #4 with control links as shown below.
Main
ESS #10
ESS #20
LAN
ESS #30
LAN
IPSI
IPSI
IPSI
EPN #1
EPN #2
EPN #3
EPN #4
EI
EI
EI
EI
LAN
IPSI
IPSI
IPSI
EPN #5
EPN #6
EPN #7
CSS PNC
Figure 60 –Control Return to Restored Main Cluster
Section 10 - More Thoughts and Frequently Asked Questions
10.1 How is ESS Feature Enabled ?
The license file is extremely important to the ESS feature. In addition to providing a server with a unique
identifier (module ID, otherwise known as cluster ID) which is required for proper registration, the license
file also includes the necessary values that tell a media server what type it is – either the main server or an
ESS server (which fundamentally changes its behavior). To this end, there are two primary customer
options to check before loading a license file onto a server. FEAT_ESS is the customer option that
determines whether or not the ESS feature itself is enabled. Without this option, the main server rejects all
ESS cluster registrations and therefore ESS clusters cannot be deployed into the system. The other ESS
relevant customer option is FEAT_ESS_SRV. This option informs a given server whether or not it is an
ESS server or one of the main servers. This particular value should be set to “no” for the main server and
“yes” for every ESS server.
These values in the license file can be checked by issuing the following commands from the CLI interface:
statuslicense –v –f FEAT_ESS
statuslicense –v –f FEAT_ESS_SRV
The matrix below summarizes the customer option settings that are required.
Customer Option
Main Server
ESS Server
FEAT_ESS_SRV
No
Yes
FEAT_ESS
Yes
Yes
Table 30. License File Settings
License files can be loaded onto a media server at any time. For a number of features, the SPE can activate
or deactivate a feature based on the newly uploaded license file without performing a reboot. For example,
81
if a license file is loaded which increases the number of IP endpoints allowed in the system, the SPE will
allow more IP phones to register immediately. However, the ESS customer option, FEAT_ESS, does not
work that way. If the current license file loaded onto a machine has FEAT_ESS set to “no” and the new
license file has FEAT_ESS set to “yes”, the main SPE needs to be rebooted in order to turn on the ESS
feature after the new license file is loaded. To this end, it is wise to have FEAT_ESS enabled in the initial
installation or upgrade even if no ESS servers are currently being deployed. Doing this prevents a system
reboot in the future when, and if, the ESS feature is set up in the system.
10.2 What are the Minimum EI Vintages Required for the ESS Feature ?
As mentioned previously, a port network is controlled by an enhanced angel or an arch-angel. The archangel is the master of the port network’s TDM bus, scanning the other angels in the carrier for information
and feeding that information back to the SPE for processing. If a port network does not have an active
arch-angel, then that PN is completely out of service because no information is successfully being
exchanged between the SPE and the angels residing on the port network. Various recovery methods have
been discussed in the single cluster reliability section which attempt to activate one of the enhanced angels
residing on the port networks by bringing up an EAL link. The EAL link can terminate at possibly four
different locations in a port network (a-side IPSI, b-side IPSI, a-side EI, or b-side EI). Before the
introduction of ESS, there was only one SPE in the system that could instantiate the EAL links and
therefore it was ensured that only one EAL would be brought up to a port network at once. If a software
fault or race condition occurs that causes two active EAL links to be brought up to the same port network
concurrently, then two arch-angels will be activated. This leads to a deadlocked PN and is referred to as
dueling arch-angels.
Due to the possibility, albeit a very low probability, of software faults and race conditions, hardware level
protection was introduced to avoid this scenario of dueling arch-angels. This hardware level protection is
justified because nothing can be done remotely from the servers to bring the PN back online. The only way
to recover a port network that is suffering from this dueling arch-angel condition is to physically remove
one of the boards supporting one of the activated arch-angels. The hardware level protection took the form
of an arch-angel token. The concept is that only one entity, and one entity only, has the ability to become
the arch-angel if, and only if, it possesses the arch-angel token. This arch-angel token capability does not
exist in all vintages of the EI boards as shown in the table below.
EI Board
TN570
TN2305
TN2306
Type of PNC
Vintage
A or B
CSS
C or D
A
ATM
B
A
ATM
B
Table 31. Arch-Angel Token Support
Arch-angel Token Aware
No
Yes
No
Yes
No
Yes
While dueling arch-angels in a single cluster environment is very rare (as per the reason it is only suggested
to have arch-angel token support on all boards rather than required), it is almost guaranteed to occur in ESS
deployments. The following figure shows port network #2 failed over to an ESS server. As mentioned
previously, if an IPSI loses its connection to its controlling cluster, it begins a no-service timer before
requesting service from an alternate source. During this time, the previously controlling cluster attempts
traditional recovery methods. If, for whatever reason, the fall-back recovery fails, the IPSI would request
service from the alternate ESS server. At this point everything is stable with PN #1 under the control of the
main server and PN #2 under the control of the ESS server. However, as far as the main server is
concerned, it believes PN #2 is out-of-service and is therefore continuously attempting to restore service to
it by re-connecting to the IPSI or establishing an EAL to the EI board (fall-back). If the connection to the
IPSI finally succeeds, the IPSI will allow the main server to return to its priority list, but not instantly
switch back control (see How ESS Works section). On the other hand, if the failure which was preventing
the tunneled EAL from being established was rectified, the main server would attempt to activate the arch-
82
angel on the EI board. Since there is no real-time communication between the main server and the ESS
servers, both servers continually attempt to control the port network as shown below. This classic case of
dueling arch-angels is avoided by having the IPSI and EI boards token aware. If this was the case, the main
SPE would never succeed in bringing up an EAL to the EI board while the ESS server has an established
EAL to the IPSI.
Main
ESS
Main
ESS
LAN
IPSI
LAN
IPSI
IPSI
EAL #1
EPN #1
IPSI
EAL #1
EAL #2
EPN #2
EI
EPN #1
EI
EAL #2
EPN #2
EI
EAL #2
EI
EAL #2
PNC
PNC
Fragmented PNC
prevents EAL from
getting established.
Dueling arch-angels if
EI board in PN #2 is
not token aware.
Figure 61. Dueling Arch-angels in ESS Environment
In summary, arch-angel token aware boards are required in any port network which contains them both EI
and IPSI boards. There are three different configurations for a port network – no EI boards (PN using an IP
PNC), only EI boards (PN using a CSS or ATM PNC with no controlling IPSIs), and both EI boards and
IPSIs (PN using a CSS or ATM PNC with controlling IPSIs). Only the final configuration mentioned
requires EI boards be token aware.
10.3 What IPSI Vintages Support the ESS Feature ?
IPSI boards were the main new component to the PBX system which allowed the SPE to control port
networks via IP as opposed to only through a private CSS or ATM PNC. The original release of the IPSI,
referred to as IPSI-1 or IPSI-AP, was able to reside in the tone clock slots supported by either traditional
MCC/SCC cabinets or the G600 cabinetry. From the initial release of ACM up to release of the ESS
offering, there were changes made to the IPSI’s firmware and hardware for both new feature support and
bug fixes. In order for the IPSI to be able to also reside in the new G650 cabinet, it needed to be able to
supply maintenance board functionality and be able to interface with a new carrier’s backplane. The new
IPSI is referred to as the IPSI-2 or IPSI-BP. The new functionality needed for the ESS feature was first
supported in firmware release 20 (anything previous to this release is called pre-ESS firmware) and could
run on either IPSI-AP boards or IPSI-BP boards. In other words, the ESS feature does not dictate, nor
require, the need for IPSI hardware versions (IPSI-AP or IPSI-BP), only the carrier type does. The table
below shows which IPSI versions are compatible with which software versions.
83
Pre-ESS IPSI Firmware
ESS IPSI Firmware
Pre-ESS Software
Supported
Supported
Main Server
Supported
Supported
ESS Servers
NOT SUPPORTED
Supported
Table 32 – IPSI Firmware Compatibility Table
It should be concluded that new firmware will be backwards compatible with pre-ESS release software and
also completely supported by all types of servers. In addition, the new software on main servers is
backwards compatible with pre-ESS firmware, but ESS servers cannot interface with it.
10.4 What is the Effect of Non-Controlling IPSIs on ESS ?
Throughout this document, IPSIs have only been discussed as an IP control interface into the port networks
for the SPE. In addition, the IPSI also was described as a conglomeration of different components – a tone
clock, a PKTINT, an arch-angel, and (for IPSI-BP boards only) a cabinet maintenance board. On a typical
upgrade the IPSI boards replace the tone clock board and replicate its functionality. In fact, if an IPSI was
inserted into a port network supported by a G3r SPE in place of a tone clock, it would report itself as a tone
clock and the G3r SPE would be none the wiser.
When the G650 cabinet was released it needed to have a maintenance board and a tone clock, but only one
slot was available for it. Therefore a new board could have been released to support these features or the
IPSI could be used instead. For an IP connected PN, the decision was simple because an IPSI was needed
for control anyway. However, if the PN was interconnected via a CSS or ATM PNC an IPSI is not
necessarily required for control (it could be tunneled through the EI boards). Therefore the concept of a
non-controlling IPSI was created which essentially says, “The port network needs a tone clock and a
maintenance board which the IPSI can provide, but do not control the PN through it”. The reason a system
might be configured like this is to avoid having to run an IP network from the SPE servers to the PN itself.
However, the ramification of this is that the system treats non-controlling IPSIs as purely tone clock boards
which prevents any server (main or ESS) from connecting to it in order to control it. As discussed
previously, the IPSI creates a failover priority list based on the ESS servers that connect to it. Since a noncontrolling IPSI never receives any connection, it does not have the ability to failover to ESS servers. An
IPSI cannot be administered as a non-controlling IPSI with respect to the main server and a controlling IPSI
with respect to an ESS server since the translation sets the servers are running are identical.
84
Section 11 - Index
Section 1 - Introduction
1.1 What is ESS ?
1.2 How is this paper organized ?
Section 2 - Background
2.1 What is an Avaya Media Server ?
2.2 What are Port Networks ?
2.3 How are Port Networks Interconnected ?
2.4 How are Port Networks Controlled ?
2.5 What are SPE Restarts ?
2.6 What are EPN restarts ?
Section 3 - Reliability – Single Cluster Environments
3.1 What is a Cluster ?
3.2 How do single cluster environments achieve high availability ?
3.3 What is a Server Interchange ?
3.4 What is Server Arbitration ?
3.5 What is an IPSI Interchange ?
3.6 What is EI Control Fallback ?
Section 4 - Reliability – Multiple Cluster Environments
4.1 How long does it take to be operational ?
4.2 What are Survivable Remote Processors (SRP) ?
4.3 What are ATM WAN Spare Pairs (ATM-WSP) ?
4.4 What are Manual Backup Servers (MBS) ?
Section 5 - ESS Overview
5.1 What are Enterprise Survivable Servers (ESS) ?
5.2 What is ESS designed to protect against ?
5.3 What is ESS NOT designed to protect against ?
Section 6 - How does ESS Work ?
6.1 What is ESS Registration ?
6.2 How are ESS Translations Updated ?
6.3 How do Media Servers and IPSIs Communicate ?
6.4 What is a Priority Score ?
6.5 How do IPSIs Manage Priority Lists ?
6.6 How are Communication Faults Detected ?
6.7 Under What Conditions do IPSIs Request Service from ESS Clusters ?
6.8 How Does Overriding IPSI Service Requests Work ?
Section 7 - ESS in Control
7.1 What Happens to Non-IP Phone Calls During Failovers ?
7.2 What Happens to IP Phone Calls During Failovers ?
7.3 What Happens to H.248 Gateway Calls During Failovers ?
7.4 How are Call Flows altered when the System is Fragmented ?
7.5 What Happens to Centralized Resources when a System is Fragmented ?
7.6 What Happens to Voicemail Access when a System is Fragmented ?
85
Section 8 - ESS Variants Based on System Configuration
8.1 Basic IP PNC Configured System
8.2 Catastrophic Server Failure in an IP PNC Environment
8.3 Network Fragmentation Failure in an IP PNC Environment
8.4 Basic ATM PNC Configured System
8.5 Catastrophic Server Failure in an ATM PNC Environment
8.6 Network Fragmentation Failure in an ATM PNC Environment
8.7 Basic CSS PNC Configured System
8.8 Catastrophic Server Failure in a CSS PNC Environment
8.9 Network Fragmentation Failure in a CSS PNC Environment
8.10 Mixed PNC Configured System
Section 9 - ESS in Action - A Virtual Demo
9.1 Demo – Setup
9.2 Demo – Catastrophic Main Server Failure
9.3 Demo – Extended Network Fragmentation
9.4 Demo –Network Fragmentation Repaired
9.5 Demo – Main Cluster is Restored
Section 10 - More Thoughts and Frequently Asked Questions
10.1 How is ESS Feature Enabled ?
10.2 What are the Minimum EI Vintages Required for the ESS Feature ?
10.3 What IPSI Vintages Support the ESS Feature ?
10.4 What is the Effect of Non-Controlling IPSIs on ESS ?
86
Download