Unity Connection 7.0

Cluster/Redundancy

TOI

EDCS – 623130

Ramesh Achuthan

Radha Radhakrishnan

© 2006 Cisco Systems, Inc. All rights reserved.

1

Agenda

Overview

Deployment

Cluster Behavior

Troubleshooting

Upgrading

Future Enhancements

© 2006 Cisco Systems, Inc. All rights reserved.

2

Overview –

Active/Active

User Interfaces:

Voice calls

• Web Admin/CPCA

• IMAP

Voice Calls

- Auto load balancing

and failover

(SCCP/SIP).

Load balancing & Failover:

• Involve external entities

(DNS, CCM etc.)

• PIMG for legacy integration

ServM SRM

Roles:

• Primary & secondary

• SRM manages roles CsMgr

Singletons

DB

Runs on top of

CCM Platform Cluster

Other...

CCM

DNS

Heart-beat

Remote Write for MbxDb

Replication DB/files

Web

Clients

Http/Imap

Dyn. load balancing and failover

with DNS.

SRM ServM

CsMgr

Singletons

DB

Other...

UC-0 Publisher

Normally - Primary Role (active)

UC-1..N Subscriber

Normally - Secondary Role (active)

© 2006 Cisco Systems, Inc. All rights reserved.

3

Some Terminology

CCM Platform Cluster: Publisher and Subscriber

 Pub is the first node – fixed at install.

 Other nodes are Subs.

UC Roles: Primary and Secondary

 The singleton processes run only in the primary server.

– Notifier, MTA, SysAgent-tasks and more.

 UnityMbxDb writes are done only through primary server.

 Certain master files: encrypt key, certificates are managed in the primary. These are replicated to secondary.

 In normal operation - primary will be the Pub in the cluster.

© 2006 Cisco Systems, Inc. All rights reserved.

4

User access

 TUI/VUI - will access servers transparently.

 IMAP/CPCA clients - will access servers transparently.

 Admin:

– Transparent server access - administration at either server.

– Voice ports etc. will need node selection.

 Serviceability:

– Trace/Alarm settings will be common for both servers.

– Service start/stop information will need node selection.

– All singleton processes run on the primary server.

 Licensing:

– Voice ports are server specific – need server specific license.

– User licenses are not server specific – can be put in any server.

 RTMT will have to access each server explicitly.

 Log files will not be replicated.

© 2006 Cisco Systems, Inc. All rights reserved.

5

© 2006 Cisco Systems, Inc. All rights reserved.

Deployment

Load balancing & Failover

6

Installing UC in Cluster

1.

Install the first node – Answer yes to the question “Is it the first node in cluster?”

2.

Administer the first node and get it running

3.

Adding the Second node in the cluster:

1.

Using the Admin GUI, add the secondary node under “System

Settings  Cluster”.

2.

Install the second node – Answer no to the question “Is it the first node in the cluster?”

3.

Provide the IP/Hostname of the first node.

Once the second node comes up it will be in the cluster with the first one.

© 2006 Cisco Systems, Inc. All rights reserved.

7

CCM setup - SCCP

Dynamic load balancing and failover with Hunt-pilot, Huntlist, & Line-Group (CCM 4 and above)

Hunt Pilot

Distribution

Algorithm

- Circular, most-idle etc.

Hunt List Line Group

*

DN

UC - 1

UC - 2

© 2006 Cisco Systems, Inc. All rights reserved.

8

CCM setup - SIP

Approaches:

1.

With DNS-SRV

 Route-Pattern  Sip-Trunk  DNS-SRV FQDN

2.

With Route-List (Simpler)

 Route-Pattern  Route-List  Route-Group  Sip-Trunk

 Uses distribution algorithm in Route-Group

1.

With Sip Gateway DNS-SRV

 Route-Pattern  Sip-Trunk  Sip-GW  DNS-SRV FQDN

© 2006 Cisco Systems, Inc. All rights reserved.

9

Other Integrations

PIMG

 PIMG pings a primary UC and can redirect calls to secondary when primary fails.

 Load balancing is done at PBX.

 PIMG failures are handled at PBX.

Primary link

UC - A

PIMG

PBX

Secondary link

PIMG

UC - B

© 2006 Cisco Systems, Inc. All rights reserved.

10

IMAP & CPCA Clients

Load balancing and failover transparent to users.

• DNS – name lookup

• Add A-records in DNS

• Users need to re-login after failover.

© 2006 Cisco Systems, Inc. All rights reserved.

11

© 2006 Cisco Systems, Inc. All rights reserved.

Cluster Behavior

12

Cluster State displayed :

 None – means only one node in the cluster

 Normal – means there is more than one node – not failedover

 Failedover – Publisher is not the primary at that time

 Admin changes made to one server should be visible on the other in few seconds.

 Messages left on one server should be available from the other server in few seconds.

 TUI/VUI, CPCA, IMAP, & Admin shall not notice any login issues when one of the servers is down.

 MWI and other notifications shall continue to work when one of the servers is down.

© 2006 Cisco Systems, Inc. All rights reserved.

13

Manual failover

Admin client make_primary( B )

Failover

Primary

Node A

Secondary

Node B

Singletons will be started here.

© 2006 Cisco Systems, Inc. All rights reserved.

14

Cluster management

© 2006 Cisco Systems, Inc. All rights reserved.

15

Manual failback

Singletons will be started here.

Acting

Secondary

Node A

Failback

Admin client make_primary( A )

Acting Primary

Node B

© 2006 Cisco Systems, Inc. All rights reserved.

16

Manual Deactivate

 Deactivating a server stops all critical services and base services in it.

 The database replication will continue in the deactivated state.

 Only secondary servers can be deactivated.

 The Administration and the Serviceability GUI are available in the deactivated state.

 This state is used for maintenance purposes, wherein all calls, and web user interactions are directed to the other server.

 A deactivated server can be activated back to service (as shown).

© 2006 Cisco Systems, Inc. All rights reserved.

17

Manual activate

© 2006 Cisco Systems, Inc. All rights reserved.

18

Auto failover

ServM

Critical Service

Failure

Failover

Primary

Node A

Secondary

Node B

© 2006 Cisco Systems, Inc. All rights reserved.

19

Acting-Primary failure

Acting

Secondary

Node A

Node A tries to be primary

Crash

Acting Primary

Node B

© 2006 Cisco Systems, Inc. All rights reserved.

20

CPCA servlet failure and redirection

CPCA

Node A redirect

CPCA

Web

Client

CPCA

Node B

© 2006 Cisco Systems, Inc. All rights reserved.

21

Tomcat failure and DNS resolution

lookup

DNS name resolution

Web

Client

T omcat

Node A

T omcat

Node B

© 2006 Cisco Systems, Inc. All rights reserved.

22

Reasons for failover

• Failover can be caused by 30 sec heartbeat failure.

• Failover can be manually initiated also.

• Conditions for auto-failover:

• Critical process cannot be started or fails

•SRM, ServM, DB, DbEventPublisher, CuCsMgr, CuMixer, Notifier etc.

• Too many restarts in some interval

• CuCsMgr - allow single restart, but maybe 3 deaths in 5 or 10 min exceeds threshold

• Non-critical processes will not cause failover. ServM will restart them on same box

© 2006 Cisco Systems, Inc. All rights reserved.

23

Failover

Upon Failover (when primary fails) -

 Any existing calls or IP traffic to primary will likely be lost.

 SRM in secondary will detect the failure and update status in DB.

 SRM in secondary will instruct ServM to start singleton processes.

 Switch/PIMG/DNS will determine failover condition and route incoming call traffic to secondary box.

 If using DNS, CPCA/IMAP traffic will be sent to secondary

© 2006 Cisco Systems, Inc. All rights reserved.

24

Two Generals’ Problem

(split-brain)

Cause:

 Unreliable communication link between primary and secondary

 Byzantine failure of SRM

 Secondary thinks primary is dead and assumes “acting-primary” role, while primary continues its operation

Issues

 DB updates will continue in primary and secondary after failover.

 Solution – Split Brain Resolution ( SBR ) (done automatically)

© 2006 Cisco Systems, Inc. All rights reserved.

25

Troubleshooting – tip 1

CLI: show cuc cluster status – shows the current status of the cluster.

 Member ID 0 means publisher (i.e., first-node).

 Exactly one server must be in the primary role.

 If both servers are primary , then they are not talking to each other. Check if the server hostnames are correct and if they can communicate.

© 2006 Cisco Systems, Inc. All rights reserved.

26

Tip 2 –

Check certain required services

 Check that these services are running on both servers:

– Server Role Manager,

Conversation Manager and Mixer,

– File Sync,

DB Event Publisher

 Check that these services are running in the primary server:

Notifier and

– Message Transfer Agent

© 2006 Cisco Systems, Inc. All rights reserved.

27

Tip 3: Log files

 Check Server Role Manager (SRM) logs for cluster issues.

– /var/opt/cisco/connection/log/diag_CuSrm_*.uc

 From RTMT select the component “Connection Server Role

Manager” to download the SRM log.

 Logs are not replicated, so it is required to check them on both servers.

© 2006 Cisco Systems, Inc. All rights reserved.

28

Upgrading a cluster

 Upgrade process is very similar to UC 2.x

 First upgrade the first node (primary) – do not switchover.

 Then upgrade the second node (secondary) – do not switchover

 At a convenient time, switchover the first node.

 Then switchover second node after the first node switchover is successful.

© 2006 Cisco Systems, Inc. All rights reserved.

29

© 2006 Cisco Systems, Inc. All rights reserved.

Future Enhancements

30

Site Redundancy – Active/Passive

Differences from A/A

• Deployment model:

• No load balancing

Auto failover (SCCP)

No load balancing.

CCM

SIP

DNS

SRV

Web/

Client

Http/Imap

Auto failover with

DNS

No load balancing.

Call/request delivered only if primary fails.

ServM

Heart-beat

SRM

CsMgr

Singletons

DB

Other...

Replication DB/files

WAN

UC-A (active) Primary

SRM ServM

CsMgr

Singletons

DB

Other...

UC-B (passive) Secondary

© 2006 Cisco Systems, Inc. All rights reserved.

31

Multi-server Cluster (N >2)

 Current approach implies a single primary to manage singletons and UnityMbxDb updates.

 This means 1 primary + N secondary in a N + 1 scenario.

 When failover happens, one of the N secondary servers assumes acting-primary role based on some pre-defined criteria.

© 2006 Cisco Systems, Inc. All rights reserved.

32

Q&A

© 2006 Cisco Systems, Inc. All rights reserved.

33

© 2006 Cisco Systems, Inc. All rights reserved.

34