Uploaded by alyjavaid

Core Network Troubleshooting Guidelines Rev B 211008

advertisement
Core Troubleshooting Guide
1 (11)
Prepared (also subject responsible if other)
No.
Mubasher A Malik, Mobily NSS Team
Approved
Checked
Aimal Khan
Date
Rev
23-01-2008
A
Reference
Core Troubleshooting Guidelines
Contents
1
2
3
4
5
6
7
7.1
7.2
7.3
7.4
8
Scope ..................................................................................................2
Introduction..........................................................................................2
Network Handling Requirements.........................................................2
Prerequisites for troubleshooting.........................................................3
Mandatory Actions for troubleshooting ................................................3
Alarm Specifications............................................................................4
Critical Faults Handling / Troubleshooting...........................................5
Call Failure or Call Drop ......................................................................5
Congestions ........................................................................................6
Link Failure..........................................................................................8
System Restart....................................................................................9
Reference Table................................................................................10
Core Troubleshooting Guide
2 (11)
Prepared (also subject responsible if other)
No.
Mubasher A Malik, Mobily NSS Team
Approved
Checked
Aimal Khan
Date
Rev
23-01-2008
A
Reference
1 Scope
This document is as an effort to simplify, standardize fault detection / troubleshooting in GSM /
WCDMA Core nodes. Managed Services and Support engineers should use this document
together with ALEX, PLEX documentation and functional specifications.
2 Introduction
With introduction of Mobile Switch Solution (MSS further in the document – former Layered
Architecture), fault localization becomes more complex.
In vertical architecture networks it was comparatively much easier to detect faulty node – one
MSC is connected to several BSCs. In case of failure (dropped calls), combination of CDR
analysis (fault code, RBS, cause code, location…), AXE trace measures (Test system) and
protocol analysis (with analyzer or with UPMTI command) could help to identify faulty node.
In MSS architecture MGws were added between MSCs and RNCs. Due to the fact that from one
MSC to one RNC call can be routed on different routes through several MGws, call path is not
easy to determine and detection of faulty node becomes complicated as a rule of thumb when a
faulty node is identified, standard node related troubleshooting procedures should be performed.
Several supporting documents and procedures have been included in this document.
3 Network Handling Requirements
Following information should be known and available to run the normal O&M as well as
troubleshooting the Problems:
¾
Network Diagram
¾
Numbering Plan (it must have DPC, GT, MTP & SCCP routing information)
¾
Signaling & Trunking Plan
¾
HLR User Profile Specification
¾
MGw capacity and Connectivity
¾
IP address for 3G Core nodes
Core Troubleshooting Guide
3 (11)
Prepared (also subject responsible if other)
No.
Mubasher A Malik, Mobily NSS Team
Approved
Checked
Aimal Khan
Date
Rev
23-01-2008
A
Reference
4 Prerequisites for troubleshooting
¾
¾
¾
¾
Good knowledge about network (Network plan must be available).
Good hands on experience in O&M of Core network.
Close cooperation between MSC, MGW and UTRAN (BO & CNS) groups.
Efficient use of ALEX and PRIMUS.
5 Mandatory Actions for troubleshooting
While troubleshooting major problems below actions must be performed. The list below has been
complied based on best practices and on-ground experience but not limited to scenarios we didn’t
not come across.
1) For all problems / incidents escalate and coordinate with Front Office.
2) Understand the problem and its impact
3) Collect relevant information and analyse it to localise the problem (Area, Node)
4) Draw the network connectivity diagram related to the problem reported.
5) Inform the concerned parties as per standard processes and procedures.
6) Check the alarm list and error logs in the connected nodes
7) Arrange proper testing for the reported cases and more.
8) Take the traces and analyse them.
9) Check the performance reports and STS data.
10) Check for the DT changes made recently in the network and take the printouts for all the
related parameters.
11) Coordinate with other groups, support team and management.
12) Recommend and agree with customer / management the best possible solution and
always consult ALEX for User-Guides and OPI’s.
13) Implement the solution as agreed.
14) Repeat steps 5 to 8.
15) Update Front Office with the status, restoration, actions taken and network stability.
16) Produce the relative report for the concerned parties.
Core Troubleshooting Guide
4 (11)
Prepared (also subject responsible if other)
No.
Mubasher A Malik, Mobily NSS Team
Approved
Checked
Aimal Khan
Date
Rev
23-01-2008
A
Reference
6 Alarm Specifications
To keep the network healthy all the supervisions must be activated and alarms should be defined
with correct severities. Below you can find some major supervision need to be activated.
¾
¾
¾
¾
¾
Processor Load Supervision & Load Regulations.
Audit Function Threshold Supervision (Memory & SAE).
Event Reporting & Disturbance Supervision.
Signalling Supervision (C7 disturbance, Link Set, Destination, SCCP_SSN, MTP).
Route Supervision
Once all the supervisions are activated, alarm definitions should be done properly. All the critical
alarms must be handled & rectified by following OPI (operational Instructions) and procedures.
Below you can find some critical alarms need to be attended immediately.
Processor Alarms (APZ):
CP Fault
CPT Fault
AP Fault
AP Not Available / Redundant
AP Process Stopped
Audit Function Threshold Supervision
EM Fault
RP Fault
Telephony Alarms (APT):
Size Alteration of Data Files Fault / Size Change Required
System Restart
CCITT7 Destination Inaccessible
CCITT7 Signalling Link Failure / Link Set Supervision
Common Charging Output Congestion / Error
Group Switch Fault
HLR State Fault
Media Gateway Unavailable
M3ua Destination Inaccessible
Above mentioned alarms are a few of several critical from the actual list of critical alarms and all
of them, have OPI (Operational instructions) available in ALEX for troubleshooting.
Core Troubleshooting Guide
5 (11)
Prepared (also subject responsible if other)
No.
Mubasher A Malik, Mobily NSS Team
Approved
Checked
Aimal Khan
Date
Rev
23-01-2008
A
Reference
7 Critical Faults Handling / Troubleshooting
In vertical architecture networks it was much easier to detect faulty node – one MSC is connected
to several BSCs. In case of failure (dropped calls), combination of CDR analysis (fault code, RBS,
cause code, location…), AXE trace measures (Test system) and protocol analysis (with analyzer
or with UPMTI command) could help to identify faulty node. As a practice one must check
alarm list and error log at the time of problem reported and once it is solved.
Troubleshooting and fault localization has become more complicated after the introduction of
Mobile Switch Solution (MSS / Layered Architecture). In MSS architecture, MGws were added
between MSCs (which becomes servers) and RNCs. Due to fact that from one MSC to one RNC
call can be routed in different ways trough several MGws, call path is not easy to determine and
detecting of faulty node become serious issue. When faulty node is identified, standard node
related troubleshooting procedures should be performed and problem handling record form
should be filled. (Attached in references table)
Emergency situation handling and collection of data from all core nodes including MSS after
outages is already described in attached documents (Bulletins), now we will further discuss some
critical faults and troubleshooting methods. In the references table you can find some documents
for handling and troubleshooting different problems, such as Processor Load Control in high load,
necessary health checks needed before & after any upgrade and Emergency / maintenance
bulletins to handle emergency situations with the important procedures and processes required to
follow during emergencies.
7.1
Call Failure or Call Drop
Call failure for MT / MO or Call drop is always declared / treated as emergency which needs
proper diagnosis to rectify. Below are the major actions required to handle this problem.
¾
¾
¾
¾
¾
¾
¾
¾
Collect relevant information to evaluate the NW status (Details about call drops: where,
when, how often, other symptoms)
Analyse this information to localise the problem (Area, Node, Number Series, MO or MT,
Prepaid or Post-paid etc)
Check the alarm list and error logs in the connected nodes
Arrange the proper testing for the reported cases and more.
Take the traces to identify the release cause.
Check STS data.
Check for the DT changes made recently in the network.
Escalate to the management and support team according to processes & procedures.
If the problem is area / node specific, after confirming with the test calls, take the traces and
analyse them to check and identify the failure / release cause. Make a health check of the
affected node including the below mentioned checks.
¾
¾
¾
¾
Status of the Node
Processor Load
Size Alterations
Disturbance Analysis (SYRIP, SYFAP, SYFDP, SYFIP, SYFSP, ALLIP, DIRCP, DIRRP
printouts should be analyzed
Core Troubleshooting Guide
6 (11)
Prepared (also subject responsible if other)
No.
Mubasher A Malik, Mobily NSS Team
Approved
Checked
Aimal Khan
¾
¾
¾
¾
¾
¾
Date
Rev
23-01-2008
A
Reference
Software record analysis (LASIP, PCORP, LAEIP, PCECP, LAFBP printouts should be
analyzed , in order to find out:
• Are some useful correction missing
• Are some corrections in the dump faulty or suspected faulty
• Is there any mismatch in software record (if part of correction is passive)
Route Utilizations
Event reporting
Signalling status (MTP & SCCP)
STS data
Analyze the traces taken
If the problematic node is MGW then following checks should be made.
¾
¾
¾
¾
Check MGW selection in MSC-S
Check the availability of MGW and available resources in it.
Check alarm and Event log in MGW
Take GCP traces in MGW
Attached document in the reference table can help in checking MGW and collection data in it.
As mentioned before Call failure or Call drop should always be treated as emergency and on duty
staff should keep the record and track of all troubleshooting done locally or remotely by support
teams and confirm the restoration by testing as well as statistics.
7.2
Congestions
In live network congestion can disturb the services in many ways. We will discuss here 3 types of
congestions, Route Congestion, Signalling Congestion and Load Congestion
7.2.1.
Route Congestion
Route Congestion can be reported by FO (if alarm activated), by performance team or observed
during fault localization such as call failure or high processor load, below are the instructions need
to follow for handling this problem.
Check the followings:
¾
¾
¾
¾
¾
¾
¾
Alarm list and error logs
Route utilization
Traffic measurements on routes
Traffic type measurements
Service quality statistics (SQS)
End-of-selection measurements
Answer to seizure ratio (ASR)
Route congestion can be due to high traffic, Transmission failure, hangings or some software
fault, To rectify this problem proper fault localization and coordination with other groups is
required which becomes easy if above mentioned checks are already made.
Below mentioned ALEX OPI’s can be useful during troubleshooting of Route Congestion.
¾
¾
¾
Disturbance Supervision of Trunk Routes
Blocking Supervision
Traffic Limitations on Routes, Change
Core Troubleshooting Guide
7 (11)
Prepared (also subject responsible if other)
No.
Mubasher A Malik, Mobily NSS Team
Approved
Checked
Aimal Khan
7.2.2.
Date
Rev
23-01-2008
A
Reference
Signalling Congestion
Signalling congestion can cause major disturbance in the network and can result in call failure as
well. It can be type of MTP signalling congestion or SCCP signalling congestion and below we
discuss the troubleshooting for both types of signalling congestion.
¾
¾
¾
¾
¾
¾
Check the alarm list and error logs in the connected nodes
Check STS data for C7 traffic measurements on Signalling links and Destination and if
congestion is found, arrange for the link expansion and load balancing.
Check event reporting for the following ENUMS related to MTP congestions.
• ATM: C7ERP:ENUM=12&13&14&15&16&17&18&19&20;
• IP: EREPP:ENUM=1030&1031&1032&1034;
Check the status of Signalling Point in SCCP network (C7NCP) and as a remedy arrange
some rerouting.
Check event reporting for the following ENUMS related to SCCP congestions.
• C7ERP:ENUM=106&156&157&158
In case of MGW congestion can be checked as mentioned below.
•
ATM:
[MO] Mtp3bSlItu.pmNoOfLocalLinkCongestRec
[MO] Mtp3bSlItu.pmNoOfLocalLinkCongestCeaseRec
•
IP:
[MO] M3uAssociation.pmNoOfSconRec
[MO] M3uAssociation.pmNoOfSconSent
Signalling congestion can be due to high traffic, Transmission failure, wrong definitions (Loop),
hangings or some software fault, To rectify this problem proper fault localization is required which
becomes easy if above mentioned checks are already made, for a remedy a temporary rerouting
can be arranged.
7.2.3.
Load Congestion or System Overload
If APZ has more load that it should have, behaviour of the node will not be the same and one
should have experience to handle the situation in high processor load. In AXE system the
processor load functions are handled by the function block LOAS, which controls the intensity of
calls accepted by the system by buffering, thereby controlling the processor load and preventing
disturbance or failure of the system. In addition to this LOAS also handles functions like
supervision of exchange input load, observation of processor load, measurement of processor
load etc.
Following checks should be made in case of load congestion to verify if the high processor load is
due to high traffic only or there is some thing else causing congestion.
¾
¾
¾
¾
¾
¾
¾
¾
¾
Check the alarm list and error logs in the connected nodes
Check the Route status
(STRSP:R=ALL;)
Check if any size alteration action is required (DBTSP:TAB=SAACTIONS;)
Check Processor Load
(PLLDP;)
Check the application protocol link data
(ARLDP:APCLNK=ALL;)
Check SAE values for block LAD
(SAAEP:SAE=ALL,BLOCK=LAD;)
Check software / application errors data
(SYRIP:SURVEY;)
Check active forlopps
(SYFAP:MINUTES=10;)
(SYFAP:MINUTES=30;)
Check CP event records
(DIRCP;)
Core Troubleshooting Guide
8 (11)
Prepared (also subject responsible if other)
No.
Mubasher A Malik, Mobily NSS Team
Approved
Checked
Aimal Khan
¾
¾
Check RP event records
Check SAE faults
Date
Rev
23-01-2008
A
Reference
(DIRRP:RP=ALL;)
(DBTSP:TAB=SAEFAULTS;)
On duty staff should check and understand fully the processor load and the values shown in the
printout against command PLLDP. The command PLLDP requests a printout of load data
information for the twelve control intervals preceding the time of reception of the command. A
control interval is 5 seconds and is used in the load control process. The processor load data is
presented in the answer printout PROCESSOR LOAD DATA.
The data included in the printout, for each control interval, is as follows:
• The average processor load.
• The call acceptance limit.
• The number of offered originating calls.
• The number of offered incoming calls.
• The number of fetched originating calls.
• The number of fetched incoming calls.
• The number of offered high priority miscellaneous tasks and printout requests.
• The number of offered low priority miscellaneous tasks and printout requests.
• The number of fetched high priority miscellaneous tasks and printout requests.
• The number of fetched low priority miscellaneous tasks and printout requests
The term "offered" means requests for processor capacity in the load control function (LOAS).
The term "fetched" means the number of calls forwarded by the load control function for further
processing.
During normal operation of the system, the following functions should be active so as to monitor
the processor load and it’s effects. Detailed descriptions of all the following commands and
printouts are available in the O&M documentation (ALEX), which should be referred before using
the commands.
¾
¾
¾
¾
¾
¾
¾
Processor Load Control Data
Processor Load Observation Data
Exchange Input Load Observation Data
Exchange Input Load Supervision Data
System Data Change
Subscriber Priority Data
Processor Load Scheduled Measurement
(PLCDP)
(PLOBP)
(PLEIP)
(PLSVP)
(PLSDP)
(PLSUP)
(PLSMP)
Handling of overload situation and regulation of traffic is described briefly in the attached
document which should be studied, understand and used.
7.3
Link Failure
Link failure normally experienced is of two types, signalling link failure or traffic link (DIP / SNT)
failure. In both cases link failure can cause congestion or traffic loss in the network (if redundancy
not available). Below are the OPI’s (available in ALEX) need to follow in case of link failure
problem.
¾
¾
¾
¾
¾
CCITT7 SIGNALLING LINK FAILURE
DIGITAL PATH FAULT SUPERVISION
SYNCHRONOUS DIGITAL PATH FAULT SUPERVISION
ATM PORT REMOTE DEFECT INDICATION FAULT
ATM PORT ALARM INDICATION SIGNAL FAULT
Core Troubleshooting Guide
9 (11)
Prepared (also subject responsible if other)
No.
Mubasher A Malik, Mobily NSS Team
Approved
Checked
Aimal Khan
Date
Rev
23-01-2008
A
Reference
End-to-End testing must be done to confirm the link status and as a work around re-routing can
be done to reduce the impact.
7.4
System Restart
A system restart can be caused by a Hardware (HW) fault, a software fault, a handling fault or the
restart can be initiated by a command.
System Restart returns the entire system to a predetermined state, which might be needed if a
fault occurs in a non-FORLOPP adapted part of the system. The following system restart levels
exist:
•
Small restart is the first used restart level. Stable transactions, e.g., calls in speech
position are retained.
•
Large restart is selected if the restart situation occurs a second time within a specified
period of time, generally about 4 minutes. The application will then normally release
all switching-network connections.
•
Large restart with reload is selected if at two or more restarts have occurred within
the last 4-minutes. This involves the automatic reload of programs and data from
either a particular portion of the main store, which is reserved for the storage of the
system backup copy, or from a disk storage unit. Reloading from main store is
extremely fast. Since reload information is automatically updated during normal traffic
execution, minimal information-loss is ensured during the reload process. Following
the reload, the same actions are taken as in the case of a large system restart. In the
event that a reload attempt is considered to be unsuccessful, the system
automatically attempts to load from an older system backup copy.
The criteria for initiating the system restart function as described above is evaluated by a socalled selective restart function. This function implies that, if a software fault is detected in a block
that is of minor importance for the traffic process (according to its defined block category), the
system restart can be either suppressed or delayed until a time of low-traffic.
A system restart can be caused by a HW fault. One indication of this is when printout CP FAULT
is received in connection with a system restart. Other indications of HW faults are:
•
•
•
Printout RESTART DATA with REASON = LATE SIDE INDICATION
Printout RESTART DATA with REASON = FAULT IN EX
Cyclic system restarts
Condition when the system restart is caused by a HW fault in the CP side that was changed from
Executive (EX) to SB in connection with the system restart is call Late Side Indication.
If the system automatically restores to parallel or normal operation in the CP after the restart, the
fault is regarded as temporary.
Repair attempts must be made either when the fault has manifested itself as a temporary fault so
many times that the alarm CP FAULT is received, or when the fault has changed in such a way
that it is now regarded as a permanent fault (alarm CP FAULT is also received).
Cyclic System Restarts
A HW fault resulting in a restart at a certain working state in the CP can cause Cyclic System
Restart.
Support team must be consulted immediately in this case.
To prevent additional restarts stop the SB side of the CP with command DPHAS (automatic
change of working state is blocked).
Core Troubleshooting Guide
10 (11)
Prepared (also subject responsible if other)
No.
Mubasher A Malik, Mobily NSS Team
Approved
Aimal Khan
Checked
Date
Rev
23-01-2008
A
Reference
Data Collection
A part of the recovery data, or all recovery data for a recovery, is automatically printed in printout
RESTART DATA. The recovery data that is automatically printed is set by command SYRFC, as
a part of the exchange data.
A recovery data printout must be enclosed when the system restart is reported. The recovery data
printout can either be RESTART DATA or RESTART INFORMATION:
RESTART DATA:
If sufficient recovery data is printed in printout RESTART DATA and the printout is available, it is
enclosed with the report.
Command SYRIP with parameter NOPRINT = ALL is used. The printout initiated by SYRIP is
used to determine the restart reason, and to mark data as printed.
RESTART INFORMATION:
If sufficient recovery data is not already available, recovery data must be printed by command.
Command SYRIP with parameter PRINT = ALL is normally used. The printout initiated by SYRIP
is used to determine the restart reason and is included as an enclosure to the report.
If restart occurred in HLR / FNR an extra check is the status of RP’s for authentication as it has
experienced that AUC-RP’s get blocked and cause authentication failure (no location update).
AUC-RP’s must be checked and reset if needed.
SYSTEM RESTART OPI is normally followed for all type of restarts. Checking of alarm list must
be performed soon after the system restart and after clearing all alarms to confirm the stability of
the node.
For further analysis data must be collected and data collection guidelines are attached in the
references table
8 Reference Table
Index Reference
Documents
Critical Alarm handling & Record table
Alarm Handling.zip
Call Failure: MGW checklist
MGW Checks.doc
Congestion: Load Control
Ramadan_1427_proc
essor load control_PA
Problem Handling
Record.xls
Core Troubleshooting Guide
11 (11)
Prepared (also subject responsible if other)
No.
Mubasher A Malik, Mobily NSS Team
Approved
Checked
Aimal Khan
Date
Rev
23-01-2008
A
Reference
System Restart : Data Collection
Data collection in
MGW.doc
Data collection in
MSC-S.doc
Feedback:
All users are urged to contribute to this document and make it even more comprehensive and
update regularly so that Managed Services and Support Engineers can benefit from it across
all Aus.
Download