Core Troubleshooting Guide 1 (11) Prepared (also subject responsible if other) No. Mubasher A Malik, Mobily NSS Team Approved Checked Aimal Khan Date Rev 23-01-2008 A Reference Core Troubleshooting Guidelines Contents 1 2 3 4 5 6 7 7.1 7.2 7.3 7.4 8 Scope ..................................................................................................2 Introduction..........................................................................................2 Network Handling Requirements.........................................................2 Prerequisites for troubleshooting.........................................................3 Mandatory Actions for troubleshooting ................................................3 Alarm Specifications............................................................................4 Critical Faults Handling / Troubleshooting...........................................5 Call Failure or Call Drop ......................................................................5 Congestions ........................................................................................6 Link Failure..........................................................................................8 System Restart....................................................................................9 Reference Table................................................................................10 Core Troubleshooting Guide 2 (11) Prepared (also subject responsible if other) No. Mubasher A Malik, Mobily NSS Team Approved Checked Aimal Khan Date Rev 23-01-2008 A Reference 1 Scope This document is as an effort to simplify, standardize fault detection / troubleshooting in GSM / WCDMA Core nodes. Managed Services and Support engineers should use this document together with ALEX, PLEX documentation and functional specifications. 2 Introduction With introduction of Mobile Switch Solution (MSS further in the document – former Layered Architecture), fault localization becomes more complex. In vertical architecture networks it was comparatively much easier to detect faulty node – one MSC is connected to several BSCs. In case of failure (dropped calls), combination of CDR analysis (fault code, RBS, cause code, location…), AXE trace measures (Test system) and protocol analysis (with analyzer or with UPMTI command) could help to identify faulty node. In MSS architecture MGws were added between MSCs and RNCs. Due to the fact that from one MSC to one RNC call can be routed on different routes through several MGws, call path is not easy to determine and detection of faulty node becomes complicated as a rule of thumb when a faulty node is identified, standard node related troubleshooting procedures should be performed. Several supporting documents and procedures have been included in this document. 3 Network Handling Requirements Following information should be known and available to run the normal O&M as well as troubleshooting the Problems: ¾ Network Diagram ¾ Numbering Plan (it must have DPC, GT, MTP & SCCP routing information) ¾ Signaling & Trunking Plan ¾ HLR User Profile Specification ¾ MGw capacity and Connectivity ¾ IP address for 3G Core nodes Core Troubleshooting Guide 3 (11) Prepared (also subject responsible if other) No. Mubasher A Malik, Mobily NSS Team Approved Checked Aimal Khan Date Rev 23-01-2008 A Reference 4 Prerequisites for troubleshooting ¾ ¾ ¾ ¾ Good knowledge about network (Network plan must be available). Good hands on experience in O&M of Core network. Close cooperation between MSC, MGW and UTRAN (BO & CNS) groups. Efficient use of ALEX and PRIMUS. 5 Mandatory Actions for troubleshooting While troubleshooting major problems below actions must be performed. The list below has been complied based on best practices and on-ground experience but not limited to scenarios we didn’t not come across. 1) For all problems / incidents escalate and coordinate with Front Office. 2) Understand the problem and its impact 3) Collect relevant information and analyse it to localise the problem (Area, Node) 4) Draw the network connectivity diagram related to the problem reported. 5) Inform the concerned parties as per standard processes and procedures. 6) Check the alarm list and error logs in the connected nodes 7) Arrange proper testing for the reported cases and more. 8) Take the traces and analyse them. 9) Check the performance reports and STS data. 10) Check for the DT changes made recently in the network and take the printouts for all the related parameters. 11) Coordinate with other groups, support team and management. 12) Recommend and agree with customer / management the best possible solution and always consult ALEX for User-Guides and OPI’s. 13) Implement the solution as agreed. 14) Repeat steps 5 to 8. 15) Update Front Office with the status, restoration, actions taken and network stability. 16) Produce the relative report for the concerned parties. Core Troubleshooting Guide 4 (11) Prepared (also subject responsible if other) No. Mubasher A Malik, Mobily NSS Team Approved Checked Aimal Khan Date Rev 23-01-2008 A Reference 6 Alarm Specifications To keep the network healthy all the supervisions must be activated and alarms should be defined with correct severities. Below you can find some major supervision need to be activated. ¾ ¾ ¾ ¾ ¾ Processor Load Supervision & Load Regulations. Audit Function Threshold Supervision (Memory & SAE). Event Reporting & Disturbance Supervision. Signalling Supervision (C7 disturbance, Link Set, Destination, SCCP_SSN, MTP). Route Supervision Once all the supervisions are activated, alarm definitions should be done properly. All the critical alarms must be handled & rectified by following OPI (operational Instructions) and procedures. Below you can find some critical alarms need to be attended immediately. Processor Alarms (APZ): CP Fault CPT Fault AP Fault AP Not Available / Redundant AP Process Stopped Audit Function Threshold Supervision EM Fault RP Fault Telephony Alarms (APT): Size Alteration of Data Files Fault / Size Change Required System Restart CCITT7 Destination Inaccessible CCITT7 Signalling Link Failure / Link Set Supervision Common Charging Output Congestion / Error Group Switch Fault HLR State Fault Media Gateway Unavailable M3ua Destination Inaccessible Above mentioned alarms are a few of several critical from the actual list of critical alarms and all of them, have OPI (Operational instructions) available in ALEX for troubleshooting. Core Troubleshooting Guide 5 (11) Prepared (also subject responsible if other) No. Mubasher A Malik, Mobily NSS Team Approved Checked Aimal Khan Date Rev 23-01-2008 A Reference 7 Critical Faults Handling / Troubleshooting In vertical architecture networks it was much easier to detect faulty node – one MSC is connected to several BSCs. In case of failure (dropped calls), combination of CDR analysis (fault code, RBS, cause code, location…), AXE trace measures (Test system) and protocol analysis (with analyzer or with UPMTI command) could help to identify faulty node. As a practice one must check alarm list and error log at the time of problem reported and once it is solved. Troubleshooting and fault localization has become more complicated after the introduction of Mobile Switch Solution (MSS / Layered Architecture). In MSS architecture, MGws were added between MSCs (which becomes servers) and RNCs. Due to fact that from one MSC to one RNC call can be routed in different ways trough several MGws, call path is not easy to determine and detecting of faulty node become serious issue. When faulty node is identified, standard node related troubleshooting procedures should be performed and problem handling record form should be filled. (Attached in references table) Emergency situation handling and collection of data from all core nodes including MSS after outages is already described in attached documents (Bulletins), now we will further discuss some critical faults and troubleshooting methods. In the references table you can find some documents for handling and troubleshooting different problems, such as Processor Load Control in high load, necessary health checks needed before & after any upgrade and Emergency / maintenance bulletins to handle emergency situations with the important procedures and processes required to follow during emergencies. 7.1 Call Failure or Call Drop Call failure for MT / MO or Call drop is always declared / treated as emergency which needs proper diagnosis to rectify. Below are the major actions required to handle this problem. ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾ Collect relevant information to evaluate the NW status (Details about call drops: where, when, how often, other symptoms) Analyse this information to localise the problem (Area, Node, Number Series, MO or MT, Prepaid or Post-paid etc) Check the alarm list and error logs in the connected nodes Arrange the proper testing for the reported cases and more. Take the traces to identify the release cause. Check STS data. Check for the DT changes made recently in the network. Escalate to the management and support team according to processes & procedures. If the problem is area / node specific, after confirming with the test calls, take the traces and analyse them to check and identify the failure / release cause. Make a health check of the affected node including the below mentioned checks. ¾ ¾ ¾ ¾ Status of the Node Processor Load Size Alterations Disturbance Analysis (SYRIP, SYFAP, SYFDP, SYFIP, SYFSP, ALLIP, DIRCP, DIRRP printouts should be analyzed Core Troubleshooting Guide 6 (11) Prepared (also subject responsible if other) No. Mubasher A Malik, Mobily NSS Team Approved Checked Aimal Khan ¾ ¾ ¾ ¾ ¾ ¾ Date Rev 23-01-2008 A Reference Software record analysis (LASIP, PCORP, LAEIP, PCECP, LAFBP printouts should be analyzed , in order to find out: • Are some useful correction missing • Are some corrections in the dump faulty or suspected faulty • Is there any mismatch in software record (if part of correction is passive) Route Utilizations Event reporting Signalling status (MTP & SCCP) STS data Analyze the traces taken If the problematic node is MGW then following checks should be made. ¾ ¾ ¾ ¾ Check MGW selection in MSC-S Check the availability of MGW and available resources in it. Check alarm and Event log in MGW Take GCP traces in MGW Attached document in the reference table can help in checking MGW and collection data in it. As mentioned before Call failure or Call drop should always be treated as emergency and on duty staff should keep the record and track of all troubleshooting done locally or remotely by support teams and confirm the restoration by testing as well as statistics. 7.2 Congestions In live network congestion can disturb the services in many ways. We will discuss here 3 types of congestions, Route Congestion, Signalling Congestion and Load Congestion 7.2.1. Route Congestion Route Congestion can be reported by FO (if alarm activated), by performance team or observed during fault localization such as call failure or high processor load, below are the instructions need to follow for handling this problem. Check the followings: ¾ ¾ ¾ ¾ ¾ ¾ ¾ Alarm list and error logs Route utilization Traffic measurements on routes Traffic type measurements Service quality statistics (SQS) End-of-selection measurements Answer to seizure ratio (ASR) Route congestion can be due to high traffic, Transmission failure, hangings or some software fault, To rectify this problem proper fault localization and coordination with other groups is required which becomes easy if above mentioned checks are already made. Below mentioned ALEX OPI’s can be useful during troubleshooting of Route Congestion. ¾ ¾ ¾ Disturbance Supervision of Trunk Routes Blocking Supervision Traffic Limitations on Routes, Change Core Troubleshooting Guide 7 (11) Prepared (also subject responsible if other) No. Mubasher A Malik, Mobily NSS Team Approved Checked Aimal Khan 7.2.2. Date Rev 23-01-2008 A Reference Signalling Congestion Signalling congestion can cause major disturbance in the network and can result in call failure as well. It can be type of MTP signalling congestion or SCCP signalling congestion and below we discuss the troubleshooting for both types of signalling congestion. ¾ ¾ ¾ ¾ ¾ ¾ Check the alarm list and error logs in the connected nodes Check STS data for C7 traffic measurements on Signalling links and Destination and if congestion is found, arrange for the link expansion and load balancing. Check event reporting for the following ENUMS related to MTP congestions. • ATM: C7ERP:ENUM=12&13&14&15&16&17&18&19&20; • IP: EREPP:ENUM=1030&1031&1032&1034; Check the status of Signalling Point in SCCP network (C7NCP) and as a remedy arrange some rerouting. Check event reporting for the following ENUMS related to SCCP congestions. • C7ERP:ENUM=106&156&157&158 In case of MGW congestion can be checked as mentioned below. • ATM: [MO] Mtp3bSlItu.pmNoOfLocalLinkCongestRec [MO] Mtp3bSlItu.pmNoOfLocalLinkCongestCeaseRec • IP: [MO] M3uAssociation.pmNoOfSconRec [MO] M3uAssociation.pmNoOfSconSent Signalling congestion can be due to high traffic, Transmission failure, wrong definitions (Loop), hangings or some software fault, To rectify this problem proper fault localization is required which becomes easy if above mentioned checks are already made, for a remedy a temporary rerouting can be arranged. 7.2.3. Load Congestion or System Overload If APZ has more load that it should have, behaviour of the node will not be the same and one should have experience to handle the situation in high processor load. In AXE system the processor load functions are handled by the function block LOAS, which controls the intensity of calls accepted by the system by buffering, thereby controlling the processor load and preventing disturbance or failure of the system. In addition to this LOAS also handles functions like supervision of exchange input load, observation of processor load, measurement of processor load etc. Following checks should be made in case of load congestion to verify if the high processor load is due to high traffic only or there is some thing else causing congestion. ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾ Check the alarm list and error logs in the connected nodes Check the Route status (STRSP:R=ALL;) Check if any size alteration action is required (DBTSP:TAB=SAACTIONS;) Check Processor Load (PLLDP;) Check the application protocol link data (ARLDP:APCLNK=ALL;) Check SAE values for block LAD (SAAEP:SAE=ALL,BLOCK=LAD;) Check software / application errors data (SYRIP:SURVEY;) Check active forlopps (SYFAP:MINUTES=10;) (SYFAP:MINUTES=30;) Check CP event records (DIRCP;) Core Troubleshooting Guide 8 (11) Prepared (also subject responsible if other) No. Mubasher A Malik, Mobily NSS Team Approved Checked Aimal Khan ¾ ¾ Check RP event records Check SAE faults Date Rev 23-01-2008 A Reference (DIRRP:RP=ALL;) (DBTSP:TAB=SAEFAULTS;) On duty staff should check and understand fully the processor load and the values shown in the printout against command PLLDP. The command PLLDP requests a printout of load data information for the twelve control intervals preceding the time of reception of the command. A control interval is 5 seconds and is used in the load control process. The processor load data is presented in the answer printout PROCESSOR LOAD DATA. The data included in the printout, for each control interval, is as follows: • The average processor load. • The call acceptance limit. • The number of offered originating calls. • The number of offered incoming calls. • The number of fetched originating calls. • The number of fetched incoming calls. • The number of offered high priority miscellaneous tasks and printout requests. • The number of offered low priority miscellaneous tasks and printout requests. • The number of fetched high priority miscellaneous tasks and printout requests. • The number of fetched low priority miscellaneous tasks and printout requests The term "offered" means requests for processor capacity in the load control function (LOAS). The term "fetched" means the number of calls forwarded by the load control function for further processing. During normal operation of the system, the following functions should be active so as to monitor the processor load and it’s effects. Detailed descriptions of all the following commands and printouts are available in the O&M documentation (ALEX), which should be referred before using the commands. ¾ ¾ ¾ ¾ ¾ ¾ ¾ Processor Load Control Data Processor Load Observation Data Exchange Input Load Observation Data Exchange Input Load Supervision Data System Data Change Subscriber Priority Data Processor Load Scheduled Measurement (PLCDP) (PLOBP) (PLEIP) (PLSVP) (PLSDP) (PLSUP) (PLSMP) Handling of overload situation and regulation of traffic is described briefly in the attached document which should be studied, understand and used. 7.3 Link Failure Link failure normally experienced is of two types, signalling link failure or traffic link (DIP / SNT) failure. In both cases link failure can cause congestion or traffic loss in the network (if redundancy not available). Below are the OPI’s (available in ALEX) need to follow in case of link failure problem. ¾ ¾ ¾ ¾ ¾ CCITT7 SIGNALLING LINK FAILURE DIGITAL PATH FAULT SUPERVISION SYNCHRONOUS DIGITAL PATH FAULT SUPERVISION ATM PORT REMOTE DEFECT INDICATION FAULT ATM PORT ALARM INDICATION SIGNAL FAULT Core Troubleshooting Guide 9 (11) Prepared (also subject responsible if other) No. Mubasher A Malik, Mobily NSS Team Approved Checked Aimal Khan Date Rev 23-01-2008 A Reference End-to-End testing must be done to confirm the link status and as a work around re-routing can be done to reduce the impact. 7.4 System Restart A system restart can be caused by a Hardware (HW) fault, a software fault, a handling fault or the restart can be initiated by a command. System Restart returns the entire system to a predetermined state, which might be needed if a fault occurs in a non-FORLOPP adapted part of the system. The following system restart levels exist: • Small restart is the first used restart level. Stable transactions, e.g., calls in speech position are retained. • Large restart is selected if the restart situation occurs a second time within a specified period of time, generally about 4 minutes. The application will then normally release all switching-network connections. • Large restart with reload is selected if at two or more restarts have occurred within the last 4-minutes. This involves the automatic reload of programs and data from either a particular portion of the main store, which is reserved for the storage of the system backup copy, or from a disk storage unit. Reloading from main store is extremely fast. Since reload information is automatically updated during normal traffic execution, minimal information-loss is ensured during the reload process. Following the reload, the same actions are taken as in the case of a large system restart. In the event that a reload attempt is considered to be unsuccessful, the system automatically attempts to load from an older system backup copy. The criteria for initiating the system restart function as described above is evaluated by a socalled selective restart function. This function implies that, if a software fault is detected in a block that is of minor importance for the traffic process (according to its defined block category), the system restart can be either suppressed or delayed until a time of low-traffic. A system restart can be caused by a HW fault. One indication of this is when printout CP FAULT is received in connection with a system restart. Other indications of HW faults are: • • • Printout RESTART DATA with REASON = LATE SIDE INDICATION Printout RESTART DATA with REASON = FAULT IN EX Cyclic system restarts Condition when the system restart is caused by a HW fault in the CP side that was changed from Executive (EX) to SB in connection with the system restart is call Late Side Indication. If the system automatically restores to parallel or normal operation in the CP after the restart, the fault is regarded as temporary. Repair attempts must be made either when the fault has manifested itself as a temporary fault so many times that the alarm CP FAULT is received, or when the fault has changed in such a way that it is now regarded as a permanent fault (alarm CP FAULT is also received). Cyclic System Restarts A HW fault resulting in a restart at a certain working state in the CP can cause Cyclic System Restart. Support team must be consulted immediately in this case. To prevent additional restarts stop the SB side of the CP with command DPHAS (automatic change of working state is blocked). Core Troubleshooting Guide 10 (11) Prepared (also subject responsible if other) No. Mubasher A Malik, Mobily NSS Team Approved Aimal Khan Checked Date Rev 23-01-2008 A Reference Data Collection A part of the recovery data, or all recovery data for a recovery, is automatically printed in printout RESTART DATA. The recovery data that is automatically printed is set by command SYRFC, as a part of the exchange data. A recovery data printout must be enclosed when the system restart is reported. The recovery data printout can either be RESTART DATA or RESTART INFORMATION: RESTART DATA: If sufficient recovery data is printed in printout RESTART DATA and the printout is available, it is enclosed with the report. Command SYRIP with parameter NOPRINT = ALL is used. The printout initiated by SYRIP is used to determine the restart reason, and to mark data as printed. RESTART INFORMATION: If sufficient recovery data is not already available, recovery data must be printed by command. Command SYRIP with parameter PRINT = ALL is normally used. The printout initiated by SYRIP is used to determine the restart reason and is included as an enclosure to the report. If restart occurred in HLR / FNR an extra check is the status of RP’s for authentication as it has experienced that AUC-RP’s get blocked and cause authentication failure (no location update). AUC-RP’s must be checked and reset if needed. SYSTEM RESTART OPI is normally followed for all type of restarts. Checking of alarm list must be performed soon after the system restart and after clearing all alarms to confirm the stability of the node. For further analysis data must be collected and data collection guidelines are attached in the references table 8 Reference Table Index Reference Documents Critical Alarm handling & Record table Alarm Handling.zip Call Failure: MGW checklist MGW Checks.doc Congestion: Load Control Ramadan_1427_proc essor load control_PA Problem Handling Record.xls Core Troubleshooting Guide 11 (11) Prepared (also subject responsible if other) No. Mubasher A Malik, Mobily NSS Team Approved Checked Aimal Khan Date Rev 23-01-2008 A Reference System Restart : Data Collection Data collection in MGW.doc Data collection in MSC-S.doc Feedback: All users are urged to contribute to this document and make it even more comprehensive and update regularly so that Managed Services and Support Engineers can benefit from it across all Aus.